Finetuning large language models for software development
Introduction
One Friday afternoon, while planning the following week's software development work, a thought crossed my mind, "Wouldn't it be nice if I could issue a set of instructions about the intended feature and have the machine take at least a first pass at writing the relevant functions for me."
Large language models have gotten a lot of attention in 2023 (from hereon just referred to as LMs). So the idea was to see how well these LMs, finetuned on our company's code (which focuses on predicting energy output from PV plants), perform on a much more simplified task.
First to get it out the way, I am of course familiar with Github Copilot. But Copilot is paid, and I would also like control over the internals of the LMs as opposed to just having a black box.
Designing a system to create interrelated blocks of code that integrate into a functioning codebase in response to a user command is a very challenging endeavor. As such, I limited the scope to something much more manageable. Namely, generating detailed code from Python function documentation (from hereon referred to as docstrings).
Background
In our codebase, we strive to adhere to standards for both docstrings and functions. Every docstring has at the minimum the same sections with a description of what the function does along with its inputs and outputs. We intentionally write layperson-friendly explanations of the pertinent engineering and solar concepts (although we won't repeat these detailed explanations across functions).
In our code, we follow the Google coding standard. We strive for a convention for variable names and a certain coding style (i.e. Pandas/numpy heavy vectorization, writing for humans, DRY, etc.).
So we can use the docstrings, which in essence describe what the function does, to generate the function itself. The way we achieve that is by finetuning existing LMs trained on code (ideally Python).
The questions
- What LMs can we test?
- How much do they improve if we finetuned them as opposed to just using them out of the box?
- How good (or bad) is the code they generate? Does it even run?
Models
I decided to use the following code-specific models. Note the values refer to the number of parameters in the model:
- SalesForce Codegen 350M
- Trained on 71.5B Python tokens
- Decicoder 1B
- Trained on 446B tokens, Python, Java, and Javascript subset of Starcoder Training Dataset
- CodeParrot 1.5B
- Based on GPT 2
My original intent was to also finetune on CodeLlama, released by Meta in August 2023. It is a 7B parameter model and has achieved top performance metrics on code generation tasks. However, I encountered memory issues training on expensive GPUs of various sizes and had to halt work temporarily. I'll detail my efforts and the results in a future article.
Data preparation
Our codebase consists of about 10 modules (aka Python files), some of which contain classes. In total there are approximately 200 functions. The functions are of course connected to each other semantically (i.e. pertaining to meaning).
To simplify the problem, though, I basically ignored class definitions and the connections between functions.
I then separated each function into an input section for the docstring and the output section for the function itself.
I left out 3 functions from the dataset used for finetuning the model in order to test how well the models perform. This is a very small number to base metrics on - I sacrificed metric generalization for using the data to get the best model possible.
Modeling
I modeled using AzureML in order to make the experiment architecture transparent and reproducible and to leverage cloud compute. Details are in the Github repo. I finetuned on Codegen and Decicoder for 10 epochs with a batch size of 20, and CodeParrot for 6 epochs with a batch size of 100. For all models, I used a sequence length of 500 tokens with a Standard_E8s_v3 machine (64 GB RAM, 128 GB storage, 16 cores, $0.64/hr). The training took around 10.5 hours.
Baseline predictions
In order to ascertain that finetuning really has an effect, it's instructive to predict on our test functions using the LMs out-of-the-box.
Note that for the test set, I chose 3 functions that represent the range of complexity within our code. I will share 2 of them - the 3rd one has our secret sauce for uncertainty quantification in energy losses.
Function 1: Calculate PV efficiency loss
None of the models have useful, let alone correct output. But they are amusing.
Codegen's prediction shows the model's affinity for dashes, equal signs and asterisks. It proceeds to a random assortment of Greek, what looks to be some central European language (maybe Czech?), and concludes with 3 solar-related terms. CodeParrot in turn generates a class that seems to be a hodge-podge of programming languages. Surprisingly, there's a big docstring section within the function definition itself.
Function 2: Get distance
def get_distance(x, site_lat, site_long): """Get distance between two geographical coordinates. Parameters ---------- x : pd Series Pandas Series containing information about a neighbouring site. site_lat : float Site latitude, in decimal degrees. site_long : float Site longitude, in decimal degrees. Returns ------- float Distance between site and neighbouring site. See Also -------- data_import.legacy_get_nearest_site_nrel_info """ d_lat = math.radians(site_lat) - x['lat_rad'] d_lng = math.radians(site_long) - x['long_rad'] temp = (math.sin(d_lat / 2)**2 + math.cos(x['lat_rad']) * math.cos(32.03914409) * math.sin(d_lng / 2)**2) return 6373.0 * (2 * math.atan2(math.sqrt(temp), math.sqrt(1 - temp)))This is a fairly straightforward function but the results are bad. It appears the docstring format is doing more harm than good.
Finetuned predictions
Now let's examine the predictions on the finetuned dataset. Note that I actually had to manually tweak the parameters min_new_tokens and max_new_tokens on inference in order for the models to avoid generating extremely short results.
Function 1 - Calculate PV efficiency loss
def calc_lf_pv_eff(self): """Calculate photovoltaic conversion losses. Assumes uncertainty consists of a fixed fractional component of the location value, and a variable component that increases with the age of the plant. As degradation proceeds, we have decreasing confidence in its actual level. Both the actual degradation and degradation uncertainty are re-calculated at a daily level so avoid quantum jumps in values at the beginning of every year. Note that the user should also specifiy a positive asymmetry factor. This ensures that even though our uncertainty increases with time, the increase in uncertainty is asymmetrical - it is higher above the location (estimated value), since below the location, we have a floor on the value, since we know that losses due to PV efficiency cannot be lower than 1 - (pv efficiency + some uncertainty). The efficiency and degradation methodology when running on simulated POA data entails calculating the starting efficiency of the prediction time period based on the number of operational years. Since we run trials to predict future outcomes, we then sample from a normal distribution using the calculated starting efficiency as the mean. For the standard deviation we currently assume a normal distribution and assume that the location value +/ the uncertainty covers x% of the data, where x is the confidence level specified by the user. We calculate the value of 1 standard deviation under these assumptions. We then also sample for the degradation and its associated uncertainty in the same way. Returns ------- pd dataframe For each timestamp, contains the location, lower and upper uncertainty bounds, and probabilities at lower and upper bounds. Notes ------ "In PVsyst, the evaluation of the "Losses" of a PV array (as for the definition of the normalized performance ratio), takes as starting point the energy which would be produced if the system worked always at STC conditions (1000 W/m², 25°C, AM1.5)." Source : https://www.pvsyst.com/help/irradiance_loss.html Loss is 1 - sum(efficiency factors) Efficiency factors: pv_eff - base efficiency pv_lid_coef - light induced degredation pv_degrad_coef - degredation coefficient pv_mql - module quality loss Future ------ Create IV curve to compare V-dc mes to V_dc theoretical (10hrs) """ name = 'lf_pv_eff' logging.info('Calculating %s', name) pv_eff = sui_configs.module_info.pv_eff pv_eff_unc = sui_configs.module_info.pv_eff_unc pv_lid_coef = sui_configs.module_info.pv_lid_coef pv_lid_coef_unc = sui_configs.module_info.pv_lid_coef_unc pv_degrad_coef = sui_configs.module_info.pv_degrad_coef pv_degrad_coef_unc = sui_configs.module_info.pv_degrad_coef_unc pv_mql = sui_configs.module_info.pv_mql pv_mql_unc = sui_configs.module_info.pv_mql_unc conf_level = user_settings.variable_uncertainty.general.conf_level #### ------------------------------------------------------------------- if self.calc_type == 'measured': day_diff = self.index - first_op_day else: # simulated option day_diff = self.index - first_sim_day # note first_sim and first_opday are global variables created in # config n_years = (first_sim_day - first_op_day) / np.timedelta64( 1, 'D') / 365 pv_eff_mu = pv_eff - (pv_degrad_coef * n_years) # Number of standard deviations # from mean to cover conf_level% of data assuming normal distribution zscore = stats.norm.ppf(1 - (1 - conf_level) / 2) # Divide the uncertainty by # of std # to get the value of 1 standard deviation # uncertainty is two way and zscore is one way divide by 2 sigma_eff = pv_eff_unc / zscore / 2 pv_eff = stats.norm.rvs(pv_eff_mu, sigma_eff) # eff at beginning of simulation period sigma_degrad = pv_degrad_coef_unc / zscore pv_degrad_coef = stats.norm.rvs(pv_degrad_coef, sigma_degrad) time_degrad = day_diff / np.timedelta64(1, 'D') * pv_degrad_coef / 365 pv_eff = pv_eff - pv_lid_coef - pv_mql - time_degrad lf = pd.Series(1 - pv_eff, index=self.index, name=name) # multiply number of days by the daily increase in uncertainty # based on temp value unc_var = (self.index - first_op_day) / np.timedelta64( 1, 'D') * pv_degrad_coef_unc / 365 unc_fixed = pv_eff_unc + pv_lid_coef_unc + pv_mql_unc lf_unc = pd.Series(unc_var + unc_fixed, index=self.index, name='lf_unc') self.validate_assert_write_helper(name, lf, lf_unc)Function 2 - Get distance
def get_distance(x, site_lat, site_long): """Get distance between two geographical coordinates. Parameters ---------- x : pd Series Pandas Series containing the indices lat_rad and long_rad with the latitude and longitude, respectively, in radians of a comparison site. site_lat : float Site latitude, in decimal degrees. site_long : float Site longitude, in decimal degrees. Returns ------- float Distance between site and neighboring site. See Also -------- data_import.legacy_get_nearest_site_nrel_info """ d_lat = math.radians(site_lat) - x['lat_rad'] d_lng = math.radians(site_long) - x['long_rad'] temp = (math.sin(d_lat / 2)**2 + math.cos(x['lat_rad']) * math.cos(32.03914409) * math.sin(d_lng / 2)**2) return 6373.0 * (2 * math.atan2(math.sqrt(temp), math.sqrt(1 - temp)))Codegen and Decicoder are making a valiant effort at generating the correct code. All the models get tripped up on Function 1. For function 2, Codegen and Deci are approaching the correct methodology, but have of confusing trigonometric functions as well as mixing up how to manipulate the input variables.
Metrics
LM metrics are an active research field. You will see terms such as 'state-of-the-art' being thrown around in reference to the latest model. However, the metrics aren't yet standardized enough to allow model performance comparisons without delving into the details of how the metric was set up.
Metrics can be divided into 2 categories: human and automated. Human evaluation is reliable, but difficult and expensive to scale. The popular automated metrics are Bleu, Chrf and Ruby, among others. These are all variants of generating statistics on what share of predicted characters or n-grams are correct compared to the ground truth. Currently, a popular metric you'll see is the HumanEval benchmark, misnamed since it's actually an automated procedure. Its approach is different from the metrics referenced above. It contains a function prompt and an associated unit test that a successful output would pass. So we can feed the model we're testing this dataset and see how many of the unit tests the model results pass.
Researchers have noted limitations - HumanEval functions are mostly focused on short, specific computer-science tasks, so it is unclear how the scores would generalize to other domains. Additionally, the evaluation is binary, making it impossible to gauge the result quality even if it doesn't pass the unit test.
Note the metrics I include below are with a min length of 200 and a max length of 1000.
Model | HumanEval@1 - Reference | Bleu - Baseline | Bleu - Finetune | ChrF - Baseline | ChrF - Finetune |
---|---|---|---|---|---|
Codegen | 12.76 | 0 | 0.08 | 8.67 | 20.98 |
Decicoder | 19.1 | 0 | 0.10 | 6.11 | 30.48 |
CodeParrot | 3.99 | 0 | 0.006 | 19.1 | 18.77 |
The low Bleu scores are because Bleu is actually a pretty strict metric - at least one 4-gram (4 set of words) needs to match to score above 0. It does appear from the ChrF score that Decicoder performs best out of the three models. However, we only have 3 samples and it's dubious to base conclusions from such a small sample size. Researchers have shown that even with a large number of samples, if the difference in metrics between models is under 2%, then it isn't meaningful (i.e. statistically significant). Refer to page 11 of the Evtikhiev paper linked at the end of the article.
Conclusion
The main takeaway is that even if we tweak the parameters for prediction, the results are not practical. I included the HumanEval reference score for reference, though it's not actually useful in our scenario. We care whether the machine can generate functions we have, with all of the idiosyncracies of our project, not on generic computer science problems.
Some ideas for improvement / thoughts:
- Experiment with basic prompts instead of the raw docstring.
- I treated all functions as separate inputs so the model has no concept of how functions relate to each other. Ways to address this is to potentially use agents to mimick the software development cycle when automating the creation of a new function or set of function. Other approaches rely on recreating the Python Abstract Syntax Tree. I will include some relevant papers below.
- Train on larger models (CodeLlama, CodeWizard).
- I noticed the docstrings themselves can be improved to leave no room for ambiguity in the inputs. This is tricky because a developer may not need every last detail spelled out, and so tweaking these functions comes with a cost.
Returning to the questions posed at the outset, the finetuning definitely improves the predictions, with Decicoder performing the best out of the 3. But, the functions do not run and they are pretty far from being correct. I would really like to see how Codellama performs on this. Stay tuned!
References
A Syntactic Neural Model for General-Purpose Code Generation. Yin, et al. 2017.04.06
Can AI Code metrics on HuggingFace
Evaluating Large Language Models Trained on Code. Chen, et al. 2021.07.14
Out of the Bleu, How Should We Assess Quality of the Code Generation Models. Evtikhiev, et al. 2023.05.10
Sk Coder: A Sketch-based Approach for Automatic Code Generation. Li, et al. 2023.07.09
WizardCoder: Empowering Code Large Language Models with Evol-Instruct. Luo, et al. 2023.06.14