o RThуЕу@s8ddlZddlZddlmZmZmZmZddlmZddlm Z GddДdeГZddДZd d ДZ ddgfd dДZdщN)┌Any┌Callable┌ NamedTuple┌Optional)┌assert_has_pandas)┌pandasc@sКeZdZUeed<dZeeed<dZeeed<dZee e ge fed<dZeeed<dZee e ge fed<dZ eeed<dS) ┌Remediation┌nameN┌ immediate_msg┌ necessary_msg┌necessary_fn┌optional_msg┌optional_fn┌ error_msg)┌__name__┌ __module__┌__qualname__┌str┌__annotations__r rrrrrr rrйrr·T/home/air/segue/gemini/backup/venv/lib/python3.10/site-packages/openai/validators.pyr s rcCs8d}t|Г|kr dnd}dt|ГЫd|ЫЭ}td|dНS)zТ This validator will only print out the number of examples and recommend to the user to increase the number of examples if less than 100. щd┌zз. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examplesz - Your file contains z prompt-completion pairs┌num_examplesйr r )┌lenr)┌df┌MIN_EXAMPLES┌optional_suggestionr rrr┌num_examples_validators ¤ rcsАddДЙd}d}d}d}И|jvr7ИddД|jDГvr1ЗЗfddД}|}dИЫd Э}d ИЫdЭ}ndИЫdЭ}td ||||dНS)z[ This validator will ensure that the necessary column is present in the dataframe. cs2ЗfddД|jDГ}|j|dИабiddН|S)Ncs g|]}t|ГабИkr|СqSrйr┌lowerй┌.0┌cй┌columnrr┌ )s zInecessary_column_validator..lower_case_column..rT)┌columns┌inplace)r(┌renamer!)rr&┌colsrr%r┌lower_case_column(sz5necessary_column_validator..lower_case_columnNcSsg|]}t|ГабСqSrr r"rrrr'3єz.necessary_column_validator..cє И|ИГSйNr)rйr,┌necessary_columnrr┌lower_case_column_creator5є z=necessary_column_validator..lower_case_column_creatorz - The `z ` column/key should be lowercasezLower case column name to `·`z^` column/key is missing. Please make sure you name your columns/keys appropriately, then retryr1)r r rrr)r(r)rr1r rrrr2rr0r┌necessary_column_validator#s( √r5┌prompt┌ completioncsиg}d}d}d}t|jГdkrLЗfddД|jDГ}d}|D]ЙЗfddД|DГ}t|Гdkr9|dИЫd ИЫd Э7}qd|Ы|ЫЭ}d|ЫЭ}Зfd dД}td|||dНS)zK This validator will remove additional columns from the dataframe. Nщcsg|]}|Иvr|СqSrrr"й┌fieldsrrr'Rr-z/additional_column_validator..rcsg|]}И|vr|СqSrrr")┌acrrr'Ur-rz9 WARNING: Some of the additional columns/keys contain `z<` in their name. These will be ignored, and the column/key `z`` will be used instead. This could also result from a duplicate column/key in the provided file.zh - The input file should contain exactly two columns/keys per row. Additional columns/keys present are: z Remove additional columns/keys: cs|ИSr/rй┌xr9rrr[sz1additional_column_validator..necessary_fn┌additional_columnйr r rr)rr(r)rr:┌additional_columnsrr r┌warn_message┌dupsr)r;r:r┌additional_column_validatorIs*А №rCcsдd}d}d}|ИаddДбабs|ИабабrG|Иdk|ИабB}|абj|аб}dИЫd|ЫЭ}ЗfddД}d t|ГЫd ИЫdЭ}tdИЫЭ|||d НS)zA This validator will ensure that no completion is empty. NcSs|dkS)Nrrr<rrr┌nsz+non_empty_field_validator..rz - `z?` column/key should not contain empty strings. These are rows: cs||ИdkjИgdНS)Nrй┌subset)┌dropnar<й┌fieldrrrssz/non_empty_field_validator..necessary_fn·Remove z rows with empty ┌s┌empty_r?)┌apply┌any┌isnull┌reset_index┌index┌tolistrr)rrIrrr ┌ empty_rows┌ empty_indexesrrHr┌non_empty_field_validatorfs&№rUcsВ|jИdН}|абj|аб}d}d}d}t|Гdkr9dt|ГЫddаИбЫd|ЫЭ}dt|ГЫd Э}Зfd dД}td|||d НS)zY This validator will suggest to the user to remove duplicate rows if they exist. rENr· - There are z duplicated ·-z sets. These are rows: rJz duplicate rowscs|jИdНS)NrE)┌drop_duplicatesr<r9rrrНєz.duplicated_rows_validator..optional_fn┌duplicated_rowsйr r r r)┌ duplicatedrPrQrRr┌joinr)rr:rZ┌duplicated_indexesr r rrr9r┌duplicated_rows_validators №r_cs|d}d}d}t|Г}|dkr6ddДЙИ|ГЙtИГdkr6dtИГЫdИЫdЭ}d tИГЫd Э}ЗЗfddД}td |||dНS)zW This validator will suggest to the user to remove examples that are too long. N·open-ended generationcSs$|jddДddН}|абj|абS)NcSst|jГt|jГdkS)Ni')rr6r7r<rrrrDеr-zClong_examples_validator..get_long_indexes..щ)┌axis)rMrPrQrR)┌d┌ long_examplesrrr┌get_long_indexesгs z1long_examples_validator..get_long_indexesrrVz. examples that are very long. These are rows: zf For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.rJz long examplescs8И|Г}И|krtjаdt|ГЫd|ЫdЭб|а|бS)NzeThe indices of the long examples has changed as a result of a previously applied recommendation. The z? long examples to be dropped are now at the following indices: ┌ )┌sys┌stdout┌writer┌drop)r=┌long_indexes_to_dropйre┌long_indexesrrrпs z,long_examples_validator..optional_fnrdr[)┌infer_task_typerr)rr r r┌ft_typerrlr┌long_examples_validatorШs" №rpcsld}d}d}d}dЙgdв}|D]}|dkr |jjаdбабr q|jjj|ddНабr,q|ЙИаddб}t|Г}|d krBtd dНSdd ДЙt|jddН} |j| kабr`d| ЫdЭ}td |dНS| dkrЪ| аddб} d| ЫdЭ}t | Гdkr~|d|ЫdЭ7}|jjdt | ГЕjj| ddНабrЩ|d| ЫdЭ7}nd}| dkrнd|ЫdЭ}ЗЗfddД}td||||d НS)!zЬ This validator will suggest to add a common suffix to the prompt if one doesn't already exist in case of classification or conditional generation. Nz ### => )· ->z ### z === z --- z ===> z ---> rqrfFй┌regex·\nr`┌ common_suffixйr cSє|d|7<|SйNr6rйr=┌suffixrrr┌ add_suffixсєz2common_prompt_suffix_validator..add_suffixrzй┌xfixzAll prompts are identical: `zt` Consider leaving the prompts blank if you want to do open-ended generation, otherwise ensure prompts are differentйr rrz - All prompts end with suffix `r4щ ·R. This suffix seems very long. Consider replacing with a shorter suffix, such as `z5 WARNING: Some of your prompts contain the suffix `zZ` more than once. We strongly suggest that you review your prompts and add a unique suffixaФ - Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts emptyzAdd a suffix separator `z` to all promptscr.r/rr<йr{┌suggested_suffixrrrr3z3common_prompt_suffix_validator..optional_fn┌common_completion_suffixйr r r rr) r6r┌containsrN┌replacernr┌get_common_xfix┌allr)rrr r r┌suffix_options┌ suffix_option┌display_suggested_suffixroru┌common_suffix_new_line_handledrrВr┌common_prompt_suffix_validator└s` ¤А √rОcsвd}d}d}d}t|jddНЙИdkrtddНSdd ДЙ|jИkабr(tddНSИdkrId ИЫdЭ}|tИГkrI|d7}d ИЫdЭ}ЗЗfddД}td|||dНS)zd This validator will suggest to remove a common prefix from the prompt if a long one exist. щN┌prefixr}r┌ common_prefixrvcSs|djt|ГdЕ|d<|Srxйrr)r=rРrrr┌remove_common_prefixsz.remove_common_prefixz" - All prompts start with prefix `r4z╥. Fine-tuning doesn't require the instruction specifying the task, or a few-shot example scenario. Most of the time you should only add the input data into the prompt, and the desired output into the completion·Remove prefix `z` from all promptscs И|ИГSr/rr<йrСrУrrr(r3z3common_prompt_prefix_validator..optional_fn┌common_prompt_prefixr[)rИr6rrЙrйr┌MAX_PREFIX_LENr r rrrХr┌common_prompt_prefix_validators, №rЩcsШd}t|jddНЙtИГdkoИddkЙtИГ|kr tddНSdd ДЙ|jИkабr0tddНSd ИЫdЭ}dИЫd Э}ЗЗЗfddД}td|||dНS)zh This validator will suggest to remove a common prefix from the completion if a long one exist. щrРr}r· rСrvcSs2|djt|ГdЕ|d<|rd|d|d<|S)Nr7rЫrТ)r=rР┌ ws_prefixrrrrУ>sz@common_completion_prefix_validator..remove_common_prefixz& - All completions start with prefix `z_`. Most of the time you should only add the output data into the completion, without any prefixrФz` from all completionscsИ|ИИГSr/rr<йrСrУrЬrrrLrYz7common_completion_prefix_validator..optional_fn┌common_completion_prefixr[)rИr7rrrЙrЧrrЭr┌"common_completion_prefix_validator3s" №rЯcs^d}d}d}d}t|Г}|dks|dkrtddНSt|jddН}|j|kабr6d|Ыd |Ыd Э}td|dНSdЙgd в}|D]}|jjj|ddНабrLq>|ЙИаddб} ddДЙ|dkrУ|аddб} d| Ыd Э}t |Гdkrw|d| Ыd Э7}|jjdt |ГЕjj|ddНабrТ|d|ЫdЭ7}nd}|dkrжd| ЫdЭ}ЗЗfddД}td||||d НS)!zа This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation. Nr`┌classificationrurvrzr}z All completions are identical: `zJ` Ensure completions are different, otherwise the model will just repeat `r4rz [END]) rf┌.z ENDz***z+++z&&&z$$$z@@@z%%%FrrrfrtcSrwйNr7rryrrrr{}r|z6common_completion_suffix_validator..add_suffixrz$ - All completions end with suffix `rАrБz9 WARNING: Some of your completions contain the suffix `zU` more than once. We suggest that you review your completions and add a unique endingaH - Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples.zAdd a suffix ending `z` to all completionscr.r/rr<rВrrrЧr3z7common_completion_suffix_validator..optional_fnrДrЕ) rnrrИr7rЙrrЖrNrЗr)rrr r rrorurКrЛrМrНrrВr┌"common_completion_suffix_validatorWsZ ¤А √rгcCs\ddД}d}d}d}|jjddЕабdks |jjdddkr&d}d}|}td |||d НS)zО This validator will suggest to add a space at the start of the completion if it doesn't already exist. This helps with tokenization. cSs|dаddДб|d<|S)Nr7cSs|ddkr d|Sd|S)NrrЫrrr<rrrrDкszLcompletions_space_start_validator..add_space_start..)rMr<rrr┌add_space_startиs z:completions_space_start_validator..add_space_startNrarrЫzц - The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detailsz=Add a whitespace character to the beginning of the completion┌completion_space_startr[)r7r┌nunique┌valuesr)rrдr rr rrr┌!completions_space_start_validatorгs,№rиcsnЗfddД}|ИаddДбаб}|ИаddДбаб}|d|kr5tddИЫd ИЫd ЭdИЫdЭ|d НSdS)zt This validator will suggest to lowercase the column values, if more than a third of letters are uppercase. cs|Иjаб|И<|Sr/r r<r%rr┌ lower_case├sz(lower_case_validator..lower_casecSєtddД|DГГS)Ncsє$Б|] }|абr|абrdVqdSйraN)┌isalpha┌isupperr"rrr┌ ╔єА"·9lower_case_validator....й┌sumr<rrrrD╔єz&lower_case_validator..cSrк)Ncsrлrм)rн┌islowerr"rrrrп╬r░r▒r▓r<rrrrD╬r┤r8rйz - More than a third of your `z%` column/key is uppercase. Uppercase zўs tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detailsz'Lowercase all your data in column/key `r4r[N)rMr│r)rr&rй┌count_upper┌count_lowerrr%r┌lower_case_validator╛s" ¤ ¤ № r╕c Cs╠tГd}d}d}d}d}tjа|бРrTРz|абаdбs$|абаdбrI|абаdбr-dnd\}}d|ЫdЭ}d|Ыd Э}tj||td На dб}nш|абаdбrqd }d}tа |б} | j} t| Гdkrf|d7}tj |tdНа dб}n└|абаdбrйd}d}t|dГП}|аб}tjddД|аdбDГ|tdНа dб}WdГn1sгwYnИ|абаdбr╥tj|dtdНа dб}t|Гdkr╨d}d}tj|tdНа dб}na n_|абаdбРrz"tj|dtdНа dб}t|Гdkrўtj|tdНа dб}nd }d}Wn4tРytj|tdНа dб}Yn!wd!}d"|vРr)|d#|Ыd$|аd"бd%Ыd&Э7}n|d#|Ыd'Э7}Wn'ttfРyS|аd"бd%аб}d(|Ыd)|Ыd*|Ыd+Э}Ynwd,|Ыd-Э}td.|||d/Н}||fS)0z╒ This function will read a file saved in .csv, .json, .txt, .xlsx or .tsv format using pandas. - for .xlsx it will read the first sheet - for .txt it will assume completions and split on newline Nz.csvz.tsv)┌CSV·,)┌TSV· z= - Based on your file extension, your file is formatted as a z filez Your format `z` will be converted to `JSONL`)┌sep┌dtyperz.xlsxzH - Based on your file extension, your file is formatted as an Excel filez/Your format `XLSX` will be converted to `JSONL`razе - Your Excel file contains more than one sheet. Please either save as csv or ensure all data is present in the first sheet. WARNING: Reading only the first sheet...)r╛z.txtz9 - Based on your file extension, you provided a text filez.Your format `TXT` will be converted to `JSONL`┌rcSsg|]}d|gСqS)rr)r#┌linerrrr'sz#read_any_format..rf)r(r╛·.jsonlT)┌linesr╛z^ - Your JSONL file appears to be in a JSON format. Your file will be converted to JSONL formatz/Your format `JSON` will be converted to `JSONL`z.jsonz^ - Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL formatz]Your file must have one of the following extensions: .CSV, .TSV, .XLSX, .TXT, .JSON or .JSONLrбz Your file `z` ends with the extension `.щ z` which is not supported.z` is missing a file extension.zYour file `z!` does not appear to be in valid z9 format. Please ensure your file is formatted as a valid z file.zFile z does not exist.┌read_any_format)r rr r)r┌os┌path┌isfiler!┌endswith┌pd┌read_csvr┌fillna┌ ExcelFile┌sheet_namesr┌ read_excel┌open┌read┌ DataFrame┌split┌ read_json┌ ValueError┌ TypeError┌upperr) ┌fnamer:┌remediationrr rr┌file_extension_str┌ separator┌xls┌sheets┌f┌contentrrrr─█sФ ¤№■А А■ "А■№r─cCs,t|Г}d}|dkrd|ЫdЭ}td|dНS)z╙ This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification. It will also suggest to use ada and explain train/validation split benefits. NrаzK - Based on your data it seems like you're trying to fine-tune a model for zу - For classification, we recommend you try one of the faster and cheaper models, such as `ada` - For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for trainingrr)rnr)rror rrr┌format_inferrer_validator7s r▀cCsb|jdurtjаd|jЫd|jЫdЭбtаdб|jdur%tjа|jб|jdur/|а|б}|S)zs This function will apply a necessary remediation to a dataframe, or print an error message if one exists. Nz ERROR in z validator: z Aborting...ra) rrg┌stderrrir ┌exitr rhr)rr╪rrr┌apply_necessary_remediationCs rтcCs.tjа|б|rtjаdбdStГабdkS)NzY T┌n)rgrhri┌inputr!)┌ input_text┌auto_acceptrrr┌accept_suggestionSs rчcCs\d}d|jЫdЭ}|jdurt||Гr|а|б}d}|jdur*tjаd|jЫdЭб||fS)zc This function will apply an optional remediation to a dataframe, based on the user input. Fz- [Recommended] z [Y/n]: NTz- [Necessary] rf)r rчrrrgrhri)rr╪rц┌optional_appliedrхrrr┌apply_optional_remediation[s rщcCsjt|Г}d}|dkrt|Г}|d}n|jddНаб}|d}ddД}||d Г}tjаd |ЫdЭбdS) z? Estimate the time it'll take to fine-tune the dataset gЁ?rаg ╫гp= ў?T)rQgСэ|?5^к?cSsd|dkrt|dГЫdЭS|dkrt|ddГЫdЭS|dkr(t|ddГЫdЭSt|ddГЫdЭS) Nщ<r8z secondsiz minutesiАQz hoursz days)┌round)┌timerrr┌format_timewsz.estimate_fine_tuning_time..format_timeщМz:Once your model starts training, it'll approximately take z~ to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you. N)rnr┌memory_usager│rgrhri)r┌ ft_format┌ expected_timer┌sizerэ┌time_stringrrr┌estimate_fine_tuning_timejs rЇcsd|rddgndg}d} |dkrd|ЫdЭndЙЗЗfdd Д|DГ}td dД|DГГs-|S|d7}q) N┌_train┌_validrrTz (·)cs,g|]}tjаИбdd|ИdСqS)r┌ _preparedr┴)r┼r╞┌splitext)r#rzйr╫┌index_suffixrrr'Мs z!get_outfnames..cssБ|] }tjа|бVqdSr/)r┼r╞r╟)r#r▌rrrrпРsАz get_outfnames..ra)rN)r╫r╥┌suffixes┌i┌candidate_fnamesrr·r┌ get_outfnamesЗs■°r cCs.|jаб}d}|dkr|jабjd}||fS)Nr8r)r7rж┌value_countsrQ)r┌ n_classes┌ pos_classrrr┌get_classification_hyperparamsХs rc Cs~t|Г}t|jddН}t|jddН}d}d}|dkr!t||Гr!d}d} |аdd б} |аdd б}t|Гd kr;d|ЫdЭnd}d }|s\|s\tjа d|Ыd| Ыd| Ыd|ЫdЭ бt |Гd*St||ГРr7t||Г} |r╪t| Гdkr{d| d vr{d| dvs}JВd}tt|Г|t t|ГdГГ}|j|ddН}|а|jб}|ddgj| d ddddН|ddgj| dddddНt|Г\}}| d7} |dkr╨| d|ЫdЭ7} n| d |ЫЭ7} nt| ГdksрJВ|ddgj| d ddddН|rєd!ndd"d#а| б}|Рrd$| dЫdЭnd}t| Гd kРrdnd%| ЫdЭ}tjа d&|Ыd'| d Ыd|Ы| Ыd(|Ы|ЫdЭбt |Гd*Stjа d)бd*S)+aQ This function will write out a dataframe to a file, if the user would like to proceed, and also offer a fine-tuning command with the newly created file. For classification it will optionally ask the user if they would like to split the data into train/valid files, and modify the suggested command to include the valid set. rzr}FzQ- [Recommended] Would you like to split into training and validation set? [Y/n]: rаTrrfrtrz Make sure to include `stop=["z;"]` so that the generated texts ends at the expected place.z@ Your data will be written to a new JSONL file. Proceed [Y/n]: zK You can use your file for fine-tuning: > openai api fine_tunes.create -t "·"ue After youтАЩve fine-tuned a model, remember that your prompt has to end with the indicator string `zX` for the model to start generating completions, rather than continuing with the prompt.r8┌train┌validraiшgЪЩЩЩЩЩщ?щ*)rу┌random_stater6r7┌records)r┬┌orient┌force_asciiz! --compute_classification_metricsz" --classification_positive_class "z --classification_n_classes rKz to `z` and `z -v "ucAfter youтАЩve fine-tuned a model, remember that your prompt has to end with the indicator string `z Wrote modified filezd` Feel free to take a look! Now use that file when fine-tuning: > openai api fine_tunes.create -t "z z#Aborting... did not write the file N)rnrИr6r7rчrЗrrgrhrirЇr ┌max┌int┌samplerjrQ┌to_jsonrr])rr╫┌any_remediationsrцrЁ┌common_prompt_suffixrДr╥rх┌additional_params┌%common_prompt_suffix_new_line_handled┌)common_completion_suffix_new_line_handled┌optional_ending_string┌fnames┌MAX_VALID_EXAMPLES┌n_train┌df_train┌df_validrr┌files_string┌valid_string┌separator_reminderrrr┌write_out_fileЭsr ¤ ( ¤( rcCs>d}t|jjабГdkrdSt|jабГt|Г|krdSdS)z> Infer the likely fine-tuning task type from the data щrr`rаzconditional generation)r│r6rrr7┌unique)r┌CLASSIFICATION_THRESHOLDrrrrnъsrnrzcCsnd} |dkr|jt|ГddЕn |jdt|ГdЕ}|абdkr' |S||jdkr1 |S|jd}q)zQ Finds the longest common suffix or prefix of all the values in a series rTrzraNr)rrrжrз)┌seriesr~┌common_xfix┌ common_xfixesrrrrИ°s ¤√ ёrИcCs2tddДddДtttttddДddДtttt t gS)NcSє t|dГSrxйr5r<rrrrDє z get_validators..cSr%rвr&r<rrrrDr'cSr%rxйr╕r<rrrrDr'cSr%rвr(r<rrrrDr')rrCrUr▀r_rprОrЩrЯrгrиrrrr┌get_validatorss ёr)c Cs╞g}|dur|а|б|D]}||Г}|dur!|а|бt||Г}q tddД|DГГ}tddД|DГГ} d} |rPtjаdб|D]}t|||Г\}}| pM|} q@ntjаdб| pY| }|||||ГdS)NcSs$g|]}|jdus|jdur|СqSr/)r rйr#r╪rrrr'6s ¤z$apply_validators..cSsg|] }|jdur|СqSr/)rr*rrrr'>s ■Fz? Based on the analysis we will perform the following actions: z No remediations found. )┌appendrтrNrgrhrirщ) rr╫r╪┌ validatorsrц┌write_out_file_func┌optional_remediations┌ validator┌&any_optional_or_necessary_remediations┌any_necessary_applied┌any_optional_appliedrш┌!any_optional_or_necessary_appliedrrr┌apply_validators$sB А■ ■ №r4)r7)rz)%r┼rg┌typingrrrr┌openai.datalib.pandas_helperrrr╔rrr5rCrUr_rprОrЩrЯrгrиr╕r─r▀rтrчrщrЇr rrrnrИr)r4rrrr┌s> & (L'$L\M