Session 35 – Coding with Artificial Intelligence

Artificial intelligence (AI) is a fast-developing topic in computer science that focuses on building systems that can perform tasks usually associated with human intelligence capabilities. These tasks can include diverse topics such as recognizing patterns, understanding written language, learning from data, making predictions, solving problems, and helping with decisions.

In coding, AI can help researchers move fast from an idea written in plain language to a working computer code, improve existing scripts, explaining error messages, converting code between languages, documenting functions, and suggesting better workflows. However, AI does not replace the responsible implementation and understanding of its resulting output. These include detecting errors, debugging, or biological implications of the results. As for now, AI can be considered an effective assistant, but it is not a replacement for your own understanding.

IMPORTANT: Most people fail at trusting the AI as free of errors, biases, and contradictions. In other words, copy-and-paste AI output must not bypass its evaluation by humans.

35.1 What is artificial intelligence?

Artificial intelligence (AI) is a broad area of computing designed to emulate and perform tasks that usually require human-like reasoning. The current AI can understand language, learn from data and examples, adapt and develop new tools rather than only following fixed instructions.

As of now, AI is expected to continue improving and becoming more general, while its limits and potential dangers remain to be defined. Some practical examples of AI use in biology include: (1) Classification of nucleotide and protein sequences or biological images; (2) Pattern recognition such as predicting structure or function from large datasets; (3) Summarizing text, papers, or documentation; (4) Helping in making decisions and predictions, explaining statistical or biological concepts, and generate text or output images; and (5) Suggesting code from natural language instructions.

For bioinformatic students, AI can help you in translating a biological questions into computational steps that can be implemented in a script that can be run. For example, you can ask AI to develop a script or a guided workflow to import a tab-delimited file, remove or deal with missing values, suggest statistical test, interpret such results against known outcomes, and create plots using ggplot2 with a R script that you can modify.

35.2 What are Large Language Models (LLMs)?

Large language models (LLMs) are artificial intelligence (AI) systems trained on very large amounts of text so that they can learn patterns in language. Because of this training, they can generate text, answer questions, summarize information, translate languages, help with coding, and explain ideas. LLMs can also be understood as advanced prediction systems for language, so an LLM reads the words that came before and predicts what is likely to come next.

Modern LLMs are typically based on the Transformer architecture, which uses attention mechanisms to learn statistical structure across syntax, semantics, discourse, and style. They are often further refined through instruction tuning and reinforcement learning from human/AI feedback. At very large scales, these models can support not only text generation, but also computer code writing, document/research article analysis, image understanding and generation, and automated task execution.

However, LLMs should not be treated as conscious agents. They can still make mistakes, invent facts, or sound confident when they are wrong. Important limitations include hallucination, sensitivity to prompting, bias inherited from training data, and occasional unreliability. Therefore, humans (users) must evaluate LLM outputs carefully whenever accuracy matters.

Here is a table with important LLM families and some examples

Model family Brief explanation Main use Link
OpenAI GPT-5.4 / GPT-4.1 / GPT-4o General-purpose frontier models. GPT-5.4 is positioned for complex reasoning and professional workflows; GPT-4.1 is strong for instruction following and coding; GPT-4o is a versatile multimodal model. Research assistance, writing, coding, agents, multimodal applications OpenAI Models
Google Gemini 2.5 Pro / Flash Gemini 2.5 Pro is Google’s advanced reasoning model, while Flash variants emphasize speed and efficiency. Gemini is also strong for long-context and document-centered workflows. Long-context analysis, STEM reasoning, document understanding, multimodal applications Gemini Models
Anthropic Claude Sonnet 4.6 / Opus 4.6 Claude models are known for strong writing quality, long-context work, coding, and agent-style task execution. Sonnet is the practical workhorse; Opus is the higher-end line. Knowledge work, coding, long documents, enterprise assistants Claude Models
Meta Llama 4 A major open-weight family with strong ecosystem adoption. Llama 4 includes models such as Scout and Maverick and is important for research, customization, and self-hosted deployment. Research, fine-tuning, private deployment, open-model experimentation Llama 4
Mistral Large 3 A leading open-weight multilingual and multimodal model from Mistral AI. It is important because it offers strong performance with more flexible deployment than fully closed models. Enterprise deployment, multilingual tasks, open-weight production use Mistral Models
Qwen3 family Qwen has become one of the most important model families for open and commercial use, with strong coding, reasoning, multilingual, and multimodal variants. Research, multilingual NLP, coding, open-model development Qwen
DeepSeek V3.1 / R1 DeepSeek is especially notable for efficient high-performance models and strong reasoning-focused releases such as R1, which became influential in discussions of open models and cost efficiency. Reasoning tasks, coding, lower-cost deployment, open-model benchmarking DeepSeek

During code development, you can ask the LLM to annotate the script by adding comments, more arguments, or command-line options. LLMs can also suggest test cases, example inputs, and debugging strategies. For example, instead of searching across many webpages for separate answers, you can ask a tool such as ChatGPT, when web search is enabled, to search across the web for you. You can also ask to produce code in R, so you can ‘copy and paste’, yet a detailed prompting is key to achieve a desired outcome. In other words, you must know how the output looks like, so the LLM can described it.

# Example of a natural-language coding request to search the web
"Find papers and their citation relevant to pigment production in birds, compare such results with human literature, provide links, and citations of relevant papers"

# Example of a natural-language coding request for R code
"Write an R script that imports a tab-delimited file, removes rows with missing values in a column called expression, and makes a boxplot of expression by tissue. Explain each step for a beginner."

35.3 Prompting as the basis of interaction with LLMs

A prompt is a set of instructions, usually as a text, that you give to the LLM to perform a useful task or response. These prompts can be a question, a command, a request for explanation, a task description, or a combination of these.

For example, if you write:

Explain natural selection for both first-year biology and master-level students.

This sentence is the prompt and the LLM then uses this information and the given instructions in that prompt to generate an answer. Current LLMs are more flexible, and you can specify a task so that the model can better align its response with the user’s defined needs. In coding, the quality of the prompt strongly affects the quality of the answer. A vague prompt usually gives a vague answer.

35.4 Anatomy of a prompt

A prompt as defined is a structured request in a text form that frames a task, provides necessary context, and specifies the desired form of the expected answer. The future of LLM interaction will be central to an effective prompt engineering. In general, a good prompt might contain a goal; enough context for its achievement; constraints or limits of its generality; desired output format such as text, code, image, etc. As a researcher, you should aim for prompts that reduces ambiguity, increases reproducibility, and improves the relevance, accuracy, and usability of the generated output.

More generally, a prompt may contain one or more of the following components that help define a clear and effective prompt for an LLM:

Element Improved description Example
Task Defines the main action the model should perform. A strong task statement makes the goal explicit and avoids ambiguity. Summarize the results section of this paper.
Context Provides the background, purpose, or situation needed so the model can produce a more relevant and accurate response. This summary will be used in an undergraduate genetics lecture.
Input data Specifies the material the model should analyze, transform, compare, rewrite, or emulate. This may include raw text, code, tables, or an example to follow. Use the following abstract and discussion section as input.
Constraints Establishes the rules or limits the response must follow, such as length, scope, allowed sources, or things to avoid. Do not use bullet points; keep the answer under 200 words.
Output format States how the response should be structured or presented so it is easy to use directly. Return the answer as a three-column table.
Audience Identifies who the response is intended for, which helps determine the level of detail, terminology, and explanation. Write for graduate students in evolutionary biology.
Tone or style Describes the voice, level, and presentation style of the response. This helps shape whether the output is formal, technical, concise, accessible, or persuasive. Use a formal, clear, and beginner-friendly style.

35.5 Why prompting matters?

A model can only respond based on the instructions it receives. If the prompt is vague, the answer will often be vague. If the prompt is precise, the answer is more likely to be accurate, relevant, and easy to use.

Some of the characteristics of good prompts include:

Characteristic Description Bad example Good example
Clear The request is easy to understand and does not force the model to guess what is being asked. Tell me something about evolution. Explain natural selection in simple terms.
Specific The task is defined in enough detail to reduce ambiguity and improve relevance. Write about amphibians. Write a 200-word summary of chemical defenses in amphibians.
Context-rich The model is given relevant background information so the response matches the user’s purpose. Summarize this paper. Summarize this paper for an undergraduate lecture in genomics.
Goal-oriented The prompt explains the intended outcome or purpose of the response. Describe this dataset. Describe this dataset so I can decide whether it is suitable for differential gene expression analysis.
Constrained when needed The prompt includes rules such as word limits, file formats, assumptions, or methods to avoid. Write a report on AI. Write a 300-word report on AI in biology without using jargon or bullet points.
Explicit about output format The prompt states how the answer should be organized, such as a table, code, bullet points, summary, or full report. Compare these methods. Compare these methods in a four-column table with strengths, weaknesses, assumptions, and applications.
Audience-aware The prompt identifies the intended audience so the explanation level and terminology are appropriate. Explain CRISPR. Explain CRISPR for first-year undergraduate biology students.
Testable or evaluable A good prompt makes it easier to judge whether the response satisfies the request because the expected result is concrete. Help me understand this better. List three reasons why this analysis may be biased and give one way to test each reason.

35.6 Comparing strong and weak prompts

A weak prompt often has one or more of the following problems:

  • It is too brief to define the task clearly.
  • It does not specify the kind of output that is needed, such as a summary, table, code, report, or presentation.
  • It leaves out important context that would help the LLM understand the purpose, background, or scope of the request.
  • It does not identify the intended audience, such as undergraduates, graduate students, researchers, or the general-public.
  • It uses vague instructions such as “analyze this,” “explain it,” or “make it better” without stating what kind of analysis, explanation, or improvement is expected.
  • It may ignore useful constraints, such as length, format, tone, level of detail, citation style, file type, or assumptions to follow.

Because of all these weaknesses, any vague prompt often leads to responses that are too general, incomplete, incorrect, or difficult to evaluate.

A strong prompt, in contrast, gives enough information for the task to be understood and completed effectively. It usually includes:

  • It is usually long and detailed to allow clear statement of the intended task and its goals
  • The relevant context or background is described
  • The expected output format is described. If you can provide an example of the desired output, this will provide the model what type of result you are expecting
  • Any important constraints or requirements as guidelines on generality or specificity
  • The intended audience if such output is a presentation, a summary, an image, a figure, a table, etc.

35.7 Prompting for research

You should consider LLMs as an extension of your capabilities to do targeted research. In other words, you need to know first your topic with detail before you venture on having an AI perform tasks for you such improvement on your data gathering, analysis, reporting, and summarization.

Some examples on prompts for biology research.

Prompt quality Good example Bad example (vague) Expected results
Biological system or organism Compare potential immunity gene candidates across Platyrrhini, using amino acid positions homologous to human IL-7. Study IL-7 in monkeys. The response focuses on the correct organism group and biological problem.
Research task Summarize which residues differ at candidate functional positions and identify patterns by Platyrrhini clades. Analyze these sequences. The output addresses a specific analytical goal rather than giving a generic summary.
Input format Use the attached file of amino acid sequences in FASTA format. Use my data. The model knows what data structure to expect and can organize the analysis appropriately.
Desired output format Return a tab-delimited summary table with the most common polarity estimate per each residue, give a numeric value such Grantham polarity metric Summarize amino acid properties. The output is directly reusable in a manuscript or downstream analysis.
Important constraints Do not overstate causation; distinguish clearly between observed patterns and functional hypotheses. Use terminology appropriate for comparative genomics. Be scientific. The answer becomes more rigorous, cautious, and suitable for research use.
Level of explanation you want Explain first at an undergraduate level and then provide a more technical interpretation for graduate students. Explain the results. The answer becomes layered, accessible, and appropriately detailed for different audiences.

35.8 Prompting for building a presentation

It has become very common for students to use AI output for presentations. Yet the quality of such presentations vary, and they should be considered as first drafts rather than final products. Again, the quality prompts determine how this first draft might be produced. As such, these prompts should specify the audience, presentation length, learning goals, slide structure, and preferred visual style. This is especially important in research presentations where the audience is known (e.g., your fellow classmates or colleagues). In this case, it is desirable that the prompting is aimed at producing one slide or figure at a time, and then, progressively improve them with your feedback. It is also very important to proofread and consider them as draft presentations. LLMs do produce factual errors, inventions (hallucinations), and tend to be repetitive or shallow.

Here are some examples of good prompts for research presentation building.

Prompt quality Good example Bad example (vague) Expected results
Audience Build a presentation for upper-level undergraduate biology majors on genome assembly using PacBio HiFi and Hi-C data. Make a biology presentation about genomics. The slides are pitched at the correct level of background knowledge.
Presentation task Create a 12-slide lecture explaining genome assembly workflow, contamination screening, annotation, and quality statistics. Make slides about genomic workflows. The presentation has a defined scope and logical structure.
Input format or source material Use the attached text, convert sections that have a title into 2-3 slides with one key message per slide, and also add speaker notes. Use this text for a genomics presentaiton. The model organizes the material rather than simply paraphrasing it.
Desired output format Return the presentation as PDF slide outline text, with slide titles, bullet points, figure suggestions, and speaker notes. Make this presentation nice for my Bioinformatics class. The output is structured and ready to transfer into slides.
Important constraints Keep terminology accurate, revise it before producing a final output, avoid crowded slides to no more than 30 words per slide, define technical terms the first time they appear, and suggest only figures that would genuinely help explain the concept. If citations are pertinent, include those as notes in a separated text file. Make my presentation professional. The presentation is clearer, more teachable, and better designed.
Level of explanation you want Provide content understandable to undergraduates, but add optional advanced notes for graduate students. Explain it well for my class of bioinformatics. The slides can serve mixed audiences more effectively.

35.9 Prompting for data mining

You should think of LLMs as tools that can help you structure, refine, and accelerate data mining tasks. The task of searching for data requires defining the scope of the search such as the taxonomic/gene/function topic, selecting the type of data to search, deciding which databases to search, identifying filtering criteria, and specifying how results should be returned.

For that reason, good prompting for data mining is not simply asking an LLM to “find genes” or “find cases.” A good prompt tells the model:

  • What biological target you want
  • Where the information should come from
  • How the information should be filtered
  • What output structure you need
  • What assumptions or limitations must be respected

Likewise, the quality of the prompt strongly affects:

  • Precision Whether the model retrieves the correct biological target rather than a large amount of irrelevant material.
  • Reproducibility Whether another person could repeat the same search using the same criteria.
  • Interpretability Whether the output is organized in a way that can be checked, edited, and incorporated into a manuscript or report.
  • Scope Whether the model distinguishes between an annotation, a similarity-based inference, and a demonstrated biological function.
  • Downstream usability Whether the result can be reused in a spreadsheet, script, figure legend, methods section, or supplementary table.

This is especially important in bioinformatics and research literature review. Again, vague prompts can lead to incorrect answers or data retrieved (e.g., wrong gene search, overly broad literature suggestions, wrong or invented references, or outputs in the incorrect format).

Some examples of prompt characteristics for data mining are shown below.

Prompt quality Good example Bad example (vague) Expected results
Biological target Search for candidate chitinase-like protein sequences in reptile proteomes, focusing on genes annotated as CHIT1, CHIA, CHID1, or chitotriosidase-like proteins. Find chitinases in reptiles. The model focuses on a defined gene family instead of broadly collecting unrelated proteins.
Taxonomic scope Restrict the search to Squamata, Testudines, Crocodylia, and representative non-avian reptiles; exclude birds and mammals. Search vertebrates. The results match the intended clade and avoid unnecessary noise.
Research task Identify candidate sequences, summarize annotation names, sequence lengths, and species, and flag possible duplicates or partial records. Analyze these genes. The task becomes operational and produces a reusable summary.
Input or data source Use the attached FASTA file, plus NCBI protein annotations if needed for cross-checking names. Use available data. The model knows what evidence to rely on and can separate uploaded data from external annotations.
Search or filtering criteria Keep only sequences with annotations suggesting chitinase or chitinase-like function, and note whether the entry appears complete or partial. Keep the best ones. The selection process becomes transparent and easier to reproduce.
Desired output format Return a tab-delimited table with Species, Accession, Annotation, Length, Completeness, and Notes. Summarize the results. The output can be reused directly in spreadsheets or downstream analyses.
Important constraints Do not assume homology from annotation alone; flag uncertain entries and separate observed annotations from functional interpretation. Be accurate. The response becomes more cautious and scientifically appropriate.
Level of explanation you want First explain the workflow for undergraduate students, then provide a more technical version for graduate students. Explain it well. The response becomes layered and useful for teaching and research.

Here is an example for data mining for sequences about chitinases in reptiles. This shows how a weak prompt can be improved into a strong prompt for a sequence-based data mining task.

## Weak prompt

"Find chitinase sequences in reptiles."

The most important problem with this prompt is it is too broad. It does not specify:

  • Which reptiles
  • Which gene names or gene family members
  • Whether the task uses uploaded files or online databases
  • How candidate sequences should be filtered
  • How the results should be returned.

Because of that, the response may be incomplete and broad. Here is an improved prompt.

## Strong prompt

"Use the attached FASTA file and identify candidate chitinase-related protein sequences in reptiles. Restrict the analysis to non-avian reptiles and organize the results by major clade when possible (for example, Squamata, Testudines, and Crocodylia). Search for annotations or headers suggesting CHIT1, CHIA, CHID1, chitotriosidase, acidic mammalian chitinase, chitinase-like proteins, or related naming variants.

For each candidate sequence, return a tab-delimited table with the following columns:

Species  
Major_clade  
FASTA_ID  
Annotation  
Sequence_length_aa  
Complete_or_partial  
Evidence_type  
Notes

Use the following rules:

1. Separate direct evidence from inferred evidence.  
2. If the sequence name clearly indicates a chitinase-related gene, label the evidence as annotation-based.  
3. If the entry is ambiguous, place it in the Notes column rather than over-interpreting it.  
4. Do not assume that all glycosyl hydrolase family proteins are true chitinases.  
5. If multiple nearly identical entries occur for the same species and gene, flag them as possible duplicates.  
6. After the table, provide a short summary indicating which reptile groups appear to have more candidate entries and where uncertainty remains.

Explain first at an undergraduate level and then add a more technical interpretation suitable for graduate students."

This prompt improves the LLM task because the resulting answer easier to verify and much more useful for downstream work. For example, the output might include:

  • A candidate list of reptile sequences with chitinase-related annotations
  • Clear indication of whether evidence comes from explicit annotation or weaker inference
  • Identification of duplicate or partial entries,
  • Short summary of patterns by reptile clade

35.11 Prompting for coding

The prompts for coding should follow similar guidelines as prompting for research in general. However, the LLM needs to know further guidelines such as the programming language, the exact task dissected in logical and usually chronological steps (e.g., a protocol), the structure of the input dataset, if data need to be downloaded from the web such as NCBI, UniProt, NCBI-SRA, PubMed, the desired output, and any constraints such as speed, and package restrictions.

Here are some examples of good prompts for coding.

Prompt quality Good example Bad example (vague) Expected results
Programming language Write a Python 3 script that reads a FASTA file and returns a tab-delimited table with sequence ID and nucleotide length. Write code for FASTA lengths. The model produces Python code instead of guessing another language.
Task to perform Create a script that removes sequences shorter than 300 bp from a multi-FASTA file. Filter this file. The task is explicit, so the output is targeted and usable.
Input format Input is a multi-FASTA file provided with the argument --input_fasta_file. Headers should be preserved exactly. Use my file. The model knows how data enter the script and can design argument parsing correctly.
Desired output format Return a tab-delimited file with columns FASTA_ID and N_nucleotides, and also save a filtered FASTA file. Give me the output. The result includes the exact files and column names requested.
Important constraints Do not use pandas. Use only Biopython and argparse. Sequences must remain on one line in the output FASTA. Include error handling for empty files. Make it efficient. The code follows package, formatting, and robustness requirements more closely.
Level of explanation you want Explain the code at a level suitable for undergraduate biology students learning Python 3, and include comments in the script. Explain it. The explanation matches the user’s background and is more educational.

35.12 A basic prompting template for coding

The following template can be adapted for most beginner coding tasks:

"Write a [LANGUAGE] script that does the following:
1) [TASK 1]
2) [TASK 2]
3) [TASK 3]

The input file looks like this: [DESCRIBE INPUT]
The output should be: [DESCRIBE OUTPUT]
Please add comments explaining each step.
Assume the user is a beginner.
If there are possible edge cases, explain them after the code."

Here is an example for a basic analysis and plots using R

"Write an R script that does the following:
1) imports a tab-delimited file called expression_data.txt
2) removes rows where column TPM is NA
3) calculates the mean TPM per tissue
4) plots the results as a barplot

The input file is tab-delimited and it has headers and the columns named gene_id, tissue, and TPM.
The output should be a plot and a table printed in the console, provide a code using ggplot2.
Please add comments explaining each step.
Assume the user is a beginner."

35.13 Prompting for bioinformatics in R

Bioinformatics prompts are usually better when they include a biological context, file structure, and expected output. This is important because the same coding task may be implemented differently depending on whether you are working with FASTA, data matrices, molecular data, or experimental result tables.

For example, a weak bioinformatics prompt would be:

"Write R code for my sequences."

A stronger prompt would be:

"Write an R script that reads a FASTA file using Biostrings, calculates the sequence length of each entry, removes sequences shorter than 300 base pairs, and writes the retained sequences to a new FASTA file. Add comments and explain what packages need to be installed."

Another good example for tabular bioinformatics data is:

"Write an R script that imports a tab-delimited differential expression table with columns gene_id, log2FC, and padj. Keep only rows where padj < 0.05 and absolute log2FC > 1. Then create a volcano plot with ggplot2 and label the 10 genes with the smallest padj values. Explain the code for a starting graduate student in bioinformatics."

A good prompt for sequence summaries in R might be:

"Write an R script that imports a DNA FASTA file, calculates GC content for each sequence, stores sequence names, lengths, and GC content in a data.frame, and writes the result as a tab-delimited file. Use Biostrings if needed and comment the script."

Below is another example of a prompt followed by the type of code structure that it is expect as an output.

"Write an R script for a beginner that imports a tab-delimited file called my_counts.txt, filters rows where count is greater than or equal to 10, and makes a histogram of count values. Include comments and use base R only."

A typical answer from an LLM may resemble the following:

# import the table
my_counts <- read.table(file = "my_counts.txt",
                        header = TRUE,
                        sep = "\t",
                        stringsAsFactors = FALSE)

# inspect the first rows
head(my_counts)

# keep only rows where count >= 10
my_counts_filtered <- subset(my_counts, count >= 10)

# inspect the filtered result
head(my_counts_filtered)

# draw a histogram of count values
hist(my_counts_filtered$count,
     main = "Histogram of filtered counts",
     xlab = "Count")

This result is useful as a starting point, and it should be carefully revised for accuracy. You can suggest fixing an error if you find one. Describe the error or any further desired improvements with follow-up questions while pasting the original code.

Prompt: "Improve the following R code by adding a line that writes the filtered table to a tab-delimited file. Then, rewrite the same script using dplyr and ggplot2. Include an explanation of what subset() does in this script. Provide notes for what will happen if the file contains NA values in the count column?"

Here is the original code:

# import the table
my_counts <- read.table(file = "my_counts.txt",
                        header = TRUE,
                        sep = "\t",
                        stringsAsFactors = FALSE)

# inspect the first rows
head(my_counts)

# keep only rows where count >= 10
my_counts_filtered <- subset(my_counts, count >= 10)

# inspect the filtered result
head(my_counts_filtered)

# draw a histogram of count values
hist(my_counts_filtered$count,
     main = "Histogram of filtered counts",
     xlab = "Count")

35.14 Iterative prompting

One of the most powerful ways to use LLMs is to progressively improve an original code by asking improvements on code successively (i.e., iteratively). This means you progressively refine the code through multiple prompts using your previous code as part of each successive prompt, then run the updated code, see its output, find errors, suggest new improvements, add analyses or graphs.

A very effective iterative prompting workflow is the following:

  1. Ask for a first working version
  2. Run the code and inspect errors or weak points
  3. Return to the LLM with the code and explain what must be improved
  4. Repeat until the script is correct, clear, and efficient

For example, imagine that the first code works but lacks flexibility. You can improve it step by step.

Step 1: Ask for the first version

"Write an R script that imports a tab-delimited file and filters rows where value > 5."

Step 2: Ask for comments and explanation after the first draft code has been output

"Add comments for each line and explain the code for a beginner."

Step 3: Ask for generalization by pasting the first draft code (second iteration)

"Now modify the script so the user can define the input file name, output file name, and threshold value at the top of the script."

Step 4: Ask for better error handling by pasting the second draft code (third iteration)

"Add checks so the script stops with a clear message if the input file does not exist or if the column named value is missing."

Step 5: Ask for style improvement by pasting the third draft code (fourth iteration)

"Rewrite the script to be cleaner and more readable, but keep base R only and do not change the output."

This type of progressive prompting usually gives much better results than trying to get the perfect script in a single request. Here is another example of a sequence of prompts for improving a beginner script in R.

Original prompt

"Write an R script that imports a tab-delimited file with columns gene and expression, removes rows with NA in expression, and makes a boxplot of expression."

First improvement

"Modify the script so the boxplot shows expression by tissue, where tissue is another column in the table. Use ggplot2 instead of base R."

Second improvement

"Add a step that writes the cleaned data to a file named cleaned_expression.txt as a tab-delimited table."

Third improvement

"Add checks so the script verifies that the columns gene, expression, and tissue exist before plotting."

Fourth improvement

"Now explain which parts of this script are data import, data cleaning, and plotting, and describe why each step is needed."

35.15 Prompting to debug code

A code might have errors (i.e., bugs) and not run or return undesired results. These errors are usually the result of some mistake in the code that the user must locate based on the reported error (if the program did not run). In the past, this activity was very tedious and required knowledge of the coding language and several bouts of testing. With LLMs, this activity can be outsourced as these tools can be very effective at debugging, but your prompting should include the original code, the error message, and what your expected output should look like (or a description of the expected output).

A weak debugging prompt:

"My code does not work."

A much better debugging prompt:

"This R code returns an error 'object tissue not found'. Based on this, improve the following code:

my_data <- read.table('my_file.txt', header = TRUE, sep = '\t')
boxplot(expression ~ tissue, data = my_data)

Explain the possible causes of this error, how to inspect the imported columns, and then provide a corrected version of the code."

This helps the LLM focus on the actual problem rather than guessing.

35.16 Asking an LLM to explain code

You should ask the LLMs not only to generate code, but also to annotate it. This helps transform the tool from a simple generator into a teaching assistant that helps in debugging.

Here are some prompts aimed at increasing code annotation

"Explain the following R function line by line for an undergraduate bioinformatics student. After the explanation, define what each argument does and provide a simple example of use."

"Rewrite this code with better comments and then explain what subset(), grepl(), and complete.cases() do in this context."

These explanation prompts are especially useful when you use pre-genetated code from another source such as GitHub, a supplementary material associated with an article, a lab script used for a bioinformatics protocol or any type code examples.

35.17 Using multiple LLMs to improve a starting code

There are diverse strategies to improve or generate complex bioinformatic pipelines. As expected, sometimes an LLM might get stuck and return an ineffective code or progressively generate a ‘Frankenstein’ patchwork of code that has become too long or too complex to understand. An alternative strategy is to use more than one LLM for different purposes or ask two or more LLMs to generate the same code. You can generate a third code by asking to compare both codes, find strengths and weaknesses on both, and generate a third version that has the best of both.

Here is an example of such prompts aimed to compare and generate a hybrid code

"Compare these two uploaded versions of this Python code to annotate proteins using BLASTn with a user-provided library. Provide in a table the strengths and weaknesses of each. Then, using this information, generate a third version that combines the best of both."

You can think of this as using different reviewers with different strengths. However, you need to test each version of the code to see if the desired output has been generated or errors have been introduced. As in some cases with LLMs generating text, AI generated code can also hallucinate and generate numeric calculations, classifications, or images that are factually wrong or misleading. A common example of this type of error is that numeric values for property or measurement differ substantially between both version of the code. If you have knowledge of what is being measured, you can eyeball such error and confirm the miscalculation in a specific code version. In other cases, the judging LLM will determine that an error was introduced and proceed to correct it by transplanting part of the code from one version to the other.

There are several reasons for differences in the output code between different LLMs. Some include the existence of different functions or packages that are redundant, and each LLM is choosing a different one. Some LLMs are spending on thinking less and try to provide a shorter version of the code that might be more flexible or prefers to use shortcuts. The calculations requested have been done using different metrics and the prompt did not specify what specific metric is needed (e.g., amino acid Kyte-Doolittle versus Hopp-Woods hydrophobicity scales).

Here is an example of a workflow that will allow some comparison and hybrid code construction:

  1. Use one LLM to generate the first script
  2. Use a second LLM to review the script for logic, edge cases, and clarity
  3. Use a third LLM to explain the revised script in simpler language or compare the first and second version
  4. Use this LLM to generate a hybrid code with the best of both previous versions
  5. Manually test the final script yourself every version. If you detect an error or miscalculation, include the explicit location and type of error (rounding, different values, etc.)
  6. You can guide LLM by indicating what type of output are you expecting by giving an example of the input data the expected output

Here is an example multi-LLM workflow.

Prompt to LLM 1: generate code in ChatGPT (thinking):

"Write an R script that imports a FASTA file, calculates sequence lengths, and outputs a tab-delimited table with sequence name and length. Assume the user is a beginner and add comments."

Prompt to LLM 2: generate a similar code in DeepSeek:

"Write an R script that imports a FASTA file, calculates sequence lengths, and outputs a tab-delimited table with sequence name and length. Assume the user is a beginner and add comments."

Prompt to LLM 3: Review both versions of the code in ChatGPT (thinking):

"Review the uploaded versions the following R script for correctness, readability, and edge cases. Identify similarities, strengths, and weaknesses of each in a table format. Suggest at least three possible improvements, then provide a revised version."

Prompt to LLM 4: Generate a hybrid final version in ChatGPT (thinking):

"Generate third version of the code using the best of both uploaded versions. Explain this revised R script step by step for a starting graduate student in bioinformatics. Define the purpose of each package and function, and mention possible input problems the user should check before running it."

35.18 Comparing answers from multiple LLMs

When comparing outputs from different LLMs, consider the following:

  1. Does the code actually run?
  2. Does it solve the requested problem?
  3. Is the code readable and well commented?
  4. Does it make unnecessary assumptions?
  5. Does it handle missing values, wrong file names, or absent columns?
  6. Does it use packages that are appropriate for the task?
  7. Does it explain the logic clearly?

It is often useful to paste the answers into a text file and compare them side by side. You can then create a combined prompt such as:

"I have two versions of an R script below. Compare them for correctness, readability, and performance. Then create a third version that combines the best parts of both. Explain why your final version is better."

This is an excellent strategy for learning because it forces the student to think critically about code quality rather than accepting the first answer.

35.19 Good practices when using LLMs for coding

When using LLMs for coding in class, in lab work, or in bioinformatics projects, it is a good practice to store your prompts in comments like most do as if it were a lab notebook.

  1. Keep a record of the prompts you used
  2. Keep the original version of the code and the revised versions
  3. Annotate what you changed manually
  4. Test the code on a small example dataset first
  5. Verify that packages and functions are real (i.e., exist) and they are correctly used
  6. Inspect the output carefully
  7. Make sure that you do not upload sensitive or restricted data as many of the LLMs keep such data
  8. Think that the prompts should help with code reproducibility and make your workflow more transparent

35.20 Limitations of LLMs for coding

LLMs can be very useful for coding because they can generate scripts quickly, explain syntax, suggest workflows, and help debug errors. However, one major limitation is that LLMs do not truly understand the data, software environment, or biological system in the way a researcher does. LLMs predict the next text based on patterns used in training. As a result, they may generate code that is syntactically valid but logically wrong, especially when the task involves unusual input formats, hidden assumptions, or edge cases. For example, a script may run without errors but parse the wrong column, mishandle missing values, reverse sequence orientation incorrectly, or apply the wrong filtering threshold.

Another limitation is that LLMs can hallucinate technical details. They may invent package names, functions, arguments, command-line flags, file formats, or even expected outputs. This can be especially problematic in bioinformatics, where tools often have highly specific options and version-dependent behavior. A generated command may look realistic but fail immediately because the option does not exist.

LLMs may also misunderstand the structure of the user’s data. If the prompt does not fully specify limits, headers, sequence formats, missing-value conventions, column meanings, encoding, or naming rules, the LLMs may fill in the gaps in logic or data with assumptions. In biological datasets, these assumptions can be dangerous because a small mismatch between what the user intended and what the script does can alter downstream conclusions.

A further issue is incomplete treatment of edge cases. These are data points that are outliers and their handling usually require specific instructions. LLM-generated code often works best for the most case that resembles examples used in their training data. However, they may fail on empty files, duplicated identifiers, malformed FASTA headers, multistate amino acid entries, unexpected characters, uneven tabular structures, or very large files. In research workflows, these edge cases and dirty data are the norm and not the exception.

LLMs can also produce overconfident explanations. They may describe an approach as correct, optimal, or standard even when it is only one possible option, or when it contains hidden errors. This can create a false sense of security for users who are still learning programming or who are unfamiliar with the data or research question. The LLM’s explanation may sound authoritative while masking uncertainty, omitted assumptions, or incorrect reasoning.

For bioinformatics, small coding errors can result in major analytical mistakes. Therefore, LLM-generated code should always be treated as a draft to be checked, not as a final answer to be trusted automatically. In practice, LLMs should be used as an assistant rather than as an authority. It can speed up drafting, debugging, and brainstorming, but the user must still verify correctness, confirm assumptions, and decide whether the approach is biologically appropriate.

Here are common Limitations of LLMs for coding

Limitation Description Example
Code may look correct but fail An LLM can generate code that appears polished and well-structured but does not run or does not solve the problem correctly. A Python script is generated with valid syntax, but it uses the wrong column name and crashes when run on the real dataset.
Invented functions or packages The model may hallucinate functions, arguments, package names, or command-line options that do not exist. An R answer suggests a function such as read_fasta_table() from a package that is not real.
Misunderstanding input structure The model may assume the wrong delimiter, header format, missing-value style, or column meaning. A script assumes a comma-separated file when the real file is tab-delimited, causing the entire row to be read as one column.
Poor handling of edge cases LLM-generated code often works only for the most typical input and ignores unusual but important cases. A FASTA parser works for simple headers but fails when sequence names contain spaces or repeated identifiers.
Omission of assumptions The code may ignore biological, statistical, or methodological assumptions needed for valid interpretation. A filtering script removes short sequences without considering that biologically meaningful loci may differ in expected length.
Overconfident explanations The model may present uncertain or flawed reasoning as if it were definitely correct. It states that a given alignment strategy is “the best method” without discussing alternatives or limitations.
Version-specific errors LLMs may suggest syntax or options that apply only to different software versions. A command for a bioinformatics tool uses an argument that existed in an older release but not in the installed version.
Shallow debugging The model may explain an error message in a generic way without identifying the true source of the problem. It blames file corruption when the real issue is an empty variable or wrong command-line argument.
Weak reproducibility practices Generated code may omit logging, validation, comments, or clear file naming, reducing reproducibility. A script writes output files but does not document parameters or record which filters were applied.
Hidden inefficiency The model may generate code that works on small examples but performs poorly on large biological datasets. A script repeatedly loops through a large FASTA file in memory instead of streaming records efficiently.