Bioinformatics Workshop Gitbook
2025-01-22
Session 1 – Getting Started with R
1.1 Why do we use R?
At the core of modern bioinformatics that is the user friendly is R. This is a programming language and free software environment for easy application of diverse methods for statistical, data mining, file processing and graphics. R is extremely flexible and provides an easily customizable data processing that ranges from simple to very complex analyses. Here are some advantages of R:
- R is free to download and use while other similar environments (e.g., SAS, SPSS, Matlab) require licenses and fees.
- R is available across platforms including Windows, Mac and Linux
- R is always evolving and at its core nature is being open source. The R community is huge and global, yet the core base software tends to be relatively stable, the expanded functionality derived by add-ons packages expand the capabilities of core R.
- R has extensive libraries such as The Comprehensive R Archive Network (CRAN) that hosts diverse packages that extend functionality, specialized statistical methods, machine learning, and data manipulation.
- R provides easy and accessible data processing. You can import diverse types of datasets from text files, Excel sheets, sequence data, images, etc.
- R provides easy data manipulation including sub-setting (i.e., you filter your source data using filters), transformation, encryption, and export to other formats.
- R provides one of the most extensive data visualization tools that are freely available with extensive manuals, vignettes and online boards. Such graphics range from simple boxplots to multipaneled, tiling, and even GIS maps.
- R statistical applications are diverse and easy to implement. Many basic univariate methods are preloaded, and newest tools and models are provided usually as add-ons developed by authors of such methods.
- R provides an easy way to share your methods (scripts) and reproduce illustrative examples of their implementation. Many of these reside as add-on packages, others in GitHub pages and supplementary materials of peer-reviewed publications.
- R allows easy scripting and functions to automate repetitive tasks and build customizable workflows.
A) The R project:
This is the main repository of R, add-on packages, documentation, and source code
B) Looking for help when using R (searching websites):
Probably one of the most useful Q/A sites for quick and practical responses to punctual questions about programming and R use. This site follows a format where someone ask or presents a problem and the community will try to respond with suggestions and practical applications of the R code to address the question. Some answers try to use the data in the question to fix the issue or problem to illustrate an answer the problem. The beauty of this approach is that the readers will vote on the provided answers and those with most votes move up among other competing responses (i.e., an answer with a 50 is considered better than one with a 10). Likewise, readers can provide discussions, comments, other examples to complement responses. Likewise, you can also “google” your questions and it is very likely that you would be pointed to a “stackoverflow” response, which furthers the value that most R users consider of the answers in that site.
Moreover, the stackoverflow website is oriented to answer most questions from most other programing languages (i.e., Python, java, etc). Therefore, I suggest to finish your questions/queries with “… in R” or use the prefix [r] (e.g., “[r] weighted mean”), so it limits its search to R applicable answers.
https://stackoverflow.com/tags/r/info
An excellent guide for R related, examples, references, index of answers relevant to R. This is a search engine limited to mostly R based websites
You can also use artificial intelligence (AI) platforms to ask questions and obtain suggestions for implementing R code, enabling you to translate verbal descriptions of data manipulation into R scripts. This approach represents the likely future of coding, as human-machine interactions increasingly embrace natural (ordinary) language for implementing computing algorithms and functions.
Here is an example in ‘chatgpt’ with a basic natural language prompt
in R create a basic plot of two vectors
Then ‘chatgpt’ will return something like this
in R:
Here’s an example of how to create a basic plot of two vectors
# Define two vectors
<- c(1, 2, 3, 4, 5)
x <- c(2, 4, 6, 8, 10)
y
# Plot the vectors
plot(x, y, type = "o", col = "blue",
main = "Basic Plot of Two Vectors",
xlab = "X Values", ylab = "Y Values")
# Add grid lines
grid()
Here is an example in ‘github: Copilot’ with a basic natural language prompt
in R create a basic plot of two vectors
Then ‘github: Copilot’ will return something like this
! Here's a simple example of how to create a basic plot of two vectors in R:
Sure
# Create two vectors
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
# Create a basic plot
plot(x, y, type="b", col="blue", main="Basic Plot of Vectors", xlab="X Axis", ylab="Y Axis")
In the case of ‘github: Copilot’, the response might include a brief explanation and how to run the code example.
C) Repositories for R-packages (add-ons to basic R):
This is the main repository. To get the packages that you need, look on your left side for Software and then click on Packages. You can get packages by: Table of available packages, sorted by date of publication or Table of available packages, sorted by name.
This archive provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor houses R-packages (i.e., open source and open development) as well as older packages that could not be found CRAN. Bioconductor provides two releases each year, and also have a very active user community.
D) Looking for suggestions of what R-packages to use:
http://cran.r-project.org/web/views
This site provides guidance about which packages are available as R-add on libraries in the CRAN repository. Each topic will give a brief overview of the included packages and which packages should be included (or excluded) - and they are not meant to endorse the “best” packages for a given task.
http://search.bioconductor.jp/
This provides a search tool for functions in Bioconductor packages.
This a standard repository where news, conferences and other summaries relevant to the R community are posted.
This graphics gallery provides a collection of plots, charts and illustrations made with the R. As indicated by this website, hundreds of charts are displayed in several sections, always with their accompanying code. Likewise, this gallery makes a focus on the tidyverse and ggplot2.
1.2 Get a good text editor
Any scripting language such as the R language requires you to write code that can be read and annotated (i.e., you add notes that will help you understand what that piece of code is doing). Therefore, finding a good text editor is essential.
Here are several suggestions for free text editor:
1) Sublime (macOS, Windows, Linux):
https://www.sublimetext.com/download
It is free to download, but if once in while will remind you to pay for license (i.e., … Sublime Text may be downloaded and evaluated for free, however a license must be purchased for continued use. There is currently no enforced time limit for the evaluation)
It is incredible customizable, and you can select the type of language that you are using to help you read your code. Personally, I use Sublime.
To improve visualization, you can start by installing Install Package Control by clicking Tools in the menu and then select Install Package Control. You can change color on this text editor by clicking Preferences then Color Scheme… and pick from the available options (I use Monokai). On the right corner you can also select an extension by clicking on current (e.g., plain text) and change to R.
You can further customize the color scheme with different add-on packages for Sublime. One of those is Rainglow, to install this package your start by clicking on the menu tab as follows:
- Go Sublime Text -> Preferences -> Package Control.
- Select Package Control: Install Package .
- Type Rainglow and select it.
- Go to Sublime Text -> Preferences -> Package Control -> Color Scheme….
- You can now select from hundreds of color schemes and try one that is nice or makes your use of the text editor easy.
2) Atom (macOS, Windows, Linux):
It is free and customizable. I am not familiar with this one, but my students have found it more appealing for it truly free nature and open source.
3) BBEdit (macOS)
https://www.barebones.com/products/bbedit/
It is useful for lots of find and replace actions, or when you need a quick file exploration. It provides a free trial, this ends after 30 days. For my needs, I do not need the extra stuff after the trial has ended (i.e., this text editor will continue to work after trial, but not some of its more sophisticated features)
4) Notepad++ (Windows)
https://notepad-plus-plus.org/
It is one of the best for windows and it is free. I am not very familiar with this editor yet it is one of the best for Windows (PC) computers.
1.3 R studio
https://rstudio.com/products/rstudio/download/
This software application defied as an integrated development environment (IDE) is designed to work and interact with R. Many people that use R always prompts new users to use R-studio. Personally, I do not use such application as it obfuscates my coding. However, most of the applications and materials developed in this workshop should be easily implemented within R-studio. If you decide or are familiar with R-studio, you can use it. However, I will be less familiar on their implementation in such IDE and less helpful in fixing errors that prompts within R-studio. If you use this software, choose the free version.
1.4 Installing R
Here is instructions to guide you in this process:
1) You need to define where you want to download R and your personalized R library (more on this later). I recommend the desktop or some easy to access folder (in the documents).
2) To install R you need to access the a repository by clicking on the mirrors page:
https://cran.r-project.org/mirrors.html
3) Try selecting the closets to your location. For NYC, you can chose
Case Western Reserve University, Cleveland, OH: https://cran.case.edu/
0-Cloud: https://cloud.r-project.org/
4) You will see a list of links understand Download and Install R. These include links to precompiled binary distributions of the base system for Windows and Mac.
5a) OSX: For MAC (macOS 10.13 – High Sierra and higher) click on Download R for (Mac) OS X. You will need to download the latest version, e.g., R 4.1.2.pkg (notarized and signed) R 4.1.2 “Bird Hippie” released on 2021/11/01.
For latest macs, you need to install also XQuartz since it is no longer part of OS X
For older MACs (macOS 10.11 – El Capitan) click on Download R for (Mac) OS X and then on R-3.6.3.nn.pkg (signed).
If you have MAC with an Apple M1 processor do not use yet the R-4.1.2-arm64.pkg (notarized and signed) to install R. I noticed that this installation package tend to give problems with some R-packages and sometimes is very hard to fix such incompatibilities. We will be using the version installed from R-4.1.2.pkg (notarized and signed) that works just fine.
5b) Windows: For computers running on Windows click on Download R for Windows and then on install R for the first time and then Download R 4.1.2 for Windows
5c) Linux: For computers running on Linux (or its variants) click on Download R for Linux and choose your operative system. Ask Randy Ortiz (Santos Lab) for further help.
1.5 Annotating code
Throughout this gitbook, I will use #
to indicate annotation for the user (you) to read what the computer will be doing, comments or results that you are expected obtain. The annotation of code is fundamental and good habit to have from the beginning. Annotation helps you to understand what a section of code is trying to accomplish. Any text after the #
is ignored by the computer as you copy and paste your code on the console from the text editor.
## the next chuck of code will print on the screen "DO NOT FORGET TO ANNOTATE YOUR CODE"
cat("\nDO NOT FORGET TO ANNOTATE YOUR CODE\n")
#DO NOT FORGET TO ANNOTATE YOUR CODE
1.6 Defining a working directory in R
R has a default directory on your computer, which can be one not easily accessible directory, or you might one specific for your project. You can get the information of the current output directory by typing getwd()
## Print my current working directory
getwd()
#[1] "/Users/santosj"
In most cases, you might want to use another directory as your working directory. The usual way to change the working directory with different approaches and some are operating system specific.
You can create your desired directory or select a specific directory as your output by:
macOS using R menu: On R console, click on Misc then Change Working Directory… or pressing command-D.
To make sure that the working directory has change use:
getwd()
#[1] "/Users/santosj/Desktop/Teach_R/my_working_directory"
macOS using drag & drop: Select you output directory and drag & drop this folder in the R console to get its path (i.e., specific address of the folder in your hard drive). For example, you will see the path of the selected working directory
~/Desktop/Teach_R/my_working_directory
#Error: unexpected '/' in "~/"
Next, you can copy the path name ~/Desktop/Teach_R/my_working_directory
and place within the function setwd()
within quotations
## Change my working directory to the following path
setwd("~/Desktop/Teach_R/my_working_directory")
To make sure that the working directory has change use:
getwd()
#[1] "/Users/santosj/Desktop/Teach_R/my_working_directory"
Windows: For PCs, you need to find the path of desired working directory. For this you need to highlight this folder, hold down the Shift key and right-click the file.
Then, you click on Properties and copy the names (text) of Location that includes the folder name. For example, if you want to change your working directory to the folder icon my_working_directory
Find its location: C:\Users\myPC\Desktop
You will get its path name as: C:\Users\myPC\Desktop\my_working_directory
Then use the setwd()
function as follows
setwd("C:\Users\myPC\Desktop\my_working_directory")
In some PCs, this will change the working directory to its desired path. However, in most, you will get the following error
setwd("C:\Users\myPC\Desktop\my_working_directory")
#Error: '\U' used without hex digits in character string starting ""C:\U"
This indicates that the R is recognizing the \
(backslashes) as part of regular expressions. To make it work, you will need to change such \
character to /
(forward slashes).
setwd("C:/Users/myPC/Desktop/my_working_directory")
getwd()
#[1] "C:/Users/myPC/Desktop/my_working_directory"
Other have reported is to use double backslashes \\
setwd("C:\\Users\\myPC\\Desktop\\my_working_directory")
getwd()
#[1] "C:/Users/myPC/Desktop/my_working_directory"
This will indicate that you have successfully change the working directory.