Jeff Grover. Bioinformatics Scientist.

Polyglot R and Python Bioinformatics and Data Science Projects Using Jupyter Notebooks

2024-06-02T00:00:00-04:00

TLDR - Check out this github repo for a (still really wordy) example: polyglot_jupyter_example

If you're anything like me, and there are probably tens of you out there, you enjoy working in multiple programming languages for your bioinformatics/data science work. Perhaps you love the tidyverse R ecosystem for data manipulation but prefer packages from a Python library like Scikit-learn. Or, as is becoming increasingly common, you're working on single-cell RNAseq analysis and you like normalization provided by Seurat, need to load data types only supported in the Python scverse, and want to use the Bioconductor SingleCellExperiment class to store your data.

If you're tried to use both Python and R in the same project maybe you've already realized that Jupyter notebooks and lab support many kernel types, including R. By installing the R Kernel in addition to the default ipython kernel you can use Jupyter for both R and Python notebooks, harmonizing your workflow across both languages. However, there is another consideration, having a reproducible environment.

Enter virtualenv and renv

Python virtual environments have been around for a very long time. They're a great way to make sure that your package versions for a given project are recorded, frozen, and don't conflict with those packages you're installed for other projects. The standard library module venv and the installable package virtualenv allow you to manage them. I typically use pyenv to install and manage python versions, and it has an extension for managing virtual environments as well: pyenv-virtualenv. Confusingly, pyenv-virtualenv actually uses venv (mostly).

On the R side of things, it seems virtual environments or project environments, haven't had as much focus. Generally, packages are very backwards compatible in R land. However, the package renv is getting more attention. I think it's great for the R community. It's long been commonplace to run sessionInfo() at the end of a notebook or script to make sure you know which packages are in-use and their versions. Instead, renv allows creating a project-specific library and tracks versions of all packages used (though, you should still run sessionInfo() for in your notebooks for completeness, I think).

There are lots of guides for Python virtual environments, but fewer at this time for renv. It's pretty easy to start using though, just install it with install.packages("renv"), restart R, initialize it in a directory for your project renv::init(), install the packages you want, then update the lockfile with renv::snapshot(). There are some quirks to it, so I recommend perusing the docs.

The Goal: Use Both virtualenvs and renvs in Jupyter

I'm a big fan of Jupyter Lab, and I use it for most of my downstream analysis tasks in both R and Python. I use R more frequently despite liking Jupyter, which I guess makes me kind of weird. I'm not using R Studio (which is a great IDE for R too) because I want to use the same editor for notebooks in both languages. There are ways to use reproducible environments for both languages in Jupyter notebooks, so I thought "Why not find a configuration that allows the use of reproducible R and Python environments in the same project."

This required a bit of tweaking, but I'm fairly happy with the result. To do this you need to be a bit adventurous, but I promise it's not that hard.

Global Configuration

This has worked well for me in Ubuntu 22.04 (bare metal as well as WSL2) and Manjaro Linux.

Install pyenv.
- Don't use your system python, especially on Linux, lots of the system can depend on it.
- Using the system python makes managing packages a nightmare.
Install pyenv-virtualenv.
Install the version of python you wish to use and set it as your global version.
- Make sure to use --PYTHON_CONFIGURE_OPTS="--enable-shared at a minimum.
- The full command with pyenv that I use at the time of writing this is (you can use another version if you wish): env PYTHON_CONFIGURE_OPTS="--enable-shared --enable-optimizations --with-lto" PYTHON_CFLAGS='-march=native -mtune=native' pyenv install 3.12.3
- pyenv global 3.12.3
Install jupyterlab and any other Python packages you want.
- pip install jupyterlab
Install R using rig
- rig is optional but I really like it.
- Installing R from the CRAN repository is also perfectly reasonable.
Install the R kernel for Jupyter.
- Follow the instructions in the R Kernel docs.

That gets you all the basics.

We are installing jupyter for the Global Python version. However, this global Python is not the system Python and is not used directly for any data science work, it's only for running jupyterlab.

Create Your Project Folder

I like to encapsulate each of my projects into a separate directory. This way, a series of computational notebooks that share a common theme can be tracked together with version control. Plus, a single environment can be used across notebooks used for a single project. This makes it easy to know which project notebooks belong to.

In this example I'm using polyglot_jupyter_example as the project name:

mkdir polyglot_jupyter_example
cd polyglot_jupyter_example

I then start jupyterlab in the project folder, or if you have a larger structure that encompasses many projects, in that higher-level directory:

jupyter lab

Set-up the Python virtualenv

Make sure you're in the project/github repo directory.
Set a version of python in the current project folder.
- pyenv local 3.12.3
Create a virtualenv, I just name them the same as the project.
- pyenv virtualenv polyglot_jupyter_example
Activate the virtualenv. You don't have to stay in the directory you made, but it keeps things simple.
- pyenv activate polyglot_jupyter_example
Install a minimal subset of packages needed for this example.
- pip install ipykernel pandas seaborn
Add the ipykernel you installed in the virtualenv to your jupyter that's outside the virtualenv.
- python -m ipykernel install --user --name=polyglot_jupyter_example
Exit the virtualenv
- pyenv deactivate
- Don't worry, jupyter will still know about it.
Start jupyter and you'll see that you can now create notebooks inside the virtualenv.

Important Note: If you're used to installing pip packages, etc. within a notebook by ! pip install {package} you'll need to adjust your workflow. The shell that jupyter spawns does not know about your virtualenv. Just keep a terminal open outside the notebook.

Set-up the Renv

This is somewhat easier, because R isn't controlling jupyter.

Inside the project directory, start R.
- R
Install renv.
- install.packages("renv")
Restart the R session.
Initialize an renv.
- renv::init(bare = TRUE)
- bare = TRUE keeps renv from parsing all text files in the project, if you're starting a project and it's not a blank slate (you have large notebooks and other files in it) this can cause renv to hang.
- This creates a project-specific library, a .Rprofile, and renv.lock amongst other things.
Exit R, and add configure bioconductor if you use it
- Setup information for the posit package manager mirror of bioconductor: here
Enter R again and install the ir kernel and data science stack.
- install.packages(c("tidyverse", "IRkernel"))
Make sure the R kernel is installed in jupyter outside the renv as well (as per the directions earlier).

As long as you start your jupyter notebooks in the top level of the project folder then R kernels will respect your Renv.

Project Structure

project_repo_dir
├── .python-version
├── .gitignore
├── .renvignore
├── .Rprofile
├── renv.lock
├── requirements.txt
├── {python_notebook_names}_py.ipynb
├── {python_notebook_outdirs}_py
    ├── {python_notebook_names}_py.py
    ├── {python_notebook_names}_py.html
    └── {analysis_outputs}
├── {r_notebook_names}_r.ipynb
├── {r_notebook_outdirs}_r
    ├── {python_notebook_names}_r.r
    ├── {python_notebook_names}_r.html
    └── {analysis_outputs}
└── README.md

There is quite a bit going on here so I will elaborate:

.python-version: Records the local version of python in the project.
.gitignore: Records any files that should be ignored by git.
.renvignore: Records any files that should not be parsed by renv.
.Rprofile: Project-specific configuration for R.
renv.lock: Where renv records the packages in the environment (including versions). Similar in concept to a Python requirements.txt.
requirements.txt: List of Python packages installed in the environment.
{python_notebook_names}_py.ipynb: Jupyter notebooks that use Python get named with "_py.ipynb" to make it clear they use Python.
{python_notebook_outdirs}_py: Output folders for Python notebooks, each Python notebook gets an output folder named the same as the parent notebook.
- Outputs from analysis go in here.
- I also like to include the notebook is plain text .py format so they can be run without Jupyter, and in .html format so they can be viewed by non-coders.
{r_notebook_names}_r.ipynb: Jupyter notebooks that use R get named with "_r.ipynb".
- {r_notebook_outdirs}_r: The same schme as {python_notebook_outdirs}.
README.md: Summary of the project.

The companion git repository is here and has examples of all these files, as well as two example notebooks. I recommend checking them out.

Notes on .gitignore and .renvignore

By default renv will parse all files in your project to determine which packages need to be tracked in the renv.lock. This can be problematic if you have large files, including notebooks in the project folder.

One way around this is to add large files to .gitignore. Renv respects that as a list of files and subdirectories not to parse. You'll notice this already will exist in the ./renv/ subdirectory it creates your project-specific library in, to avoid having git track all that. This probably isn't sufficient to avoid having issues with it parsing notebooks, which you likely do want to track with git.

If you create an .renvignore in your project folder then Renv will use that instead of .gitignore. I have been naming notebooks I write python in with _py on the end so I can match them and output folders in .renvignore easily. This is kind of clunky though.

In light of this, I set up my .gitignore with typical settings from jupyter notebooks like this: .gitignore. Ignoring the .ipynb_checkpoints and .virtual_documents folders.

For .renvignore, I create one that ignores all python-based notebooks. You may want to tweak this to your liking: .renvignore. You can get around the behavior of renv trying to parse large files by forcing it to record all packages installed in an environment rather than just ones used and their dependencies. This is demonstrated in the R notebook example with renv::snapshot(type = "all").

Restoring an Environment

If you need to restore the environment at a later date, or on a new machine, you will need to enter the project folder; then set up a new virtualenv for Pythonm install the packages from the requirements.txt, and install the ipython kernel to your jupyter as you did during the initial set-up:

pyenv virtualenv {polyglot_project_name}  # This should already use the local python version from .python-version
pyenv activate {polyglot_project_name}
pip install -r requirements.txt
python -m ipykernel install --user --name={polyglot_project_name}

You'll then need to make sure you have R and renv installed before opening an R terminal and running:

renv::restore()

Notes on Alternative Setups

There is more than one way to do this. Poetry is one attractive option, but I decided against this as it's another dependency and this was already plenty complicated. I intended this as a minimal example of such a setup depending only on Python, R, virtualenvs (python, through pyenv), and renv(R). Some will prefer to opt for conda environments instead, since there is some support for R. Anaconda is pretty heavy and I'm not a fan, but miniconda is certainly an option. Mamba is a great option to manage conda environments which is much faster.

Then, there's the nuclear option of every environment being a standalone Docker or Podman container. This is attractive when you don't need to interact much with a host system, and therefore is a good fit for working in "the cloud." Of course, there are ways around that by mounting local storage inside your containers. You still need to document your environments to be able to recreate them.

Perhaps you vastly prefer R over Python, and would rather call Python from R using reticulate. Reticulate actually works with this set-up as well, you do need to force it to use the virtual environment you created for your project with use_python().

If you have a strong preference for Python, then you might already use rpy2. This is similar in concept to reticulate, but I can't speak to how well it works with a setup like this, or if it would even be necessary. This does require you to call R from Python, rather than writing standalone R notebooks though.

There are probably improvements I'll add over time if I get around to it. However, this may give you some ideas of your own when it comes to living the polyglot data science/bioinformatics life.

Nextflow With First Class Metadata: A Minimal Example

2024-05-24T00:00:00-04:00

TLDR - Check out this github repo for the full example: https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example

I recently wrote an article regarding some of my opinions on bioinformatics workflow design. I've written workflows in several languages over the years, but at this point it seems that Nextflow has become something of the de facto industry standard. I thought it might make a nice example to show one of my recommendations in action for this commonly-used workflow language.

This is a deliberately simple example of an RNAseq workflow, and not really intended as an example of production-ready code. However, it will demonstrate one of the points that I wrote about in that article, First Class Metadata.

First Class Metadata

What I'm referring to as first class metadata is the concept that the important information about your data lives elsewhere in a simple format that can be easily parsed. The filenames themselves are not the ground truth for information about your data. Filenames are simply an identifier and a method of linking data to metadata. Take these hypothetical files as an example:

RNASEQ_cell_line_A_treated_20um_replicate_1_R1.fastq.gz
RNASEQ_cell_line_A_treated_20um_replicate_1_R2.fastq.gz
RNASEQ_cell_line_A_treated_20um_replicate_2_R1.fastq.gz
RNASEQ_cell_line_A_treated_20um_replicate_2_R2.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_1_R1.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_1_R2.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_2_R1.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_2_R2.fastq.gz

These are clearly from an RNAseq experiment. It even says so! What else do we know? There's other information, apparently. They're from a cell line, the highly specific "cell_line_A," some have been treated with something and we have a concentration (20 micromolar). We also have a replicate number (I hope in your real experiments you're doing more than one replicate...) and since these are paired end samples there is information on whether they are read 1 or 2 of the pair.

What would first class metadata mean? Here's an example, a simple .csv (or .tsv):

sample_id	experiment	cell_line	treatment	replicate	paired_status	read1	read2
treated_20_1	rnaseq	A	20_micromolar	1	paired_end	RNASEQ_cell_line_A_treated_20um_replicate_1_R1.fastq.gz	RNASEQ_cell_line_A_treated_20um_replicate_1_R2.fastq.gz
treated_20_2	rnaseq	A	20_micromolar	2	paired_end	RNASEQ_cell_line_A_treated_20um_replicate_2_R1.fastq.gz	RNASEQ_cell_line_A_treated_20um_replicate_2_R1.fastq.gz
untreated_1	rnaseq	A	NA	1	paired_end	RNASEQ_cell_line_A_untreated_replicate_1_R1.fastq.gz	RNASEQ_cell_line_A_untreated_replicate_1_R2.fastq.gz
untreated_2	rnaseq	A	NA	2	paired_end	RNASEQ_cell_line_A_untreated_replicate_2_R1.fastq.gz	RNASEQ_cell_line_A_untreated_replicate_2_R2.fastq.gz

What makes it "first class" is that this separate document, whether that's a tabular file or something more complex like a database, is the ultimate source of truth. The sample is also the fundamental unit of observations in the table, rather than the file. This is apparent because both reads are listed for a single sample, rather than each as a separate line. This means that single-end and paired-end samples can coexist without the need to duplicate a lot of metadata on more lines. Some sequencing protocols also create a separate fastq for UMIs or barcodes, they could also be included as an addtitional metadata field.

These metadata can then be parsed when executing a workflow as long as the files are referenced in the same sample sheet. Using this paradigm it's easier to use sample metadata in the course of your workflow execution. Perhaps you'll assign output file names based on the sample_id field above, or split samples into groups based on treatment. The possibilities are myriad!

How Does First Class Metadata Help My Nextflow Workflows?

If you've recorded your sample metadata in this fashion you can directly read it during workflow execution. This means you're no longer forced to create brittle code that makes assumptions about your samples based on file names. Some of the nf-core workflows make use of these ideas.

This isn't a full-blown Nextflow tutorial, but I will demonstrate this with a minimal example. Imagine you have a simple RNAseq workflow, just Fastqc and a pseudoaligner like Salmon. All your sample information is contained within a single .csv file. When you define your workflow in the .nf file you can create a channel that takes your metadata sheet as an input like so:

workflow {
    // Read samplesheet and convert to queueable channel of: sample id, paired_status, read1, read2 as a tuple
    reads_channel = channel.fromPath(params.samplesheet)
        | splitCsv(header:true)
        | map{
            // map{} applies a function to each element of a channel
            // In this case, the rows from splitCsv() are converted to a tuple based on the header 
            row -> tuple(row.sample_id, row.paired_status, file(params.input_dir + row.read1), file(params.input_dir + row.read2))
        }

    FASTQC(reads_channel)
    SALMON_QUANT(reads_channel)
}

map{} is an operator that applies a function to each element of a channel. In this case, the iterator returned by splitCsv (named row here for clarity) is converted to a tuple that contains sample information. This tuple is then used as input to FASTQC and SALMON_QUANT, the only two steps (processes in Nextflowese, which would be defined elsewhere in the .nf file) in the workflow.

The idea is you would use your main metadata database as the sample sheet, just filtered for the samples you want to analyze.

The full example can be found here: https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example. I have included lots of comments to help you get started writing Nextflow, as well as ways to procure a small test dataset. The test dataset and workflow run in just over a minute on my laptop, excluding pulling containers, so lack of a cluster or cloud compute environment shouldn't stand in your way for testing.

Conclusion

This very simple example highlights a strategy I use when writing workflows, and currently that means Nextflow. These ideas are transferrable to other workflow languages as well. Personally, I am a fan of Snakemake, as it was the first workflow language I learned, and it's implemented in Python. However, Nextflow has become something of an industry standard, and Snakemake is more common in academia, plus Seqera labs supports Nextflow with paid products like the Seqera Platform (formerly known as Nextflow Tower), and there are other bioinformatics cloud platforms that are increasingly supporting Nextflow workflows. At the end of the day, it is a tool to help enable reproducible analysis of your data, and we should be careful to do it in a way that maximizes that reproducibility.

If you're interested in more advanced and/or more comprehensive Nextflow material I can recommend the docs at https://www.nextflow.io/docs/latest/index.html and the Nextflow training material at https://training.nextflow.io/. There is also a lot of content you can find trawling around GitHub.

Bridging the Gap With Wet Lab Using R Shiny

2024-05-04T00:00:00-04:00

How do you communicate results of an analysis? What tools do you use? Scientists that work in the wet lab are accustomed to firing up excel or some instrument-specific software and working with their own data. For genomics or other types of experiments in biology that result in large datasets this approach is problematic and bioinformaticians have other tools to deal with our data. Often, this involves working with large data in a programmatic way, and the two languages in common usage are Python (usually including its data science stack pandas et al.) and R (usually including the tidyverse, bioconductor, and friends).

There is obvious friction here. Bioinformatics scientists have a toolkit that works great for us, but is foreign to a lot of the wet lab scientists around us. How can we bridge this gap?

A Common Example

Your wet lab colleagues want to determine which genes are affected by treatment with a compound, or test the effect of a mutation in a cell line/plant/mouse on gene expression, etc. These situations are common applications for RNAseq. Typically, they also involve a simple experimental design; a comparison of two sample groups (treatment or mutant vs control).

The bioinformatician performs their analysis using a workflow resulting in fold changes, adjusted p-values, etc. in a table. They create a visualization to help summarize the results to their colleagues, perhaps a volcano plot. Then ensues a back and forth collaboration, resulting in requests to modify visualizations, look at specific lists of genes, and more. While this process is rewarding, and can result in a fruitful experiment, it can also be very inefficient. It can also result in frustration on the part of both the bioinformatician and wet-lab scientist. The bioinformatician because they want to be the most helpful, but the wet lab scientist isn't experienced in exploring their data with the same tools. The wet lab scientist wishes they could be more independent. This back and forth can take a lot of time.

There are ways around this, graphical platforms that enable low/no-code ways to analyze NGS or other big data. These are typically commercial products with a lot of functionality, and despite being low/no-code there is still a learning curve. Often the wet lab scientists want something simple, a way to explore an already analyzed dataset supplied to them by their collaborating bioinformatics scientist.

The Shiny Framework

Shiny is a framework for creating web applications quickly, with minimal code required for an interactive app. Originally only for R, it now supports Python as well. We'll be using it with R in this example. There are other options (such as Dash) so if you'd rather use those, knock yourself out. I find Shiny's R syntax to be relatively easy to work with, and very quick to learn.

What if we could create, in a matter of a few hours (depending on your experience level), an application that runs in a web browser that enables our wet lab friends to explore their analyzed datasets? It's not that hard. I promise. If you have written a few functions and made some plots in R, Python, or some other programming language then you know most of what you need to get started.

Our Goal

For this example, we will create a Shiny application which generates volcano plots from DESeq2 results. This example app will have the following features:

Load results tables via browsing the local system.
Generate a volcano plot.
Display and allow searching through the results table.
Allow changing of differential expression thresholds.
Update the visualization and table according to differential expression thresholds selected by users.

We'll use the pasilla package from bioconductor for our test data, and the volcano plot code I've used as previous examples as a starting point.

Set-up

You'll need R installed. After that, to you'll need the some packages to follow along. You can get them as follows. Open the R terminal and:

# If you don't have bioconductor
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.19")

# BioCManager::install() will also install packages from CRAN so they're also in this list
BiocManager::install(c(
    "DESeq2",
    "pasilla",
    "apeglm",  # Needed for logfoldshrink in DESeq2
    "tidyverse",
    "shiny",
    "DT",
    "DescTools"
))

# I like to use renv to manage project-specific libraries, this is optional
install.packages("renv")

Note: I won't detail the use of renv throughout this guide, but I like the package and you can and should read up on the documentation. It's kind of like virtual environments for R. Especially useful for a project like this or if you're working with multiple developers.

You'll now want to create a folder to work in, you'll likely want to put it on github or another version control system. In your shell:

mkdir ~/Development/volcano_rnaseq_shiny_example

Then, go into that directory and create a subfolder called "shiny" and a single .R file called "app.R".

cd ~/Development/volcano_rnaseq_shiny_example
mkdir shiny
touch shiny/app.R

Prepare the Differential Expression Data

The test data for this app is supplied in the aforementioned github repo, but for the sake of completeness, this is how it's generated.

First, load and reformat the pasilla data so it can be used for differential expression in DESeq2. It's supplied as a counts matrix and metadata, but they need some reformatting:

# Load pasilla, and I always use the tidyverse
library(pasilla)
library(tidyverse)

# First load the counts table and then the metadata
counts_table <- system.file("extdata/pasilla_gene_counts.tsv", package = "pasilla") %>% read_tsv()

metadata_table <- system.file("extdata", "pasilla_sample_annotation.csv", package = "pasilla") %>% read_csv()

In order for this to work with DESeq2 the column names for the samples (aside from the gene IDs) in the counts_table must match row names (or a column that can be converted to row names) in the metadata_table.

counts_table %>% head()
# A tibble: 6 × 8
  gene_id untreated1 untreated2 untreated3 untreated4 treated1 treated2 treated3
  <chr>        <dbl>      <dbl>      <dbl>      <dbl>    <dbl>    <dbl>    <dbl>
1 FBgn00…          0          0          0          0        0        0        1
2 FBgn00…         92        161         76         70      140       88       70
3 FBgn00…          5          1          0          0        4        0        0
4 FBgn00…          0          2          1          2        1        0        0
5 FBgn00…       4664       8714       3564       3150     6205     3072     3334
6 FBgn00…        583        761        245        310      722      299      308

metadata_table %>% head()
# A tibble: 6 × 6
  file    condition type  `number of lanes` total number of read…¹ `exon counts`
  <chr>   <chr>     <chr>             <dbl> <chr>                          <dbl>
1 treate… treated   sing…                 5 35158667                    15679615
2 treate… treated   pair…                 2 12242535 (x2)               15620018
3 treate… treated   pair…                 2 12443664 (x2)               12733865
4 untrea… untreated sing…                 2 17812866                    14924838
5 untrea… untreated sing…                 6 34284521                    20764558
6 untrea… untreated pair…                 2 10542625 (x2)               10283129
# ℹ abbreviated name: ¹`total number of reads`

metadata_table$file
[1] "treated1fb"   "treated2fb"   "treated3fb"   "untreated1fb" "untreated2fb"
[6] "untreated3fb" "untreated4fb"

metadata_table <- metadata_table %>% select(file, condition) %>%
    mutate(file = str_remove(file, "fb"))

metadata_table
# A tibble: 7 × 2
  file       condition
  <chr>      <chr>    
1 treated1   treated  
2 treated2   treated  
3 treated3   treated  
4 untreated1 untreated
5 untreated2 untreated
6 untreated3 untreated
7 untreated4 untreated

# Order the samples correctly
counts_table <- counts_table %>% select(gene_id, metadata_table$file)

Welcome to the wild and wonderful world of data cleaning. How's it feel to be a computer janitor? I still do this kind of stuff more than any of the fancy analysis methods I've learned. Data are messy.

Now, you can generate some results based on treated vs untreated conditions:

library(DESeq2)

dds <- DESeqDataSetFromMatrix(
    countData = column_to_rownames(counts_table, "gene_id"),
    colData = column_to_rownames(metadata_table, "file"),
    design = ~condition
)

dds <- DESeq(dds)
results <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm")
results <- results %>% as.data.frame() %>% rownames_to_column("gene_id")

results %>% head()
      gene_id     baseMean log2FoldChange     lfcSE    pvalue      padj
1 FBgn0000003    0.1715687    0.006979656 0.2057852 0.7874583        NA
2 FBgn0000008   95.1440790    0.001115354 0.1517065 0.9923316 0.9969282
3 FBgn0000014    1.0565722   -0.004634136 0.2048948 0.8181371        NA
4 FBgn0000015    0.8467233   -0.018148393 0.2061771 0.3714205        NA
5 FBgn0000017 4352.5928988   -0.191126743 0.1201758 0.0568330 0.2823626
6 FBgn0000018  418.6149305   -0.070043056 0.1236900 0.4797142 0.8239063

results %>% write_csv("pasilla_results.csv")

Now you can see how many differentially expressed genes (at the BH-adjusted p < 0.1):

results %>% filter(padj < 0.1) %>% nrow()
[1] 1061

Let's Build the Shiny App

For the ease of following along, I'm putting this shiny app up on my github: volcano_rnaseq_shiny_example. You can find a working example there.

To create it yourself, open the app.R we created earlier in your favorite text/code editor and add the following:

library(shiny)
library(bslib)
library(DT)
library(tidyverse)
library(ggrepel)
library(DescTools)


# Volcano plot code based from https://github.com/groverj3/genomics_visualizations/blob/master/volcano_plotteR.r
volcplot <- function(data, padj_threshold = 0.05, fc = 1, plot_title = 'Volcano Plot', plot_subtitle = NULL) {

  # Set the fold-change thresholds
  neg_log2fc <- -log2(fc)
  pos_log2fc <- log2(fc)

  # Make a dataset for plotting, add the status as a new column
  plot_ready_data <- data %>%
    mutate_at('padj', ~replace(.x, is.na(.x), 1)) %>%
    mutate_at("padj", ~replace(.x, .x == 0, .Machine$double.xmin)) %>%  # When p values are zero, they're actually below the lowest value R can display
    mutate_at('log2FoldChange', ~replace(.x, is.na(.x), 0)) %>%
    mutate(
      log2fc_threshold = ifelse(log2FoldChange >= pos_log2fc & padj <= padj_threshold, 'up',
                         ifelse(log2FoldChange <= neg_log2fc & padj <= padj_threshold, 'down', 'ns')
        )
    )

  # Get the number of up, down, and unchanged genes
  up_genes <- plot_ready_data %>% filter(log2fc_threshold == 'up') %>% nrow()
  down_genes <- plot_ready_data %>% filter(log2fc_threshold == 'down') %>% nrow()
  unchanged_genes <- plot_ready_data %>% filter(log2fc_threshold == 'ns') %>% nrow()

  # Make the labels for the legend
  legend_labels <- c(
      str_c('Up: ', up_genes),
      str_c('NS: ', unchanged_genes),
      str_c('Down: ', down_genes)
  )

  # Set the x axis limits, rounded to the next even number
  x_axis_limits <- DescTools::RoundTo(
    max(abs(plot_ready_data$log2FoldChange)),
    2,
    ceiling
  )

  # Set the plot colors
  plot_colors <- c(
      'up' = 'firebrick1',
      'ns' = 'gray',
      'down' = 'dodgerblue1'
  )


  # Make the plot, these options are a reasonable starting point
  plot <- ggplot(plot_ready_data) +
    geom_point(
      alpha = 0.25,
      size = 1.5
    ) +
    aes(
      x = log2FoldChange,
      y = -log10(padj),
      color = log2fc_threshold,
      label = gene_id
    ) +
    geom_vline(
      xintercept = c(neg_log2fc, pos_log2fc),
      linetype = 'dashed'
    ) +
    geom_hline(
      yintercept = -log10(padj_threshold),
      linetype = 'dashed'
    ) +
    scale_x_continuous(
      'log2(FC)',
      limits = c(-x_axis_limits, x_axis_limits)
    ) +
    scale_color_manual(
      values = plot_colors,
      labels = legend_labels
      ) +
    labs(
      color = str_c(fc, '-fold, padj ≤', padj_threshold),
      title = plot_title,
      subtitle = plot_subtitle
    ) +
    theme_bw(base_size = 24) +
    theme(
      aspect.ratio = 1,
      axis.text = element_text(color = 'black'),
      legend.margin = margin(0, 0, 0, 0),
      legend.box.margin = margin(0, 0, 0, 0),  # Reduces dead area around legend
      legend.spacing.x = unit(0.2, 'cm')
    )
  plot
}


# Define UI
ui <- page_sidebar(
  title = "Volcano PlotteR",
  sidebar = sidebar(
    fileInput("deseq2_results", "DESeq2 Results Table"),
    numericInput("foldchange_threshold", "Fold Change Threshold", value = 1),
    numericInput("padj_threshold", "Adjusted p-value Threshold", value = 0.1),
    textInput("plot_title", "Plot Title", value = "Volcano Plot"),
    textInput("plot_subtitle", "Plot Subtitle", value = NULL),
  ),
  card(
    plotOutput("volcano_plot"),
    min_height = 580  # Ensures you don't have to scroll within this card
  ),
  DTOutput("deseq2_table")
)


# Server function
server <- function(input, output) {
  options(shiny.maxRequestSize=30*1024^2)

  deseq2_results <- reactive({
    req(input$deseq2_results)
    read_csv(input$deseq2_results$datapath)
  })

  deseq2_results_filtered <- reactive({
    req(deseq2_results)
    deseq2_results() %>%
      filter(2^abs(log2FoldChange) >= input$foldchange_threshold & padj <= input$padj_threshold)
  })

  output$deseq2_table <- renderDT(
    deseq2_results_filtered()
  )

  output$volcano_plot <- renderPlot({
    req(deseq2_results)
    deseq2_results() %>%
      volcplot(
        padj_threshold = input$padj_threshold,
        fc = input$foldchange_threshold,
        plot_title = input$plot_title,
        plot_subtitle = input$plot_subtitle
      )
    },
    height = 550
  )
}

# Run the application
shinyApp(ui = ui, server = server)

There's a bit to unpack here. But for now, just run it by entering the volcano_rnaseq_shiny_example project directory, starting the R interpreter with R, and running:

shiny::runApp("shiny")

"shiny" within runApp matches the name of the subfolder containing our app.R. If everything works as expected (fingers crossed!) you'll be presented with something like the following in your R terminal:

Listening on http://127.0.0.1:3698

If you navigate to the IP address and port in your favorite browser you should see the fruits of your labors. After clicking the browse button you can load the pasilla_results.csv you created earlier:

That looks great! Now try changing the controls on the left. You'll see that the plot and table react in real time!

Explanation

If you ignore the volcano plot code, which is mostly the same (with some changes and simplifications) as my explanation in a previous post you're left with only ~50 lines of code. That's really not much to get an interactive web app.

The main logic of the app is broken up into two parts, the ui (defined by layouts and content) and the server() function. The inputs and outputs in the server function map to the names of the inputs and outputs from the ui. You can have different types of outputs (plots, tables, etc.) in sections of the UI. In this simple example we use the use library(DT) and the DTOutput() function as a way to display dataframes that uses the javascript DataTables library as a backend. Likewise our volcano plot code uses ggplot2 and the plotOutput() function displays it. At the end of the code, we run shinyApp() with our ui and server to make it all happen.

There are a few other things of note going on here. We've wrapped filtering of our results table and plotting in reactive(). This does exactly what you think, it makes the plot and data table react to changes in the input data. So when you change the data that was loaded in, or any of the controls that map to filtering criteria, etc. the elements are regenerated. The req() function in there identifies that the input dataset is the element to which the output reacts. Hopefully that makes sense. For a simple example, I think that explanation suffices.

All of these packages and functions have great documentation that goes far beyond what I've written here, so I recommend reading it. You can add a lot more functionality without too much trouble, this is just a simple example.

What Can You DO With This?

Imagine you're in a meeting and you're having that back and forth with the wet lab scientists I talked about earlier. Now, you can pull out your shiny app and use that as a tool to filter data, generate visualizations, and save the output on the fly. Even better, if you get really ambitious you can containerize it, serve it on your LAN, and let anyone use it!

I suspect the bench scientists will be happy because they can filter, visualize, and do whatever else you've built for them. You'll be happy because your meetings can be more productive and your colleagues can generate more insights on their own and bring those to you for in-depth analysis.

The more I think about what the optimal split of duties for a genomics project should be, the more I think we should be developing simple tools like this. Small interactive apps like this allow bioinformatics staff to focus on solving hard problems, making sure data is processed consistently, figuring out how to apply novel methods to lage datasets, etc. The other stakeholders who help to generate the data can be empowered to explore that data without the burden of knowing how to process it from raw files, but still get to have an active role in generating insights.

Shiny and similar frameworks have relatively easy syntax to learn when getting started if you're already familiar with R or python. While there are certainly commercial products that have functionality far surpassing these small apps, if you're looking for a simple tool to help bridge the gap between wet and dry lab scientists this may fit the bill at $0, aside from your labor :).

Reference

Huber W, Reyes A (2024). pasilla: Data package with per-exon and per-gene read counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome Research 2011.. R package version 1.32.0.

On Bioinformatics Workflow Design

2024-04-26T00:00:00-04:00

Since I was in grad school I've been writing bioinformatics workflows. Usually to process NGS data. The concept of a workflow is simple, and not limited to the domain of bioinformatics. However, a workflow (aka "pipeline") used to analyze data from next generation sequencing (again, will it ever be "current gen?") certainly falls under this banner.

Over the past few years I've become more opinionated on how bioinformatics workflows should be designed. First, we should have a basic understanding of what a bioinformatics workflow is and what they look like.

What Is a Bioinformatics Workflow?

If you work in bioinformatics and/or computational biology (really, most fields of science that utilize computational resources on a medium-large scale) you've probably written a workflow. I would define such a process as a series of programs, usually (but not always) operating sequentially on the output of the previous tool in the series. A typical workflow for the analysis of RNAseq data might look like this:

QC of raw .fastq data (fastqc)
Trimming of reads to remove adapters and low quality basecalls (trim_galore)
Alignment of cleaned reads against a reference genome (STAR)
Sorting and indexing the alignment .bam file (samtools)
Post-alignment QC (picard)
Read per gene counting (featurecounts)
Log file aggregation (multiqc)
Differential expression analysis (DESeq2)

This example is pretty straightforward, and is not by any means the entireity of what one may do in the course of RNAseq analysis (for one, STAR can generate the counts tables directly with no need for featurecounts or other tools). It does demonstrate the concept though. In most cases, each tool runs on the output file of the previous step, with some exceptions (fastqc and trim_galore, for example, both operate on the raw data). More complex workflows may feature many steps which can run in parallel because their outputs are not required as inputs until reaching a later step.

Common Workflow Frameworks In Bioinformatics

If you've ever written a series of scripts in shell, python, R, or any other language, that uses sequential processes on one or more input files then you already have a workflow. Most people in bioinformatics that process NGS data begin writing BASH scripts as both a way to enable hands-off running of workflow steps, and a way to record what work was performed.

There are many workflow management frameworks in common usage within bioinformatics today. These include (this list is not exhaustive):

Note: I am fully aware that not all of these aspire to also encompass execution, choosing to leave that to separate tools (CWL, WDL). However, for the purpose of this discussion lumping them together with Nextflow and Snakemake makes sense logically as workflow languages. I will probably catch flak for this.

This article is not a deep dive on each framework's pros and cons or my opinions on them.

The "Why" and "When" Of Workflow Automation

Workflow management frameworks do a lot of things for you, but all have a learning curve. They each have unique syntax, and their own common workflow design idioms. However, if you use a framework like these you'll be able to write once, run anywhere (in theory). Run locally, fine. Run on an HPC, great. Run on a public cloud (AWS, GCP, Azure), cool. Run on a kubernetes cluster, probably fine. You get the idea.

These workflow languages and the executors that run them allow efficient resource usage, if you define the resources required for each step the executor will determine how many of each job can run in parallel. They also excel at scattering jobs across compute nodes, this is especially important in an HPC or cloud compute context. Importantly, they (can) enable greater reproducibility. They integrate with container runtimes, allowing you to use Docker, Podman, Apptainer, et al. for each piece of software in a workflow and guarantee a specific version of each tool is used. This also eases deployment to HPCs or cloud compute.

In my opinion, you should look to workflow management frameworks when you want:

Reproducibility
- You can rerun an analysis and get reliable, and comparable, results.
Automation
- No need to write scripts for each step, no need for complex scripting to handle resource management.
Portability
- Run the same process agnostic of underlying hardware (local, HPC, cloud, etc.).
Harmonization
- You should have no question about whether data from similar experiments, analyzed with the same workflow, are comparable.
Efficiency
- You are going to run a process many times and you want to reduce execution time and/or costs.

I wouldn't bother uising workflow management frameworks for:

Prototyping
- Write the workflow when you done prototyping. Scripts are well-suited for this and it's easier to write workflows when you have well-written scripts to start with.
One-off analyses
- Is it really worth it to spend your time on this instead of directly answering your research question? BASH scripts are still a thing.
Bespoke visualization and statistical analysis
- This is difficult to standardize and requires careful consideration of data distributions. Consider computational notebooks instead.

Some Bioinformatics Workflow Anti-Patterns

Now I'm going to get controversial.

I have seen these in some high-profile implementations.

God Workflows
- A workflow that actually is for the processing of multiple kinds of -omics data.
- A workflow that includes specialized processing of the same kind of data in multiple unrelated ways.
- I have seen these as the buzzword "multi-omics" has become more prevalent.
- Similar in concept to god functions or god objects.
Filename Implicit Metadata
- Who isn't guilty of using filenames to store metadata? I am. sampleID_treatment_replicate.fastq.gz
- It's not a problem to store metadata in filenames, per se. It's a problem to depend on filenames as the ground truth source of file information.
Not Actually Automated
- Workflows which expose far too many options to actually be that useful as automation.
- Do you really need to allow the changing of every option in your steps?
- This is especially problematic when there are options that are not suitable for a given workflow and should never be enabled/disabled.
Forgetting To Make Design Decisions
- In the name of providing options to users, you may be tempted to allow multiple programs in different steps.
- "You can use STAR or HISAT2 by running with --aligner STAR or --aligner HISAT2."
- Now, saying "I ran {workflow_name_here} from the {fancy_project_name} github" is not descriptive enough to actually inform people what was run.
- Yes, I am aware log files still exist, but you shouldn't need to look at a log file to at least know the basic steps that were executed on some data.
Unnecessary Complexity
- "Real programmers separate all their functions and classes into modules, so each step of my 10 step workflow is in a separate file."
- Breaking up workflows can be done for good reasons, but this isn't one of them.

Good Practices To Use Instead

To avoid these problems, in order:

Embrace Modular Workflows
- A workflow processes one kind of data.
- A workflow outputs data for one purpose (not necessary only one kind of experiment when the same outputs may be useful for more than one type of experiment).
- If you need to integrate multiple -omics types consider higher order workflows, like higher-order functions or classes (depending on your preference for functional or object oriented programming).
- Higher order, or nested workflows, allow the execution of multiple sub workflows which still may be executed just as well on their own.
First Class Metadata
- The ground truth for sample information, the metadata, lives elsewhere from the filenames.
- Link the metadata to the files. Using the filenames and md5 hashes in a separate database is a good option.
- At a minimum, a low-effort way to solve this just involves a .csv file that has columns for filenames in addition to other sample metadata.
- Use metadata sheets when executing workflows to read sample names and important information related to options that need changing for specific samples (paired end read status, etc.)
Automate, Automate, Automate!
- If there are options that should not be changed do not allow users to change them.
- Every "option" should have a default that you have chosen for good reason.
- You should very rarely need to specify an option at execution to successfully and correctly run your workflow.
- Remember that one of your reasons behind writing a workflow is to automate the thing beyond writing individual scripts.
Own Your Design Decisions
- Every tool for each step of your analysis was chosen for a reason, let your users know that.
- There are exceptions to every rule, but the rule should be one step, one tool.
- If someone thinks their favorite tool is better than the one you picked they can write their own workflow or use a different one.
- It should be clear what happened to the data when you explain "I ran {workflow_name_here} from the {fancy_project_name} github." Yes, log files are still important.
Keep It Simple
- If you have a somewhat small workflow you don't need separate modules for every step, it's actually less readable.

In Conclusion

No code today, just thoughts. Maybe you've thought these things but didn't put them into writing. Maybe you're just coming around to the idea of workflow automation. Maybe you think I'm wrong (in this case, please don't email me I already get too much email and I'll just delete it anyway). However, I think by keeping a few things in mind you can really improve both the readability of your workflows and their usefulness.

Thanks for coming to my TED Talk/giant wall o' text. If you like these thoughts I accept payment in the form of cookies, peanut m&ms, millions of dollars in cash by the briefcase, and mysterious wire transfers in amounts large enough to pay off my wife and I's student debt.

Making Volcano Plots With ggplot2

2024-04-21T00:00:00-04:00

One of the, if not the, most common downstream analysis task I'm asked to perform on RNAseq data is to generate the venerable "Volcano Plot." These are kind of the bioinformatics equivalent of saying "Hey! Look how much data I have!" Regardless, they are a pretty good way to quickly summarize an RNAseq experiment. There are now lots of options for generating these visualizations. If you're looking for a plug and play option, the excellent bioconductor package EnhancedVolcano. However, if you are an R tidyverse user you actually already have everything you need to make these plots.

Starting in grad school, I created a library of R and Python snippets that I still reuse. I've continued to update my volcano plot code over time and at this point I actually still reuse that rather than loading in another package. Below, I will share this code and explain the major concepts behind making it. I'm not a software engineer, so it's likely that there are lots of other ways to throw this together.

The Full Function

This function is also available here

library(dplyr)
library(ggplot2)
library(ggrepel)  # For displaying gene labels, if you don't want them you can omit this library

volcplot <- function(data, padj_threshold = 0.05, fc = 1, plot_title = 'Volcano Plot', plot_subtitle = NULL, genelist_vector = NULL, genelist_filter = FALSE) {

  # Set the fold-change thresholds
  neg_log2fc <- -log2(fc)
  pos_log2fc <- log2(fc)

  # Make a dataset for plotting, add the status as a new column
  plot_ready_data <- data %>%
    mutate_at('padj', ~replace(.x, is.na(.x), 1)) %>%
    mutate_at('log2FoldChange', ~replace(.x, is.na(.x), 0)) %>%
    mutate(
      log2fc_threshold = ifelse(log2FoldChange >= pos_log2fc & padj <= padj_threshold, 'up',
                         ifelse(log2FoldChange <= neg_log2fc & padj <= padj_threshold, 'down', 'ns')
        )
    ) %>%
    mutate(hgnc_symbol = replace_na(hgnc_symbol, 'none'))

  if (genelist_filter) {
    plot_ready_data <- plot_ready_data %>% filter(hgnc_symbol %in% genelist_vector)
  }

  if(!is.null(genelist_vector)) {
    plot_ready_data <- plot_ready_data %>% mutate(hgnc_symbol = ifelse(hgnc_symbol %in% genelist_vector & padj < padj_threshold & log2fc_threshold != 'ns', hgnc_symbol, ''))
  }

  # Get the number of up, down, and unchanged genes
  up_genes <- plot_ready_data %>% filter(log2fc_threshold == 'up') %>% nrow()
  down_genes <- plot_ready_data %>% filter(log2fc_threshold == 'down') %>% nrow()
  unchanged_genes <- plot_ready_data %>% filter(log2fc_threshold == 'ns') %>% nrow()

  # Make the labels for the legend
  legend_labels <- c(
      str_c('Up: ', up_genes),
      str_c('NS: ', unchanged_genes),
      str_c('Down: ', down_genes)
  )

  # Set the x axis limits, rounded to the next even number
  x_axis_limits <- DescTools::RoundTo(
    max(abs(plot_ready_data$log2FoldChange)),
    2,
    ceiling
  )

  # Set the plot colors
  plot_colors <- c(
      'up' = 'firebrick1',
      'ns' = 'gray',
      'down' = 'dodgerblue1'
  )


  # Make the plot, these options are a reasonable strting point
  plot <- ggplot(plot_ready_data) +
    geom_point(
      alpha = 0.25,
      size = 1.5
    ) +
    aes(
      x = log2FoldChange,
      y = -log10(padj),
      color = log2fc_threshold,
      label = hgnc_symbol
    ) +
    geom_vline(
      xintercept = c(neg_log2fc, pos_log2fc),
      linetype = 'dashed'
    ) +
    geom_hline(
      yintercept = -log10(padj_threshold),
      linetype = 'dashed'
    ) +
    scale_x_continuous(
      'log2(FC)',
      limits = c(-x_axis_limits, x_axis_limits)
    ) +
    scale_color_manual(
      values = plot_colors,
      labels = legend_labels
      ) +
    labs(
      color = str_c(fc, '-fold, padj ≤', padj_threshold),
      title = plot_title,
      subtitle = plot_subtitle
    ) +
    theme_bw(base_size = 24) +
    theme(
      aspect.ratio = 1,
      axis.text = element_text(color = 'black'),
      legend.margin = margin(0, 0, 0, 0),
      legend.box.margin = margin(0, 0, 0, 0),  # Reduces dead area around legend
      legend.spacing.x = unit(0.2, 'cm')
    )

    # Add gene labels if needed
    if (!is.null(genelist_vector)) {
        plot <- plot +
        geom_label_repel(
          size = 6,
          force = 0.1,
          max.overlaps = 100000,
          nudge_x = 1,
          segment.color = 'black',
          min.segment.length = 0,
          show.legend = FALSE
        )
    }
    plot
}

Yes, this is rather long, but it's actually fairly straightforward to understand. Hopefully the comments help.

But What Does This Look Like?

Here's an example of a typical volcano plot this generates:

There are lots of places to customize, of course, since it's just a normal ggplot2 object.

How Is The Input Data Formatted?

This function works with DESeq2 output results as a data frame, but requires a bit of reformatting. So, you can get there like this:

deseq_results <- as.data.frame(deseq2_results) %>%
    rownamnes_to_column(var = 'ensembl_id') %>%
    left_join({ensembl_id_hgnc_symbol})

I typically work with ensembl gene IDs as a ground truth identifier for genes, and also include gene symbols as a more human readable identifier. Since I'm primarily working with human cell lines at the moment there needs to be a column in your dataset called "hgnc_symbol," according to the design of the volcano plot function. We achieve this by left_join() with an additional dataframe that consists of only two columns, "ensembl_id" and "hgnc_symbol." If you do work in mice, plants, etc. you can change all references to that column to suit your needs both here and in the plotting function.

Brief Explanation

You can think of this function doing things in a few discrete steps:

Set the fold chance thresholds for the plot based on what you provide for the variable fc, which defaults to 1 (no threshold).
Set NAs in the padj column to 1 and in the log2FoldChange column to 0. Create a new variable with the gene's differential expression status (up, down, not significant).
Filter the dataset on a list of hgnc symbols you supply (optional).
Remove gene symbol labels if not differentially expressed and not a member of a list supplied when invoking the function (optional).
Get the number of genes which are significantly up and down, and the number which are not significant for the legend.
Create the legend labels based on number of differentially expressed genes.
Set the X axis limits based on rounding to the next multiple of 2 (because log base 2) of the absolute value of the max in the log2FoldChange column.
Set the colors for the plot, defaults can be easily changed but I like them.
Build the ggplot object, simply using geom_point() and some vertical/horizontal lines based on your fold change and padj thresholds.
Add labels to points based on hgnc_symbol (optional).

Some Gotchas

DESeq2 sets padj and log2foldchange to NA for many reasons. This may be because of the expression level and filtering out low-expressing genes prior to statistical testing. It may also be due to lack of replicates and too much variability. Regardless, it's something of a philosophical question as to whether you want these genes to show up in the "not significant" category or whether you should simply not include them in the results at all. At this point, I lean toward setting their p values to 1 and log fold changes to 0. This way, such genes end up in the "not significant" category. My reasoning, this heads off question about why the number of genes in each category may not add up to the number in the annotation set across comparisons. Now those genes which are significantly up, down, and not significant always add up to the same number assuming that you're using the same annotations.

Why Not Just Use EnhancedVolcano?

Honestly, there isn't really a good reason not to. However, I already had this code on-hand and therefore I find it pretty easy to just run this on the reg. If you're learning ggplot2 and the tidyverse I think this is a good way to learn with a real example.

Managing Software on a Multiuser Linux System - An Update

2024-04-20T00:00:00-04:00

Back in 2019, the halcyon days of yore, near the end of my time in graduate school I wrote a well-intentioned article about software management for multi-user linux systems (here). This original article was written based on my experiences as the de-facto sysadmin of our lab's bioinformatics server. I am not a trained Linux sysadmin, I didn't even major in anything computer-related in college. However, I have been a big nerd as long as I can remember and have been playing with Linux far longer than I was using it for a job. That article was a good starting point. Lots of things have changed in the past few years. While my thoughts are similar as back then, I do have the benefit of additional experience to draw on, as well as some developments on the software and hardware side of things since then.

This is not a guide to setting up an HPC or cluster. It's also not a guide for setting up a cloud compute environment. If you're not at a university or a large company, then you're unlikely to have an HPC. On the flip side, I like cloud compute, but always like to have a local server for my work. It's just faster to develop on. If you have average compute needs for bioinformatics (alignment, variant calling, notebooks, etc.) then a single node local server does a very good job. Especially with modern processors and enough RAM. Plus, if you need more compute you can always get it through your cloud vendor of choice.

As the only full-time bioinformatics scientist at a midsize biotech company I have once again found myself in the situation of managing a server for my own work. As we add more people, we need processes that scale well. My goal is to make this system, augmented with cloud resources, work as a primary compute server until reaching 4-5 users. At that point, it makes more sense to use such a system for testing and prototyping rather than a main compute resource.

Consider these tips an addendum to my previous article on the subject.

0. Make Your Life Easier With Containers

What is the best way to avoid installing a bunch of random software for your small number of users? Just don't. Encourage every user to use containers. But how will they do so without sudo privileges for Docker? Easy, have them use Apptainer or Podman. Neither of these options require superuser privileges. Apptainer split off from Singularity, and that is also an option as well. Podman is a Red Hat product, but runs just fine on Ubuntu and other distros, plus it is mostly CLI-identical to Docker. Docker is problematic in HPC and multi-user environments because it requires superuser permissions, or adding a user to the Docker group (which then allows control of all system containers, also problematic).

A wrinkle to using containers for scientific computing is that usually you don't want the container to continue running after the job is done. For Apptainer this is fine, it was designed with this use-case in mind. For Podman/Docker, simply include the --rm option in your docker run or podman run. Another thing to remember is that you're going to have to mount your directory containing your input data, and output location as a volume for Podman/Docker, or if using Apptainer and you need to cross filesystem boundaries.

Managing containers is not a big hassle either. You might think that making your own, and putting together a registry is too much work. Firstly, most bioinformatics and data science software is already available in Docker containers. If not from the developers, it's likely available through Biocontainers. In terms of storing them, you don't need a "real" container registry, you can simply save them like so:

podman save {container_name} | gzip > container_name.tar.gz

If a container is only available for Docker, usually you can convert them to Apptainer/SIngularity format without any fuss:

apptainer build {container_name}.sif docker-archive:{container_name}.tar.gz

Just stick them somewhere consistent in your directory structure. I use a folder on a NAS called "container_library."

1. Encourage Use of Virtual Environments For Python

There is a great extension for pyenv called pyenv-virtualenv. When working on a data science project, or developing a standalone program, one should start by creating a virtualenv for said project. This is a great solution, in combination with pyenv, for python's terrible package management and dependency resolution.

Yes, you could use Anaconda to do the same thing but Anaconda is bloated and its conda package manager is slow. Plus, you're then stuck using it for everything. It's pretty easy to install your own local version of python with pyenv, and then simply run:

pyenv virtualenv 3.10.2 my-virtual-env-3.10.2

The best part of this from a sysadmin perspective is that, again, it's the users' own responsibility to manage all this. You just give them the tools.

2. Users Run Their Own Jupyter Notebook Servers

There are options like jupyterhub, or the littlest jupyterhub. However, it's even easier to just have users install jupyter in their own python libraries, assign them a port to serve it on, and have them start it like so:

jupyter lab --no-browser --port {assigned_port}

Then, they can forward that port to your local machine as such:

ssh -L {assigned_port}:localhost:{assigned_port} {remote_user}@{remote_host_ip}

Easy peasy.

3. Install R With rig

One of the more exciting things to happen recently, is the R installation manager, rig. This provides some of the same functionality that you get from pyenv in R. Though, you still can't leave installation of R up to users without having them run R in a container (a decent option, but makes some things awkward unless you start doing everything that way). Rig lets you install R across the system, without depending on your system package manager. It also configures R to use the POSIT package manager by default. This is exciting because you no longer need to compile R packages. Yes, there were ways around this before, but none were very plug and play, and some methods required installing packages for the whole system. Running install.package('package_name') will default to a binary package and then falling back to compilation if required.

This solution also defaults to user-specific package libraries. So, no need to manage that.

4. Have A Real Storage And Backup Strategy

It's not software, but this is a blog and there are no rules. I do what I want. To make sure you never lose critical data, even if this is just a testing system you need to have a backup strategy. I currently do the following on the server I manage for that purpose:

The / (root) directory is fast NVMe, but not terribly large. Only system files go here.
/home is on the same device as /, but users are encouraged to store data elsewhere.
/ is redundant with a RAID1 configuration storing a mirror of all data to a second identical NVMe drive.
/mnt/bulk_nvme is an array of 6 4TB configured as a single volume. Each user gets a directory here symlinked to their /home. This is where most work happens.
/mnt/bulk_nvme is backed up daily to a NAS directly connected to the server via an rsync cron job over 10 gigabit ethernet.
Commonly used genome/transcriptome references, containers, and other files are available on /mnt/bulk_nvme as well as on the NAS.
On the NAS (affectionately named Dagobah, because it's a data "swamp"), there are directories for "data_archive" and "analysis_results" where raw data and analyzed data live, respectively.
The "data_archive," "analysis_results," and "container_library" are backed up to AWS S3 Glacier Instant Retrieval tier.
Computational notebooks (jupyter), scripts (BASH, R, Python), and workflow are versioned on github.

This works for now, and is likely to change in order to scale better. In particular, we are looking at automatically archiving data from a NAS using data lifecycle policies, as well as more complex solutions like WEKA.

5. Know Where You Draw The Line

Cloud computing is more accessible than ever before. I am personally attached to having some local compute for testing, lighter computation, and direct control. However, if you're not a large organization with existing HPC infrastructure and don't plan on buying into that, then cloud computing is the way to go. It becomes nearly impossible to predict your cloud bill, and that's a downside to this. However, if you have more than 2-5 people, and even by the time you get to 4 or 5, you're going to either need a fulltime real sysadmin or you're going to have to use more cloud resources.

We supplement our on-prem compute with AWS. So, if there is a job that requires more RAM than available, uses more CPUs than we have in one server, or needs powerful GPUs then to the cloud it goes. There are also specific bioinformatics cloud platforms, and I have opinions on these, but that's for another post.

Wrap-up

I'm sure there are tens of people out there like me, who don't mind managing a system like this. Regardless, this may give you some ideas on how to manage your compute environment. I find having a local server more convenient than doing everything in "the cloud" (someone else's computer), but I have specific limits on this and once reached, augmenting with cloud resources is a smart thing to reduce admin overhead.

Now, let's see if I can manage more than one post every 4 years.

Publications, Dissertations, Job Hunts, and a Pandemic

2020-05-30T00:00:00-04:00

I started this github site as a place to expand my professional reach by posting my random musings on bioinformatics, Linux, data science, and etc. I made a few reasonably cogent posts, but then life got in the way! It's been a really busy time, a very eventful year. I'm now in a very different position than I was last summer. It's exciting progress, but hanging over all of it is the COVID-19 pandemic. I'm going to write another post specifically dealing with job hunting as a Ph.D. student, but here I thought I'd kick off my blog again with a general update since it's been a while.

Wrappig up Grad School During the COVID-19 Pandemic

My lack of posts began mostly as a consequence of the need to grind harder than I ever had before on my dissertation research. However, I was succcessful and finished my third publication during my Ph.D. program. It's since been accepted in PNAS, but I'm not sure when the final version will be available. Until then, you can read the bioRxiv version:

Abundant expression of maternal siRNAs is a conserved feature of seed development

As with many things in grad school, progress on this paper wasn't consistent. It happened in a series of spurts, with a large portion of the work really only happening once we got the final sequencing samples we had been waiting on last summer.

Along with that comes the need to write a dissertation. Since I had thee papers my committee was happy for me to do what's called a "staple dissertation." The name here is a bit incorrect, I did have to write more besides putting together my papers. However, this also took a significant amount of time.

While this was all going on, I attempted a short-lived bioinformatics/data science student group. Though, in the end wrangling grad students is a bit like herding cats. Still, I think it probably helped other grad students, which was the point.

Back in January I really started to look at job postings more seriously because I had known from the very beginning that my goal was to go back to industry. Mostly I was looking at bioinformatics scientist-type positions. I really threw myself into this, and I think all-in-all I probably put out 150 applications.

Then, as we all know, the pandemic hit in a big way for all of us. Labs were shut down, jobs were lost, colleges went online. I was in the phase of my Ph.D. where I needed to write my dissertation and apply for jobs, so I wasn't disrupted as much as most. However, it was definitely disheartening to have my job interviews dry up. Luckily, it seems that people with bioinformatics experience are still in high demand though and I was able to land a good job regardless. Not without becoming more and more concerned on a daily basis and the rejection emails started to roll in. Overall though, I think that the prospects for biology and bioinformatics in particular are still strong. We might not be "essential" workers, but it's hard to justify to employing people like us at a time when we could actually discover something vitally important to help with the current situation. Or, at the very least, the pandemic underscores that biology is important, and needs to be taken seriously, funded, and considered more by society in general. The pandemic will be temporary, but perhaps society will continue to think about biologists more frequently.

Just last week, I defended my Ph.D. It was a fantastic moment, even if it had to happen over a Zoom meeting. Following this, and wrapping up everything, I'll be moving to Boston to be closer to my new employer, Seven Bridges Genomics. There, I will be working as a "Genomics Scientist" to aid bioinformatics workflow development and coordination of data and metadata availability for their public programs. It's a great way to move on from grad school.

Moving forward, I'm going to try to keep up with posting things. Lots of exciting things happening! Hopefully I can keep my skills sharp, and I think this blog will be essential for that.

Just Write Your Own Python Parsers for .fastq Files

2019-08-22T00:00:00-04:00

In contrast to the zen of python there are actually many ways to handle sequence data in Python. There are several packages on PyPI that provide parsers for sequence formats like .fastq and .fasta. I've never bothered with these, including the oft-used Biopython. I vaguely remembered Biopython being slower than any parser I'd written myself early-on in learning bioinformatics, and it not actually being simpler to implement. However, I'd never looked at this in detail. Additionally, I'd recently run across a few posts on biostars where users were deriding people for asking "What is the most efficient way to parse a huge .fastq file" for something similar.

First of all, don't discourage people who are trying to learn. Secondly, this is a good question! As scientists, we should know that just because data exists doesn't meant it's good. Likewise, just because software exists doesn't mean it's the best tool for any given job. Plus, writing simple parsers for common formats is a good way to practice file processing for when you eventually need to do something hard and no ready-made parser exists in a package.

Rather than vaguely saying "X package is slow, do this instead" I thought it'd be best to actually benchmark some different .fastq parser options.

The Contenders

There are several packages that include parsers for biological sequence data. These include:

I'm familiar with Biopython from the recommendations that abound in the community for exactly this task, and HTSeq mostly for HTSeq-count. Scikit-bio seems to be newer and under current development, so results from testing that are subject to change. Just in case someone looks at this yers after it's written and wonders why I got the performance that I did.

When it comes to dealing with .fastq files I checked through my library of Python scripts and came across two patterns that I'll also test compared to these packages:

Reading line-by-line, using a counter to yield records
Reading line-by-line, using zip_longest() from itertools to yield records

Setting up the Test

I did this in a jupyter notebook, since that's what I use on a day-to-day basis. Most of my interactive "data science" work is done in R, which is mostly a consequence of at one point needing to use some R packages that have no Python equivalents, and just rolling with that. So, actually using Python in jupyter is a bit of a departure from the norm for me.

First, you need the necessary packages. I just use pip with pyenv:

pip install biopython HTSeq scikit-bio

Then, started a new jupyter notebook with jupyterlab (a sweet new UI for jupyter that you should use!). Your first step is always to do your imports.

from Bio import SeqIO
from HTSeq import FastqReader
from itertools import zip_longest
import skbio

I'm only using one function from skbio, but it's just called read() which is too generic a name to just import that single function without causing all sorts of annoyances and gnashing of teeth.

Also, it's important with any parsing problem to understand the file format. The .fastq format is ubiquitous in bioinformatics and looks like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Source

You can understand it as a repeated series of four lines:

Sequence ID, starting with "@"
Sequence (ATCG)
Separator (+)
Quality score for each base call (same length as sequence)

The catch here is that you can't use @ as a record separator. It's a valid character in the score line, too. So, you really do need to group the lines in batches of four, as it's possible @ will exist in position 1 of the score line.

Define Some Functions to Test

In order to make the benchmarking easier to follow, I figured I'd define the functions I want to bechmark in a consistent way:

# Using Biopython
def parse_biopython(input_fastq):
    for record in SeqIO.parse(input_fastq, 'fastq'):
        yield record

# Using HTSeq
def parse_htseq(input_fastq):
    for record in FastqReader(input_fastq):
        yield record

# HTSeq raw
def parse_htseq_raw(input_fastq):
    for record in FastqReader(input_fastq, raw_iterator=True):
        yield record

# Skbio
def parse_skbio(input_fastq):
    for record in skbio.io.read(input_fastq, format='fastq'):
        yield record

# Line by line with counter
def parse_lbl_counter(input_fastq):
    with open(input_fastq, 'r') as input_handle:
        record = []
        n = 0
        for line in input_handle:
            n += 1
            record.append(line.rstrip())
            if n == 4:
                yield record
                n = 0
                fq_record = []

# Line by line with zip_longest
def parse_zip_longest(input_fastq):
    with open(input_fastq, 'r') as input_handle:
        fastq_iterator = (l.rstrip() for l in input_handle)
        for record in zip_longest(*[fastq_iterator] * 4):
            yield record

Here I intended to use two different methods from HTSeq, one which just returns bare tuples rather than objects with other kinds of validation based on the definition of the format. However, neither HTSeq method worked. Instead, giving a StopIteration error when it reached the end of a file. Trying to catch that with a try: except: block didn't seem to work? It did parse until it reached the end of a file though. I think this is a bug, and I may raise it with the HTSeq people. So it is, regrettably, not included in my benchmarking results. Also, in both custom parsers, str.rstrip() was marginally faster than str.strip() so I went with that instead.

Run Some Benchmarks

I decided I would try each of these with 1 million lines from a whole-genome bisulfite experiment. These are the R1 mates from 75bp paired end reads:

%timeit -n 10 -r 10 [record for record in parse_biopython('JWG3_2_2_R1.head.fastq')]
2.86 s ± 56.7 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -n 10 -r 10 [record for record in parse_skbio('JWG3_2_2_R1.head.fastq')]
1min 33s ± 13.7 s per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -n 10 -r 10 [record for record in parse_lbl_counter('JWG3_2_2_R1.head.fastq')]
295 ms ± 14.7 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -n 10 -r 10 [record for record in parse_zip_longest('JWG3_2_2_R1.head.fastq')]
249 ms ± 2.57 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

The %timeit function there is some ipython "line magic." It simplifies timing a single line of code. The %%timeit is the "cell magic" version.

It seems that skbio isn't ready for primetime just yet. The real question then is, would biopython suffice for day-to-day work? Perhaps yes, ~1M lines in < 3s (349650.35 lines per second) is a timescale that people might be willing to work with. Keep in mind this is on my personal laptop, so it's hardly a compute cluster. In contrast, the very simple line counter-based parser that I wrote as a master's student back in 2013 as a python-learning exercise is nearly 10x faster! There is also an improvement in speed for using zip_longest() from itertools (a trick I'm pretty sure I saw in a post from Brent Pedersen on stackoverflow).

Visualize

I'm usually a ggplot2 useR for visualizations, but I'm already in python here so let's use this as an excuse to try the great python plotting library altair. It's declarative, like ggplot2, and you build your plot by "mapping" your "variables" (columns) to "encodings" (analogous to "aesthetics" in ggplot2). I ran several other benchmarks and turned them into a pandas data frame. First you'll need to do some imports:

import pandas as pd
import numpy as np
import altair as alt

Then make the data frame

# Create a dataframe Pandas style

timing_data = pd.DataFrame({'Method': np.repeat(['biopython', 'skbio', 'lbl', 'zip'], 5),
                            'Reads': (np.tile([100, 1000, 10000, 100000, 1000000], 4) / 4),
                            'Time (s)': [(670 / 1e6), (4.4 / 1000), (40.49 / 1000), (418 / 1000), 2.86,
                                         (14.2 /1000), (132 / 1000), 1.32, 13.9, (60 + 33),
                                         (181 / 1e6), (442 / 1e6), (3.92 / 1000), (40.5 / 1000), (295 / 1000),
                                         (70.2 / 1e6), (352 / 1e6), (3.19 / 1000), (32.5 / 1000), (249 / 1000)]})

Since each record is 4 lines, converting lines to # of reads requires dividing by four. Likewise, the benchmarking results are in various time units, so I've converted all of them to seconds. Not particular efficiently, but for this simple example it's fine.

Now we can visualize with Altair. It has a very nice syntax inspired by ggplot2's "grammar of graphics." It's based on vega-lite under the hood and allows you to easily save your plot from jupyterlab. Here's the code:

# Plot as a scatterplot

alt.Chart(timing_data).mark_point().encode(
    x='Reads',
    y='Time (s)',
    color='Method'
)

# Plot on log scale

alt.Chart(timing_data).mark_point().encode(
    alt.X('Reads', scale=alt.Scale(type='log', base=10)),
    alt.Y('Time (s)', scale=alt.Scale(type='log', base=10)),
    color='Method'
)

Everything scales linearly, but at massively different rates. Sci-kit bio is in another universe in terms of time, such that you can't even visualize it with the others in a meaningful way until you log scale everything. By the log scale, you can essentially see that biopython is an order of magnitude faster than skbio, and either simple parser are an order of magnitude faster again. The difference between the two simple parsers is pretty insignificant.

Note: Altair is great! Not quite as full-featured as ggplot2 in R, but it's definitely promising and something to watch for in the future. They definitely should make it work with jupyterlab's dark theme though. Due to the transparent plot backgrounds it requres a light theme.

To Wrap Things Up

I'm not saying you should never use biopython, I suspect its parser does some extra validation that my simple parsers don't. It also returns objects with some possibly useful methods. However, if you just want to read files quckly then the simple line-by-line parsers aren't actually very complicated to write. Plus, you don't even need to import anything unless you want a minor speed boost from itertools. Additionally, if you didn't need to strip newlines you'd get a boost from not calling an str.strip() method on each line.

If you're ok with living dangerously, and are sure your files are formatted correctly you can easily write something that will outperform standard implementations with little effort when it comes to .fastq parsing.

The Snakemake Tutorial I Wish I Had

2019-08-19T00:00:00-04:00

Over the past few years the use of workflow managers in genomics and bioinformatics has grown greatly. This is a great thing for the field and adds to our ability to perform reproducible analyses, especially for pipelines with many steps. These are common in bioinformatics, but prior to the use of workflow managers they were mostly handled with BASH scripts. While a good BASH script is perfectly acceptable much of the time they aren't very portable and don't handle multithreading and concurrent processes without annoying hacks. For a one-off analysis that's all fine, but what about a pipeline you need to run many times? This is where a workflow manager really shines, especially when combined with containers.

I decided to implement two pipelines that we use often here in the Mosher Lab as Snakemake workflows. We work with a lot of small RNA sequencing and recently some whole-genome bisulfite sequencing data. There are already available pipelines for WGBS, but they seem like overkill and writing my own is a good way to learn the ins and outs of Snakemake.

I chose Snakemake over other workflow managers due to its already frequent use in bioinformatics workflows and my familiarity with Python. However, Nextflow also seems like a solid option as well. I found much of the official documentation very lacking in useful examples though. I ended up consulting numerous workflows available on Github. The issue there is also that a lot of them are trying to do too much IMHO. So, I figured I'd write the tutorial I wish I had been able to find.

Step 0 - Install Snakemake and Your Workflow's Software Dependencies

I'm assuming you're on some kind of Linux system. Though, these directions may also work on macOS. Your first step should be to install Snakemake:

pip install --user Snakemake

I always recommend not using your system python. If you're on a non-rolling release distribution, or on macOS it's probably super outdated. I use pyenv to manage my Python installations, though there are other options. I also never use sudo to install python packages.

The WGBS workflow consists of several steps which can be represented by this DAG (Snakemake can make these! Neat-o!)

You can boil it down to:

Index the reference genome with bwa-meth (needed for bwa-meth alignment)
Index the reference genome with samtools faidx (needed for MethylDackel)
Quality reporting with FastQC
Trimming adapters with Trim Galore
Quality reporting on trimmed reads (FastQC again)
Alignment with bwa-meth
Sorting with samtools sort
Marking PCR duplicates with Picard Tools
Detecting bias and extracting per-cytosine percent methylation with MethylDackel
Determining fold-coverage with mosdepth and a little python script

Not a super complicated workflow, but enough to demonstrate a read-world use of Snakemake. A workflow that's complicated enough you don't want to run each step separately either.

I don't expect anyone to replicate this exact workflow, but it's a useful example.

Step 1 - Learn Some Snakemake Basics

There are some basics to explain before I start throwing code around. Firstly, Snakemake does not work you way you might think, it actually works backwards from a set of target files through a set of rules. You may think this sounds unnecessarily confusing, but there is a good reason for this. When Snakemake begins a workflow, this ensures that (as long as you don't do anything too weird) it will not fail for trivial reasons like files not being generated as inputs to other rules. It creates a directed acyclic graph representing the workflow for each sample that it can match through wildcards to your targets. It will use the rules you define in its main script (the Snakefile) to create a path from targets to inputs (samples). This is backwards from the way we think, and there are workflow managers that do push rather than pull. Each has their advantages and disadvantages.

Another key concept is that your rules live in a Snakefile, just a Python script with extra syntax. So, you can use Python code in it! Keep this in mind, and you can do some neat things (creating sample tables for differential expression, etc.).

Typically, the first rule in a Snakefile is called "all" and this rule will indicate the targets that you want to generate. This tells Snakemake to start and use the rules to try to make them based on wildcard matching with the inputs.

Additionally, it's good practice to include the options you'd like a user to be able to configure in a .json or .yaml file. You can think of these files like a python dictionary in static file form (pickling).

Step 2 - Create a Rule

Create a blank file called Snakefile in the directory you're using for development and fill it with this:

# Run fastqc on the raw .fastq.gz files
rule fastqc_raw:
    input:
        'input_data/{sample}_R{mate}.fastq.gz'
    output:
        '1_fastqc_raw/{sample}_R{mate}_fastqc.html',
        '1_fastqc_raw/{sample}_R{mate}_fastqc.zip'
    params:
        out_dir = '1_fastqc_raw/'
    shell:
        'fastqc -o {params.out_dir} {input}'

This is the first rule in our workflow, running FastQC on the input files. In the {} are wildcards. While they have names, they are only there for readability right now. The only param we're currently passing to it is the output directory, but this is where your options would be. Snakemake will check whether the input for a rule can be made before allowing the workflow to start. Therefore, if your workflow starts, it should finish. However, we need a rule "all" which will tell it to be run.

Add the following to the Snakefile before the fastqc rule:

rule all:
    input:
        expand('1_fastqc_raw/{sample}_R{mate}_fastqc.{ext}',
               sample=SAMPLES, mate=[1, 2], ext=['html', 'zip'])

This tells the workflow which targets to create. In this command, expand instructs Snakemake to fill in all things which match the wildcards. So, this indicates that all files which match this pattern when filled in are the targets. You can use expand in other rules, too. When you want multiple files as input that may be created asynchronously in previous rules. We're still missing a very important thing though! The input files! For that, create another file, config.yaml, in the same directory and add this to it:

samples:
    - Sample1
    - Sample2
    - etc...

Your samples are yaml entries without names (you could name them if you want), and should not include read pair numbers (so 1 ID for each pair). The sample IDs should match what is in rule all in place of {sample}. The config.yaml is also where the options for your workflow steps can live, under their own headings. Now, go back to your Snakefile and add this above your rules:

# Get overall workflow parameters from config.yaml
configfile: 'config.yaml'

SAMPLES = config['samples']

This parses the config file into a Python dictionary. Do you see how SAMPLES is filled in for {sample} in rule all which then creates a target which can be generated by rule fastqc_raw. It's kind of a mind-bender at first, but it all fits together.

Save all these files and include some test data in a subdirectory called "input_data".

Step 2 - Running Your First Rules

I copied one sample to my input_data directory and added it to the .yaml file:

samples:
    - JWG3_2_2

Now run the workflow from the directory you're developing in with snakemake --cores #:

[groverj3@x1-carbon snakemake_test]$ snakemake --cores 2
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        2       fastqc_raw
        3

[Mon Aug 19 23:30:46 2019]
rule fastqc_raw:
    input: input_data/JWG3_2_2_R2.fastq.gz
    output: 1_fastqc_raw/JWG3_2_2_R2_fastqc.html, 1_fastqc_raw/JWG3_2_2_R2_fastqc.zip
    jobid: 2
    wildcards: sample=JWG3_2_2, mate=2


[Mon Aug 19 23:30:46 2019]
rule fastqc_raw:
    input: input_data/JWG3_2_2_R1.fastq.gz
    output: 1_fastqc_raw/JWG3_2_2_R1_fastqc.html, 1_fastqc_raw/JWG3_2_2_R1_fastqc.zip
    jobid: 1
    wildcards: sample=JWG3_2_2, mate=1

Started analysis of JWG3_2_2_R1.fastq.gzStarted analysis of JWG3_2_2_R2.fastq.gz

I ran it with --cores 2, but I did not include a threads parameter to the rule in addition in input, output, params, and shell. So, it only thinks the rule fastqc_raw requires one processor. This means it will parallelize samples through that rule up to the maximum you give it at run-time! This is handy. Do you see now how this is better than a bash script? It intelligently replaces processes that can be run in parallel, but if you specify a number of threads for the rule it will wait until those cores are available.

Let's add a few more rules.

Step 3 - Add Rules

I'm going to add a bunch of rules. Don't freak out. Our Snakefile now looks like this:

# Get overall workflow parameters from config.yaml
configfile: 'config.yaml'

SAMPLES = config['samples']
REFERENCE_GENOME = config['reference_genome']

rule all:
    input:
        expand('3_bwameth_aligned/{sample}.bam',
               sample=SAMPLES),
        expand('2_trim_galore/{sample}_R{mate}_val_{mate}_fastqc.{ext}',
               sample=SAMPLES, mate=[1, 2], ext=['html', 'zip'])


# Index the reference genome
# ancient() will assume the reference is older than output files if they exist
rule bwameth_index:
    input:
        ancient(REFERENCE_GENOME)
    output:
        REFERENCE_GENOME + '.bwameth.c2t',
        REFERENCE_GENOME + '.bwameth.c2t.amb',
        REFERENCE_GENOME + '.bwameth.c2t.ann',
        REFERENCE_GENOME + '.bwameth.c2t.bwt',
        REFERENCE_GENOME + '.bwameth.c2t.pac',
        REFERENCE_GENOME + '.bwameth.c2t.sa'
    params:
        bwameth_path = config['paths']['bwameth_path'],
    shell:
        '{params.bwameth_path} index {input}'


# Run fastqc on the raw .fastq.gz files
rule fastqc_raw:
    input:
        'input_data/{sample}_R{mate}.fastq.gz'
    output:
        '1_fastqc_raw/{sample}_R{mate}_fastqc.html',
        '1_fastqc_raw/{sample}_R{mate}_fastqc.zip'
    params:
        fastqc_path = config['paths']['fastqc_path'],
        out_dir = '1_fastqc_raw/'
    shell:
        '{params.fastqc_path} -o {params.out_dir} {input}'


# Trim the read pairs
rule trim_galore:
    input:
        '1_fastqc_raw/{sample}_R1_fastqc.html',
        '1_fastqc_raw/{sample}_R1_fastqc.zip',
        '1_fastqc_raw/{sample}_R2_fastqc.html',
        '1_fastqc_raw/{sample}_R2_fastqc.zip',
        R1 = 'input_data/{sample}_R1.fastq.gz',
        R2 = 'input_data/{sample}_R2.fastq.gz'
    output:
        '2_trim_galore/{sample}_R1_val_1.fq.gz',
        '2_trim_galore/{sample}_R1.fastq.gz_trimming_report.txt',
        '2_trim_galore/{sample}_R2_val_2.fq.gz',
        '2_trim_galore/{sample}_R2.fastq.gz_trimming_report.txt'
    params:
        adapter_seq = config['trim_galore']['adapter_seq'],
        out_dir = '2_trim_galore',
        trim_galore_path = config['paths']['trim_galore_path']
    shell:
        '''
        {params.trim_galore_path} \
        --a {params.adapter_seq} \
        --gzip \
        --trim-n \
        --quality 20 \
        --output_dir {params.out_dir} \
        --paired \
        {input.R1} {input.R2} \
        '''


# Run fastqc on the trimmmed reads
rule fastqc_trimmmed:
    input:
        '2_trim_galore/{sample}_R{mate}.fastq.gz_trimming_report.txt',
        fq_gz = '2_trim_galore/{sample}_R{mate}_val_{mate}.fq.gz'
    output:
        '2_trim_galore/{sample}_R{mate}_val_{mate}_fastqc.html',
        '2_trim_galore/{sample}_R{mate}_val_{mate}_fastqc.zip'
    params:
        fastqc_path = config['paths']['fastqc_path'],
        out_dir = '2_trim_galore/'
    shell:
        '{params.fastqc_path} -o {params.out_dir} {input.fq_gz}'


# Align to the reference
rule bwameth_align:
    input:
        {rules.bwameth_index.output},
        R1 = '2_trim_galore/{sample}_R1_val_1.fq.gz',
        R2 = '2_trim_galore/{sample}_R2_val_2.fq.gz'
    output:
        '3_bwameth_aligned/{sample}.bam'
    log:
        '3_bwameth_aligned/{sample}.bwameth.log'
    threads:
        config['bwameth']['threads']
    params:
        bwameth_path = config['paths']['bwameth_path'],
        genome = REFERENCE_GENOME
    shell:
        '''
        {params.bwameth_path} \
        -t {threads} \
        --reference {params.genome} \
        {input.R1} {input.R2} \
        2> {log} \
        | samtools view -b - \
        > {output}
        '''

And the config.yaml looks like this:

samples:
    - JWG3_2_2_
    # Samples should be reported by ID rather than filenames, and exclude the
    # trailing "R1" and "R2", one sample ID per pair. If samples are supplied as
    # separate .fastq.gz files within each pair concatenate them to a single R1 and
    # R2 file prior to running.

reference_genome:
    reference_genome/ref_genome.fasta
    # Path to reference genome here with .fasta extension


# Options for individual workflow steps
# Configure threads for each step as desired, this is a sane starting point

trim_galore:
    adapter_seq : AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
    quality : 20

 bwameth:
    threads : 10   


# Paths to individual tools
# You probably don't need to change this unless programs are not in your $PATH

paths:
  fastqc_path : fastqc
  trim_galore_path : trim_galore
  bwameth_path : bwameth.py

That's a lot to take in, so a few words of explanation are in order. I have moved paths for the individual programs to a section in the config file. This is to help with potential portability problems. On some server you may have tools installed in directories outside your $PATH. They are pre-filled with just the tool name, so they work fine when programs are executable from a command prompt but this allows configuration (it's also at the bottom of the file because most users shouldn't have to change it). I also now have a section containing options for each tool that's run. Which you can pull out of the dictionary made from the config.yaml in each step's params using . (dot) notation. I've allocated 10 threads to the alignment step. This means it won't run if there aren't 10 threads available due to other rules running more than 10 concurrent processes.

There are also multiple targets now. This is because the output from rule fastqc_trimmed is not used as input to another rule. Unless you explicitly tell Snakemake to run that rule to generate its target it will not run and you will get very annoyed.

It's a lot to take in, but this is essentially a usable workflow.

Step 4 - Wrapping Up + A Few Tips

You can now add rules to your heart's content. Keep in mind though, you need to change your targets! Otherwise, it won't run your new rules :(.

Also, you're probably wondering what happens when you don't actually have enough CPUs to run that rule with 10 threads. Just change the --cores argument at run-time to a lower number. It will reduce that rule's threads to the number specified.

Another thing to consider is that Snakemake has the ability to work with HPC job submission frameworks like SLURM and PBS. Though, it's not really that difficult to include snakemake --cores # in a normal .pbs script. It also plays nice with containers (Docker and Singularity). So, if you package up your software in one you have automation and deployability all-in-one!

If you're curious what the full workflow looks like then check it out!

Suggestions for Reproducible Bioinformatic Analyses

2019-08-09T00:00:00-04:00

Bioinformatic analyses often require lengthy workflows or pipleines, where the output of program A feeds into program B, and so on. These programs may also not output their results in a format which is convenient to use in the subsequent steps, requiring writing a conversion script, or piping its output through yet another program. This means that something as simple as running a differential expression experiment still requires several steps. If you aren't careful this can result in an incredibly messy filesystem. Worse, you may not remember which programs or scripts were run on each file, and with which options. This is a huge issue out there and likely a good reason why it's so hard to reproduce results even when the same underlying data is used. Additionally, you'll inevitably need to spend time doing iterative analysis. This also needs to be documented and reproducible.

In this post I'll be explaining a few methods that we can use organize this situation before it drives you or your coworkers mad. Depending, of course, on the level of automation and reproucibility required of the workflow.

Suggestion 1: Interactive Terminal Sessions Are For Development Only

There is most definitely a time and a place for testing things out in your terminal. When you're learning to use a new program, needing to check the --help or man pages, figuring out how to glue together programs A and B, etc. However, in-depth analyses for publication should not be done in this manner.

This is because after running your analysis you may have absolutely no record of what was run! Of course, some (but, criticall, not all!) programs will export a log file. However, not all do. You can quickly end up in a situation where you have no idea which script was run on which file. So, reserve the interactive terminal sessions for those use cases above.

Suggestion 2: Interactive Data Manipulation Should Be Performed in R or Jupyter Notebooks

Don't use Excel. I repeat, don't use Excel.

Ok, Excel has its uses. However, if you're doing complex data analysis it's very easy to get to the scale that you'll regret using Excel quickly. Luckily the entire https://www.r-project.org/ was designed for this, and Python with Pandas provides some similar tools. In addition to scale, you also have no real record of what was done in an Excel workbook. When you combine R or Python with computational notebooks you can run code, and see the direct output of that code below it. This tracks everything that you've run and its outputs.

Even though I do most of my interactive analysis and figure-making in R, I still prefer Jupyter Notebooks over R Notebooks. This is because they're more widely used, and Jupyter is extensible to multiple languages. Installing the R Kernel is very simple.

Suggestion 3: Single-run Pipelines Should be Automated With Shell Scripts

When you write a one-off pipeline it should still be automated with a script. This enables reproducibility. In a perfect world you'd list the version of each piece of software in the pipeline as well. This could result in a single shell script file, or separate ones for each step. You may not know the next step a-priori. These shell scripts should clearly indicate the date of creation and the script's purpose. This is a simple example for one step in a single-use pipeline:

#!/usr/bin/env bash

# Author: Jeffrey Grover
# Date: 2019-07-24
# Purpose: Extract reads over small RNA loci groups with bedtools intersect

# Use bedtools intersect and pipe to bam2fq

align_dir="~/large_data/2019-06-28_aligned_reads"

for bed_file in ./srna_groups/*.bed; do

    bed_filename=$(basename $bed_file)
    out_dir=${bed_filename%.bed}_reads
    mkdir "./$out_dir"

    for bamfile in ${align_dir}/*.bam; do

        bamfile_name=$(basename $bamfile)

        bedtools intersect \
                -ubam \
                -a "$bamfile" \
                -b "$srna_file" \
            | samtools bam2fq -n - > "./$out_dir/$bamfile_name.fq"
    done

    pigz -p 10 ./$out_dir/*.fq
done

I work with a lot of small RNA sequencing, and I recently needed to extract reads from several different groups of small RNA loci I'd defined. It's relatively simple to use bedtools intersect with your interesting loci as a .bed file and pipe that output to samtools bam2fq. This isn't the kind of thing that's a standard analysis I need to do and it's not very long. Therefore, to enable it to be reproducible writing a quick shell script like this is the way to go. The comment lines also carry enough information to tell someone what it does.

Suggestion 4: Long Pipelines Should Have a W i d e Directory Structure

What does this mean? It means this:

[groverj3@x1-carbon wgbs_snakemake]$ ls
1_fastqc_raw                4_methyldackel_mbias    config.yaml  README.md         Snakefile
2_trim_galore               5_methyldackel_extract  input_data   reference_genome  temp_data
3_aligned_sorted_markdupes  6_mosdepth              LICENSE      scripts

and not this:

wgbs_snakemake/1_fastqc_raw/2_trim_galore/3_aligned_sorted_markdupes/4_methyldackel_mbias/5_methyldackel_extract/6_mosdepth

This makes navigating your directory structure much less of a pain. Especially when a pipeline is several steps long.

Suggestion 5: Automate Often-run Pipelines With Workflow Managers

If there is a particular pipeline that you run frequently then consider using a workflow manager. Options include:

My vote goes to Snakemake with Nextflow as a close second. These tools require some fiddling to transfer over an existing pipeline to fit their framework, but what you gain is reproducibility and automation. Additionally, they all utilize threading with parallel steps better than your BASH script does. They also work with HPC job submission frameworks (SLURM, PBS, etc.) and containers.

Writing these workflows is beyond the scope of this article, but definitely worth writing in detail about in a future one!

A word of caution: it's easy to think, "Oh, I'm only going to analyze bisulfite sequencing this one time" only to find yourself running your workflow several times as you acquire more data. There are also some freely available workflows already written that you can check out!

(Shameless plug for mine)

Suggestion 6: Containerize!

Wrap-up your workflow and its required software in a container for the ultimate write once run anywhere solution. You can make a Docker container with your entire workflow which can then be used on your server, or cloud computing. However, in order to run this in an HPC environment you'll need to run it through Singularity instead. That's fine though! Singularity can run Docker containers, and you'll already have one to use for cloud compute if needed.

Wrapping up

Hopefully you've found this informative and helpful. Next time I'll be back with more practical examples.

Efficiently Filtering While Reading Data Into R (With Python?!)

2019-07-17T00:00:00-04:00

Working with large amounts of tabular data is a daily occurance for both bioinformaticians and data scientists. There's a lot the two groups can learn from each other (great future post material). However, I recently ran into a situation that I was sure had to be relatively common. Apparently it wasn't and I had very little luck checking for a solution in my usual genomics/bioinformatics cirlces, as well as the data science material I had on-hand.

The Problem

I recently received output from a large BLAST search. Something on the order of 200,000 queries. Some of those queries had many thousand hits. This is because the search was completed with minimal filtering. The idea being, it's easy to post-filter it, but you can't get the hits back that are thrown away. Fair enough. It was also split between 100+ files. The files were also output in BLAST's "format 7" (run with --outfmt 7). This means it's tabular (.tsv) with comment lines throughout. This collection of files was actually too big to load them into R (where I do most exploratory data analysis) and then filter them. So, I figured there had to be a way to combine loading and filtering in a satisfactory way. Of course, you could also pre-filter it with awk or Python line-by-line and write it out to the hard drive, but this problem interested me.

TLDR

If you have to load AND filter you should use the lesser-known readr function read_delim_chunked() (and its derivatives, read_{tsv|csv|table}_chunked()) or write a parser in Python and translate the resulting object (list of lists, dictionary of lists, or Pandas dataframe) to R with reticulate. The reason behind this is that iterating through a file and filtering line-by-line, while a seemingly common thing to do, is horrifically slow in R as far as I can tell. I'm happy to eat my words if other useRs can prove I'm wrong.

Attempt #1: Writing a Line-by-line Parser in R

I read all the warnings. They say that R is slow. But is this really true? I frequently read pretty large files into R with readr or data.table, and they're wicked fast. What I failed to immediately realize is that these packages are fast because they're written in C/C++ and are effectively compiled programs that interact with R through its API.

I decided I would test this on both a subset of the data, one of the smaller files (~2.6 GB), and the first 10,000 lines. The tsv files have comment lines denoted with '#' and 12 columns which vary between character, float, and integers:

# Example data
input_file <- 'test_blast_fmt7.out'
system('head -n 10000 test_blast_fmt7.out > test_blast_fmt7.head10000.out')
input_file_head <- 'test_blast_fmt7.head10000.out'
col_names <- c('query', 'subject', 'identity', 'align_length',
               'mismatches', 'gap_opens', 'q_start', 'q_end',
               's_start', 's_end', 'evalue', 'bit_score')

I first attempted to solve this problem by writing the following function (don't do this).

read_filter_blast7_lbl_base <- function (input_file, header, min_perc_id, min_al_len, max_evalue) {
  # Initialize a line counter
  i = 1

  # Open a file connection, yield one line at a time
  out_list <- list()
  conn <- file(input_file, open = 'r')
  while (length(n_line <- readLines(conn, n = 1, warn = FALSE)) > 0) {

    # Split the line at the separator to yield a list, turn into a vector
    line <- unlist(strsplit(n_line, '\t'))

    # Skip comment lines
    if (!startsWith(line[1], '#')) {
      # Include filtering conditions here in this if statement
      if (line[3] >= min_perc_id & line[4] >= min_al_len & line[11] <= max_evalue) {
        out_list[[i]] <- line
      }
    }
    # Count lines
    i = i + 1
  }
  close(conn)

  # Bind the lines as a data frame, don't convert strings to factors
  out_df <- as.data.frame(do.call(rbind, out_list), stringsAsFactors=FALSE)
  colnames(out_df) <- header

  # Set the column classes
  for (i in header[3:12]) {
    out_df[, i] <- as.numeric(out_df[, i])
  }

  # The out_df object will include filtered data, all columns as character vectors
  return(out_df)
}

Wow! What an abomination. I'm not sure if this says more about R's unsuitability to this kind of task or my obvious "Python-think" that's seeping in here. It's a mess. And it's slow. I tried testing on the first 10,000 lines of one of the files:

library(microbenchmark)

# Benchmark 100 iterations (default) over the first 10000 lines
> microbenchmark(read_filter_blast7_lbl_base(input_file_head, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                         expr
 read_filter_blast7_lbl_base(input_file_head, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq    max neval
 338.4403 346.2422 351.0633 349.2319 352.5961 404.62   100

It works, but this doesn't scale up to a full file (it sat for ages until I killed it), and it suffers from R's problems with growing lists in a loop leading to copying rather than appending. There are clearly other issues too because my attempts to pre-allocate a list or data frame of the correct size did not speed it up. This means that I might be doing something wrong. Regardless, this is too much work to do something so simple. I welcome others to find a pure base R implementation that's better. It seems like there should be a way to do it.

However, there are better options.

Attempt #2: Using sqldf to Filter a Temporary sqlite Database

The internet led me to believe that this isn't really something people do in R. And if you can't load it all into memory then you should use a database and query that. It seems excessive, but the sqldf R package can do this. It even includes a function to create the DB on the fly while reading (read.csv.sql()) I totally understand using SQL or similar to query a DB when you have a reason to query it often and it's stored as an SQL DB. I question the wisdom of this suggestion though for this purpose. I was able to use it as follows:

read_filter_blast7_sqldf <- function(input_file, header, min_perc_id, min_al_len, max_evalue) {
  temp_df <- read.csv.sql(
    file = input_file,
    filter = "sed -e '/^#/d'",
    sql = paste0('SELECT * FROM file WHERE V3 >= ', min_perc_id,
                 ' AND V4 >= ', min_al_len, ' AND V11 <= ', max_evalue),
    header = FALSE,
    sep = '\t',
    colClasses = c(rep('character', 2), 'numeric', rep('integer', 7), rep('numeric', 2)),
  )
  colnames(temp_df) <- header
  return(temp_df)
}

One of the key limitations here is that you need to pipe through a shell command (sed) to remove comment lines. Not the biggest deal, but having to write a sed command does take you out of your flow in an R or jupyter notebook. Let's see how that performs:

# Benchmark 100 iterations (default) over the first 10000 lines
> microbenchmark(read_filter_blast7_sqldf(input_file_head, col_names, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                      expr
 read_filter_blast7_sqldf(input_file_head, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq      max neval
 72.27077 73.30782 93.27967 79.38143 103.4807 199.8507   100

What's going on here? The max is much slower than the min. This is because the first iteration reads this into a temporary file! This will take even more time for a larger file, and that temporary database will be the size of the full, unfiltered file. Usually you load each file once, and each file needs its own temp sqlite DB. So, the max time is actually the only timing that matters! Plus, it has the problem of filling up your /tmp directory. What happens when I try to load the smallest whole file (2.6 GB)?

# Benchmark the whole file once
> microbenchmark(read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50), times = 1)
Unit: seconds
                                                            expr      min
 read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50) 165.4224
       lq     mean   median       uq      max neval
 165.4224 165.4224 165.4224 165.4224 165.4224     1

That's decent performance, but because it makes temporary files it will fill up your /tmp directory:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
dev             7.8G     0  7.8G   0% /dev
run             7.8G  1.4M  7.8G   1% /run
/dev/nvme0n1p2  423G   34G  368G   9% /
tmpfs           7.8G  170M  7.6G   3% /dev/shm
tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
tmpfs           7.8G  6.9G  913M  89% /tmp
/dev/nvme0n1p1  300M  348K  300M   1% /boot/efi
tmpfs           1.6G   32K  1.6G   1% /run/user/1000

And running it on a second file fails:

# Benchmark the whole file once
> microbenchmark(read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50), times = 1)
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : 
  RS_sqlite_import: database or disk is full
In addition: Warning message:
In .Internal(gc(verbose, reset, full)) :
  closing unused connection 4 (test_blast_fmt7.out)
Error in result_create(conn@ptr, statement) : 
  cannot rollback - no transaction is active

I then had to sudo rm -r /tmp/Rtmp* because my ssd was full.

On a system with tons of space it could be fine. I'm running this on my laptop during a flight so that doesn't help. You could also specify where those databases are made. However, the largest file I needed to work with is 30 GB and there were several. And this exact problem happened with that on our lab's server. (Note to self: Ask my PI to upgrade the root drive).

Still not a great solution.

Attempt #3: readr `read_delim_chunked()`

This function isn't well-documented, but is the fastest option I found. It's not quite the line-by-line implementation I thought up, but it's similar. Basically, it will use readr and a function (user-definable) to bind together dataframes which are read in chunks. Getting the best performance would require optimizing the chunk size to the largest you can reasonably handle in memory. I stuck with 10,000 because I was comparing to other options.

library(readr)

read_filter_blast7_readr_chunked <- function (input_file, header, min_perc_id, min_al_len, max_evalue) {

  f <- function(x, pos) subset(x, identity >= min_perc_id & align_length >= min_al_len & evalue <= max_evalue)
  out_df <- read_tsv_chunked(input_file, chunk_size = 10000, col_names = header, comment = '#', callback = DataFrameCallback$new(f))
}

Benchmarking results in some very very solid performance:

# Benchmark 100 iterations (default) over the first 10000 lines
> microbenchmark(read_filter_blast7_readr_chunked(input_file_head, col_names, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                              expr
 read_filter_blast7_readr_chunked(input_file_head, col_names,      80, 432, 1e-50)
      min       lq     mean   median       uq      max neval
 28.90464 29.10356 30.87358 29.92421 31.28273 92.10364   100

That's not too surprising, since it's basically just reading in the whole file at once and readr is fast. So, how's it work on the full file?

# Benchmark the whole file once
> microbenchmark(read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50), times = 1)
Unit: seconds
                                                                         expr
 read_filter_blast7_readr_chunked(input_file, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq      max neval
 76.59867 76.59867 76.59867 76.59867 76.59867 76.59867     1

Really good performance. You can tune it better as well, the. This is probably your best bet without getting too weird. But let's get weird ;)

Attempt #4: Parse With Python Translate to R With reticulate

Let's do this in Python! Sort of...

Writing a function to read and filter something in a for loop is a common thing for me in python. I usually don't use Python for exploratory analysis though and am less familiar with Pandas et al. than I am with R's ecosystem. However, I was able to figure out that it will automatically turn a list of lists into a dataframe, which is pretty neat. A solid win over R's functionality:

import pandas as pd

def filter_blast7(input_blast_results, header, min_perc_id, min_al_len,
                  max_evalue):
    df_list = []
    with open(input_blast_results, 'r') as input_handle:
        for line in input_handle:
            if not line.startswith('#'):
                entry = line.strip().split()
                perc_id = float(entry[2])
                al_len = int(entry[3])
                evalue = float(entry[10])
                if (perc_id >= min_perc_id and al_len >= min_al_len
                        and evalue <= max_evalue):
                    df_list.append(entry)
    return pd.DataFrame(df_list, columns=header)

This function, when tested in python returns a data frame as expected:

>>> filter_blast7(input_file_head, col_names, 80, 432, 1e-50).head()
   query     subject    identity align_length mismatches  ... q_end s_start s_end  evalue bit_score
0  query_1  scaffold88  100.000          869          0  ...   869  733052  733920    0.0      1605
1  query_1  scaffold88   95.732          867         34  ...   869  734435  735298    0.0      1393
2  query_1   scaffold1  100.000          869          0  ...   869    4053    4921    0.0      1605
3  query_1   scaffold1   95.732          867         34  ...   869    5436    6299    0.0      1393
4  query_3  scaffold88  100.000          786          0  ...   786  735004  735789    0.0      1452

Nice, but I thought we were using R? Well, we can actually use the R package reticulate to run this python code and translate its output to an R data frame. Pretty neat! You can get it from CRAN with install.packages('reticulate'). With it installed you'll want to make sure it can find your python installation:

> library(reticulate)
> py_discover_config()
python:         /home/groverj3/.pyenv/shims/python
libpython:      /home/groverj3/.pyenv/versions/3.7.2/lib/libpython3.7m.so
pythonhome:     /home/groverj3/.pyenv/versions/3.7.2:/home/groverj3/.pyenv/versions/3.7.2
version:        3.7.2 (default, Mar 17 2019, 02:15:50)  [GCC 8.2.1 20181127]
numpy:          /home/groverj3/.local/lib/python3.7/site-packages/numpy
numpy_version:  1.16.4

python versions found: 
 /home/groverj3/.pyenv/shims/python
 /usr/bin/python
 /usr/bin/python3
> py_available()
[1] FALSE
> py_available(initialize = TRUE)
[1] TRUE

Even though it found my python installation py_available() returns false? I actually forgot to initialize it with use_python() but you can either do that or use py_available(initialize = TRUE). You also must have the python shared library installed installed (which it is not when using pyenv). Now, you can either source the python parsing function from a saved .py script, or run it inline as follows, and coerce it to an R function:

read_filter_blast7_lbl_py <- py_run_string(
"import pandas as pd

def filter_blast7(input_blast_results, header, min_perc_id, min_al_len,
                   max_evalue):
    df_list = []
    with open(input_blast_results, 'r') as input_handle:
        for line in input_handle:
            if not line.startswith('#'):
                entry = line.strip().split()
                perc_id = float(entry[2])
                al_len = int(entry[3])
                evalue = float(entry[10])
                if (
                    perc_id >= min_perc_id
                    and al_len >= min_al_len
                    and evalue <= max_evalue
                ):
                    df_list.append(entry)
    return pd.DataFrame(df_list, columns=header)"
)$filter_blast7

Which shows up as a "python.builtin.function."

class(read_filter_blast7_lbl_py)
[1] "python.builtin.function" "python.builtin.object"

Now, let's benchmark that:

> # Benchmark 100 iterations over the first 10000 lines
> microbenchmark(read_filter_blast7_lbl_py(input_file_head, col_names, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                       expr
 read_filter_blast7_lbl_py(input_file_head, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq      max neval
 60.73332 62.57396 67.58808 63.75887 69.98544 105.8659   100
> 
> # Benchmark the whole file once
> microbenchmark(read_filter_blast7_lbl_py(input_file, col_names, 80, 432, 1e-50), times = 1)
Unit: seconds
                                                             expr      min
 read_filter_blast7_lbl_py(input_file, col_names, 80, 432, 1e-50) 139.1212
       lq     mean   median       uq      max neval
 139.1212 139.1212 139.1212 139.1212 139.1212     1
>

It also returns a data frame identical to the python one:

head(read_filter_blast7_lbl_py(input_file_head, col_names, 80, 432, 1e-50))
  query    subject    identity align_length mismatches
1 query_1 scaffold88  100.000          869          0
2 query_1 scaffold88   95.732          867         34
3 query_1  scaffold1  100.000          869          0
4 query_1  scaffold1   95.732          867         34
5 query_3 scaffold88  100.000          786          0
6 query_3  scaffold1  100.000          786          0
  gap_opens q_start q_end s_start  s_end evalue bit_score
1         0       1   869  733052 733920    0.0      1605
2         1       3   869  734435 735298    0.0      1393
3         0       1   869    4053   4921    0.0      1605
4         1       3   869    5436   6299    0.0      1393
5         0       1   786  735004 735789    0.0      1452
6         0       1   786    6005   6790    0.0      1452

It's definitely not as fast as reading it chunked with readr. However, that Python function was easy to write and performed far better than the base R way to read and filter line-by-line. In the future, if I have something that I know how to do in Python I may not try to translate it to R. Just run the python code with reticulate!

Wrapping Things Up

You would think that is a commonly done thing, to filter huge datasets while reading. Apparently, it's not common enough in R for good documentation to exist. In the end readr wins again with read_delim_chunked(). This does require some tuning to get the best performance, but if you pick some sane chunk_size then further opitmization is probably unnecessary and it will work fine. However, the revelation that communicating between python and R works so well opens up a lot of future possibilities! Some things are just better suited to one language or another. While Python has pandas dataframes and a large ecosystem around them none of it is as intuitive as the tidyverse (to me). Something like the two winners here are ideal for situations where you have a huge file to load, but you know that most of the file will not meet your filtering criteria.

Using Python to do general purpose programming and communicating those results to R for statistical testing and visualization is definitely something to consider.

Addendum: What's the Overhead of Filtering While reading?

Let's check it out, first the readr method:

> microbenchmark(read_tsv(input_file, col_names=col_names, comment='#'), times = 1)
Parsed with column specification:
cols(
  query = col_character(),
  subject = col_character(),
  identity = col_double(),
  align_length = col_double(),
  mismatches = col_double(),
  gap_opens = col_double(),
  q_start = col_double(),
  q_end = col_double(),
  s_start = col_double(),
  s_end = col_double(),
  evalue = col_double(),
  bit_score = col_double()
)
|=================================================================| 100% 2685 MB
Unit: seconds
                                                       expr      min       lq
 read_tsv(input_file, col_names = col_names, comment = "#") 64.35369 64.35369
     mean   median       uq      max neval
 64.35369 64.35369 64.35369 64.35369     1

This is compared with 76.59867 ms for filtering. That's basically no difference. How about the reading it in with Python using Pandas read_csv:

> read_csv_py <- py_run_string(
"def read_blast7(input_blast_results, header):
    return pd.read_csv(input_blast_results, sep='\t', comment='#')"
)$read_blast7

> # Benchmark the whole file once
> microbenchmark(read_csv_py(input_file, col_names), times = 1)
Unit: seconds
                               expr      min       lq     mean   median
 read_csv_py(input_file, col_names) 97.45004 97.45004 97.45004 97.45004
       uq      max neval
 97.45004 97.45004     1

There's clearly a bit of overhead with both of these methods, but it's pretty minor, on the order of 10-30 ms for a 2.6 gb file.

Variations on RNAseq Workflows for DEG Analysis

2019-07-09T00:00:00-04:00

When analyzing RNAseq you're faced with many possible analysis pipelines. The biggest decision you need to make is what the purpose of your experiment is. I will make the assumption that most of the time people want to determine which genes are differentially expressed between two samples, genotypes, conditions, etc. In DEG analyss you are interested in gene-level expression. This means you are not interested in differential isoforms/transcripts or alternative splicing.The absolute most simple version of this is simply having control and experimental samples (preferably with >= 3 biological replicates each). However, this isn't as straightforward as firing up your favorite aligner and going to town on the data. There are other considerations.

I Have a High Quality Annotated Reference Genome or Transcriptome

My Reference Genome is High Quality

Align reads to reference genome (STAR, HISAT2)
Count reads per gene (HTSeq-count, summarizeOverlaps, featurecounts)
DEG Analysis (DESeq2, edgeR)

This is the standard workflow that you're probably accustomed to. Note: it is very important to use a modern splicing-aware aligner. Do not use bowtie. Both STAR and HISAT2 are very fast compared to older aligners and are designed for RNAseq. Their default options are generally appropriate for most simple experimental designs. As a bonus, STAR can actually do step 2 itself, although the output format is kind of clunky.

This workflow is a good general purpose one in model organisms, and nobody will fault you for using it there. However, there are potentially better options.

My Annotation/transcriptome is High Quality

Pseudoalignment-based abundance estimation (Salmon, Kallisto)
Aggregate abundances per gene from transcripts (tximport)
DEG Analysis (DESeq2, edgeR)

This workflow may actually be better (ref) even if you have a reference genome. I've always assumed that reference-genome alignment is superior when you have a good reference, but apparently this is not necessarily the case for the reasons detailed here.

Pros: very fast, potentially more accurate.

Cons: no .bam file is generated so you can't look at positional information from your reads, no ability to discover new transcripts later from your alignments.

Either of these workflows will work fine in this situation, and the better your genome is the closer the first will likely approximate the second. Though, I now believe that the second workflow should be the standard if your goal is purely DEG analysis. There are still a lot of good reasons to want a .bam file, though nothing is stopping you from aligning your reads anyway for future-use.

My Genome/Transcriptome is Incomplete

In this case you have some deicsions to make, yet again.

Genome is Good but Annotations Are Poor

Align to reference genome (STAR, HISAT2)
Assemble transcripts, genome-guided (Stringtie)
Aggregate abundances per gene from transcripts (tximport)
DEG Analysis (DESeq2, edgeR)

Another option here is to use a tool like PASA to update the existing annotations if they exist. I've run that pipeline. It's very quirky, a pain to get running, and if you don't need genomic coordinates I'd avoid it. You could also use Salmon/Kallisto with StringTie's transcripts, without using its quantification, but this seems to be an unnecessary step.

Genome and Transcriptome Are Poor

Assemble transcriptome (Trinity)
Pseudoalignment-based abundance estimation (Salmon, Kallisto)
Aggregate abundances per gene from transcripts (tximport)
DEG Analysis (DESeq2, edgeR)

In this case you're going to want to do a thorough de-novo transcriptome assembly using something like Trinity. This transcriptome can then be used for pseudoalignment-based abundance estimation and then DEGs can be determined after aggregation of isoform abundances. Trinity can be quite a resource hog, so you're going to want to get more ram.

Why Not Cufflinks/Stringtie For Transcript Assembly In Model Organisms?

First of all, don't use Cufflinks. Stringtie is essentially a more modern Cufflinks that's faster and more accurate. Secondly, if you're working in a well annotated genome chances are that "novel transcripts" you find are more likely noise, or not biologically meaningful (unless you know better for your use-case!).

Concluding Thoughts

The paper detailing that transcript abundances, when aggregated to gene level, improve DEG analysis is particularly interesting. This makes me rethink my usual assumption and I now believe that tools like Salmon or Kallisto should be the go to tools for DEG analysis when you have a good transcriptome to work with.

However, I still think it's worthwhile to align your reads and generate a .bam file. There are many types of visualizations and comparisons that you simply can't do without them. For example, calculating coverage over featutres of interest. If you must compare expression of genes across multiple samples or from different experiments then you'll probably want to convert your expression values to some normalized measurement. In this case you can use FPKM or TPM, though the consensus seems to be that TPM is the way to go these days.

And, at the end of the day you know that an out-of-date collaborator is probably going to ask you for FPKM measurements or something anyway.

Making Better Metaplots With ggplot, Part 2

2019-06-28T00:00:00-04:00

Last time we prepared our data using Deeptools.

Now we're going to do something kind of scandalous. R and python, living together in peace. What is this madness? I like R's ecosystem for manipulating data and plotting with the tidyverse. It still requires some tweaking, but with a bit of a time investment you can have publication-ready vector images in only a few lines of code. It's great for genomics data as well. Some out there may prefer matplotlib in python, and it is powerful, but I find it kind of tedious to use without adding another package on-top like Seaborn.

Genomics and data science belong together just like python and R!

1. Investigate Deeptools' Metaplot Output

Your first task with any new data is just to see what it looks like. In a terminal my initial instinct is always to call:

head ${filename}

But you can also just open deeptools metaplot table in your text editor of choice. What you'll find is:

bin labels    -1.0Kb    ...    start    ...    end    ...    1.0Kb
bins        1.0 2.0 3.0 ...
sample_name genes score score score ...
sample_name genes score score score ...

A tab delimited table of bin labels, bin numbers, and scores (data to plot) for each of those bins. This is a rather odd format because it's horizontal, rather than the long format that would be more convenient. We also have a label "genes" in position 2 of the same row as the score data. The bin labels only have 4 values in the whole row. The --upstream, --startLabel, --endLabel, and --downstream values from the computeMatrix step. We can work with this but it's a bit unwieldly.

2. Load Data Into R

Before getting started here make sure you have the tidyverse packages installed:

install.packages('tidyverse')

There are no built-in functions to read a "transposed tsv" file like this, but with a little googling this turned out to not be so bad. My original thought was to read it as a standard .tsv file with read_tsv() or base read.csv() and transpose with t() but these didn't like the data. This is because of the need to keep that first row exactly as it appears, despite most of it technically being empty, we'll need the blank labels later. So, from that stackoverflow post I was able to edit a few things:

read_deeptools_table <- function(file) {

  n <- max(count.fields(file, sep = '\t'), na.rm = TRUE)
  x <- readLines(file)

  .splitvar <- function(x, sep, n) {
    var <- unlist(strsplit(x, split = sep))
    length(var) <- n
    return(var)
  }

  x <- do.call(cbind, lapply(x, .splitvar, sep = '\t', n = n))
  x <- apply(x, 1, paste, collapse = '\t')
  plot_table <- na.omit(read.csv(text = x, sep = '\t')[-1,])  # Remove first row with "gene" label

  return(plot_table)
}

Essentially, this function is finding the length of the lines, reading the lines as character vectors, splitting the vectors by the tab character, and creating a new table from the vectors. Reading the data with this gives us a nice dataframe. From here on I will be using the tidyverse packages so feel free to load them with library('tidyverse').

> table_test <- read_deeptools_table('metaplot.tab')
> as_tibble(table_test)
# A tibble: 600 x 4
   bin.labels  bins sample_1            sample_2
   <fct>      <dbl> <fct>               <fct>
 1 -1.0Kb         1 0.7382198952879583  0.008900523560209424
 2 ""             2 0.9565445026178011  0.007329842931937172
 3 ""             3 0.9879581151832458  0.008376963350785341
 4 ""             4 0.8026178010471204  0.005235602094240838
 5 ""             5 0.7968586387434556  0.0031413612565445023
 6 ""             6 0.593717277486911   0.005235602094240838
 7 ""             7 0.36230366492146604 0.004712041884816754
 8 ""             8 0.5392670157068064  0.0
 9 ""             9 0.9617801047120418  0.0020942408376963353
10 ""            10 1.403664921465969   0.01099476439790576

3. Convert the Data to Long Format

A quirk of ggplot is that it really likes long format data. Where, instead of separate columns for the different samples you end up with a column of "scores" and another of "sample_id." This means that your sample ID is actually a variable and can be plotted. This results in a new data frame which concatenates the current sample columns into one, replicates bin.labels and bins as needed, and creates a new column with the sample ID for each row. The easiest way to do this is with the gather() function in tidyr:

long_table <- gather(plot_table, 'sample', 'score', -bin.labels, -bins)

You can check out what this looks like as follows:

> head(long_table)
  bin.labels bins   sample              score
1     -1.0Kb    1 sample_1 0.7382198952879583
2               2 sample_1 0.9565445026178011
3               3 sample_1 0.9879581151832458
4               4 sample_1 0.8026178010471204
5               5 sample_1 0.7968586387434556
6               6 sample_1  0.593717277486911
> tail(long_table)
     bin.labels bins     sample                score
1195             595 sample_2 0.005759162303664922
1196             596 sample_2                  0.0
1197             597 sample_2 0.006282722513089005
1198             598 sample_2 0.017277486910994764
1199             599 sample_2 0.020942408376963352
1200      1.0Kb  600 sample_2 0.012565445026178013

This would be annoying to work with by hand, but ggplot2 understands it just fine.

4. Build the ggplot Command

Now that our data is in the right format it's time to get plotting! We'll start simple, and make it more complex from there:

plot <- ggplot(long_table, aes(x = bins, y = as.numeric(score), color = sample)) +
  geom_line() +
  scale_x_continuous(breaks = long_table$bins,
                     labels = long_table$bin.labels)

This will get us started with a simple line plot. The key here is that the x axis breaks are the bin numbers, but are labeled as the bounds, start, and end of the features. However, this also creates a major gridline at each break. Not ideal. I have some sensible plot defaults that I use often, which I have saved to a code snippet on my codeberg. I'll use these as a starting point for the theming. Another feature we may want to add is the ability to smooth the line. This can be accomplished by using:

geom_smooth(method = 'loess', se = FALSE)

This will smooth the data with loess regression. The amount of smoothing can be configured with the span = ... parameter in geom_smooth(). You'll also want to control the size of the plot when it's saved, and perhaps stretch or shrink its aspect ratio. This can also be controlled by ggplot2 using ggsave() at the end of our plotting command. We also will want to add the ability to specify the colors rather than just using the defaults. It's best to use a colorblind-friendly palette when possible. Putting this all together, our plot command becomes:

metaplot <- function(long_table, start_label, end_label, y_axis_label, span,
                     out_prefix, format, smooth, line, aspect, width, height,
                     colors) {

  start_bin <- subset(long_table, bin.labels == start_label)$bins
  end_bin <- subset(long_table, bin.labels == end_label)$bins

  plot <- ggplot(long_table, aes(x = bins, y = as.numeric(score), color = sample))
  if (smooth == TRUE) plot <- plot + geom_smooth(method = 'loess',
                                                 span = span,
                                                 se = FALSE)
  if (line == TRUE) plot <- plot + geom_line()
  plot <- plot + scale_color_manual(values = unlist(strsplit(colors, ','))) +
    scale_x_continuous(breaks = long_table$bins,
                       labels = long_table$bin.labels) +
    geom_vline(xintercept = c(start_bin, end_bin), linetype = 'dotted') +
    ylab(y_axis_label) +
    xlab('Position') +
    theme_bw(base_size = 22) +
    theme(legend.title = element_blank(),
          legend.position = 'bottom',
          legend.direction = 'horizontal',
          legend.margin = margin(0,0,0,0),
          legend.box.margin = margin(-10,-10,-10,-10),
          axis.text = element_text(color = 'black'),
          axis.ticks.x = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(),
          aspect.ratio = aspect) +
    ggsave(paste0(out_prefix, '.', format),
           width = width,
           height = height)
}

With a few if statements we actually can plot both the smoothed and line plots on the same coordinate system if we want. Let's test it with some data small RNA expression data from my 2018 paper over a set of genomic features:

I think we can agree that this is an improvement. This could still be improved by showing error when replicates are plotted, but it's pretty good for now.

Wrapping up

While this required a little patience, I think the results are worth it. Creating clean visualizations is necessary to get your point across. I've tidied this up a bit and pushed the full code to github. Bonus: it runs as a standalone script and works with any number of input samples! Check it out!

Making Better Metaplots With ggplot, Part 1

2019-06-27T00:00:00-04:00

Commonly, in bioinformatics we're in the business of determining whether something, be it gene expression, or DNA methylation, or splicing, etc. is different between multiple conditions. Typically this would be done by comparing those data and using some kind of statistical test. However, with the continued advances in sequencing technologies generating greater read depth, and these technologies becoming more available to researchers we can also look at genome-scale data in other ways. Testing purely on count or score data does not inform one of the positional information associated with that data.

To look at the positional context associated with genomics data we have several options. One common way is a visualization that's often referred to as a "metaplot" or "metagene plot." These plots are similar to the TSS or "peak" plots commonly used to visualize chip-seq or similar data. In a metaplot the entire length of a feature is scaled such that each feature now is composed of the same number of "bins" of data. This allows one to visualize the data associated with these features across their entire length. There are existing software packages that can make these plots without too much trouble such as Deeptools or the Genomation R library. In particular, I find Deeptools to be a great software package, and it makes some wonderful visualizations that would be a pain to make yourself. Genomation requires one to be very familiar with R since it isn't a standalone program. Deeptools is easier to use but its metaplots leave something to be desired:

I like the control I have over all plot elements and the professional look that ggplot affords. I use it for most of my data visualization needs. So, I figured, why not make something prettier with it?

1. Install Required Packages

This guide will use Deeptools, a Python package with a ton of functionality that you can play around with later, and ggplot2 from the tidyverse. The tidyverse is a collection of R libraries designed by Hadley Wickham that make data science a snap. You can install them as follows in a terminal:

pip install --user deeptools

Launch the R interpreter by typing R and then:

install.packages('tidyverse')

I recommend installing them into a user-specific library by either the --user flag for pip or setting up a .Renviron file with a path to a local library. You can learn how to do that in my previous post. You're also going to need samtools. Feel free to use the package manager of your choice if conda is more your jam.

2. Generate the Data Table With Deeptools

Now that you've got the software installed you'll need to generate per-position "score" information. If this is expression data or similar you can use Deeptools again. But you should be able to use other inputs to the later steps as well. If using expression data you can use your bam file you can use Deeptools' bamCoverage tool. First, you need to index the alignment .bam file:

samtools index ${input_bam} ${input_bam%.bam}

${} is the syntax for using a previously declared variable in BASH and I'll use that kind of representation throughout for places where values should be specified.

Now that you have that out of the way. Your first step is to generate a coverage file in bigWig format. This is a binary format but contains similar data to a bedGraph. You can use the bamCoverage tool:

bamCoverage \
    -p ${threads_to_use} \
    --binSize 1 \
    --normalizeUsing ${RPKM|CPM|BPM|RPGC} \
    --outFileFormat bigwig \
    --bam ${input_bam} \
    --outFileName ${input_bam%.bam}.bigWig

A --binsize of 1 will just generate per-base converage. This may be slow, and you could increase the value if you wish. There are also other ways of generating coverage/depth such as mosdepth (a great tool by Brent Pedersen). This comes with Deeptools though, and is easy to get running. The --normalizeUsing option will let you normalize the coverage by several methods, which is particularly useful for plotting multiple experiments together at the end.

Next, you'll need to generate a score matrix. In other words, a matrix of coverages or other values of interest. This step can be done on any score data in a bedGraph/bigWig file, even if you did not generate it with Deeptools. So, if you're using data from a tool other than bamCoverage this is your starting point.

computeMatrix scale-regions \
    -p 10 \
    --startLabel start \
    --endLabel end \
    --upstream ${base_pairs} \
    --downstream ${base_pairs} \
    --regionBodyLength ${scale_length} \
    --regionsFileName ${regions_bed} \
    --scoreFileName ${input1_bigWig} ${input2_bigWig} \
    --outFileName ${output_matrix.gz}

The --startLabel and --endLabel values can be changed as desired, but don't forget them! The --upstream and --downstream values can be as desired. The --regionBodyLength is the value to which all features will be scaled. I suggest using either the mean or median length of the features of interest. The regions will be input as a .bed file, and the bigWig files that were generated in the previous step will be used where indicated. Multiple files can be input, space separated. You can specify that the matrix be gzipped by simply adding .gz to the name of your output file. Now, the final step is to generate the plot and also output the raw data:

plotProfile \
    --startLabel start \
    --endLabel end \
    --averageType ${mean|median} \
    --matrixFile ${input_matrix.gz} \
    --outFileName ${metaplot.svg} \
    --outFileNameData ${metaplot.tab}

This will generate a plot, but also output the table of per-bin values that were plotted. I made this with it:

I could play with Deeptools further, but the options for changing its aesthetics are more limited than I'd like. In particular, smoothing the lines requires smoothing the underlying data in the scoreMatrix step. Which I am not a huge fan of. Now, let's load that table into R and make something prettier in Part 2.

Managing Software on a Multiuser Linux System

2019-06-25T00:00:00-04:00

When I started my Ph.D. I had a good amount of experience working in a Linux environment on my own computers. Mostly as a hobby. My advisor had bought a small server several years previous for a post-doc's project and I was offered this system to use for my day-to-day work. It doesn't set any speed records, but it is a 24 thread system with 75gb of RAM and 12TB of storage. This makes it perfect for running analyses that I wouldn't want to do on my laptop, but need to be tweaked repeatedly and therefore are awkward to run on the university HPC. I also use this server for jupyter notebooks and it still handles a few users at a time well.

Since this system was starting from a blank slate I decided to implement some simple rules for system management. When I started out I was the only user, but since then we've added several others and this plan has held up. This is going to be heavily biased toward running a small server for computational work that's shared between < 10 users, because that's what I do.

These are ordered, but feel free to ignore that. They're really more like general tips.

0. Run a Well-Supported (Popular) Linux Server Distro

I know, I know, I know. You may have a favorite Linux distribution. It might be Fedora, or Mint, or Manjaro (that's what I've been using). You might use Arch, you might be a masochist, or you may enjoy running something with an innovative package management system like Guix or NixOS.

Maybe you just don't know why everyone uses this *nix stuff and don't know why you can't just bioinformatics in Excel.

You're welcome to use something flashier, but I'd recommend sticking to Ubuntu Server or CentOS. Fedora Server might also be a good choice. Especially with Ubuntu potentially not shipping 32bit support in the future. For those with more time or inclination to fiddle around, Debian would also make a good research computing environment. The reason for this is that most software that's already packaged will be either in .deb (Debian and derivate, including Ubuntu) or .rpm (Redhat, Fedora, SUSE) format. Can you extract these packages and install them on other systems? Sure. Are you going to want to do that every time you update stuff. No.

You also want to make sure that required libraries for software you may need to compile are available without much fussing around straight from the repositories. You'll have to do enough annoying things. Don't make this annoying.

1. Revoke Other Users' sudo Privileges

This may seem obvious but you'd be surprised how many academic labs don't think about this on their private server (if they have one). It's hard to overstate the terrible time you'll have as a sysadmin if another one of your users types the dreaded:

sudo rm -r /

sudo rm -r /*

It's easy to forget that "." before the "/".

Or, less catasrophically, that user may try installing software in a brittle way. Meaning, you, the humble pseudo-sysadmin who's not actually getting paid for sysadmin tasks, will have to spend time fixing it.

All it takes is for you, the SUPER USER, the GOD OF THE SERVER, to run:

sudo deluser {USERNAME} sudo

Replace {USERNAME} with the user to remove.

2. Don't Blindly Install Software From Your Distro's Repos

I did just say to pick a distro with lots of stuff in the repos, right? Yes, but particularly in scientific/research computing you really really really can't assume these repos are anything close to up-to-date. Don't be afraid to download the source code and compile, or even easier, there is likely a prebuilt binary release available on the project's github.

As an example, if you're running the most recent LTS version of Ubuntu (18.04) then the version of samtools available to you is v1.7 which is a year and a half old at the time of writing. If you have control of the system, then at least try to install the most recent stable versions of critical software.

3. Use an Easily Followed Convention for Manual Software Installation

When you need to download software and install it manually put it somewhere easy to remember, and easy to find for others. I put manually installed software in /opt/software_version and symlink the binaries to /usr/local/bin/. This way, you quickly know what you have manually installed, and what version they are just from the directory structure. You also make everything available in the $PATH and runnable with just the program name.

The worst thing that can happen in a broken symlink if you change software versions, and that's an easy fix with a:

sudo ln -s /path/to/binary /usr/local/bin

4. Encourage Users To Test Software in ~/bin

Create a private bin directory inside each user's home folder. This is often pre-configured in each user's path. If not you'll need to add it to each user's .bashrc or .profile or .bash_profile, depending on which is the preferred method for your distro:

# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
    PATH="$HOME/bin:$PATH"
fi

Let your users test and if multiple people need it, or they're running something all the time, then you can install it system-wide in /opt.

5. Encourage Python Users to Set Up pyenv

Linux systems use Python under the hood a lot. Much of the system depends on python, and your distro's package manager has already likely installed many python packages. However, these versions are likely old and frozen at the version number that shipped with the OS. I dislike running software that is years out of date. Python's package management with pip is kind of a mess and it doesn't know which packages are needed by the system, and which are installed with it. This is improving over time, but it's still not good.

To avoid this, users should install the most recent stable version of Python. Pyenv gives you a relatively easy and very lightweight way to do this. It also allows the system packages to coexist peacefully in the root directory so it's harder to break things. Plus, the users get the latest Python features.

The pyenv github has relatively easy to follow instructions.

6. Use User-specific Language Libraries/Packages

This pops up for us with both python and R. It boils down to never, ever, using:

sudo pip install {PACKAGE_NAME}

sudo R
install.packages('PACKAGE_NAME')

If users can't use sudo there's no danger here anyway, but using user-specific libraries and packages keeps things consistent. It also means that, once again, you don't have to manage something. The following will solve this for Python:

pip install --user {PACKAGE_NAME}

This installs packages to ~/.local/lib/python{VERSION}/site-packages

R requires a bit more doing. To create a user-library I recommend creating a .Renviron in each user's home directory and adding the following to it.

# .Renviron is run every time a new R session is started
# Use .Renviron to set environment variables for R

# Use the local R library
R_LIBS_USER="~/.local/lib/R/site-library"

Wrapping up

In summary, administering a small multi-user system doesn't have to be complicated. You do want to minimize the ability for your users to break things though. By no means is this an exhaustive guide, but it might help you out if you're wondering where to start.

Despite the proliferation of HPC systems at Universities, and cloud computing in enterprise environments, a smaller server for your research group is still a good investment in 2019. Submitting jobs to a queue is fine when you're not doing iterative work, but if you want to quickly test things it gets old really quick. Likewise, you can easily get hardware on-par with a remote VM and it's more readily accessed.

Tune in next time for something more bioinformatics-focused!

Setting up a Static Site With Pelican and GitHub Pages

2019-06-15T00:00:00-04:00

In an effort to aid in my future job searching I decided I needed a personal/professional website. It needed to look good, contain links to my relevant social and job-search profiles, host some examples of work from my Ph.D. , showcase my skillset, and host my CV. GitHub pages seemed like a natural fit, since I already share most of my work there. GitHub recommends static site generation with Jekyll, which I've seen to be a fine way to do that, and they have integrated tools for working with it. However, I mostly write python day-to-day (and R) and the idea of using a ruby-based framework for this just seemed silly to me. So, stubborn as I am, I decided to embark on a quest to use a python-based alternative. Pelican seemed to be the most actively developed, so I went ahead with that.

The issue I ran into is that many of the guides were unnecessarily complicated, or didn't contain information for my particular use-case. So, I've compiled here the steps I used to generate this site, in the hope that it will help others.

Note Before Starting

Before starting here, I would like to mention that I will not be recommending using a virtual environment. Why? This is overkill for a simple static site/blog and adds unnecessary complication to a process that doesn't need to be hard at all. These sorts of instructions are useful for more advanced deployments, and if you need them then you probably don't need a guide as simplified as this anyway.

Personally, I do use pyenv to manage python installations on my home and work computers, and well as our lab server. It makes my life much easier. But it is not required so I won't be going over it here.

I still recommend installing python packages at the user level though. Mostly a *nix/macOS thing, I'm pretty sure Windows peeps can ignore this. This will be explained where relevant.

This guide will be GNU/Linux centric and I'm not apologizing for it :)

Don't use Python 2.x, it's 2019. I'm writing these with Ubuntu and derivatives in mind, so I will specify python3 throughout, since I believe python still points to 2.7.

Step-by-step Instructions Start Here:

Like I said above, there are existing guides for this. However, most of them recommend installing what amounts to extreme overkill for simple GitHub pages.

Have Python and PIP Installed

If you're on Linux then congrats, you've already got it. If not, consult the docs at python.org

However, you may not have pip, the python package manager. For that check by running:

which pip3

If it doesn't point to an executable then you'll need to run (Ubuntu-based):

sudo apt install python3-pip

For other distros/package managers or macOS with homebrew consult the docs to get the specific commands. These will likely require superuser/administrator privileges.
Install Pelican

On any *nix or macOS machine the following should do the trick in or other terminal:

pip3 install --user pelican ghp-import Markdown typogrify

The --user flag installs Pelican to your home directory and doesn't require super-user/administrator privileges and ghp-import will allow you to push directly to github.
Create a New GitHub Repo and Clone

It's perfectly fine to use the web interface to create a new repo, so go to your github homepage and create a new repository with the name:

{username}.github.io

This is important, and will allow you to access your site at {username}.github.io rather than needing extra bits on the end of your github url. Initializing with a README and LICENSE is up to you! May I recommend the MIT license for simplifcity and FOSSness?

Go to your desired dev folder on your machine, I keep all github projects in "~/Github/{project_name}", and clone:

cd {project_folder} && git clone {repo.git}
Run Pelican Quickstart in Your Repo Directory

Pelican comes with a handy quickstart script. Though, it's not terribly-well documented. My settings were as follows. Only non-defaults listed (for defaults just push enter):

pelican quick-start

Do you want to specify a URL prefix? Y (Followed by: {https://{username}.github.io})
What is your time zone? {insert local timezone here}
Do you want to upload your website using GitHub Pages? Y

This will create the skeleton of your page, and allow you start adding content! Other things can be changed later in your config files.
Create a First Post

By default, things created in your root level directory are turned into blog posts. Don't ask me why this is the default, I don't like it. However, this can be changed/hacked around later. For now create a file called test.md. Add the following to it:

Title: This is a Blog Post!
Date: 2019-06-15
Category: Article

Hello World!
Generate Your Site

There are several ways to do this, this is the simplest:

make devserver

This command starts a dev server, which automatically updates the generated content in real-time. So you can edit and preview simultaneously. Point your web browser of choice to localhost:8000 and take a look!
Add a Static Page

By default, things in the root directory are blog posts (configurable), but you'll probably want some static pages that are always linked to and don't contain blog content. For that, without stopping the devserver, create a new folder inside the "content" subdirectory called "pages":

mkdir ./content/pages

Create a new markdown document in there called "about.md":

touch ./content/pages/about.md

Fill this with the following:

Title: About
Date: 2019-06-14

Hello world! This is a test, using Pelican to create a github pages site.
Preview Your New Pages

Go back to your browser, which should have been running the whole time, and refresh on localhost:8000. You should now see options to go to a new page called "about." That's it! Easy peasy!
Generate Your Content and Push

Kill the devserver with ctrl + c. Run the following in your root directory:

make html

This is probably unnecessary, but in case the devserver wasn't working correctly, then this ensures you will have no issues.

Next, run the following to push to github:

make github

This will ask for your github username and password, then pushes to your repo.

Now direct your browser to:

https://{username}.github.io

You site should be visible now!
Push Your Source Code to a New Branch

This method of pushing creates a problem from a dev standpoint. It will write over all content in your repo every time. Plus, it only writes the rendered site there. If you want to work on things off this same machine, you're going to want to push the source code. Fortunately, there's an easy workaround for this.

Go to the GitHub web interface and create a new branch called "source". This will copy all current content to it, which is just the rendered page. Now, back in your development folder, copy all content from your repo's folder elsewhere (non-hidden stuff only). Then, open a terminal and type:

git checkout source

This switches you to the source branch. It also replaces the contents with the rendered content-only.

Delete the contents again and replace with your copy of the source code. Now enter:

git add . && git commit -m "Pushed source" && git push -f origin source

This will force a push to the source branch. Technically you don't need the "origin source" since you've checked it out, but for extra safety since we're already doing something that is frowned on. This will totally overwrite your site's content with the source code used to generate it. But only on that branch. Now you can push using the make github command, which defaults to the master branch, when you want to publish, and push with git push origin source when you want to update the source code.

Final Thoughts

You're now done! And you can switch between branches to see the source and output from Pelican's rendering. I'll make another post later to detail some more configuration details. Until then, the docs are a wonderful resource.

Jeff Grover. Bioinformatics Scientist.

Polyglot R and Python Bioinformatics and Data Science Projects Using Jupyter Notebooks

Enter virtualenv and renv

The Goal: Use Both virtualenvs and renvs in Jupyter

Global Configuration

Create Your Project Folder

Set-up the Python virtualenv

Set-up the Renv

Project Structure

Notes on .gitignore and .renvignore

Restoring an Environment

Notes on Alternative Setups

Nextflow With First Class Metadata: A Minimal Example

First Class Metadata

How Does First Class Metadata Help My Nextflow Workflows?

Conclusion

Bridging the Gap With Wet Lab Using R Shiny

A Common Example

The Shiny Framework

Our Goal

Set-up

Prepare the Differential Expression Data

Let's Build the Shiny App

Explanation

What Can You DO With This?

Reference

On Bioinformatics Workflow Design

What Is a Bioinformatics Workflow?

Common Workflow Frameworks In Bioinformatics

The "Why" and "When" Of Workflow Automation

Some Bioinformatics Workflow Anti-Patterns

Good Practices To Use Instead

In Conclusion

Making Volcano Plots With ggplot2

The Full Function

But What Does This Look Like?

How Is The Input Data Formatted?

Brief Explanation

Some Gotchas

Why Not Just Use EnhancedVolcano?

Managing Software on a Multiuser Linux System - An Update

0. Make Your Life Easier With Containers

1. Encourage Use of Virtual Environments For Python

2. Users Run Their Own Jupyter Notebook Servers

3. Install R With rig

4. Have A Real Storage And Backup Strategy

5. Know Where You Draw The Line

Wrap-up

Publications, Dissertations, Job Hunts, and a Pandemic

Wrappig up Grad School During the COVID-19 Pandemic

Just Write Your Own Python Parsers for .fastq Files

The Contenders

Setting up the Test

Define Some Functions to Test

Run Some Benchmarks

Visualize

To Wrap Things Up

The Snakemake Tutorial I Wish I Had

Step 0 - Install Snakemake and Your Workflow's Software Dependencies

Step 1 - Learn Some Snakemake Basics

Step 2 - Create a Rule

Step 2 - Running Your First Rules

Step 3 - Add Rules

Step 4 - Wrapping Up + A Few Tips

Suggestions for Reproducible Bioinformatic Analyses

Suggestion 1: Interactive Terminal Sessions Are For Development Only

Suggestion 2: Interactive Data Manipulation Should Be Performed in R or Jupyter Notebooks

Suggestion 3: Single-run Pipelines Should be Automated With Shell Scripts

Suggestion 4: Long Pipelines Should Have a W i d e Directory Structure

Suggestion 5: Automate Often-run Pipelines With Workflow Managers

Suggestion 6: Containerize!

Wrapping up

Efficiently Filtering While Reading Data Into R (With Python?!)

The Problem

TLDR

Attempt #1: Writing a Line-by-line Parser in R

Attempt #2: Using sqldf to Filter a Temporary sqlite Database

Attempt #3: readr read_delim_chunked()

Attempt #4: Parse With Python Translate to R With reticulate

Wrapping Things Up

Attempt #3: readr `read_delim_chunked()`