<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Jeff Grover. Bioinformatics Scientist.</title><link href="https://groverj3.codeberg.page/" rel="alternate"></link><link href="https://groverj3.codeberg.page/feeds/all.atom.xml" rel="self"></link><id>https://groverj3.codeberg.page/</id><updated>2024-06-02T00:00:00-04:00</updated><subtitle>Senior Scientist - NGS &amp; Bioinformatics @ &lt;a href="https://www.entradatx.com" target="_blank"&gt;Entrada Therapeutics&lt;/a&gt;</subtitle><entry><title>Polyglot R and Python Bioinformatics and Data Science Projects Using Jupyter Notebooks</title><link href="https://groverj3.codeberg.page/articles/2024-06-02_polyglot-r-and-python-bioinformatics-and-data-science-projects-using-jupyter-notebooks.html" rel="alternate"></link><published>2024-06-02T00:00:00-04:00</published><updated>2024-06-02T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2024-06-02:/articles/2024-06-02_polyglot-r-and-python-bioinformatics-and-data-science-projects-using-jupyter-notebooks.html</id><summary type="html">&lt;p&gt;TLDR - Check out this github repo for a (still really wordy) example:
&lt;a href="https://codeberg.org/groverj3/polyglot_jupyter_example"&gt;polyglot_jupyter_example&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you're anything like me, and there are probably tens of you out there, you
enjoy working in multiple programming languages for your bioinformatics/data
science work. Perhaps you love the &lt;a href="https://www.tidyverse.org/"&gt;tidyverse&lt;/a&gt; R
ecosystem for data manipulation …&lt;/p&gt;</summary><content type="html">&lt;p&gt;TLDR - Check out this github repo for a (still really wordy) example:
&lt;a href="https://codeberg.org/groverj3/polyglot_jupyter_example"&gt;polyglot_jupyter_example&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you're anything like me, and there are probably tens of you out there, you
enjoy working in multiple programming languages for your bioinformatics/data
science work. Perhaps you love the &lt;a href="https://www.tidyverse.org/"&gt;tidyverse&lt;/a&gt; R
ecosystem for data manipulation but prefer packages from a Python library like
&lt;a href="https://scikit-learn.org/stable/index.html"&gt;Scikit-learn&lt;/a&gt;. Or, as is becoming
increasingly common, you're working on single-cell RNAseq analysis and you like
normalization provided by &lt;a href="https://satijalab.org/seurat/"&gt;Seurat&lt;/a&gt;, need to
load data types only supported in the Python &lt;a href="https://github.com/scverse"&gt;scverse&lt;/a&gt;,
and want to use the Bioconductor &lt;a href="https://bioconductor.org/packages/devel/bioc/html/SingleCellExperiment.html"&gt;SingleCellExperiment&lt;/a&gt;
class to store your data.&lt;/p&gt;
&lt;p&gt;If you're tried to use both Python and R in the same project maybe you've
already realized that &lt;a href="https://jupyter.org/"&gt;Jupyter notebooks and lab&lt;/a&gt; support
many kernel types, including R. By installing the &lt;a href="https://irkernel.github.io/"&gt;R Kernel&lt;/a&gt;
in addition to the default &lt;a href="https://github.com/ipython/ipykernel"&gt;ipython kernel&lt;/a&gt;
you can use Jupyter for both R and Python notebooks, harmonizing your workflow
across both languages. However, there is another consideration, having a
reproducible &lt;em&gt;environment&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;Enter virtualenv and renv&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/pyenv/pyenv"&gt;&lt;img src="https://avatars.githubusercontent.com/u/16530698?s=200&amp;v=4" alt="Pyenv" style="width:120px;height:120px;"&gt;&lt;/a&gt;
&lt;a href="https://github.com/rstudio/renv"&gt;&lt;img src="https://rstudio.github.io/renv/logo.svg" alt="Renv" style="width:120px;height:120px;"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Python virtual environments have been around for a very long time. They're a
great way to make sure that your package versions for a given project are
recorded, frozen, and don't conflict with those packages you're installed for
other projects. The standard library module &lt;a href="https://docs.python.org/3/library/venv.html"&gt;venv&lt;/a&gt;
and the installable package &lt;a href="https://virtualenv.pypa.io/en/stable/"&gt;virtualenv&lt;/a&gt;
allow you to manage them. I typically use &lt;a href="https://github.com/pyenv/pyenv"&gt;pyenv&lt;/a&gt;
to install and manage python versions, and it has an extension for managing
virtual environments as well: &lt;a href="https://github.com/pyenv/pyenv-virtualenv"&gt;pyenv-virtualenv&lt;/a&gt;.
Confusingly, pyenv-virtualenv actually uses &lt;code&gt;venv&lt;/code&gt; (mostly).&lt;/p&gt;
&lt;p&gt;On the R side of things, it seems virtual environments or project environments,
haven't had as much focus. Generally, packages are very backwards compatible in
R land. However, the package &lt;a href="https://rstudio.github.io/renv/index.html"&gt;renv&lt;/a&gt;
is getting more attention. I think it's great for the R community. It's long
been commonplace to run &lt;code&gt;sessionInfo()&lt;/code&gt; at the end of a notebook or script to
make sure you know which packages are in-use and their versions. Instead,
&lt;code&gt;renv&lt;/code&gt; allows creating a project-specific library and tracks versions of all
packages used (though, you should still run &lt;code&gt;sessionInfo()&lt;/code&gt; for in your
notebooks for completeness, I think).&lt;/p&gt;
&lt;p&gt;There are lots of guides for Python virtual environments, but fewer at this
time for &lt;code&gt;renv&lt;/code&gt;. It's pretty easy to start using though, just install it with
&lt;code&gt;install.packages("renv")&lt;/code&gt;, restart R, initialize it in a directory for your project
&lt;code&gt;renv::init()&lt;/code&gt;, install the packages you want, then update the lockfile with
&lt;code&gt;renv::snapshot()&lt;/code&gt;. There are some quirks to it, so I recommend perusing the
docs.&lt;/p&gt;
&lt;h2&gt;The Goal: Use Both virtualenvs and renvs in Jupyter&lt;/h2&gt;
&lt;p&gt;I'm a big fan of Jupyter Lab, and I use it for most of my downstream analysis
tasks in both R and Python. I use R more frequently despite liking Jupyter,
which I guess makes me kind of weird. I'm not using R Studio (which is a great
IDE for R too) because I want to use the same editor for notebooks in both
languages. There are ways to use reproducible environments for both languages
in Jupyter notebooks, so I thought "Why not find a configuration that allows the
use of reproducible R and Python environments in the &lt;strong&gt;same project&lt;/strong&gt;."&lt;/p&gt;
&lt;p&gt;This required a bit of tweaking, but I'm fairly happy with the result. To do
this you need to be a bit adventurous, but I promise it's not that hard.&lt;/p&gt;
&lt;h2&gt;Global Configuration&lt;/h2&gt;
&lt;p&gt;This has worked well for me in Ubuntu 22.04 (bare metal as well as WSL2) and
Manjaro Linux.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install &lt;a href="https://github.com/pyenv/pyenv"&gt;pyenv&lt;/a&gt;.&lt;ul&gt;
&lt;li&gt;Don't use your system python, especially on Linux, lots of the system can depend on it.&lt;/li&gt;
&lt;li&gt;Using the system python makes managing packages a nightmare.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Install &lt;a href="https://github.com/pyenv/pyenv-virtualenv"&gt;pyenv-virtualenv&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Install the version of python you wish to use and set it as your global version.&lt;ul&gt;
&lt;li&gt;Make sure to use &lt;code&gt;--PYTHON_CONFIGURE_OPTS="--enable-shared&lt;/code&gt; at a minimum.&lt;/li&gt;
&lt;li&gt;The full command with pyenv that I use at the time of writing this is (you can use another
version if you wish):
&lt;code&gt;env PYTHON_CONFIGURE_OPTS="--enable-shared --enable-optimizations --with-lto" PYTHON_CFLAGS='-march=native -mtune=native' pyenv install 3.12.3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pyenv global 3.12.3&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Install jupyterlab and any other Python packages you want.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pip install jupyterlab&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Install R using &lt;a href="https://github.com/r-lib/rig"&gt;rig&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;rig is optional but I really like it.&lt;/li&gt;
&lt;li&gt;Installing R from the &lt;a href="https://cran.r-project.org/"&gt;CRAN&lt;/a&gt; repository is also perfectly reasonable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Install the R kernel for Jupyter.&lt;ul&gt;
&lt;li&gt;Follow the instructions in the &lt;a href="https://irkernel.github.io/"&gt;R Kernel&lt;/a&gt; docs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That gets you all the basics.&lt;/p&gt;
&lt;p&gt;We are installing jupyter for the Global Python version. However, this global
Python is not the system Python and is not used directly for any data science
work, it's only for running jupyterlab.&lt;/p&gt;
&lt;h2&gt;Create Your Project Folder&lt;/h2&gt;
&lt;p&gt;I like to encapsulate each of my projects into a separate directory. This way,
a series of computational notebooks that share a common theme can be tracked
together with version control. Plus, a single environment can be used across notebooks
used for a single project. This makes it easy to know which project notebooks
belong to.&lt;/p&gt;
&lt;p&gt;In this example I'm using &lt;code&gt;polyglot_jupyter_example&lt;/code&gt; as the project name:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;polyglot_jupyter_example
&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;polyglot_jupyter_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then start jupyterlab in the project folder, or if you have a larger
structure that encompasses many projects, in that higher-level directory:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;jupyter&lt;span class="w"&gt; &lt;/span&gt;lab
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Set-up the Python virtualenv&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Make sure you're in the project/github repo directory.&lt;/li&gt;
&lt;li&gt;Set a version of python in the current project folder.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pyenv local 3.12.3&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Create a virtualenv, I just name them the same as the project.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pyenv virtualenv polyglot_jupyter_example&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Activate the virtualenv. You don't have to stay in the directory you made, but it keeps things simple.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pyenv activate polyglot_jupyter_example&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Install a minimal subset of packages needed for this example.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pip install ipykernel pandas seaborn&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Add the ipykernel you installed in the virtualenv to your jupyter that's &lt;strong&gt;outside&lt;/strong&gt; the virtualenv.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;python -m ipykernel install --user --name=polyglot_jupyter_example&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Exit the virtualenv&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pyenv deactivate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Don't worry, jupyter will still know about it.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Start jupyter and you'll see that you can now create notebooks inside the virtualenv.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Important Note:&lt;/strong&gt; If you're used to installing pip packages, etc. within a
notebook by &lt;code&gt;! pip install {package}&lt;/code&gt; you'll need to adjust your workflow. The
shell that jupyter spawns does not know about your virtualenv. Just keep a terminal
open outside the notebook.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Great Success!" src="https://codeberg.org/groverj3/polyglot_jupyter_example/raw/branch/main/images/python_env_jupyter.png"&gt;&lt;/p&gt;
&lt;h2&gt;Set-up the Renv&lt;/h2&gt;
&lt;p&gt;This is somewhat easier, because R isn't controlling jupyter.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Inside the project directory, start R.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;R&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Install &lt;code&gt;renv&lt;/code&gt;.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;install.packages("renv")&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Restart the R session.&lt;/li&gt;
&lt;li&gt;Initialize an renv.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;renv::init(bare = TRUE)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bare = TRUE&lt;/code&gt; keeps renv from parsing all text files in the project, if you're starting a
   project and it's not a blank slate (you have large notebooks and other files in it) this can cause renv to hang.&lt;/li&gt;
&lt;li&gt;This creates a project-specific library, a &lt;code&gt;.Rprofile&lt;/code&gt;, and &lt;code&gt;renv.lock&lt;/code&gt; amongst other things.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Exit R, and add configure bioconductor if you use it&lt;ul&gt;
&lt;li&gt;Setup information for the posit package manager mirror of bioconductor: &lt;a href="https://packagemanager.posit.co/client/#/repos/bioconductor/setup"&gt;here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Enter R again and install the ir kernel and data science stack.&lt;ul&gt;
&lt;li&gt;&lt;code&gt;install.packages(c("tidyverse", "IRkernel"))&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Make sure the R kernel is installed in jupyter outside the renv as well (as per the directions earlier).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As long as you start your jupyter notebooks in the top level of the project folder then R kernels
will respect your Renv.&lt;/p&gt;
&lt;h2&gt;Project Structure&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;project_repo_dir
├── .python-version
├── .gitignore
├── .renvignore
├── .Rprofile
├── renv.lock
├── requirements.txt
├── {python_notebook_names}_py.ipynb
├── {python_notebook_outdirs}_py
    ├── {python_notebook_names}_py.py
    ├── {python_notebook_names}_py.html
    └── {analysis_outputs}
├── {r_notebook_names}_r.ipynb
├── {r_notebook_outdirs}_r
    ├── {python_notebook_names}_r.r
    ├── {python_notebook_names}_r.html
    └── {analysis_outputs}
└── README.md
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There is quite a bit going on here so I will elaborate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.python-version&lt;/code&gt;: Records the local version of python in the project.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.gitignore&lt;/code&gt;: Records any files that should be ignored by git.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.renvignore&lt;/code&gt;: Records any files that should not be parsed by &lt;code&gt;renv&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.Rprofile&lt;/code&gt;: Project-specific configuration for R.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;renv.lock&lt;/code&gt;: Where &lt;code&gt;renv&lt;/code&gt; records the packages in the environment (including
  versions). Similar in concept to a Python requirements.txt.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt;: List of Python packages installed in the environment.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{python_notebook_names}_py.ipynb&lt;/code&gt;: Jupyter notebooks that use Python get named
  with "_py.ipynb" to make it clear they use Python.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{python_notebook_outdirs}_py&lt;/code&gt;: Output folders for Python notebooks, each
  Python notebook gets an output folder named the same as the parent notebook.&lt;ul&gt;
&lt;li&gt;Outputs from analysis go in here.&lt;/li&gt;
&lt;li&gt;I also like to include the notebook is plain text .py format so they can
  be run without Jupyter, and in .html format so they can be viewed by
  non-coders.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{r_notebook_names}_r.ipynb&lt;/code&gt;: Jupyter notebooks that use R get named with "_r.ipynb".&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{r_notebook_outdirs}_r&lt;/code&gt;: The same schme as &lt;code&gt;{python_notebook_outdirs}&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;README.md: Summary of the project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The companion git repository is &lt;a href="https://codeberg.org/groverj3/polyglot_jupyter_example"&gt;here&lt;/a&gt;
and has examples of all these files, as well as two example notebooks. I
recommend checking them out.&lt;/p&gt;
&lt;h2&gt;Notes on .gitignore and .renvignore&lt;/h2&gt;
&lt;p&gt;By default renv will parse all files in your project to determine which
packages need to be tracked in the &lt;code&gt;renv.lock&lt;/code&gt;. This can be problematic if you
have large files, including notebooks in the project folder.&lt;/p&gt;
&lt;p&gt;One way around this is to add large files to &lt;code&gt;.gitignore&lt;/code&gt;. Renv respects that as a list
of files and subdirectories not to parse. You'll notice this already will exist
in the &lt;code&gt;./renv/&lt;/code&gt; subdirectory it creates your project-specific library in, to
avoid having git track all that. This probably isn't sufficient to avoid having
issues with it parsing notebooks, which you likely &lt;em&gt;do&lt;/em&gt; want to track with git.&lt;/p&gt;
&lt;p&gt;If you create an &lt;code&gt;.renvignore&lt;/code&gt; in your project folder then Renv will use that
&lt;em&gt;instead&lt;/em&gt; of &lt;code&gt;.gitignore&lt;/code&gt;. I have been naming notebooks I write python in with
&lt;code&gt;_py&lt;/code&gt; on the end so I can match them and output folders in &lt;code&gt;.renvignore&lt;/code&gt;
easily. This is kind of clunky though.&lt;/p&gt;
&lt;p&gt;In light of this, I set up my &lt;code&gt;.gitignore&lt;/code&gt; with typical settings from jupyter
notebooks like this: &lt;a href="https://codeberg.org/groverj3/polyglot_jupyter_example/src/branch/main/.gitignore"&gt;.gitignore&lt;/a&gt;.
Ignoring the &lt;code&gt;.ipynb_checkpoints&lt;/code&gt; and &lt;code&gt;.virtual_documents&lt;/code&gt; folders.&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;.renvignore&lt;/code&gt;, I create one that ignores all python-based notebooks. You may
want to tweak this to your liking: &lt;a href="https://codeberg.org/groverj3/polyglot_jupyter_example/src/branch/main/.renvignore"&gt;.renvignore&lt;/a&gt;.
You can get around the behavior of renv trying to parse large files by forcing
it to record &lt;strong&gt;all&lt;/strong&gt; packages &lt;em&gt;installed&lt;/em&gt; in an environment rather than just
ones &lt;em&gt;used&lt;/em&gt; and their dependencies. This is demonstrated in the
&lt;a href="https://codeberg.org/groverj3/polyglot_jupyter_example/src/branch/main/polyglot_jupyter_example_r.ipynb"&gt;R notebook example&lt;/a&gt;
with &lt;code&gt;renv::snapshot(type = "all")&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Restoring an Environment&lt;/h2&gt;
&lt;p&gt;If you need to restore the environment at a later date, or on a new machine,
you will need to enter the project folder; then set up a new virtualenv for
Pythonm install the packages from the &lt;code&gt;requirements.txt&lt;/code&gt;, and install the
ipython kernel to your jupyter as you did during the initial set-up:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pyenv&lt;span class="w"&gt; &lt;/span&gt;virtualenv&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;polyglot_project_name&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# This should already use the local python version from .python-version&lt;/span&gt;
pyenv&lt;span class="w"&gt; &lt;/span&gt;activate&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;polyglot_project_name&lt;span class="o"&gt;}&lt;/span&gt;
pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;requirements.txt
python&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;ipykernel&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;--user&lt;span class="w"&gt; &lt;/span&gt;--name&lt;span class="o"&gt;={&lt;/span&gt;polyglot_project_name&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You'll then need to make sure you have R and renv installed before opening an
R terminal and running:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;renv&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;restore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Notes on Alternative Setups&lt;/h2&gt;
&lt;p&gt;There is more than one way to do this. &lt;a href="https://python-poetry.org/"&gt;Poetry&lt;/a&gt; is
one attractive option, but I decided against this as it's another dependency and
this was already plenty complicated. I intended this as a minimal example of
such a setup depending only on Python, R, virtualenvs (python, through pyenv),
and renv(R). Some will prefer to opt for &lt;a href="https://conda.io"&gt;conda&lt;/a&gt; environments
instead, since there is some support for R. Anaconda is pretty heavy and I'm
not a fan, but miniconda is certainly an option. &lt;a href="https://github.com/mamba-org/mamba"&gt;Mamba&lt;/a&gt;
is a great option to manage conda environments which is much faster.&lt;/p&gt;
&lt;p&gt;Then, there's the nuclear option of every environment being a standalone
&lt;a href="https://www.docker.com/"&gt;Docker&lt;/a&gt; or &lt;a href="https://podman.io/"&gt;Podman&lt;/a&gt; container.
This is attractive when you don't need to interact much with a host system, and
therefore is a good fit for working in "the cloud." Of course, there are ways
around that by mounting local storage inside your containers. You still need to document
your environments to be able to recreate them.&lt;/p&gt;
&lt;p&gt;Perhaps you vastly prefer R over Python, and would rather call Python from R
using &lt;a href="https://rstudio.github.io/reticulate/"&gt;reticulate&lt;/a&gt;. Reticulate actually
works with this set-up as well, you do need to force it to use the virtual
environment you created for your project with &lt;a href="https://rstudio.github.io/reticulate/reference/use_python.html"&gt;&lt;code&gt;use_python()&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you have a strong preference for Python, then you might already use
&lt;a href="https://rpy2.github.io/"&gt;rpy2&lt;/a&gt;. This is similar in concept to reticulate, but
I can't speak to how well it works with a setup like this, or if it would even
be necessary. This does require you to call R from Python, rather than writing
standalone R notebooks though.&lt;/p&gt;
&lt;p&gt;There are probably improvements I'll add over time if I get around to it.
However, this may give you some ideas of your own when it comes to living the
polyglot data science/bioinformatics life.&lt;/p&gt;</content><category term="how-to"></category><category term="bioinformatics"></category><category term="data-science"></category><category term="jupyter"></category><category term="notebooks"></category><category term="tutorial"></category><category term="r"></category><category term="python"></category><category term="sysadmin"></category></entry><entry><title>Nextflow With First Class Metadata: A Minimal Example</title><link href="https://groverj3.codeberg.page/articles/2024-05-24_nextflow-with-first-class-metadata-a-minimal-example.html" rel="alternate"></link><published>2024-05-24T00:00:00-04:00</published><updated>2024-05-24T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2024-05-24:/articles/2024-05-24_nextflow-with-first-class-metadata-a-minimal-example.html</id><summary type="html">&lt;p&gt;TLDR - Check out this github repo for the full example:
&lt;a href="https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example"&gt;https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I recently wrote &lt;a href="2024-04-26_on-bioinformatics-workflow-design.html"&gt;an article&lt;/a&gt;
regarding some of my opinions on bioinformatics workflow design. I've written
workflows in several languages over the years, but at this point it seems that
&lt;a href="https://www.nextflow.io/"&gt;Nextflow&lt;/a&gt; has become something of …&lt;/p&gt;</summary><content type="html">&lt;p&gt;TLDR - Check out this github repo for the full example:
&lt;a href="https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example"&gt;https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I recently wrote &lt;a href="2024-04-26_on-bioinformatics-workflow-design.html"&gt;an article&lt;/a&gt;
regarding some of my opinions on bioinformatics workflow design. I've written
workflows in several languages over the years, but at this point it seems that
&lt;a href="https://www.nextflow.io/"&gt;Nextflow&lt;/a&gt; has become something of the de facto
industry standard. I thought it might make a nice example to show one of my
recommendations in action for this commonly-used workflow language.&lt;/p&gt;
&lt;p&gt;This is a deliberately simple example of an RNAseq workflow, and not really
intended as an example of production-ready code. However, it will demonstrate
one of the points that I wrote about in that article, First Class Metadata.&lt;/p&gt;
&lt;h2&gt;First Class Metadata&lt;/h2&gt;
&lt;p&gt;What I'm referring to as first class metadata is the concept that the important
information about your data lives elsewhere in a simple format that can be
easily parsed. The filenames themselves are not the ground truth for
information about your data. Filenames are simply an identifier and a method of
linking data to metadata. Take these hypothetical files as an example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;RNASEQ_cell_line_A_treated_20um_replicate_1_R1.fastq.gz
RNASEQ_cell_line_A_treated_20um_replicate_1_R2.fastq.gz
RNASEQ_cell_line_A_treated_20um_replicate_2_R1.fastq.gz
RNASEQ_cell_line_A_treated_20um_replicate_2_R2.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_1_R1.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_1_R2.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_2_R1.fastq.gz
RNASEQ_cell_line_A_untreated_replicate_2_R2.fastq.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These are clearly from an RNAseq experiment. It even says so! What else do we
know? There's other information, apparently. They're from a cell line, the
highly specific "cell_line_A," some have been treated with &lt;em&gt;something&lt;/em&gt; and we
have a concentration (20 micromolar). We also have a replicate number (I hope
in your real experiments you're doing more than one replicate...) and since
these are paired end samples there is information on whether they are read 1
or 2 of the pair.&lt;/p&gt;
&lt;p&gt;What would first class metadata mean? Here's an example, a simple .csv (or
.tsv):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;sample_id&lt;/th&gt;
&lt;th&gt;experiment&lt;/th&gt;
&lt;th&gt;cell_line&lt;/th&gt;
&lt;th&gt;treatment&lt;/th&gt;
&lt;th&gt;replicate&lt;/th&gt;
&lt;th&gt;paired_status&lt;/th&gt;
&lt;th&gt;read1&lt;/th&gt;
&lt;th&gt;read2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;treated_20_1&lt;/td&gt;
&lt;td&gt;rnaseq&lt;/td&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;20_micromolar&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;paired_end&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_treated_20um_replicate_1_R1.fastq.gz&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_treated_20um_replicate_1_R2.fastq.gz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;treated_20_2&lt;/td&gt;
&lt;td&gt;rnaseq&lt;/td&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;20_micromolar&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;paired_end&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_treated_20um_replicate_2_R1.fastq.gz&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_treated_20um_replicate_2_R1.fastq.gz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;untreated_1&lt;/td&gt;
&lt;td&gt;rnaseq&lt;/td&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;NA&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;paired_end&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_untreated_replicate_1_R1.fastq.gz&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_untreated_replicate_1_R2.fastq.gz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;untreated_2&lt;/td&gt;
&lt;td&gt;rnaseq&lt;/td&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;NA&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;paired_end&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_untreated_replicate_2_R1.fastq.gz&lt;/td&gt;
&lt;td&gt;RNASEQ_cell_line_A_untreated_replicate_2_R2.fastq.gz&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;What makes it "first class" is that this separate document, whether that's a
tabular file or something more complex like a database, is the ultimate source
of truth. The &lt;strong&gt;sample&lt;/strong&gt; is also the fundamental unit of observations in the
table, rather than the &lt;strong&gt;file&lt;/strong&gt;. This is apparent because both reads are listed
for a single sample, rather than each as a separate line. This means that
single-end and paired-end samples can coexist without the need to duplicate a
lot of metadata on more lines. Some sequencing protocols also create a separate
fastq for UMIs or barcodes, they could also be included as an addtitional
metadata field.&lt;/p&gt;
&lt;p&gt;These metadata can then be parsed when executing a workflow as long as the
files are referenced in the same sample sheet. Using this paradigm it's easier 
to use sample metadata in the course of your workflow execution. Perhaps you'll
assign output file names based on the sample_id field above, or split samples
into groups based on treatment. The possibilities are myriad!&lt;/p&gt;
&lt;h2&gt;How Does First Class Metadata Help My Nextflow Workflows?&lt;/h2&gt;
&lt;p&gt;If you've recorded your sample metadata in this fashion you can directly read
it during workflow execution. This means you're no longer forced to create
brittle code that makes assumptions about your samples based on file names.
Some of the &lt;a href="https://nf-co.re/"&gt;nf-core&lt;/a&gt; workflows make use of these ideas.&lt;/p&gt;
&lt;p&gt;This isn't a full-blown Nextflow tutorial, but I will demonstrate this with a
minimal example. Imagine you have a simple RNAseq workflow, just Fastqc and a
pseudoaligner like Salmon. All your sample information is contained within a
single .csv file. When you define your workflow in the &lt;code&gt;.nf&lt;/code&gt; file you can
create a channel that takes your metadata sheet as an input like so:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// Read samplesheet and convert to queueable channel of: sample id, paired_status, read1, read2 as a tuple&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;reads_channel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;samplesheet&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;splitCsv&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;header:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="c1"&gt;// map{} applies a function to each element of a channel&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="c1"&gt;// In this case, the rows from splitCsv() are converted to a tuple based on the header &lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tuple&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sample_id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;paired_status&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;input_dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;read1&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;input_dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;read2&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;FASTQC&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reads_channel&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;SALMON_QUANT&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reads_channel&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;map{}&lt;/code&gt; is an operator that applies a function to each element of a channel. In
this case, the iterator returned by splitCsv (named &lt;code&gt;row&lt;/code&gt; here for clarity) is
converted to a tuple that contains sample information. This tuple is then used
as input to FASTQC and SALMON_QUANT, the only two steps (processes in
Nextflowese, which would be defined elsewhere in the .nf file) in the workflow.&lt;/p&gt;
&lt;p&gt;The idea is you would use your main metadata database as the sample sheet, just
filtered for the samples you want to analyze.&lt;/p&gt;
&lt;p&gt;The full example can be found here: &lt;a href="https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example"&gt;https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example&lt;/a&gt;.
I have included lots of comments to help you get started writing Nextflow, as
well as ways to procure a small test dataset. The test dataset and workflow run
in just over a minute on my laptop, excluding pulling containers, so lack of a
cluster or cloud compute environment shouldn't stand in your way for testing.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Podman Example" src="https://codeberg.org/groverj3/minimal_nextflow_samplesheet_example/raw/branch/main/example_execution/podman_run.png"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This very simple example highlights a strategy I use when writing workflows,
and currently that means Nextflow. These ideas are transferrable to other
workflow languages as well. Personally, I am a fan of Snakemake, as it was the
first workflow language I learned, and it's implemented in Python. However,
Nextflow has become something of an industry standard, and Snakemake is more
common in academia, plus Seqera labs supports Nextflow with paid products like
the Seqera Platform (formerly known as Nextflow Tower), and there are other
bioinformatics cloud platforms that are increasingly supporting Nextflow
workflows. At the end of the day, it is a tool to help enable reproducible
analysis of your data, and we should be careful to do it in a way that
maximizes that reproducibility.&lt;/p&gt;
&lt;p&gt;If you're interested in more advanced and/or more comprehensive Nextflow
material I can recommend the docs at &lt;a href="https://www.nextflow.io/docs/latest/index.html"&gt;https://www.nextflow.io/docs/latest/index.html&lt;/a&gt;
and the Nextflow training material at &lt;a href="https://training.nextflow.io/"&gt;https://training.nextflow.io/&lt;/a&gt;.
There is also a lot of content you can find trawling around GitHub.&lt;/p&gt;</content><category term="how-to"></category><category term="workflows"></category><category term="bioinformatics"></category><category term="RNAseq"></category></entry><entry><title>Bridging the Gap With Wet Lab Using R Shiny</title><link href="https://groverj3.codeberg.page/articles/2024-05-04_bridging-the-gap-with-wet-lab-using-r-shiny.html" rel="alternate"></link><published>2024-05-04T00:00:00-04:00</published><updated>2024-05-04T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2024-05-04:/articles/2024-05-04_bridging-the-gap-with-wet-lab-using-r-shiny.html</id><summary type="html">&lt;p&gt;How do you communicate results of an analysis? What tools do you use?
Scientists that work in the wet lab are accustomed to firing up excel or some
instrument-specific software and working with their own data. For genomics
or other types of experiments in biology that result in large datasets …&lt;/p&gt;</summary><content type="html">&lt;p&gt;How do you communicate results of an analysis? What tools do you use?
Scientists that work in the wet lab are accustomed to firing up excel or some
instrument-specific software and working with their own data. For genomics
or other types of experiments in biology that result in large datasets this
approach is problematic and bioinformaticians have other tools to deal with our
data. Often, this involves working with large data in a programmatic way, and
the two languages in common usage are &lt;a href="https://www.python.org/"&gt;Python&lt;/a&gt; (usually
including its data science stack &lt;a href="https://pandas.pydata.org/"&gt;pandas&lt;/a&gt; et al.)
and &lt;a href="https://www.r-project.org/"&gt;R&lt;/a&gt; (usually including the &lt;a href="https://www.tidyverse.org/"&gt;tidyverse&lt;/a&gt;,
&lt;a href="https://bioconductor.org/"&gt;bioconductor&lt;/a&gt;, and friends).&lt;/p&gt;
&lt;p&gt;There is obvious friction here. Bioinformatics scientists have a toolkit that
works great for us, but is foreign to a lot of the wet lab scientists around
us. How can we bridge this gap?&lt;/p&gt;
&lt;h2&gt;A Common Example&lt;/h2&gt;
&lt;p&gt;Your wet lab colleagues want to determine which genes are affected by treatment
with a compound, or test the effect of a mutation in a cell line/plant/mouse on
gene expression, etc. These situations are common applications for RNAseq.
Typically, they also involve a simple experimental design; a comparison of two
sample groups (treatment or mutant vs control).&lt;/p&gt;
&lt;p&gt;The bioinformatician performs their analysis using a &lt;a href="2024-04-26_on-bioinformatics-workflow-design.html"&gt;workflow&lt;/a&gt;
resulting in fold changes, adjusted p-values, etc. in a table. They create a
visualization to help summarize the results to their colleagues, perhaps a
&lt;a href="2024-04-21_making-volcano-plots-with-ggplot2.html"&gt;volcano plot&lt;/a&gt;. Then ensues
a back and forth collaboration, resulting in requests to modify visualizations,
look at specific lists of genes, and more. While this process is rewarding, and can
result in a fruitful experiment, it can also be very inefficient. It can also
result in frustration on the part of both the bioinformatician and wet-lab
scientist. The bioinformatician because they want to be the most helpful, but
the wet lab scientist isn't experienced in exploring their data with the same
tools. The wet lab scientist wishes they could be more independent. This back
and forth can take a lot of time.&lt;/p&gt;
&lt;p&gt;There are ways around this, graphical platforms that enable low/no-code ways to
analyze NGS or other big data. These are typically commercial products with a
lot of functionality, and despite being low/no-code there is still a learning
curve. Often the wet lab scientists want something simple, a way to explore an
already analyzed dataset supplied to them by their collaborating bioinformatics
scientist.&lt;/p&gt;
&lt;h2&gt;The Shiny Framework&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://shiny.posit.co/"&gt;Shiny&lt;/a&gt; is a framework for creating web applications
quickly, with minimal code required for an interactive app. Originally only for
R, it now supports Python as well. We'll be using it with R in this example.
There are other options (such as &lt;a href="https://dash.plotly.com/"&gt;Dash&lt;/a&gt;) so if you'd
rather use those, knock yourself out. I find Shiny's R syntax to be relatively
easy to work with, and very quick to learn.&lt;/p&gt;
&lt;p&gt;What if we could create, in a matter of a few hours (depending on your
experience level), an application that runs in a web browser that enables our
wet lab friends to explore their analyzed datasets? It's not that hard. I
promise. If you have written a few functions and made some plots in R, Python,
or some other programming language then you know most of what you need to get
started.&lt;/p&gt;
&lt;h2&gt;Our Goal&lt;/h2&gt;
&lt;p&gt;For this example, we will create a Shiny application which generates volcano
plots from &lt;a href="https://bioconductor.org/packages/release/bioc/html/DESeq2.html"&gt;DESeq2&lt;/a&gt;
results. This example app will have the following features:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Load results tables via browsing the local system.&lt;/li&gt;
&lt;li&gt;Generate a volcano plot.&lt;/li&gt;
&lt;li&gt;Display and allow searching through the results table.&lt;/li&gt;
&lt;li&gt;Allow changing of differential expression thresholds.&lt;/li&gt;
&lt;li&gt;Update the visualization and table according to differential expression
thresholds selected by users.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We'll use the &lt;a href="https://bioconductor.org/packages/release/data/experiment/html/pasilla.html"&gt;pasilla&lt;/a&gt;
package from bioconductor for our test data, and the &lt;a href="https://codeberg.org/groverj3/genomics_visualizations/src/branch/master/volcano_plotteR.r"&gt;volcano plot code&lt;/a&gt;
I've used as previous examples as a starting point.&lt;/p&gt;
&lt;h2&gt;Set-up&lt;/h2&gt;
&lt;p&gt;You'll need R installed. After that, to you'll need the some packages to follow
along. You can get them as follows. Open the R terminal and:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# If you don&amp;#39;t have bioconductor&lt;/span&gt;
&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;BiocManager&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;quietly&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;install.packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;BiocManager&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;BiocManager&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;install&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;3.19&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# BioCManager::install() will also install packages from CRAN so they&amp;#39;re also in this list&lt;/span&gt;
&lt;span class="n"&gt;BiocManager&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;install&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DESeq2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pasilla&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;apeglm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Needed for logfoldshrink in DESeq2&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tidyverse&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;shiny&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DT&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DescTools&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# I like to use renv to manage project-specific libraries, this is optional&lt;/span&gt;
&lt;span class="nf"&gt;install.packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;renv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; I won't detail the use of &lt;a href="https://rstudio.github.io/renv/articles/renv.html"&gt;renv&lt;/a&gt;
throughout this guide, but I like the package and you can and should read up on
the documentation. It's kind of like virtual environments for R. Especially
useful for a project like this or if you're working with multiple developers.&lt;/p&gt;
&lt;p&gt;You'll now want to create a folder to work in, you'll likely want to put it on
github or another version control system. In your shell:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;~/Development/volcano_rnaseq_shiny_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, go into that directory and create a subfolder called "shiny" and a single
.R file called "app.R".&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;~/Development/volcano_rnaseq_shiny_example
mkdir&lt;span class="w"&gt; &lt;/span&gt;shiny
touch&lt;span class="w"&gt; &lt;/span&gt;shiny/app.R
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Prepare the Differential Expression Data&lt;/h2&gt;
&lt;p&gt;The test data for this app is supplied in the aforementioned github repo, but
for the sake of completeness, this is how it's generated.&lt;/p&gt;
&lt;p&gt;First, load and reformat the pasilla data so it can be used for differential
expression in DESeq2. It's supplied as a counts matrix and metadata, but they
need some reformatting:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load pasilla, and I always use the tidyverse&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pasilla&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tidyverse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# First load the counts table and then the metadata&lt;/span&gt;
&lt;span class="n"&gt;counts_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;system.file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;extdata/pasilla_gene_counts.tsv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pasilla&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_tsv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;metadata_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;system.file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;extdata&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pasilla_sample_annotation.csv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pasilla&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In order for this to work with DESeq2 the column names for the samples (aside
from the gene IDs) in the &lt;code&gt;counts_table&lt;/code&gt; must match row names (or a column that
can be converted to row names) in the &lt;code&gt;metadata_table&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;counts_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# A tibble: 6 × 8&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;gene_id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated3&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn00&lt;/span&gt;…&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn00&lt;/span&gt;…&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;92&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;161&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;76&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;70&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;140&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;88&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;70&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn00&lt;/span&gt;…&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn00&lt;/span&gt;…&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn00&lt;/span&gt;…&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;4664&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;8714&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;3564&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;3150&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;6205&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;3072&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;3334&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn00&lt;/span&gt;…&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;583&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;761&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;245&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;310&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;722&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;299&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;308&lt;/span&gt;

&lt;span class="n"&gt;metadata_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# A tibble: 6 × 6&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;`number of lanes`&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;…¹&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;`exon counts`&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treate&lt;/span&gt;…&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;sing&lt;/span&gt;…&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;35158667&lt;/span&gt;&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="m"&gt;15679615&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treate&lt;/span&gt;…&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;…&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;12242535&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;15620018&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treate&lt;/span&gt;…&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;…&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;12443664&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;12733865&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untrea&lt;/span&gt;…&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sing&lt;/span&gt;…&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;17812866&lt;/span&gt;&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="m"&gt;14924838&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untrea&lt;/span&gt;…&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sing&lt;/span&gt;…&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;34284521&lt;/span&gt;&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="m"&gt;20764558&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untrea&lt;/span&gt;…&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt;…&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10542625&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;10283129&lt;/span&gt;
&lt;span class="c1"&gt;# ℹ abbreviated name: ¹​`total number of reads`&lt;/span&gt;

&lt;span class="n"&gt;metadata_table&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;treated1fb&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;treated2fb&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;treated3fb&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;untreated1fb&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;untreated2fb&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;untreated3fb&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;untreated4fb&amp;quot;&lt;/span&gt;

&lt;span class="n"&gt;metadata_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;metadata_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;str_remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;fb&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;metadata_table&lt;/span&gt;
&lt;span class="c1"&gt;# A tibble: 7 × 2&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;condition&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated1&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;treated&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated2&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;treated&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;treated3&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;treated&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;untreated&lt;/span&gt;

&lt;span class="c1"&gt;# Order the samples correctly&lt;/span&gt;
&lt;span class="n"&gt;counts_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;counts_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gene_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;metadata_table&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Welcome to the wild and wonderful world of data cleaning. How's it feel to be a
computer janitor? I still do this kind of stuff more than any of the fancy analysis
methods I've learned. Data are messy.&lt;/p&gt;
&lt;p&gt;Now, you can generate some results based on treated vs untreated conditions:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DESeq2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;DESeqDataSetFromMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;countData&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;column_to_rownames&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;gene_id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;colData&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;column_to_rownames&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;file&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;design&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;condition&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;DESeq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;lfcShrink&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;condition_treated_vs_untreated&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;apeglm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.data.frame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rownames_to_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;gene_id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;gene_id&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;baseMean&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;lfcSE&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;pvalue&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn0000003&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.1715687&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.006979656&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.2057852&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.7874583&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kc"&gt;NA&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn0000008&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;95.1440790&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.001115354&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.1517065&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.9923316&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.9969282&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn0000014&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1.0565722&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;-0.004634136&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.2048948&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.8181371&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kc"&gt;NA&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn0000015&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.8467233&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;-0.018148393&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.2061771&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.3714205&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kc"&gt;NA&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn0000017&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4352.5928988&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;-0.191126743&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.1201758&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.0568330&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.2823626&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FBgn0000018&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;418.6149305&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;-0.070043056&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.1236900&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.4797142&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.8239063&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;write_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pasilla_results.csv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now you can see how many differentially expressed genes (at the BH-adjusted p &amp;lt; 0.1):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1061&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Let's Build the Shiny App&lt;/h2&gt;
&lt;p&gt;For the ease of following along, I'm putting this shiny app up on my github:
&lt;a href="https://codeberg.org/groverj3/volcano_rnaseq_shiny_example"&gt;volcano_rnaseq_shiny_example&lt;/a&gt;.
You can find a working example there.&lt;/p&gt;
&lt;p&gt;To create it yourself, open the &lt;code&gt;app.R&lt;/code&gt; we created earlier in your favorite
text/code editor and add the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shiny&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bslib&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tidyverse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggrepel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DescTools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# Volcano plot code based from https://github.com/groverj3/genomics_visualizations/blob/master/volcano_plotteR.r&lt;/span&gt;
&lt;span class="n"&gt;volcplot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Volcano Plot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_subtitle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Set the fold-change thresholds&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;neg_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;log2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;pos_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;log2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Make a dataset for plotting, add the status as a new column&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate_at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;padj&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is.na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate_at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;padj&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.Machine&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;double.xmin&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# When p values are zero, they&amp;#39;re actually below the lowest value R can display&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate_at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;log2FoldChange&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is.na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pos_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;up&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                         &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neg_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;down&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ns&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Get the number of up, down, and unchanged genes&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;up_genes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;up&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;down_genes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;down&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;unchanged_genes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ns&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Make the labels for the legend&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;legend_labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Up: &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;up_genes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;NS: &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;unchanged_genes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Down: &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;down_genes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Set the x axis limits, rounded to the next even number&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;x_axis_limits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DescTools&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;RoundTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;ceiling&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Set the plot colors&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot_colors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;up&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;firebrick1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ns&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;down&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;dodgerblue1&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Make the plot, these options are a reasonable starting point&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.5&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;log10&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gene_id&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;geom_vline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;xintercept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;neg_log2fc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pos_log2fc&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;linetype&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;dashed&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;geom_hline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;yintercept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;log10&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;linetype&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;dashed&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;scale_x_continuous&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;log2(FC)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;limits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x_axis_limits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x_axis_limits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;scale_color_manual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_colors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;legend_labels&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;-fold, padj ≤&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;subtitle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_subtitle&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;theme_bw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;aspect.ratio&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;axis.text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;element_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;black&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;legend.margin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;legend.box.margin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Reduces dead area around legend&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;legend.spacing.x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;cm&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="c1"&gt;# Define UI&lt;/span&gt;
&lt;span class="n"&gt;ui&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;page_sidebar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Volcano PlotteR&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sidebar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sidebar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;fileInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;deseq2_results&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DESeq2 Results Table&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;numericInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;foldchange_threshold&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Fold Change Threshold&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;numericInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;padj_threshold&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Adjusted p-value Threshold&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;textInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;plot_title&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Plot Title&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Volcano Plot&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;textInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;plot_subtitle&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Plot Subtitle&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;card&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;plotOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;volcano_plot&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;min_height&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;580&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Ensures you don&amp;#39;t have to scroll within this card&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;DTOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;deseq2_table&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# Server function&lt;/span&gt;
&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;options&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shiny.maxRequestSize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;30&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;deseq2_results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;reactive&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;req&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;deseq2_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;deseq2_results&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;datapath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;deseq2_results_filtered&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;reactive&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;req&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deseq2_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;deseq2_results&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;foldchange_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;deseq2_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;renderDT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;deseq2_results_filtered&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;volcano_plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;renderPlot&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;req&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deseq2_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;deseq2_results&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;volcplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;foldchange_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;plot_title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;plot_title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;plot_subtitle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;plot_subtitle&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;550&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Run the application&lt;/span&gt;
&lt;span class="nf"&gt;shinyApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ui&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ui&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There's a bit to unpack here. But for now, just run it by entering the
&lt;code&gt;volcano_rnaseq_shiny_example&lt;/code&gt; project directory, starting the R interpreter
with &lt;code&gt;R&lt;/code&gt;, and running:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;shiny&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;runApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;shiny&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;"shiny" within runApp matches the name of the subfolder containing our &lt;code&gt;app.R&lt;/code&gt;.
If everything works as expected (fingers crossed!) you'll be presented with
something like the following in your R terminal:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Listening on http://127.0.0.1:3698
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you navigate to the IP address and port in your favorite browser you should
see the fruits of your labors. After clicking the browse button you can load
the &lt;code&gt;pasilla_results.csv&lt;/code&gt; you created earlier:&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="https://codeberg.org/groverj3/volcano_rnaseq_shiny_example/raw/branch/main/images/volcano_plotteR_1.png" alt="Your App is Served"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;That looks great! Now try changing the controls on the left. You'll see that the
plot and table react in real time!&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="https://codeberg.org/groverj3/volcano_rnaseq_shiny_example/raw/branch/main/images/volcano_plotteR_2.png" alt="It's reactive!"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;h2&gt;Explanation&lt;/h2&gt;
&lt;p&gt;If you ignore the volcano plot code, which is mostly the same (with some
changes and simplifications) as my explanation in a &lt;a href="2024-04-21_making-volcano-plots-with-ggplot2.html"&gt;previous post&lt;/a&gt;
you're left with only ~50 lines of code. That's really not much to get an
interactive web app.&lt;/p&gt;
&lt;p&gt;The main logic of the app is broken up into two parts, the ui (defined by
layouts and content) and the &lt;code&gt;server()&lt;/code&gt; function. The inputs and outputs in the
server function map to the names of the inputs and outputs from the ui. You can
have different types of outputs (plots, tables, etc.) in sections of the UI. In
this simple example we use the use &lt;code&gt;library(DT)&lt;/code&gt; and the &lt;code&gt;DTOutput()&lt;/code&gt; function
as a way to display dataframes that uses the javascript DataTables library as a
backend. Likewise our volcano plot code uses ggplot2 and the &lt;code&gt;plotOutput()&lt;/code&gt;
function displays it. At the end of the code, we run &lt;code&gt;shinyApp()&lt;/code&gt; with our ui
and server to make it all happen.&lt;/p&gt;
&lt;p&gt;There are a few other things of note going on here. We've wrapped filtering of
our results table and plotting in &lt;code&gt;reactive()&lt;/code&gt;. This does exactly what you
think, it makes the plot and data table react to changes in the input data. So
when you change the data that was loaded in, or any of the controls that map to
filtering criteria, etc. the elements are regenerated. The &lt;code&gt;req()&lt;/code&gt; function in
there identifies that the input dataset is the element to which the output
reacts. Hopefully that makes sense. For a simple example, I think that
explanation suffices.&lt;/p&gt;
&lt;p&gt;All of these packages and functions have great documentation that goes far
beyond what I've written here, so I recommend reading it. You can add a lot
more functionality without too much trouble, this is just a simple example.&lt;/p&gt;
&lt;h2&gt;What Can You &lt;strong&gt;DO&lt;/strong&gt; With This?&lt;/h2&gt;
&lt;p&gt;Imagine you're in a meeting and you're having that back and forth with the wet
lab scientists I talked about earlier. Now, you can pull out your shiny app
and use that as a tool to filter data, generate visualizations, and save the
output on the fly. Even better, if you get really ambitious you can containerize
it, serve it on your LAN, and let anyone use it!&lt;/p&gt;
&lt;p&gt;I suspect the bench scientists will be happy because they can filter, visualize,
and do whatever else you've built for them. You'll be happy because your
meetings can be more productive and your colleagues can generate more insights
on their own and bring those to you for in-depth analysis.&lt;/p&gt;
&lt;p&gt;The more I think about what the optimal split of duties for a genomics project
should be, the more I think we should be developing simple tools like this.
Small interactive apps like this allow bioinformatics staff to focus on solving
hard problems, making sure data is processed consistently, figuring out how to
apply novel methods to lage datasets, etc. The other stakeholders who help to
generate the data can be empowered to explore that data without the burden of
knowing how to process it from raw files, but still get to have an active role
in generating insights.&lt;/p&gt;
&lt;p&gt;Shiny and similar frameworks have relatively easy syntax to learn when getting
started if you're already familiar with R or python. While there are certainly
commercial products that have functionality far surpassing these small apps, if
you're looking for a simple tool to help bridge the gap between wet and dry lab
scientists this may fit the bill at $0, aside from your labor :).&lt;/p&gt;
&lt;h2&gt;Reference&lt;/h2&gt;
&lt;p&gt;Huber W, Reyes A (2024). pasilla: Data package with per-exon and per-gene read
counts of RNA-seq samples of Pasilla knock-down by Brooks et al., Genome
Research 2011.. R package version 1.32.0. &lt;/p&gt;</content><category term="how-to"></category><category term="R"></category><category term="shiny"></category><category term="app"></category><category term="RNAseq"></category><category term="bioinformatics"></category><category term="data-visualization"></category></entry><entry><title>On Bioinformatics Workflow Design</title><link href="https://groverj3.codeberg.page/articles/2024-04-26_on-bioinformatics-workflow-design.html" rel="alternate"></link><published>2024-04-26T00:00:00-04:00</published><updated>2024-04-26T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2024-04-26:/articles/2024-04-26_on-bioinformatics-workflow-design.html</id><summary type="html">&lt;p&gt;Since I was in grad school I've been writing bioinformatics &lt;a href="2019-08-19_the-snakemake-tutorial-i-wish-i-had.html"&gt;workflows&lt;/a&gt;.
Usually to process NGS data. The concept of a workflow is simple, &lt;a href="https://github.com/spotify/luigi"&gt;and not limited to the domain of bioinformatics&lt;/a&gt;.
However, a workflow (aka "pipeline") used to analyze data from next generation sequencing (again, will it ever be
"current …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Since I was in grad school I've been writing bioinformatics &lt;a href="2019-08-19_the-snakemake-tutorial-i-wish-i-had.html"&gt;workflows&lt;/a&gt;.
Usually to process NGS data. The concept of a workflow is simple, &lt;a href="https://github.com/spotify/luigi"&gt;and not limited to the domain of bioinformatics&lt;/a&gt;.
However, a workflow (aka "pipeline") used to analyze data from next generation sequencing (again, will it ever be
"current gen?") certainly falls under this banner.&lt;/p&gt;
&lt;p&gt;Over the past few years I've become more opinionated on how bioinformatics workflows should be designed. First, we should
have a basic understanding of what a bioinformatics workflow is and what they look like.&lt;/p&gt;
&lt;h3&gt;What Is a Bioinformatics Workflow?&lt;/h3&gt;
&lt;p&gt;If you work in bioinformatics and/or computational biology (really, most fields of science that utilize computational
resources on a medium-large scale) you've probably written a workflow. I would define such a process as a series of
programs, usually (but not always) operating sequentially on the output of the previous tool in the series. A typical
workflow for the analysis of RNAseq data might look like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;QC of raw .fastq data &lt;a href="https://github.com/s-andrews/FastQC"&gt;(fastqc)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Trimming of reads to remove adapters and low quality basecalls &lt;a href="https://github.com/FelixKrueger/TrimGalore"&gt;(trim_galore)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Alignment of cleaned reads against a reference genome &lt;a href="https://github.com/alexdobin/STAR"&gt;(STAR)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Sorting and indexing the alignment .bam file &lt;a href="https://github.com/samtools/samtools"&gt;(samtools)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Post-alignment QC &lt;a href="https://github.com/broadinstitute/picard"&gt;(picard)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Read per gene counting &lt;a href="https://subread.sourceforge.net/"&gt;(featurecounts)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Log file aggregation &lt;a href="https://github.com/MultiQC/MultiQC"&gt;(multiqc)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Differential expression analysis &lt;a href="https://bioconductor.org/packages/release/bioc/html/DESeq2.html"&gt;(DESeq2)&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This example is pretty straightforward, and is not by any means the entireity of what one may do in the course of RNAseq
analysis (for one, STAR can generate the counts tables directly with no need for featurecounts or other tools). It does
demonstrate the concept though. In most cases, each tool runs on the output file of the previous step, with some
exceptions (fastqc and trim_galore, for example, both operate on the raw data). More complex workflows may feature many
steps which can run in parallel because their outputs are not required as inputs until reaching a later step.&lt;/p&gt;
&lt;h3&gt;Common Workflow Frameworks In Bioinformatics&lt;/h3&gt;
&lt;p&gt;If you've ever written a series of scripts in shell, python, R, or any other language, that uses sequential processes on
one or more input files then you already have a workflow. Most people in bioinformatics that process NGS data begin
writing BASH scripts as both a way to enable hands-off running of workflow steps, and a way to record what work was
performed.&lt;/p&gt;
&lt;p&gt;There are many workflow management frameworks in common usage within bioinformatics today. These include (this list is
not exhaustive):&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nextflow.io/" target="_blank"&gt;&lt;img src="../images/nextflow-logo-bg-light.png" alt="Nextflow" style="width:214px;height:35px;"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://snakemake.github.io/" target="_blank"&gt;&lt;img src="../images/snakemake_logo_dark.png" alt="Snakemake" style="width:208px;height:51px;"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.commonwl.org/" target="_blank"&gt;&lt;img src="../images/CWL-Logo-VGA.png" alt="CWL" style="width:111px;height:70px"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://openwdl.org/" target="_blank"&gt;&lt;img src="../images/wdl-logo.png" alt="WDL" style="width:128px;height:51px"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; I am fully aware that not all of these aspire to also encompass execution, choosing to leave that to separate
tools (CWL, WDL). However, for the purpose of this discussion lumping them together with Nextflow and Snakemake makes
sense logically as workflow languages. I will probably catch flak for this.&lt;/p&gt;
&lt;p&gt;This article is not a deep dive on each framework's pros and cons or my opinions on them.&lt;/p&gt;
&lt;h3&gt;The "Why" and "When" Of Workflow Automation&lt;/h3&gt;
&lt;p&gt;Workflow management frameworks do a lot of things for you, but all have a learning curve. They each have unique syntax,
and their own common workflow design idioms. However, if you use a framework like these you'll be able to write once,
run anywhere (in theory). Run locally, fine. Run on an HPC, great. Run on a public cloud (AWS, GCP, Azure), cool. Run on
a kubernetes cluster, probably fine. You get the idea.&lt;/p&gt;
&lt;p&gt;These workflow languages and the executors that run them allow efficient resource usage, if you define the resources
required for each step the executor will determine how many of each job can run in parallel. They also excel at
scattering jobs across compute nodes, this is especially important in an HPC or cloud compute context. Importantly, they
(can) enable greater reproducibility. They integrate with container runtimes, allowing you to use Docker, Podman,
Apptainer, et al. for each piece of software in a workflow and guarantee a specific version of each tool is used. This
also eases deployment to HPCs or cloud compute.&lt;/p&gt;
&lt;p&gt;In my opinion, you should look to workflow management frameworks when you want:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Reproducibility&lt;ul&gt;
&lt;li&gt;You can rerun an analysis and get reliable, and comparable, results.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Automation&lt;ul&gt;
&lt;li&gt;No need to write scripts for each step, no need for complex scripting to handle resource management.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Portability&lt;ul&gt;
&lt;li&gt;Run the same process agnostic of underlying hardware (local, HPC, cloud, etc.).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Harmonization&lt;ul&gt;
&lt;li&gt;You should have no question about whether data from similar experiments, analyzed with the same workflow, are comparable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Efficiency&lt;ul&gt;
&lt;li&gt;You are going to run a process many times and you want to reduce execution time and/or costs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I wouldn't bother uising workflow management frameworks for:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Prototyping&lt;ul&gt;
&lt;li&gt;Write the workflow when you &lt;em&gt;done&lt;/em&gt; prototyping. Scripts are well-suited for this and it's easier to write
workflows when you have well-written scripts to start with.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;One-off analyses&lt;ul&gt;
&lt;li&gt;Is it really worth it to spend your time on this instead of directly answering your research question? BASH
scripts are still a thing.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Bespoke visualization and statistical analysis&lt;ul&gt;
&lt;li&gt;This is difficult to standardize and requires careful consideration of data distributions. Consider computational
notebooks instead.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Some Bioinformatics Workflow Anti-Patterns&lt;/h3&gt;
&lt;p&gt;Now I'm going to get controversial.&lt;/p&gt;
&lt;p&gt;I have seen these in some high-profile implementations.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;God Workflows&lt;ul&gt;
&lt;li&gt;A workflow that actually is for the processing of multiple kinds of -omics data.&lt;/li&gt;
&lt;li&gt;A workflow that includes specialized processing of the same kind of data in multiple unrelated ways.&lt;/li&gt;
&lt;li&gt;I have seen these as the buzzword "multi-omics" has become more prevalent.&lt;/li&gt;
&lt;li&gt;Similar in concept to god functions or &lt;a href="https://en.wikipedia.org/wiki/God_object"&gt;god objects&lt;/a&gt;. &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Filename Implicit Metadata&lt;ul&gt;
&lt;li&gt;Who isn't guilty of using filenames to store metadata? I am. &lt;code&gt;sampleID_treatment_replicate.fastq.gz&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;It's not a problem to store metadata in filenames, per se. It's a problem to &lt;strong&gt;depend&lt;/strong&gt; on filenames as the
ground truth source of file information.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Not Actually Automated&lt;ul&gt;
&lt;li&gt;Workflows which expose far too many options to actually be that useful as automation.&lt;/li&gt;
&lt;li&gt;Do you really need to allow the changing of &lt;strong&gt;every&lt;/strong&gt; option in your steps?&lt;/li&gt;
&lt;li&gt;This is especially problematic when there are options that are not suitable for a given workflow and should never
be enabled/disabled.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Forgetting To Make Design &lt;strong&gt;Decisions&lt;/strong&gt;&lt;ul&gt;
&lt;li&gt;In the name of providing options to users, you may be tempted to allow multiple programs in different steps.&lt;/li&gt;
&lt;li&gt;"You can use STAR &lt;em&gt;or&lt;/em&gt; HISAT2 by running with &lt;code&gt;--aligner STAR&lt;/code&gt; or &lt;code&gt;--aligner HISAT2&lt;/code&gt;."&lt;/li&gt;
&lt;li&gt;Now, saying "I ran &lt;code&gt;{workflow_name_here}&lt;/code&gt; from the &lt;code&gt;{fancy_project_name}&lt;/code&gt; github" is not descriptive enough to
actually inform people what was run.&lt;/li&gt;
&lt;li&gt;Yes, I am aware log files still exist, but you shouldn't need to look at a log file to at least know the basic
steps that were executed on some data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Unnecessary Complexity&lt;ul&gt;
&lt;li&gt;"Real programmers separate all their functions and classes into modules, so each step of my 10 step workflow is in
a separate file."&lt;/li&gt;
&lt;li&gt;Breaking up workflows can be done for good reasons, but this isn't one of them.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Good Practices To Use Instead&lt;/h3&gt;
&lt;p&gt;To avoid these problems, in order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Embrace Modular Workflows&lt;ul&gt;
&lt;li&gt;A workflow processes &lt;strong&gt;one&lt;/strong&gt; kind of data.&lt;/li&gt;
&lt;li&gt;A workflow outputs data for &lt;strong&gt;one&lt;/strong&gt; purpose (not necessary only one kind of experiment when the same outputs may
be useful for more than one type of experiment).&lt;/li&gt;
&lt;li&gt;If you need to integrate multiple -omics types consider higher order workflows, like higher-order functions or
classes (depending on your preference for functional or object oriented programming).&lt;/li&gt;
&lt;li&gt;Higher order, or nested workflows, allow the execution of multiple sub workflows which still may be executed just
as well on their own.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;First Class Metadata&lt;ul&gt;
&lt;li&gt;The ground truth for sample information, the metadata, lives elsewhere from the filenames.&lt;/li&gt;
&lt;li&gt;Link the metadata to the files. Using the filenames and md5 hashes in a separate database is a good option.&lt;/li&gt;
&lt;li&gt;At a minimum, a low-effort way to solve this just involves a .csv file that has columns for filenames in addition
to other sample metadata.&lt;/li&gt;
&lt;li&gt;Use metadata sheets when executing workflows to read sample names and important information related to options
that need changing for specific samples (paired end read status, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Automate, Automate, Automate!&lt;ul&gt;
&lt;li&gt;If there are options that should not be changed do not allow users to change them.&lt;/li&gt;
&lt;li&gt;Every "option" should have a default that you have chosen for &lt;strong&gt;good reason&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You should very rarely &lt;em&gt;need&lt;/em&gt; to specify an option at execution to successfully and correctly run your workflow.&lt;/li&gt;
&lt;li&gt;Remember that one of your reasons behind writing a workflow is to automate the thing beyond writing individual
scripts.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Own Your Design Decisions&lt;ul&gt;
&lt;li&gt;Every tool for each step of your analysis was chosen for a reason, let your users know that.&lt;/li&gt;
&lt;li&gt;There are exceptions to every rule, but the rule should be &lt;strong&gt;one step, one tool&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;If someone thinks their favorite tool is better than the one you picked they can write their own workflow or use
a different one.&lt;/li&gt;
&lt;li&gt;It should be clear what happened to the data when you explain "I ran &lt;code&gt;{workflow_name_here}&lt;/code&gt; from the
&lt;code&gt;{fancy_project_name}&lt;/code&gt; github." Yes, log files are still important.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Keep It Simple&lt;ul&gt;
&lt;li&gt;If you have a somewhat small workflow you don't need separate modules for every step, it's actually &lt;strong&gt;less&lt;/strong&gt;
readable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;In Conclusion&lt;/h3&gt;
&lt;p&gt;No code today, just thoughts. Maybe you've thought these things but didn't put them into writing. Maybe you're just
coming around to the idea of workflow automation. Maybe you think I'm wrong (in this case, please don't email me I
already get too much email and I'll just delete it anyway). However, I think by keeping a few things in mind you can
really improve both the readability of your workflows and their usefulness.&lt;/p&gt;
&lt;p&gt;Thanks for coming to my TED Talk/giant wall o' text. If you like these thoughts I accept payment in the form of cookies,
peanut m&amp;amp;ms, millions of dollars in cash by the briefcase, and mysterious wire transfers in amounts large enough to pay
off my wife and I's student debt.&lt;/p&gt;</content><category term="commentary"></category><category term="workflows"></category><category term="bioinformatics"></category></entry><entry><title>Making Volcano Plots With ggplot2</title><link href="https://groverj3.codeberg.page/articles/2024-04-21_making-volcano-plots-with-ggplot2.html" rel="alternate"></link><published>2024-04-21T00:00:00-04:00</published><updated>2024-04-21T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2024-04-21:/articles/2024-04-21_making-volcano-plots-with-ggplot2.html</id><summary type="html">&lt;p&gt;One of the, if not &lt;em&gt;the&lt;/em&gt;, most common downstream analysis task I'm asked to perform on RNAseq data is to generate the
venerable "Volcano Plot." These are kind of the bioinformatics equivalent of saying "Hey! Look how much data I have!"
Regardless, they are a pretty good way to quickly …&lt;/p&gt;</summary><content type="html">&lt;p&gt;One of the, if not &lt;em&gt;the&lt;/em&gt;, most common downstream analysis task I'm asked to perform on RNAseq data is to generate the
venerable "Volcano Plot." These are kind of the bioinformatics equivalent of saying "Hey! Look how much data I have!"
Regardless, they are a pretty good way to quickly summarize an RNAseq experiment. There are now lots of options for
generating these visualizations. If you're looking for a plug and play option, the excellent bioconductor package
&lt;a href="https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html"&gt;EnhancedVolcano&lt;/a&gt;. However, if you are an R
tidyverse user you actually already have everything you need to make these plots.&lt;/p&gt;
&lt;p&gt;Starting in grad school, I created a library of R and Python snippets that I still reuse. I've continued to update my
volcano plot code over time and at this point I actually still reuse that rather than loading in another package. Below,
I will share this code and explain the major concepts behind making it. I'm not a software engineer, so it's likely that
there are lots of other ways to throw this together.&lt;/p&gt;
&lt;h3&gt;The Full Function&lt;/h3&gt;
&lt;p&gt;This function is also available &lt;a href="https://codeberg.org/groverj3/genomics_visualizations/src/branch/master/volcano_plotteR.r"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dplyr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggrepel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# For displaying gene labels, if you don&amp;#39;t want them you can omit this library&lt;/span&gt;

&lt;span class="n"&gt;volcplot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Volcano Plot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_subtitle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;genelist_vector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;genelist_filter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Set the fold-change thresholds&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;neg_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;log2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;pos_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;log2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Make a dataset for plotting, add the status as a new column&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate_at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;padj&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is.na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate_at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;log2FoldChange&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;is.na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pos_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;up&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                         &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neg_log2fc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;down&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ns&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hgnc_symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;replace_na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hgnc_symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;none&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genelist_filter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hgnc_symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%in%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;genelist_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;is.null&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genelist_vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hgnc_symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hgnc_symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%in%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;genelist_vector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ns&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hgnc_symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Get the number of up, down, and unchanged genes&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;up_genes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;up&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;down_genes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;down&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;unchanged_genes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ns&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Make the labels for the legend&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;legend_labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Up: &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;up_genes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;NS: &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;unchanged_genes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Down: &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;down_genes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Set the x axis limits, rounded to the next even number&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;x_axis_limits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DescTools&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;RoundTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;ceiling&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Set the plot colors&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot_colors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;up&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;firebrick1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ns&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;gray&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;down&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;dodgerblue1&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Make the plot, these options are a reasonable strting point&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot_ready_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.5&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log2FoldChange&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;log10&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;padj&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log2fc_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hgnc_symbol&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;geom_vline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;xintercept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;neg_log2fc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pos_log2fc&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;linetype&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;dashed&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;geom_hline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;yintercept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;log10&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;linetype&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;dashed&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;scale_x_continuous&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;log2(FC)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;limits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x_axis_limits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x_axis_limits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;scale_color_manual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_colors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;legend_labels&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;str_c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;-fold, padj ≤&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;padj_threshold&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;subtitle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot_subtitle&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;theme_bw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;aspect.ratio&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;axis.text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;element_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;black&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;legend.margin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;legend.box.margin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Reduces dead area around legend&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;legend.spacing.x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;cm&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# Add gene labels if needed&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;is.null&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genelist_vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nf"&gt;geom_label_repel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;force&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;max.overlaps&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;nudge_x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;segment.color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;black&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;min.segment.length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;show.legend&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Yes, this is rather long, but it's actually fairly straightforward to understand. Hopefully the comments help.&lt;/p&gt;
&lt;h3&gt;But What Does This Look Like?&lt;/h3&gt;
&lt;p&gt;Here's an example of a typical volcano plot this generates:&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="https://codeberg.org/groverj3/genomics_visualizations/raw/branch/master/volcano_plotteR.png"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;There are lots of places to customize, of course, since it's just a normal ggplot2 object.&lt;/p&gt;
&lt;h3&gt;How Is The Input Data Formatted?&lt;/h3&gt;
&lt;p&gt;This function works with DESeq2 output results as a data frame, but requires a bit of reformatting. So, you can get there
like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;deseq_results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deseq2_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;rownamnes_to_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;ensembl_id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;left_join&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;ensembl_id_hgnc_symbol&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I typically work with ensembl gene IDs as a ground truth identifier for genes, and also include gene symbols as a more
human readable identifier. Since I'm primarily working with human cell lines at the moment there needs to be a column in
your dataset called "hgnc_symbol," according to the design of the volcano plot function. We achieve this by &lt;code&gt;left_join()&lt;/code&gt;
with an additional dataframe that consists of only two columns, "ensembl_id" and "hgnc_symbol." If you do work in mice, plants,
etc. you can change all references to that column to suit your needs both here and in the plotting function.&lt;/p&gt;
&lt;h3&gt;Brief Explanation&lt;/h3&gt;
&lt;p&gt;You can think of this function doing things in a few discrete steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Set the fold chance thresholds for the plot based on what you provide for the variable &lt;code&gt;fc&lt;/code&gt;, which defaults to 1 (no threshold).&lt;/li&gt;
&lt;li&gt;Set NAs in the &lt;code&gt;padj&lt;/code&gt; column to 1 and in the &lt;code&gt;log2FoldChange&lt;/code&gt; column to 0. Create a new variable with the gene's differential expression status (up, down, not significant).&lt;/li&gt;
&lt;li&gt;Filter the dataset on a list of hgnc symbols you supply (optional).&lt;/li&gt;
&lt;li&gt;Remove gene symbol labels if not differentially expressed and not a member of a list supplied when invoking the function (optional).&lt;/li&gt;
&lt;li&gt;Get the number of genes which are significantly up and down, and the number which are not significant for the legend.&lt;/li&gt;
&lt;li&gt;Create the legend labels based on number of differentially expressed genes.&lt;/li&gt;
&lt;li&gt;Set the X axis limits based on rounding to the next multiple of 2 (because log base 2) of the absolute value of the max in the log2FoldChange column.&lt;/li&gt;
&lt;li&gt;Set the colors for the plot, defaults can be easily changed but I like them.&lt;/li&gt;
&lt;li&gt;Build the ggplot object, simply using &lt;code&gt;geom_point()&lt;/code&gt; and some vertical/horizontal lines based on your fold change and padj thresholds.&lt;/li&gt;
&lt;li&gt;Add labels to points based on hgnc_symbol (optional).&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Some Gotchas&lt;/h3&gt;
&lt;p&gt;DESeq2 sets padj and log2foldchange to NA for many reasons. This may be because of the expression level and filtering out
low-expressing genes prior to statistical testing. It may also be due to lack of replicates and too much variability. Regardless,
it's something of a philosophical question as to whether you want these genes to show up in the "not significant" category
or whether you should simply not include them in the results at all. At this point, I lean toward setting their p values
to 1 and log fold changes to 0. This way, such genes end up in the "not significant" category. My reasoning, this heads
off question about why the number of genes in each category may not add up to the number in the annotation set across
comparisons. Now those genes which are significantly up, down, and not significant always add up to the same number
assuming that you're using the same annotations.&lt;/p&gt;
&lt;h3&gt;Why Not Just Use EnhancedVolcano?&lt;/h3&gt;
&lt;p&gt;Honestly, there isn't really a good reason not to. However, I already had this code on-hand and therefore I find it
pretty easy to just run this on the reg. If you're learning ggplot2 and the tidyverse I think this is a good way to learn
with a real example.&lt;/p&gt;</content><category term="how-to"></category><category term="bioinformatics"></category><category term="data-visualization"></category><category term="rnaseq"></category></entry><entry><title>Managing Software on a Multiuser Linux System - An Update</title><link href="https://groverj3.codeberg.page/articles/2024-04-20_managing-software-on-a-multiuser-linux-system-an-update.html" rel="alternate"></link><published>2024-04-20T00:00:00-04:00</published><updated>2024-04-20T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2024-04-20:/articles/2024-04-20_managing-software-on-a-multiuser-linux-system-an-update.html</id><summary type="html">&lt;p&gt;Back in 2019, the halcyon days of yore, near the end of my time in graduate school I wrote a well-intentioned article
about software management for multi-user linux systems (&lt;a href="/articles/2019-06-25_managing-software-on-a-multiuser-linux-system.html"&gt;here&lt;/a&gt;).
This original article was written based on my experiences as the de-facto sysadmin of our lab's bioinformatics server.
I am …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Back in 2019, the halcyon days of yore, near the end of my time in graduate school I wrote a well-intentioned article
about software management for multi-user linux systems (&lt;a href="/articles/2019-06-25_managing-software-on-a-multiuser-linux-system.html"&gt;here&lt;/a&gt;).
This original article was written based on my experiences as the de-facto sysadmin of our lab's bioinformatics server.
I am not a trained Linux sysadmin, I didn't even major in anything computer-related in college. However, I have been a
big nerd as long as I can remember and have been playing with Linux far longer than I was using it for a job. That
article was a good starting point. Lots of things have changed in the past few years. While my thoughts are similar as
back then, I do have the benefit of additional experience to draw on, as well as some developments on the software and
hardware side of things since then.&lt;/p&gt;
&lt;p&gt;This is not a guide to setting up an HPC or cluster. It's also not a guide for setting up a cloud compute environment.
If you're not at a university or a large company, then you're unlikely to have an HPC. On the flip side, I like cloud
compute, but always like to have a local server for my work. It's just faster to develop on. If you have average compute
needs for bioinformatics (alignment, variant calling, notebooks, etc.) then a single node local server does a very good
job. Especially with modern processors and enough RAM. Plus, if you need more compute you can always get it through your
cloud vendor of choice.&lt;/p&gt;
&lt;p&gt;As the only full-time bioinformatics scientist at a midsize biotech company I have once again found myself in the situation
of managing a server for my own work. As we add more people, we need processes that scale well. My goal is to make this
system, augmented with cloud resources, work as a primary compute server until reaching 4-5 users. At that point, it
makes more sense to use such a system for testing and prototyping rather than a main compute resource.&lt;/p&gt;
&lt;p&gt;Consider these tips an addendum to my previous article on the subject.&lt;/p&gt;
&lt;h3&gt;0. Make Your Life Easier With Containers&lt;/h3&gt;
&lt;p&gt;What is the best way to avoid installing a bunch of random software for your small number of users? Just don't. Encourage
every user to use containers. But how will they do so without &lt;code&gt;sudo&lt;/code&gt; privileges for Docker? Easy, have them use &lt;a href="https://apptainer.org/"&gt;Apptainer&lt;/a&gt;
or &lt;a href="https://podman.io/"&gt;Podman&lt;/a&gt;. Neither of these options require superuser privileges. Apptainer split off from &lt;a href="https://sylabs.io/"&gt;Singularity&lt;/a&gt;,
and that is also an option as well. Podman is a Red Hat product, but runs just fine on Ubuntu and other distros, plus it
is &lt;em&gt;mostly&lt;/em&gt; CLI-identical to Docker. Docker is problematic in HPC and multi-user environments because it requires superuser
permissions, or adding a user to the Docker group (which then allows control of all system containers, also problematic).&lt;/p&gt;
&lt;p&gt;A wrinkle to using containers for scientific computing is that usually you don't want the container to continue running
after the job is done. For Apptainer this is fine, it was designed with this use-case in mind. For Podman/Docker, simply
include the &lt;code&gt;--rm&lt;/code&gt; option in your &lt;code&gt;docker run&lt;/code&gt; or &lt;code&gt;podman run&lt;/code&gt;. Another thing to remember is that you're going to have
to mount your directory containing your input data, and output location as a volume for Podman/Docker, or if using
Apptainer and you need to cross filesystem boundaries.&lt;/p&gt;
&lt;p&gt;Managing containers is not a big hassle either. You might think that making your own, and putting together a registry is
too much work. Firstly, most bioinformatics and data science software is already available in Docker containers. If not
from the developers, it's likely available through &lt;a href="https://biocontainers.pro/"&gt;Biocontainers&lt;/a&gt;. In terms of storing them,
you don't need a "real" container registry, you can simply save them like so:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;podman&lt;span class="w"&gt; &lt;/span&gt;save&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;container_name&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;gzip&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;container_name.tar.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If a container is only available for Docker, usually you can convert them to Apptainer/SIngularity format without any
fuss:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;apptainer&lt;span class="w"&gt; &lt;/span&gt;build&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;container_name&lt;span class="o"&gt;}&lt;/span&gt;.sif&lt;span class="w"&gt; &lt;/span&gt;docker-archive:&lt;span class="o"&gt;{&lt;/span&gt;container_name&lt;span class="o"&gt;}&lt;/span&gt;.tar.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Just stick them somewhere consistent in your directory structure. I use a folder on a NAS called "container_library."&lt;/p&gt;
&lt;h3&gt;1. Encourage Use of Virtual Environments For Python&lt;/h3&gt;
&lt;p&gt;There is a great extension for pyenv called &lt;a href="https://github.com/pyenv/pyenv-virtualenv"&gt;pyenv-virtualenv&lt;/a&gt;. When working
on a data science project, or developing a standalone program, one should start by creating a virtualenv for said project.
This is a great solution, in combination with pyenv, for python's terrible package management and dependency resolution.&lt;/p&gt;
&lt;p&gt;Yes, you could use Anaconda to do the same thing but Anaconda is bloated and its conda package manager is slow. Plus,
you're then stuck using it for everything. It's pretty easy to install your own local version of python with pyenv, and
then simply run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pyenv&lt;span class="w"&gt; &lt;/span&gt;virtualenv&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;.10.2&lt;span class="w"&gt; &lt;/span&gt;my-virtual-env-3.10.2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The best part of this from a sysadmin perspective is that, again, it's the users' own responsibility to manage all this.
You just give them the tools.&lt;/p&gt;
&lt;h3&gt;2. Users Run Their Own Jupyter Notebook Servers&lt;/h3&gt;
&lt;p&gt;There are options like &lt;a href="https://jupyter.org/hub"&gt;jupyterhub&lt;/a&gt;, or &lt;a href="https://tljh.jupyter.org/en/latest/"&gt;the littlest jupyterhub&lt;/a&gt;.
However, it's even easier to just have users install jupyter in their own python libraries, assign them a port to serve
it on, and have them start it like so:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;jupyter&lt;span class="w"&gt; &lt;/span&gt;lab&lt;span class="w"&gt; &lt;/span&gt;--no-browser&lt;span class="w"&gt; &lt;/span&gt;--port&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;assigned_port&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, they can forward that port to your local machine as such:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;ssh&lt;span class="w"&gt; &lt;/span&gt;-L&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;assigned_port&lt;span class="o"&gt;}&lt;/span&gt;:localhost:&lt;span class="o"&gt;{&lt;/span&gt;assigned_port&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;remote_user&lt;span class="o"&gt;}&lt;/span&gt;@&lt;span class="o"&gt;{&lt;/span&gt;remote_host_ip&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Easy peasy.&lt;/p&gt;
&lt;h3&gt;3. Install R With &lt;a href="https://github.com/r-lib/rig"&gt;rig&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;One of the more exciting things to happen recently, is the R installation manager, rig. This provides &lt;em&gt;some&lt;/em&gt; of the same
functionality that you get from pyenv in R. Though, you still can't leave installation of R up to users without having
them run R in a container (a decent option, but makes some things awkward unless you start doing &lt;em&gt;everything&lt;/em&gt; that way).
Rig lets you install R across the system, without depending on your system package manager. It also configures R to use
the POSIT package manager by default. This is exciting because you no longer need to compile R packages. Yes, there were
ways around this before, but none were very plug and play, and some methods required installing packages for the whole
system. Running &lt;code&gt;install.package('package_name')&lt;/code&gt; will default to a binary package and then falling back to compilation
if required.&lt;/p&gt;
&lt;p&gt;This solution also defaults to user-specific package libraries. So, no need to manage that.&lt;/p&gt;
&lt;h3&gt;4. Have A Real Storage And Backup Strategy&lt;/h3&gt;
&lt;p&gt;It's not software, but this is a blog and there are no rules. I do what I want. To make sure you never lose critical
data, even if this is just a testing system you need to have a backup strategy. I currently do the following on the
server I manage for that purpose:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The / (root) directory is fast NVMe, but not terribly large. Only system files go here.&lt;/li&gt;
&lt;li&gt;/home is on the same device as /, but users are encouraged to store data elsewhere.&lt;/li&gt;
&lt;li&gt;/ is redundant with a RAID1 configuration storing a mirror of all data to a second identical NVMe drive.&lt;/li&gt;
&lt;li&gt;/mnt/bulk_nvme is an array of 6 4TB configured as a single volume. Each user gets a directory here symlinked to their
/home. This is where most work happens.&lt;/li&gt;
&lt;li&gt;/mnt/bulk_nvme is backed up daily to a NAS directly connected to the server via an &lt;code&gt;rsync&lt;/code&gt; cron job over 10 gigabit ethernet.&lt;/li&gt;
&lt;li&gt;Commonly used genome/transcriptome references, containers, and other files are available on /mnt/bulk_nvme as well as on the NAS.&lt;/li&gt;
&lt;li&gt;On the NAS (affectionately named Dagobah, because it's a data "swamp"), there are directories for "data_archive" and
"analysis_results" where raw data and analyzed data live, respectively.&lt;/li&gt;
&lt;li&gt;The "data_archive," "analysis_results," and "container_library" are backed up to AWS S3 Glacier Instant Retrieval tier.&lt;/li&gt;
&lt;li&gt;Computational notebooks (jupyter), scripts (BASH, R, Python), and workflow are versioned on github.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This works for now, and is likely to change in order to scale better. In particular, we are looking at automatically
archiving data from a NAS using data lifecycle policies, as well as more complex solutions like &lt;a href="https://www.weka.io/"&gt;WEKA&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;5. Know Where You Draw The Line&lt;/h3&gt;
&lt;p&gt;Cloud computing is more accessible than ever before. I am personally attached to having &lt;em&gt;some&lt;/em&gt; local compute for testing,
lighter computation, and direct control. However, if you're not a large organization with existing HPC infrastructure and
don't plan on buying into that, then cloud computing is the way to go. It becomes nearly impossible to predict your cloud
bill, and that's a downside to this. However, if you have more than 2-5 people, and even by the time you get to 4 or 5,
you're going to either need a fulltime &lt;em&gt;real&lt;/em&gt; sysadmin or you're going to have to use more cloud resources.&lt;/p&gt;
&lt;p&gt;We supplement our on-prem compute with AWS. So, if there is a job that requires more RAM than available, uses more CPUs
than we have in one server, or needs powerful GPUs then to the cloud it goes. There are also specific bioinformatics
cloud platforms, and I have opinions on these, but that's for another post.&lt;/p&gt;
&lt;h3&gt;Wrap-up&lt;/h3&gt;
&lt;p&gt;I'm sure there are tens of people out there like me, who don't mind managing a system like this. Regardless, this may give
you some ideas on how to manage your compute environment. I find having a local server more convenient than doing
everything in "the cloud" (someone else's computer), but I have specific limits on this and once reached, augmenting with
cloud resources is a smart thing to reduce admin overhead.&lt;/p&gt;
&lt;p&gt;Now, let's see if I can manage more than one post every 4 years.&lt;/p&gt;</content><category term="how-to"></category><category term="sysadmin"></category></entry><entry><title>Publications, Dissertations, Job Hunts, and a Pandemic</title><link href="https://groverj3.codeberg.page/articles/2020-05-30_publications-dissertations-job-hunts-and-a-pandemic.html" rel="alternate"></link><published>2020-05-30T00:00:00-04:00</published><updated>2020-05-30T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2020-05-30:/articles/2020-05-30_publications-dissertations-job-hunts-and-a-pandemic.html</id><summary type="html">&lt;p&gt;I started this github site as a place to expand my professional reach by posting
my random musings on bioinformatics, Linux, data science, and etc. I made a few
reasonably cogent posts, but then life got in the way! It's been a really busy
time, a very eventful year. I'm …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I started this github site as a place to expand my professional reach by posting
my random musings on bioinformatics, Linux, data science, and etc. I made a few
reasonably cogent posts, but then life got in the way! It's been a really busy
time, a very eventful year. I'm now in a very different position than I was last
summer. It's exciting progress, but hanging over all of it is the COVID-19
pandemic. I'm going to write another post specifically dealing with job hunting
as a Ph.D. student, but here I thought I'd kick off my blog again with a general
update since it's been a while.&lt;/p&gt;
&lt;h3&gt;Wrappig up Grad School During the COVID-19 Pandemic&lt;/h3&gt;
&lt;p&gt;My lack of posts began mostly as a consequence of the need to grind harder than I
ever had before on my dissertation research. However, I was succcessful and
finished my third publication during my Ph.D. program. It's since been accepted
in PNAS, but I'm not sure when the final version will be available. Until then,
you can read the bioRxiv version:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://doi.org/10.1101/866806"&gt;Abundant expression of maternal siRNAs is a conserved feature of seed development&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;As with many things in grad school, progress on this paper wasn't consistent. It
happened in a series of spurts, with a large portion of the work really only
happening once we got the final sequencing samples we had been waiting on last
summer.&lt;/p&gt;
&lt;p&gt;Along with that comes the need to write a dissertation. Since I had thee papers
my committee was happy for me to do what's called a "staple dissertation." The
name here is a bit incorrect, I did have to write more besides putting together
my papers. However, this also took a significant amount of time.&lt;/p&gt;
&lt;p&gt;While this was all going on, I attempted a short-lived bioinformatics/data
science student group. Though, in the end wrangling grad students is a bit like
herding cats. Still, I think it probably helped other grad students, which was
the point.&lt;/p&gt;
&lt;p&gt;Back in January I really started to look at job postings more seriously because I
had known from the very beginning that my goal was to go back to industry. Mostly
I was looking at bioinformatics scientist-type positions. I really threw myself
into this, and I think all-in-all I probably put out 150 applications.&lt;/p&gt;
&lt;p&gt;Then, as we all know, the pandemic hit in a big way for all of us. Labs were
shut down, jobs were lost, colleges went online. I was in the phase of my Ph.D.
where I needed to write my dissertation and apply for jobs, so I wasn't disrupted
as much as most. However, it was definitely disheartening to have my job
interviews dry up. Luckily, it seems that people with bioinformatics experience
are still in high demand though and I was able to land a good job regardless. Not
without becoming more and more concerned on a daily basis and the rejection
emails started to roll in. Overall though, I think that the prospects for biology
and bioinformatics in particular are still strong. We might not be "essential"
workers, but it's hard to justify to employing people like us at a time when we
could actually discover something vitally important to help with the current
situation. Or, at the very least, the pandemic underscores that biology is
important, and needs to be taken seriously, funded, and considered more by
society in general. The pandemic will be temporary, but perhaps society will
continue to think about biologists more frequently.&lt;/p&gt;
&lt;p&gt;Just last week, I defended my Ph.D. It was a fantastic moment, even if it had to
happen over a Zoom meeting. Following this, and wrapping up everything, I'll be
moving to Boston to be closer to my new employer,
&lt;a href="https://www.sevenbridges.com/"&gt;Seven Bridges Genomics&lt;/a&gt;. There, I will be working
as a "Genomics Scientist" to aid bioinformatics workflow development and
coordination of data and metadata availability for their public programs. It's
a great way to move on from grad school.&lt;/p&gt;
&lt;p&gt;Moving forward, I'm going to try to keep up with posting things. Lots of exciting
things happening! Hopefully I can keep my skills sharp, and I think this blog
will be essential for that.&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="/images/defense.jpg", style="width:700px;"&gt;
&lt;/center&gt;&lt;/p&gt;</content><category term="commentary"></category><category term="grad school"></category><category term="jobs"></category></entry><entry><title>Just Write Your Own Python Parsers for .fastq Files</title><link href="https://groverj3.codeberg.page/articles/2019-08-22_just-write-your-own-python-parsers-for-fastq-files.html" rel="alternate"></link><published>2019-08-22T00:00:00-04:00</published><updated>2019-08-22T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-08-22:/articles/2019-08-22_just-write-your-own-python-parsers-for-fastq-files.html</id><summary type="html">&lt;p&gt;In contrast to the &lt;a href="https://en.wikipedia.org/wiki/Zen_of_Python"&gt;zen of python&lt;/a&gt;
there are actually many ways to handle sequence data in Python. There are several
packages on &lt;a href="https://pypi.org"&gt;PyPI&lt;/a&gt; that provide parsers for sequence formats
like .fastq and .fasta. I've never bothered with these, including the oft-used
&lt;a href="https://biopython.org"&gt;Biopython&lt;/a&gt;. I vaguely remembered Biopython being slower
than …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In contrast to the &lt;a href="https://en.wikipedia.org/wiki/Zen_of_Python"&gt;zen of python&lt;/a&gt;
there are actually many ways to handle sequence data in Python. There are several
packages on &lt;a href="https://pypi.org"&gt;PyPI&lt;/a&gt; that provide parsers for sequence formats
like .fastq and .fasta. I've never bothered with these, including the oft-used
&lt;a href="https://biopython.org"&gt;Biopython&lt;/a&gt;. I vaguely remembered Biopython being slower
than any parser I'd written myself early-on in learning bioinformatics, and it
not actually being simpler to implement. However, I'd never looked at this in
detail. Additionally, I'd recently run across a few posts on
&lt;a href="https://www.biostars.org/"&gt;biostars&lt;/a&gt; where users were deriding people for asking 
"What is the most efficient way to parse a huge .fastq file" for something
similar.&lt;/p&gt;
&lt;p&gt;First of all, don't discourage people who are trying to learn. Secondly, this is
a good question! As scientists, we should know that just because data exists
doesn't meant it's good. Likewise, just because software exists doesn't mean it's
the best tool for any given job. Plus, writing simple parsers for common formats
is a good way to practice file processing for when you eventually need to do
something hard and no ready-made parser exists in a package.&lt;/p&gt;
&lt;p&gt;Rather than vaguely saying "X package is slow, do this instead" I thought it'd be
best to actually benchmark some different .fastq parser options.&lt;/p&gt;
&lt;h3&gt;The Contenders&lt;/h3&gt;
&lt;p&gt;There are several packages that include parsers for biological sequence data.
These include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://biopython.org"&gt;Biopython&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://htseq.readthedocs.io"&gt;HTSeq&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://scikit-bio.org/"&gt;scikit-bio&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I'm familiar with Biopython from the recommendations that abound in the community
for exactly this task, and HTSeq mostly for &lt;code&gt;HTSeq-count&lt;/code&gt;. Scikit-bio seems to be
newer and under current development, so results from testing that are subject to
change. Just in case someone looks at this yers after it's written and wonders
why I got the performance that I did.&lt;/p&gt;
&lt;p&gt;When it comes to dealing with .fastq files I checked through my library of Python
scripts and came across two patterns that I'll also test compared to these
packages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Reading line-by-line, using a counter to yield records&lt;/li&gt;
&lt;li&gt;Reading line-by-line, using &lt;code&gt;zip_longest()&lt;/code&gt; from &lt;code&gt;itertools&lt;/code&gt; to yield records&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Setting up the Test&lt;/h3&gt;
&lt;p&gt;I did this in a &lt;a href="https://jupyter.org"&gt;jupyter&lt;/a&gt; notebook, since that's what I use
on a day-to-day basis. Most of my interactive "data science" work is done in R,
which is mostly a consequence of at one point needing to use some R packages that
have no Python equivalents, and just rolling with that. So, actually using Python
in jupyter is a bit of a departure from the norm for me.&lt;/p&gt;
&lt;p&gt;First, you need the necessary packages. I just use pip with
&lt;a href="https://github.com/pyenv/pyenv"&gt;pyenv&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;biopython&lt;span class="w"&gt; &lt;/span&gt;HTSeq&lt;span class="w"&gt; &lt;/span&gt;scikit-bio
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, started a new jupyter notebook with
&lt;a href="https://jupyterlab.readthedocs.io"&gt;jupyterlab&lt;/a&gt; (a sweet new UI for jupyter that
you should use!). Your first step is always to do your imports.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;Bio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SeqIO&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;HTSeq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastqReader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;itertools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zip_longest&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;skbio&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I'm only using one function from skbio, but it's just called &lt;code&gt;read()&lt;/code&gt; which is
too generic a name to just import that single function without causing all sorts
of annoyances and gnashing of teeth.&lt;/p&gt;
&lt;p&gt;Also, it's important with any parsing problem to understand the file format. The
.fastq format is ubiquitous in bioinformatics and looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;@SEQ_ID&lt;/span&gt;
&lt;span class="n"&gt;GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT&lt;/span&gt;
&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;((((&lt;/span&gt;&lt;span class="o"&gt;***+&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;%%%++&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="o"&gt;%%%%&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="mf"&gt;.1&lt;/span&gt;&lt;span class="o"&gt;***-+*&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="n"&gt;CCF&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;CCCCCCC65&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/FASTQ_format"&gt;Source&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;You can understand it as a repeated series of four lines:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Sequence ID, starting with "@"&lt;/li&gt;
&lt;li&gt;Sequence (ATCG)&lt;/li&gt;
&lt;li&gt;Separator (+)&lt;/li&gt;
&lt;li&gt;Quality score for each base call (same length as sequence)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The catch here is that you can't use @ as a record separator. It's a valid
character in the score line, too. So, you really do need to group the lines in
batches of four, as it's possible @ will exist in position 1 of the score line.&lt;/p&gt;
&lt;h3&gt;Define Some Functions to Test&lt;/h3&gt;
&lt;p&gt;In order to make the benchmarking easier to follow, I figured I'd define the
functions I want to bechmark in a consistent way:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Using Biopython&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_biopython&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SeqIO&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;fastq&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;

&lt;span class="c1"&gt;# Using HTSeq&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_htseq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FastqReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;

&lt;span class="c1"&gt;# HTSeq raw&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_htseq_raw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FastqReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw_iterator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;

&lt;span class="c1"&gt;# Skbio&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_skbio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;skbio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;fastq&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;

&lt;span class="c1"&gt;# Line by line with counter&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_lbl_counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;input_handle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_handle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;
                &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
                &lt;span class="n"&gt;fq_record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Line by line with zip_longest&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_zip_longest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_fastq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;input_handle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;fastq_iterator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;zip_longest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fastq_iterator&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here I intended to use two different methods from HTSeq, one which just returns
bare tuples rather than objects with other kinds of validation based on the
definition of the format. However, neither HTSeq method worked. Instead, giving a
&lt;code&gt;StopIteration&lt;/code&gt; error when it reached the end of a file. Trying to catch that
with a &lt;code&gt;try:&lt;/code&gt; &lt;code&gt;except:&lt;/code&gt; block didn't seem to work? It did parse until it reached
the end of a file though. I think this is a bug, and I may raise it with the
HTSeq people. So it is, regrettably, not included in my benchmarking results.
Also, in both custom parsers, &lt;code&gt;str.rstrip()&lt;/code&gt; was marginally faster than
&lt;code&gt;str.strip()&lt;/code&gt; so I went with that instead.&lt;/p&gt;
&lt;h3&gt;Run Some Benchmarks&lt;/h3&gt;
&lt;p&gt;I decided I would try each of these with 1 million lines from a whole-genome
bisulfite experiment. These are the R1 mates from 75bp paired end reads:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parse_biopython&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;JWG3_2_2_R1.head.fastq&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="mf"&gt;2.86&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;56.7&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parse_skbio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;JWG3_2_2_R1.head.fastq&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt; &lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;13.7&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parse_lbl_counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;JWG3_2_2_R1.head.fastq&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="mi"&gt;295&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;14.7&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parse_zip_longest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;JWG3_2_2_R1.head.fastq&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="mi"&gt;249&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;2.57&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;%timeit&lt;/code&gt; function there is some ipython "line magic." It simplifies timing
a single line of code. The &lt;code&gt;%%timeit&lt;/code&gt; is the "cell magic" version.&lt;/p&gt;
&lt;p&gt;It seems that skbio isn't ready for primetime just yet. The real question then
is, would biopython suffice for day-to-day work? Perhaps yes, ~1M lines in &amp;lt; 3s
(349650.35 lines per second) is a timescale that people might be willing to work
with. Keep in mind this is on my personal laptop, so it's hardly a compute
cluster. In contrast, the very simple line counter-based parser that I wrote as a
master's student back in 2013 as a python-learning exercise is nearly 10x faster!
There is also an improvement in speed for using &lt;code&gt;zip_longest()&lt;/code&gt; from &lt;code&gt;itertools&lt;/code&gt;
(a trick I'm pretty sure I saw in a post from Brent Pedersen on stackoverflow).&lt;/p&gt;
&lt;h3&gt;Visualize&lt;/h3&gt;
&lt;p&gt;I'm usually a ggplot2 useR for visualizations, but I'm already in python here so
let's use this as an excuse to try the great python plotting library
&lt;a href="https://altair-viz.github.io/"&gt;altair&lt;/a&gt;. It's declarative, like ggplot2, and you
build your plot by "mapping" your "variables" (columns) to "encodings" (analogous
to "aesthetics" in ggplot2). I ran several other benchmarks and turned them into
a &lt;a href="https://pandas.pydata.org/"&gt;pandas&lt;/a&gt; data frame. First you'll need to do some
imports:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;altair&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;alt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then make the data frame&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a dataframe Pandas style&lt;/span&gt;

&lt;span class="n"&gt;timing_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Method&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;biopython&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;skbio&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;lbl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;zip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                            &lt;span class="s1"&gt;&amp;#39;Reads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                            &lt;span class="s1"&gt;&amp;#39;Time (s)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="mi"&gt;670&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;4.4&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;40.49&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;418&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mf"&gt;2.86&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;14.2&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;132&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mf"&gt;1.32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;13.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;181&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;442&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;3.92&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;40.5&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;295&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                         &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;70.2&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;352&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;3.19&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;32.5&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;249&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)]})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Since each record is 4 lines, converting lines to # of reads requires dividing by
four. Likewise, the benchmarking results are in various time units, so I've
converted all of them to seconds. Not particular efficiently, but for this simple
example it's fine.&lt;/p&gt;
&lt;p&gt;Now we can visualize with Altair. It has a very nice syntax inspired by ggplot2's
"grammar of graphics." It's based on
&lt;a href="https://vega.github.io/vega-lite/"&gt;vega-lite&lt;/a&gt; under the hood and allows you to
easily save your plot from jupyterlab. Here's the code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Plot as a scatterplot&lt;/span&gt;

&lt;span class="n"&gt;alt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chart&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timing_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mark_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Reads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Time (s)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Method&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Plot on log scale&lt;/span&gt;

&lt;span class="n"&gt;alt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chart&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timing_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mark_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;alt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Reads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;alt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;log&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;alt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Time (s)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;alt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;log&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Method&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;center&gt;
&lt;img alt="Scatterplot" src="/figures/2019-08-22_just-write-your-own-python-parsers-for-fastq-files/benchmark.png"&gt;
&lt;img alt="Log scale scatterplot" src="/figures/2019-08-22_just-write-your-own-python-parsers-for-fastq-files/benchmark_log.png"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;Everything scales linearly, but at massively different rates. Sci-kit bio is in
another universe in terms of time, such that you can't even visualize it with the
others in a meaningful way until you log scale everything. By the log scale, you
can essentially see that biopython is an order of magnitude faster than skbio,
and either simple parser are an order of magnitude faster again. The difference
between the two simple parsers is pretty insignificant.&lt;/p&gt;
&lt;p&gt;Note: Altair is great! Not quite as full-featured as ggplot2 in R, but it's
definitely promising and something to watch for in the future. They definitely
should make it work with jupyterlab's dark theme though. Due to the transparent
plot backgrounds it requres a light theme.&lt;/p&gt;
&lt;h3&gt;To Wrap Things Up&lt;/h3&gt;
&lt;p&gt;I'm not saying you should never use biopython, I suspect its parser does some
extra validation that my simple parsers don't. It also returns objects with some
possibly useful methods. However, if you just want to read files quckly then the
simple line-by-line parsers aren't actually very complicated to write. Plus, you
don't even need to import anything unless you want a minor speed boost from
&lt;code&gt;itertools&lt;/code&gt;. Additionally, if you didn't need to strip newlines you'd get a boost
from not calling an &lt;code&gt;str.strip()&lt;/code&gt; method on each line.&lt;/p&gt;
&lt;p&gt;If you're ok with living dangerously, and are sure your files are formatted
correctly you can easily write something that will outperform standard
implementations with little effort when it comes to .fastq parsing.&lt;/p&gt;</content><category term="commentary"></category><category term="bioinformatics"></category><category term="python"></category><category term="workflows"></category></entry><entry><title>The Snakemake Tutorial I Wish I Had</title><link href="https://groverj3.codeberg.page/articles/2019-08-19_the-snakemake-tutorial-i-wish-i-had.html" rel="alternate"></link><published>2019-08-19T00:00:00-04:00</published><updated>2019-08-19T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-08-19:/articles/2019-08-19_the-snakemake-tutorial-i-wish-i-had.html</id><summary type="html">&lt;p&gt;Over the past few years the use of workflow managers in genomics and
bioinformatics has grown greatly. This is a great thing for the field and adds to
our ability to perform reproducible analyses, especially for pipelines with many
steps. These are common in bioinformatics, but prior to the use …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Over the past few years the use of workflow managers in genomics and
bioinformatics has grown greatly. This is a great thing for the field and adds to
our ability to perform reproducible analyses, especially for pipelines with many
steps. These are common in bioinformatics, but prior to the use of workflow
managers they were mostly handled with BASH scripts. While a good BASH script is
perfectly acceptable much of the time they aren't very portable and don't handle
multithreading and concurrent processes without annoying hacks. For a one-off
analysis that's all fine, but what about a pipeline you need to run many times?
This is where a workflow manager really shines, especially when combined with
containers.&lt;/p&gt;
&lt;p&gt;I decided to implement two pipelines that we use often here in the Mosher Lab as
Snakemake workflows. We work with a lot of small RNA sequencing and recently some
whole-genome bisulfite sequencing data. There are already available pipelines for
WGBS, but they seem like overkill and writing my own is a good way to learn the
ins and outs of Snakemake.&lt;/p&gt;
&lt;p&gt;I chose Snakemake over other workflow managers due to its already frequent use in
bioinformatics workflows and my familiarity with Python. However, Nextflow also
seems like a solid option as well. I found much of the official documentation
very lacking in useful examples though. I ended up consulting numerous workflows
available on Github. The issue there is also that a lot of them are trying to do
too much IMHO. So, I figured I'd write the tutorial I wish I had been able to
find.&lt;/p&gt;
&lt;h3&gt;Step 0 - Install Snakemake and Your Workflow's Software Dependencies&lt;/h3&gt;
&lt;p&gt;I'm assuming you're on some kind of Linux system. Though, these directions may
also work on macOS. Your first step should be to install Snakemake:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;--user&lt;span class="w"&gt; &lt;/span&gt;Snakemake
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I always recommend &lt;em&gt;not&lt;/em&gt; using your system python. If you're on a non-rolling
release distribution, or on macOS it's probably super outdated. I use
&lt;a href="https://github.com/pyenv/pyenv"&gt;pyenv&lt;/a&gt; to manage my Python installations, though
there are other options. I also &lt;strong&gt;never&lt;/strong&gt; use &lt;code&gt;sudo&lt;/code&gt; to install python packages.&lt;/p&gt;
&lt;p&gt;The WGBS workflow consists of several steps which can be represented by this DAG
(Snakemake can make these! Neat-o!)&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img alt="Snakemake WGBS Workflow" src="https://codeberg.org/groverj3/wgbs_snakemake/raw/branch/master/dag.png"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;You can boil it down to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Index the reference genome with &lt;a href="https://github.com/brentp/bwa-meth"&gt;bwa-meth&lt;/a&gt; (needed for bwa-meth alignment)&lt;/li&gt;
&lt;li&gt;Index the reference genome with &lt;a href="http://www.htslib.org/doc/faidx.html"&gt;samtools faidx&lt;/a&gt; (needed for MethylDackel)&lt;/li&gt;
&lt;li&gt;Quality reporting with &lt;a href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/"&gt;FastQC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Trimming adapters with &lt;a href="https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/"&gt;Trim Galore&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Quality reporting on trimmed reads (FastQC again)&lt;/li&gt;
&lt;li&gt;Alignment with bwa-meth&lt;/li&gt;
&lt;li&gt;Sorting with &lt;a href="http://www.htslib.org/doc/samtools.html"&gt;samtools sort&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Marking PCR duplicates with &lt;a href="https://broadinstitute.github.io/picard/"&gt;Picard Tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Detecting bias and extracting per-cytosine percent methylation with &lt;a href="https://github.com/dpryan79/MethylDackel"&gt;MethylDackel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Determining fold-coverage with &lt;a href="https://github.com/brentp/mosdepth"&gt;mosdepth&lt;/a&gt; and a little &lt;a href="https://codeberg.org/groverj3/wgbs_snakemake/src/branch/master/scripts/mosdepth_to_x_coverage.py"&gt;python script&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Not a super complicated workflow, but enough to demonstrate a read-world use of
Snakemake. A workflow that's complicated enough you don't want to run each step
separately either.&lt;/p&gt;
&lt;p&gt;I don't expect anyone to replicate this exact workflow, but it's a useful
example.&lt;/p&gt;
&lt;h3&gt;Step 1 - Learn Some Snakemake Basics&lt;/h3&gt;
&lt;p&gt;There are some basics to explain before I start throwing code around. Firstly,
Snakemake does &lt;strong&gt;not&lt;/strong&gt; work you way you might think, it actually works
&lt;strong&gt;backwards&lt;/strong&gt; from a set of &lt;strong&gt;target&lt;/strong&gt; files through a set of &lt;strong&gt;rules&lt;/strong&gt;. You may
think this sounds unnecessarily confusing, but there is a good reason for this.
When Snakemake begins a workflow, this ensures that (as long as you don't do
anything too weird) it will not fail for trivial reasons like files not being
generated as inputs to other rules. It creates a &lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;directed acyclic graph&lt;/a&gt;
representing the workflow for &lt;em&gt;each sample&lt;/em&gt; that it can match through wildcards
to your &lt;strong&gt;targets&lt;/strong&gt;. It will use the &lt;strong&gt;rules&lt;/strong&gt; you define in its main script
(the Snakefile) to create a path from &lt;strong&gt;targets&lt;/strong&gt; to &lt;strong&gt;inputs (samples)&lt;/strong&gt;. This
is backwards from the way we think, and there are workflow managers that do
&lt;em&gt;push&lt;/em&gt; rather than &lt;em&gt;pull&lt;/em&gt;. Each has their advantages and disadvantages.&lt;/p&gt;
&lt;p&gt;Another key concept is that your &lt;strong&gt;rules&lt;/strong&gt; live in a &lt;code&gt;Snakefile&lt;/code&gt;, just a Python
script with extra syntax. So, you can use Python code in it! Keep this in mind,
and you can do some neat things (creating sample tables for differential
expression, etc.).&lt;/p&gt;
&lt;p&gt;Typically, the first &lt;strong&gt;rule&lt;/strong&gt; in a Snakefile is called "all" and this rule will
indicate the &lt;strong&gt;targets&lt;/strong&gt; that you want to generate. This tells Snakemake to
start and use the rules to try to make them based on wildcard matching with the
inputs.&lt;/p&gt;
&lt;p&gt;Additionally, it's good practice to include the options you'd like a user to be
able to configure in a .json or .yaml file. You can think of these files like
a python dictionary in static file form (&lt;a href="https://en.wikipedia.org/wiki/Serialization"&gt;pickling&lt;/a&gt;).&lt;/p&gt;
&lt;h3&gt;Step 2 - Create a Rule&lt;/h3&gt;
&lt;p&gt;Create a blank file called &lt;code&gt;Snakefile&lt;/code&gt; in the directory you're using for
development and fill it with this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Run fastqc on the raw .fastq.gz files&lt;/span&gt;
&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;fastqc_raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;input_data/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;.fastq.gz&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.zip&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;fastqc -o &lt;/span&gt;&lt;span class="si"&gt;{params.out_dir}&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="si"&gt;{input}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is the first rule in our workflow, running FastQC on the input files. In the
&lt;code&gt;{}&lt;/code&gt; are &lt;strong&gt;wildcards&lt;/strong&gt;. While they have names, they are only there for
readability right now. The only &lt;code&gt;param&lt;/code&gt; we're currently passing to it is the
output directory, but this is where your options would be. Snakemake will check
whether the &lt;code&gt;input&lt;/code&gt; for a rule can be made before allowing the workflow to start.
Therefore, if your workflow starts, it &lt;em&gt;should&lt;/em&gt; finish. However, we need a rule
"all" which will tell it to be run.&lt;/p&gt;
&lt;p&gt;Add the following to the Snakefile &lt;strong&gt;before&lt;/strong&gt; the fastqc rule:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="nb"&gt;all&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;expand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.&lt;/span&gt;&lt;span class="si"&gt;{ext}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SAMPLES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;zip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This tells the workflow which &lt;strong&gt;targets&lt;/strong&gt; to create. In this command, &lt;code&gt;expand&lt;/code&gt;
instructs Snakemake to fill in all things which match the wildcards. So, this
indicates that &lt;strong&gt;all&lt;/strong&gt; files which match this pattern when filled in are the
&lt;strong&gt;targets&lt;/strong&gt;. You can use &lt;code&gt;expand&lt;/code&gt; in other rules, too. When you want multiple
files as input that may be created asynchronously in previous rules. We're still
missing a very important thing though! The input files! For that, create another
file, &lt;code&gt;config.yaml&lt;/code&gt;, in the same directory and add this to it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Sample1&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Sample2&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;etc...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Your &lt;code&gt;samples&lt;/code&gt; are yaml entries without names (you could name them if you want),
and should &lt;strong&gt;not&lt;/strong&gt; include read pair numbers (so 1 ID for each pair). The sample
IDs should match what is in &lt;code&gt;rule all&lt;/code&gt; in place of &lt;code&gt;{sample}&lt;/code&gt;. The &lt;code&gt;config.yaml&lt;/code&gt;
is also where the options for your workflow steps can live, under their own
headings. Now, go back to your Snakefile and add this &lt;em&gt;above&lt;/em&gt; your rules:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get overall workflow parameters from config.yaml&lt;/span&gt;
&lt;span class="n"&gt;configfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;config.yaml&amp;#39;&lt;/span&gt;

&lt;span class="n"&gt;SAMPLES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;samples&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This parses the config file into a Python dictionary. Do you see how &lt;code&gt;SAMPLES&lt;/code&gt; is
filled in for &lt;code&gt;{sample}&lt;/code&gt; in &lt;code&gt;rule all&lt;/code&gt; which then creates a &lt;strong&gt;target&lt;/strong&gt; which can
be generated by &lt;code&gt;rule fastqc_raw&lt;/code&gt;. It's kind of a mind-bender at first, but it 
all fits together.&lt;/p&gt;
&lt;p&gt;Save all these files and include some test data in a subdirectory called
"input_data".&lt;/p&gt;
&lt;h3&gt;Step 2 - Running Your First Rules&lt;/h3&gt;
&lt;p&gt;I copied one sample to my &lt;code&gt;input_data&lt;/code&gt; directory and added it to the .yaml file:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;JWG3_2_2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now run the workflow from the directory you're developing in with
&lt;code&gt;snakemake --cores #&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;groverj3@x1-carbon&lt;span class="w"&gt; &lt;/span&gt;snakemake_test&lt;span class="o"&gt;]&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;snakemake&lt;span class="w"&gt; &lt;/span&gt;--cores&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
Building&lt;span class="w"&gt; &lt;/span&gt;DAG&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;jobs...
Using&lt;span class="w"&gt; &lt;/span&gt;shell:&lt;span class="w"&gt; &lt;/span&gt;/usr/bin/bash
Provided&lt;span class="w"&gt; &lt;/span&gt;cores:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
Rules&lt;span class="w"&gt; &lt;/span&gt;claiming&lt;span class="w"&gt; &lt;/span&gt;more&lt;span class="w"&gt; &lt;/span&gt;threads&lt;span class="w"&gt; &lt;/span&gt;will&lt;span class="w"&gt; &lt;/span&gt;be&lt;span class="w"&gt; &lt;/span&gt;scaled&lt;span class="w"&gt; &lt;/span&gt;down.
Job&lt;span class="w"&gt; &lt;/span&gt;counts:
&lt;span class="w"&gt;        &lt;/span&gt;count&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nb"&gt;jobs&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;all
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;fastqc_raw
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;

&lt;span class="o"&gt;[&lt;/span&gt;Mon&lt;span class="w"&gt; &lt;/span&gt;Aug&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;19&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;23&lt;/span&gt;:30:46&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2019&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
rule&lt;span class="w"&gt; &lt;/span&gt;fastqc_raw:
&lt;span class="w"&gt;    &lt;/span&gt;input:&lt;span class="w"&gt; &lt;/span&gt;input_data/JWG3_2_2_R2.fastq.gz
&lt;span class="w"&gt;    &lt;/span&gt;output:&lt;span class="w"&gt; &lt;/span&gt;1_fastqc_raw/JWG3_2_2_R2_fastqc.html,&lt;span class="w"&gt; &lt;/span&gt;1_fastqc_raw/JWG3_2_2_R2_fastqc.zip
&lt;span class="w"&gt;    &lt;/span&gt;jobid:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;wildcards:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;JWG3_2_2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;mate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;


&lt;span class="o"&gt;[&lt;/span&gt;Mon&lt;span class="w"&gt; &lt;/span&gt;Aug&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;19&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;23&lt;/span&gt;:30:46&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2019&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
rule&lt;span class="w"&gt; &lt;/span&gt;fastqc_raw:
&lt;span class="w"&gt;    &lt;/span&gt;input:&lt;span class="w"&gt; &lt;/span&gt;input_data/JWG3_2_2_R1.fastq.gz
&lt;span class="w"&gt;    &lt;/span&gt;output:&lt;span class="w"&gt; &lt;/span&gt;1_fastqc_raw/JWG3_2_2_R1_fastqc.html,&lt;span class="w"&gt; &lt;/span&gt;1_fastqc_raw/JWG3_2_2_R1_fastqc.zip
&lt;span class="w"&gt;    &lt;/span&gt;jobid:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;wildcards:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;JWG3_2_2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;mate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;

Started&lt;span class="w"&gt; &lt;/span&gt;analysis&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;JWG3_2_2_R1.fastq.gzStarted&lt;span class="w"&gt; &lt;/span&gt;analysis&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;JWG3_2_2_R2.fastq.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I ran it with &lt;code&gt;--cores 2&lt;/code&gt;, but I did not include a &lt;code&gt;threads&lt;/code&gt; parameter to the
rule in addition in &lt;code&gt;input&lt;/code&gt;, &lt;code&gt;output&lt;/code&gt;, &lt;code&gt;params&lt;/code&gt;, and &lt;code&gt;shell&lt;/code&gt;. So, it only thinks
the &lt;code&gt;rule fastqc_raw&lt;/code&gt; requires one processor. This means it will parallelize
samples through that rule up to the maximum you give it at run-time! This is
handy. Do you see now how this is better than a bash script? It intelligently
replaces processes that can be run in parallel, but if you specify a number of
&lt;code&gt;threads&lt;/code&gt; for the rule it will wait until those cores are available.&lt;/p&gt;
&lt;p&gt;Let's add a few more rules.&lt;/p&gt;
&lt;h3&gt;Step 3 - Add Rules&lt;/h3&gt;
&lt;p&gt;I'm going to add a bunch of rules. Don't freak out. Our &lt;code&gt;Snakefile&lt;/code&gt; now looks
like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get overall workflow parameters from config.yaml&lt;/span&gt;
&lt;span class="n"&gt;configfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;config.yaml&amp;#39;&lt;/span&gt;

&lt;span class="n"&gt;SAMPLES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;samples&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;reference_genome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="nb"&gt;all&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;expand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;3_bwameth_aligned/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;.bam&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SAMPLES&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;expand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_val_&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.&lt;/span&gt;&lt;span class="si"&gt;{ext}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SAMPLES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;zip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;span class="c1"&gt;# Index the reference genome&lt;/span&gt;
&lt;span class="c1"&gt;# ancient() will assume the reference is older than output files if they exist&lt;/span&gt;
&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;bwameth_index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ancient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.bwameth.c2t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.bwameth.c2t.amb&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.bwameth.c2t.ann&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.bwameth.c2t.bwt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.bwameth.c2t.pac&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.bwameth.c2t.sa&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;bwameth_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;paths&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;bwameth_path&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{params.bwameth_path}&lt;/span&gt;&lt;span class="s1"&gt; index &lt;/span&gt;&lt;span class="si"&gt;{input}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;


&lt;span class="c1"&gt;# Run fastqc on the raw .fastq.gz files&lt;/span&gt;
&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;fastqc_raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;input_data/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;.fastq.gz&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.zip&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;fastqc_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;paths&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;fastqc_path&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{params.fastqc_path}&lt;/span&gt;&lt;span class="s1"&gt; -o &lt;/span&gt;&lt;span class="si"&gt;{params.out_dir}&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="si"&gt;{input}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;


&lt;span class="c1"&gt;# Trim the read pairs&lt;/span&gt;
&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;trim_galore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R1_fastqc.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R1_fastqc.zip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R2_fastqc.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;1_fastqc_raw/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R2_fastqc.zip&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;R1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;input_data/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R1.fastq.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;R2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;input_data/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R2.fastq.gz&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R1_val_1.fq.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R1.fastq.gz_trimming_report.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R2_val_2.fq.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R2.fastq.gz_trimming_report.txt&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;adapter_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;trim_galore&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;adapter_seq&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2_trim_galore&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trim_galore_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;paths&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;trim_galore_path&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;        {params.trim_galore_path} \&lt;/span&gt;
&lt;span class="sd"&gt;        --a {params.adapter_seq} \&lt;/span&gt;
&lt;span class="sd"&gt;        --gzip \&lt;/span&gt;
&lt;span class="sd"&gt;        --trim-n \&lt;/span&gt;
&lt;span class="sd"&gt;        --quality 20 \&lt;/span&gt;
&lt;span class="sd"&gt;        --output_dir {params.out_dir} \&lt;/span&gt;
&lt;span class="sd"&gt;        --paired \&lt;/span&gt;
&lt;span class="sd"&gt;        {input.R1} {input.R2} \&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;


&lt;span class="c1"&gt;# Run fastqc on the trimmmed reads&lt;/span&gt;
&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;fastqc_trimmmed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;.fastq.gz_trimming_report.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;fq_gz&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_val_&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;.fq.gz&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_val_&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_val_&lt;/span&gt;&lt;span class="si"&gt;{mate}&lt;/span&gt;&lt;span class="s1"&gt;_fastqc.zip&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;fastqc_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;paths&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;fastqc_path&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{params.fastqc_path}&lt;/span&gt;&lt;span class="s1"&gt; -o &lt;/span&gt;&lt;span class="si"&gt;{params.out_dir}&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="si"&gt;{input.fq_gz}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;


&lt;span class="c1"&gt;# Align to the reference&lt;/span&gt;
&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;bwameth_align&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bwameth_index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;R1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R1_val_1.fq.gz&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;R2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;2_trim_galore/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;_R2_val_2.fq.gz&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;3_bwameth_aligned/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;.bam&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;3_bwameth_aligned/&lt;/span&gt;&lt;span class="si"&gt;{sample}&lt;/span&gt;&lt;span class="s1"&gt;.bwameth.log&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;bwameth&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;threads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;bwameth_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;paths&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;bwameth_path&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;genome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;REFERENCE_GENOME&lt;/span&gt;
    &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;        {params.bwameth_path} \&lt;/span&gt;
&lt;span class="sd"&gt;        -t {threads} \&lt;/span&gt;
&lt;span class="sd"&gt;        --reference {params.genome} \&lt;/span&gt;
&lt;span class="sd"&gt;        {input.R1} {input.R2} \&lt;/span&gt;
&lt;span class="sd"&gt;        2&amp;gt; {log} \&lt;/span&gt;
&lt;span class="sd"&gt;        | samtools view -b - \&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;gt; {output}&lt;/span&gt;
&lt;span class="sd"&gt;        &amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And the &lt;code&gt;config.yaml&lt;/code&gt; looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;JWG3_2_2_&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# Samples should be reported by ID rather than filenames, and exclude the&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# trailing &amp;quot;R1&amp;quot; and &amp;quot;R2&amp;quot;, one sample ID per pair. If samples are supplied as&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# separate .fastq.gz files within each pair concatenate them to a single R1 and&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# R2 file prior to running.&lt;/span&gt;

&lt;span class="nt"&gt;reference_genome&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;reference_genome/ref_genome.fasta&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;# Path to reference genome here with .fasta extension&lt;/span&gt;


&lt;span class="c1"&gt;# Options for individual workflow steps&lt;/span&gt;
&lt;span class="c1"&gt;# Configure threads for each step as desired, this is a sane starting point&lt;/span&gt;

&lt;span class="nt"&gt;trim_galore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;adapter_seq &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;quality &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;20&lt;/span&gt;

&lt;span class="w w-Error"&gt; &lt;/span&gt;&lt;span class="nt"&gt;bwameth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;threads &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;10&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;


&lt;span class="c1"&gt;# Paths to individual tools&lt;/span&gt;
&lt;span class="c1"&gt;# You probably don&amp;#39;t need to change this unless programs are not in your $PATH&lt;/span&gt;

&lt;span class="nt"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;fastqc_path &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;fastqc&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;trim_galore_path &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;trim_galore&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;bwameth_path &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;bwameth.py&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's a lot to take in, so a few words of explanation are in order. I have moved
paths for the individual programs to a section in the config file. This is to
help with potential portability problems. On some server you may have tools
installed in directories outside your &lt;code&gt;$PATH&lt;/code&gt;. They are pre-filled with just the
tool name, so they work fine when programs are executable from a command prompt
but this allows configuration (it's also at the bottom of the file because most
users shouldn't have to change it). I also now have a section containing options
for each tool that's run. Which you can pull out of the dictionary made from the
&lt;code&gt;config.yaml&lt;/code&gt; in each step's &lt;code&gt;params&lt;/code&gt; using . (dot) notation. I've  allocated 10
threads to the alignment step. This means it won't run if there aren't 10 threads
available due to other rules running more than 10 concurrent processes.&lt;/p&gt;
&lt;p&gt;There are also multiple &lt;strong&gt;targets&lt;/strong&gt; now. This is because the output from 
&lt;code&gt;rule fastqc_trimmed&lt;/code&gt; is not used as input to another rule. Unless you explicitly
tell Snakemake to run that rule to generate its &lt;strong&gt;target&lt;/strong&gt; it will not run and
you will get very annoyed.&lt;/p&gt;
&lt;p&gt;It's a lot to take in, but this is essentially a usable workflow.&lt;/p&gt;
&lt;h3&gt;Step 4 - Wrapping Up + A Few Tips&lt;/h3&gt;
&lt;p&gt;You can now add &lt;strong&gt;rules&lt;/strong&gt; to your heart's content. Keep in mind though, you need
to change your &lt;strong&gt;targets&lt;/strong&gt;! Otherwise, it won't run your new rules :(.&lt;/p&gt;
&lt;p&gt;Also, you're probably wondering what happens when you don't actually have enough
CPUs to run that rule with 10 threads. Just change the &lt;code&gt;--cores&lt;/code&gt; argument at
run-time to a lower number. It will reduce that rule's &lt;code&gt;threads&lt;/code&gt; to the number
specified.&lt;/p&gt;
&lt;p&gt;Another thing to consider is that Snakemake has the ability to work with HPC job
submission frameworks like SLURM and PBS. Though, it's not really that difficult
to include &lt;code&gt;snakemake --cores #&lt;/code&gt; in a normal .pbs script. It also plays nice with
containers (Docker and Singularity). So, if you package up your software in one
you have automation and deployability all-in-one!&lt;/p&gt;
&lt;p&gt;If you're curious what the full workflow looks like then &lt;a href="https://codeberg.org/groverj3/wgbs_snakemake"&gt;check it out!&lt;/a&gt;&lt;/p&gt;</content><category term="how-to"></category><category term="bioinformatics"></category><category term="python"></category><category term="workflows"></category><category term="snakemake"></category></entry><entry><title>Suggestions for Reproducible Bioinformatic Analyses</title><link href="https://groverj3.codeberg.page/articles/2019-08-09_suggestions-for-reproducible-bioinformatic-analyses.html" rel="alternate"></link><published>2019-08-09T00:00:00-04:00</published><updated>2019-08-09T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-08-09:/articles/2019-08-09_suggestions-for-reproducible-bioinformatic-analyses.html</id><summary type="html">&lt;p&gt;Bioinformatic analyses often require lengthy workflows or pipleines, where the
output of program A feeds into program B, and so on. These programs may also not
output their results in a format which is convenient to use in the subsequent
steps, requiring writing a conversion script, or piping its output …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Bioinformatic analyses often require lengthy workflows or pipleines, where the
output of program A feeds into program B, and so on. These programs may also not
output their results in a format which is convenient to use in the subsequent
steps, requiring writing a conversion script, or piping its output through yet
another program. This means that something as simple as running a differential
expression experiment still requires several steps. If you aren't careful this
can result in an incredibly messy filesystem. Worse, you may not remember which
programs or scripts were run on each file, and with which options. This is a huge
issue out there and likely a good reason why it's so hard to reproduce results
even when the same underlying data is used. Additionally, you'll inevitably need
to spend time doing iterative analysis. This also needs to be documented and
reproducible.&lt;/p&gt;
&lt;p&gt;In this post I'll be explaining a few methods that we can use organize this
situation before it drives you or your coworkers mad. Depending, of course, on
the level of automation and reproucibility required of the workflow.&lt;/p&gt;
&lt;h3&gt;Suggestion 1: Interactive Terminal Sessions Are For Development &lt;strong&gt;Only&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;There is most definitely a time and a place for testing things out in your
terminal. When you're learning to use a new program, needing to check the
&lt;code&gt;--help&lt;/code&gt; or &lt;code&gt;man&lt;/code&gt; pages, figuring out how to glue together programs A and B, etc.
However, in-depth analyses for publication should not be done in this manner.&lt;/p&gt;
&lt;p&gt;This is because after running your analysis you may have absolutely no record of
what was run! Of course, some (but, criticall, not all!) programs will export a
log file. However, not all do. You can quickly end up in a situation where you
have no idea which script was run on which file. So, reserve the interactive
terminal sessions for those use cases above.&lt;/p&gt;
&lt;h3&gt;Suggestion 2: Interactive Data Manipulation Should Be Performed in R or Jupyter Notebooks&lt;/h3&gt;
&lt;p&gt;Don't use Excel. I repeat, don't use Excel.&lt;/p&gt;
&lt;p&gt;Ok, Excel has its uses. However, if you're doing complex data analysis
it's very easy to get to the scale that you'll regret using Excel quickly.
Luckily the &lt;em&gt;entire&lt;/em&gt; &lt;a href="R programming language"&gt;https://www.r-project.org/&lt;/a&gt; was 
designed for this, and &lt;a href="https://www.python.org/"&gt;Python&lt;/a&gt; with
&lt;a href="https://pandas.pydata.org/index.html"&gt;Pandas&lt;/a&gt; provides some similar tools. In
addition to scale, you also have no real record of what was done in an Excel
workbook. When you combine R or Python with computational notebooks you can run
code, and see the direct output of that code below it. This tracks everything
that you've run and its outputs.&lt;/p&gt;
&lt;p&gt;Even though I do most of my interactive analysis and figure-making in R, I still
prefer Jupyter Notebooks over R Notebooks. This is because they're more widely
used, and Jupyter is extensible to multiple languages. Installing the
&lt;a href="https://irkernel.github.io/"&gt;R Kernel&lt;/a&gt; is very simple.&lt;/p&gt;
&lt;h3&gt;Suggestion 3: Single-run Pipelines Should be Automated With Shell Scripts&lt;/h3&gt;
&lt;p&gt;When you write a one-off pipeline it should still be automated with a script.
This enables reproducibility. In a perfect world you'd list the version of each
piece of software in the pipeline as well. This could result in a single shell
script file, or separate ones for each step. You may not know the next step
a-priori. These shell scripts should clearly indicate the date of creation and
the script's purpose. This is a simple example for one step in a single-use
pipeline:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="ch"&gt;#!/usr/bin/env bash&lt;/span&gt;

&lt;span class="c1"&gt;# Author: Jeffrey Grover&lt;/span&gt;
&lt;span class="c1"&gt;# Date: 2019-07-24&lt;/span&gt;
&lt;span class="c1"&gt;# Purpose: Extract reads over small RNA loci groups with bedtools intersect&lt;/span&gt;

&lt;span class="c1"&gt;# Use bedtools intersect and pipe to bam2fq&lt;/span&gt;

&lt;span class="nv"&gt;align_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;~/large_data/2019-06-28_aligned_reads&amp;quot;&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;bed_file&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./srna_groups/*.bed&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;do&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;bed_filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;basename&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$bed_file&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;out_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;bed_filename&lt;/span&gt;&lt;span class="p"&gt;%.bed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;_reads
&lt;span class="w"&gt;    &lt;/span&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./&lt;/span&gt;&lt;span class="nv"&gt;$out_dir&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;bamfile&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;align_dir&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;/*.bam&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;do&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nv"&gt;bamfile_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;basename&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$bamfile&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;bedtools&lt;span class="w"&gt; &lt;/span&gt;intersect&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;-ubam&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;-a&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nv"&gt;$bamfile&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;-b&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nv"&gt;$srna_file&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;samtools&lt;span class="w"&gt; &lt;/span&gt;bam2fq&lt;span class="w"&gt; &lt;/span&gt;-n&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./&lt;/span&gt;&lt;span class="nv"&gt;$out_dir&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$bamfile_name&lt;/span&gt;&lt;span class="s2"&gt;.fq&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;pigz&lt;span class="w"&gt; &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./&lt;span class="nv"&gt;$out_dir&lt;/span&gt;/*.fq
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I work with a lot of small RNA sequencing, and I recently needed to extract reads
from several different groups of small RNA loci I'd defined. It's relatively
simple to use &lt;code&gt;bedtools intersect&lt;/code&gt; with your interesting loci as a .bed file and
pipe that output to &lt;code&gt;samtools bam2fq&lt;/code&gt;. This isn't the kind of thing that's a
standard analysis I need to do and it's not very long. Therefore, to enable it to
be reproducible writing a quick shell script like this is the way to go. The
comment lines also carry enough information to tell someone what it does.&lt;/p&gt;
&lt;h3&gt;Suggestion 4: Long Pipelines Should Have a &lt;strong&gt;W i d e&lt;/strong&gt; Directory Structure&lt;/h3&gt;
&lt;p&gt;What does this mean? It means this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;groverj3@x1-carbon&lt;span class="w"&gt; &lt;/span&gt;wgbs_snakemake&lt;span class="o"&gt;]&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;ls
1_fastqc_raw&lt;span class="w"&gt;                &lt;/span&gt;4_methyldackel_mbias&lt;span class="w"&gt;    &lt;/span&gt;config.yaml&lt;span class="w"&gt;  &lt;/span&gt;README.md&lt;span class="w"&gt;         &lt;/span&gt;Snakefile
2_trim_galore&lt;span class="w"&gt;               &lt;/span&gt;5_methyldackel_extract&lt;span class="w"&gt;  &lt;/span&gt;input_data&lt;span class="w"&gt;   &lt;/span&gt;reference_genome&lt;span class="w"&gt;  &lt;/span&gt;temp_data
3_aligned_sorted_markdupes&lt;span class="w"&gt;  &lt;/span&gt;6_mosdepth&lt;span class="w"&gt;              &lt;/span&gt;LICENSE&lt;span class="w"&gt;      &lt;/span&gt;scripts
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and not this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;wgbs_snakemake/1_fastqc_raw/2_trim_galore/3_aligned_sorted_markdupes/4_methyldackel_mbias/5_methyldackel_extract/6_mosdepth
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This makes navigating your directory structure much less of a pain. Especially
when a pipeline is several steps long.&lt;/p&gt;
&lt;h3&gt;Suggestion 5: Automate Often-run Pipelines With Workflow Managers&lt;/h3&gt;
&lt;p&gt;If there is a particular pipeline that you run frequently then consider using a
workflow manager. Options include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://snakemake.readthedocs.io/en/stable/"&gt;snakemake&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.nextflow.io/"&gt;Nextflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://scipipe.org/"&gt;scipipe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pharmbio/sciluigi"&gt;SciLuigi&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;My vote goes to Snakemake with Nextflow as a close second. These tools require
some fiddling to transfer over an existing pipeline to fit their framework, but
what you gain is reproducibility and automation. Additionally, they all utilize
threading with parallel steps better than your BASH script does. They also work
with HPC job submission frameworks (SLURM, PBS, etc.) and containers.&lt;/p&gt;
&lt;p&gt;Writing these workflows is beyond the scope of this article, but definitely worth
writing in detail about in a future one!&lt;/p&gt;
&lt;p&gt;A word of caution: it's easy to think, "Oh, I'm only going to analyze bisulfite
sequencing this one time" only to find yourself running your workflow several
times as you acquire more data. There are also some freely available workflows
already written that you can check out!&lt;/p&gt;
&lt;p&gt;(Shameless plug for &lt;a href="https://codeberg.org/groverj3/wgbs_snakemake"&gt;mine&lt;/a&gt;)&lt;/p&gt;
&lt;h3&gt;Suggestion 6: Containerize!&lt;/h3&gt;
&lt;p&gt;Wrap-up your workflow and its required software in a container for the ultimate
write once run anywhere solution. You can make a 
&lt;a href="https://www.docker.com/"&gt;Docker&lt;/a&gt; container with your entire workflow which can
then be used on your server, or cloud computing. However, in order to run this in
an HPC environment you'll need to run it through 
&lt;a href="https://sylabs.io/"&gt;Singularity&lt;/a&gt; instead. That's fine though! Singularity can
run Docker containers, and you'll already have one to use for cloud compute if
needed.&lt;/p&gt;
&lt;h3&gt;Wrapping up&lt;/h3&gt;
&lt;p&gt;Hopefully you've found this informative and helpful. Next time I'll be back with
more practical examples.&lt;/p&gt;</content><category term="commentary"></category><category term="bioinformatics"></category><category term="thoughts"></category><category term="workflows"></category></entry><entry><title>Efficiently Filtering While Reading Data Into R (With Python?!)</title><link href="https://groverj3.codeberg.page/articles/2019-07-17_efficiently-filtering-while-reading-data-into-r-with-python.html" rel="alternate"></link><published>2019-07-17T00:00:00-04:00</published><updated>2019-07-17T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-07-17:/articles/2019-07-17_efficiently-filtering-while-reading-data-into-r-with-python.html</id><summary type="html">&lt;p&gt;Working with large amounts of tabular data is a daily occurance for both
bioinformaticians and data scientists. There's a lot the two groups can learn
from each other (great future post material). However, I recently ran into a
situation that I was sure had to be relatively common. Apparently it …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Working with large amounts of tabular data is a daily occurance for both
bioinformaticians and data scientists. There's a lot the two groups can learn
from each other (great future post material). However, I recently ran into a
situation that I was sure had to be relatively common. Apparently it wasn't and I
had very little luck checking for a solution in my usual genomics/bioinformatics
cirlces, as well as the data science material I had on-hand.&lt;/p&gt;
&lt;h3&gt;The Problem&lt;/h3&gt;
&lt;p&gt;I recently received output from a large BLAST search. Something on the order of
200,000 queries. Some of those queries had many thousand hits. This is because
the search was completed with minimal filtering. The idea being, it's easy to
post-filter it, but you can't get the hits back that are thrown away. Fair
enough. It was also split between 100+ files. The files were also output in
BLAST's "format 7" (run with --outfmt 7). This means it's tabular (.tsv) with
comment lines throughout. This collection of files was actually too big to load
them into R (where I do most exploratory data analysis) and then filter them. So,
I figured there had to be a way to combine loading and filtering in a
satisfactory way. Of course, you could also pre-filter it with awk or Python
line-by-line and write it out to the hard drive, but this problem interested me.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;TLDR&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;If you have to load &lt;strong&gt;AND&lt;/strong&gt; filter you should use the lesser-known
&lt;a href="http://readr.tidyverse.org"&gt;readr&lt;/a&gt; function &lt;code&gt;read_delim_chunked()&lt;/code&gt; (and its
derivatives, &lt;code&gt;read_{tsv|csv|table}_chunked()&lt;/code&gt;) or write a parser in Python and
translate the resulting object (list of lists, dictionary of lists, or Pandas
dataframe) to R with 
&lt;a href="https://rstudio.github.io/reticulate/index.html"&gt;reticulate&lt;/a&gt;. The reason behind
this is that iterating through a file and filtering line-by-line, while a
seemingly common thing to do, is horrifically slow in R as far as I can tell. I'm
happy to eat my words if other useRs can prove I'm wrong.&lt;/p&gt;
&lt;h3&gt;Attempt #1: Writing a Line-by-line Parser in R&lt;/h3&gt;
&lt;p&gt;I read all the warnings. They say that R is slow. But is this really true? I
frequently read pretty large files into R with &lt;a href="https://readr.tidyverse.org/"&gt;readr&lt;/a&gt;
or &lt;a href="https://github.com/Rdatatable/data.table/wiki"&gt;data.table&lt;/a&gt;, and they're
&lt;em&gt;wicked fast&lt;/em&gt;. What I failed to immediately realize is that these packages are
fast because they're written in C/C++ and are effectively compiled programs that
interact with R through its API.&lt;/p&gt;
&lt;p&gt;I decided I would test this on both a subset of the data, one of the smaller
files (~2.6 GB), and the first 10,000 lines. The tsv files have comment lines
denoted with '#' and 12 columns which vary between character, float, and
integers:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example data&lt;/span&gt;
&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;test_blast_fmt7.out&amp;#39;&lt;/span&gt;
&lt;span class="nf"&gt;system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;head -n 10000 test_blast_fmt7.out &amp;gt; test_blast_fmt7.head10000.out&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;test_blast_fmt7.head10000.out&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;query&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;subject&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;identity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;align_length&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;mismatches&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;gap_opens&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;q_start&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;q_end&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;s_start&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;s_end&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;evalue&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;bit_score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I first attempted to solve this problem by writing the following function (don't
do this).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;read_filter_blast7_lbl_base&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_perc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_al_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Initialize a line counter&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Open a file connection, yield one line at a time&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;out_list&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;open&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_line&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;readLines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;warn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# Split the line at the separator to yield a list, turn into a vector&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;unlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;strsplit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;\t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# Skip comment lines&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="c1"&gt;# Include filtering conditions here in this if statement&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_perc_id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_al_len&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;out_list&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;# Count lines&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Bind the lines as a data frame, don&amp;#39;t convert strings to factors&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;do.call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rbind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;out_list&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;stringsAsFactors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;colnames&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Set the column classes&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# The out_df object will include filtered data, all columns as character vectors&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;return&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Wow! What an abomination. I'm not sure if this says more about R's unsuitability
to this kind of task or my obvious "Python-think" that's seeping in here. It's a
mess. And it's slow. I tried testing on the first 10,000 lines of one of the
files:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Benchmark 100 iterations (default) over the first 10000 lines&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_lbl_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;milliseconds&lt;/span&gt;
&lt;span class="w"&gt;                                                                         &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_lbl_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;338.4403&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;346.2422&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;351.0633&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;349.2319&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;352.5961&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;404.62&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It works, but this doesn't scale up to a full file (it sat for ages until I
killed it), and it suffers from R's problems with growing lists in a loop leading
to copying rather than appending. There are clearly other issues too because my
attempts to pre-allocate a list or data frame of the correct size did not speed
it up. This means that I might be doing something wrong. Regardless, this is too
much work to do something so simple. I welcome others to find a pure base R
implementation that's better. It seems like there &lt;em&gt;should&lt;/em&gt; be a way to do it.&lt;/p&gt;
&lt;p&gt;However, there are better options.&lt;/p&gt;
&lt;h3&gt;Attempt #2: Using &lt;a href="https://www.rdocumentation.org/packages/sqldf/versions/0.4-11"&gt;&lt;strong&gt;sqldf&lt;/strong&gt;&lt;/a&gt; to Filter a Temporary sqlite Database&lt;/h3&gt;
&lt;p&gt;The internet led me to believe that this isn't really something people do in R.
And if you can't load it all into memory then you should use a database and query
that. It seems excessive, but the &lt;a href="https://github.com/ggrothendieck/sqldf"&gt;sqldf&lt;/a&gt;
R package can do this. It even includes a function to create the DB on the fly
while reading (&lt;code&gt;read.csv.sql()&lt;/code&gt;) I totally understand using SQL or similar to query
a DB when you have a reason to query it often and it's stored as an SQL DB. I
question the wisdom of this suggestion though for this purpose. I was able to use
it as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;read_filter_blast7_sqldf&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_perc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_al_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;temp_df&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read.csv.sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sed -e &amp;#39;/^#/d&amp;#39;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;paste0&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;SELECT * FROM file WHERE V3 &amp;gt;= &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_perc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="s"&gt;&amp;#39; AND V4 &amp;gt;= &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_al_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39; AND V11 &amp;lt;= &amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;\t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;colClasses&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;character&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;numeric&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;integer&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;numeric&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;colnames&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temp_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;return&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temp_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;One of the key limitations here is that you need to pipe through a shell command
(sed) to remove comment lines. Not the biggest deal, but having to write a sed
command does take you out of your flow in an R or jupyter notebook. Let's see
how that performs:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Benchmark 100 iterations (default) over the first 10000 lines&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_sqldf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;milliseconds&lt;/span&gt;
&lt;span class="w"&gt;                                                                      &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_sqldf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;72.27077&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;73.30782&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;93.27967&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;79.38143&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;103.4807&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;199.8507&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;What's going on here? The max is much slower than the min. This is because the
first iteration reads this into a temporary &lt;strong&gt;file&lt;/strong&gt;! This will take
even more time for a larger file, and that temporary database will be the size of
the full, unfiltered file. Usually you load each file once, and each file needs
its own temp sqlite DB. So, the max time is actually the only timing that
matters! Plus, it has the problem of filling up your /tmp directory. What happens
when I try to load the smallest whole file (2.6 GB)?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Benchmark the whole file once&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_sqldf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;
&lt;span class="w"&gt;                                                            &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_sqldf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;165.4224&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;165.4224&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;165.4224&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;165.4224&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;165.4224&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;165.4224&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's decent performance, but because it makes temporary files it will fill up
your /tmp directory:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;df&lt;span class="w"&gt; &lt;/span&gt;-h
Filesystem&lt;span class="w"&gt;      &lt;/span&gt;Size&lt;span class="w"&gt;  &lt;/span&gt;Used&lt;span class="w"&gt; &lt;/span&gt;Avail&lt;span class="w"&gt; &lt;/span&gt;Use%&lt;span class="w"&gt; &lt;/span&gt;Mounted&lt;span class="w"&gt; &lt;/span&gt;on
dev&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/dev
run&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.4M&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/run
/dev/nvme0n1p2&lt;span class="w"&gt;  &lt;/span&gt;423G&lt;span class="w"&gt;   &lt;/span&gt;34G&lt;span class="w"&gt;  &lt;/span&gt;368G&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/
tmpfs&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;  &lt;/span&gt;170M&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.6G&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/dev/shm
tmpfs&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/sys/fs/cgroup
tmpfs&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.8G&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;.9G&lt;span class="w"&gt;  &lt;/span&gt;913M&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;89&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/tmp
/dev/nvme0n1p1&lt;span class="w"&gt;  &lt;/span&gt;300M&lt;span class="w"&gt;  &lt;/span&gt;348K&lt;span class="w"&gt;  &lt;/span&gt;300M&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/boot/efi
tmpfs&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.6G&lt;span class="w"&gt;   &lt;/span&gt;32K&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.6G&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;/run/user/1000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And running it on a second file fails:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Benchmark the whole file once&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_sqldf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;connection_import_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;eol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;skip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;RS_sqlite_import&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;disk&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;full&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;addition&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;.Internal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;gc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;full&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;closing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;unused&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_blast_fmt7.out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;result_create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;statement&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;cannot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rollback&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transaction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then had to &lt;code&gt;sudo rm -r /tmp/Rtmp*&lt;/code&gt; because my ssd was full.&lt;/p&gt;
&lt;p&gt;On a system with tons of space it could be fine. I'm running this on my laptop
during a flight so that doesn't help. You could also specify where those
databases are made. However, the largest file I needed to work with is 30 GB and
there were several. And this exact problem happened with that on our lab's
server. (Note to self: Ask my PI to upgrade the root drive).&lt;/p&gt;
&lt;p&gt;Still not a great solution.&lt;/p&gt;
&lt;h3&gt;Attempt #3: &lt;a href="https://readr.tidyverse.org/"&gt;&lt;strong&gt;readr&lt;/strong&gt;&lt;/a&gt; &lt;code&gt;read_delim_chunked()&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;This function isn't well-documented, but is the fastest option I found. It's not
quite the line-by-line implementation I thought up, but it's similar. Basically,
it will use readr and a function (user-definable) to bind together dataframes
which are read in chunks. Getting the best performance would require optimizing
the chunk size to the largest you can reasonably handle in memory. I stuck with
10,000 because I was comparing to other options.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;readr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;read_filter_blast7_readr_chunked&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_perc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_al_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;identity&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_perc_id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;align_length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;min_al_len&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;evalue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;out_df&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_tsv_chunked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DataFrameCallback&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Benchmarking results in some very very solid performance:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Benchmark 100 iterations (default) over the first 10000 lines&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_readr_chunked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;milliseconds&lt;/span&gt;
&lt;span class="w"&gt;                                                                              &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_readr_chunked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;28.90464&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;29.10356&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;30.87358&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;29.92421&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;31.28273&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;92.10364&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's not too surprising, since it's basically just reading in the whole file at
once and readr is fast. So, how's it work on the full file?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Benchmark the whole file once&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_sqldf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;
&lt;span class="w"&gt;                                                                         &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_readr_chunked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;76.59867&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;76.59867&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;76.59867&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;76.59867&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;76.59867&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;76.59867&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Really good performance. You can tune it better as well, the. This is probably your best bet without getting too
weird. But let's get weird ;)&lt;/p&gt;
&lt;h3&gt;Attempt #4: Parse With Python Translate to R With &lt;a href="https://rstudio.github.io/reticulate/index.html"&gt;&lt;strong&gt;reticulate&lt;/strong&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Let's do this in Python! Sort of...&lt;/p&gt;
&lt;p&gt;Writing a function to read and filter something in a for loop is a common thing
for me in python. I usually don't use Python for exploratory analysis though and
am less familiar with Pandas &lt;em&gt;et al.&lt;/em&gt; than I am with R's ecosystem. However, I
was able to figure out that it will automatically turn a list of lists into a
dataframe, which is pretty neat. A solid win over R's functionality:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter_blast7&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_blast_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_perc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_al_len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;df_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_blast_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;input_handle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_handle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="n"&gt;perc_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="n"&gt;al_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="n"&gt;evalue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;perc_id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_perc_id&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;al_len&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_al_len&lt;/span&gt;
                        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;evalue&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;max_evalue&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;df_list&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This function, when tested in python returns a data frame as expected:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;filter_blast7&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
   &lt;span class="n"&gt;query&lt;/span&gt;     &lt;span class="n"&gt;subject&lt;/span&gt;    &lt;span class="n"&gt;identity&lt;/span&gt; &lt;span class="n"&gt;align_length&lt;/span&gt; &lt;span class="n"&gt;mismatches&lt;/span&gt;  &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;q_end&lt;/span&gt; &lt;span class="n"&gt;s_start&lt;/span&gt; &lt;span class="n"&gt;s_end&lt;/span&gt;  &lt;span class="n"&gt;evalue&lt;/span&gt; &lt;span class="n"&gt;bit_score&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="n"&gt;query_1&lt;/span&gt;  &lt;span class="n"&gt;scaffold88&lt;/span&gt;  &lt;span class="mf"&gt;100.000&lt;/span&gt;          &lt;span class="mi"&gt;869&lt;/span&gt;          &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="mi"&gt;869&lt;/span&gt;  &lt;span class="mi"&gt;733052&lt;/span&gt;  &lt;span class="mi"&gt;733920&lt;/span&gt;    &lt;span class="mf"&gt;0.0&lt;/span&gt;      &lt;span class="mi"&gt;1605&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="n"&gt;query_1&lt;/span&gt;  &lt;span class="n"&gt;scaffold88&lt;/span&gt;   &lt;span class="mf"&gt;95.732&lt;/span&gt;          &lt;span class="mi"&gt;867&lt;/span&gt;         &lt;span class="mi"&gt;34&lt;/span&gt;  &lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="mi"&gt;869&lt;/span&gt;  &lt;span class="mi"&gt;734435&lt;/span&gt;  &lt;span class="mi"&gt;735298&lt;/span&gt;    &lt;span class="mf"&gt;0.0&lt;/span&gt;      &lt;span class="mi"&gt;1393&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;query_1&lt;/span&gt;   &lt;span class="n"&gt;scaffold1&lt;/span&gt;  &lt;span class="mf"&gt;100.000&lt;/span&gt;          &lt;span class="mi"&gt;869&lt;/span&gt;          &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="mi"&gt;869&lt;/span&gt;    &lt;span class="mi"&gt;4053&lt;/span&gt;    &lt;span class="mi"&gt;4921&lt;/span&gt;    &lt;span class="mf"&gt;0.0&lt;/span&gt;      &lt;span class="mi"&gt;1605&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="n"&gt;query_1&lt;/span&gt;   &lt;span class="n"&gt;scaffold1&lt;/span&gt;   &lt;span class="mf"&gt;95.732&lt;/span&gt;          &lt;span class="mi"&gt;867&lt;/span&gt;         &lt;span class="mi"&gt;34&lt;/span&gt;  &lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="mi"&gt;869&lt;/span&gt;    &lt;span class="mi"&gt;5436&lt;/span&gt;    &lt;span class="mi"&gt;6299&lt;/span&gt;    &lt;span class="mf"&gt;0.0&lt;/span&gt;      &lt;span class="mi"&gt;1393&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="n"&gt;query_3&lt;/span&gt;  &lt;span class="n"&gt;scaffold88&lt;/span&gt;  &lt;span class="mf"&gt;100.000&lt;/span&gt;          &lt;span class="mi"&gt;786&lt;/span&gt;          &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="mi"&gt;786&lt;/span&gt;  &lt;span class="mi"&gt;735004&lt;/span&gt;  &lt;span class="mi"&gt;735789&lt;/span&gt;    &lt;span class="mf"&gt;0.0&lt;/span&gt;      &lt;span class="mi"&gt;1452&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Nice, but I thought we were using R? Well, we can actually use the R package
&lt;a href="https://rstudio.github.io/reticulate/index.html"&gt;reticulate&lt;/a&gt; to run this python
code and translate its output to an R data frame. Pretty neat! You can get it
from CRAN with &lt;code&gt;install.packages('reticulate')&lt;/code&gt;. With it installed you'll want to
make sure it can find your python installation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reticulate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;py_discover_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;groverj3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;.pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;shims&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="n"&gt;libpython&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;groverj3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;.pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;3.7&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;libpython3.7m.so&lt;/span&gt;
&lt;span class="n"&gt;pythonhome&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;groverj3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;.pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;3.7&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;groverj3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;.pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;3.7&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="m"&gt;3.7&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Mar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;17&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2019&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;02&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;15&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;GCC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8.2&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;20181127&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;groverj3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;.local&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3.7&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;numpy&lt;/span&gt;
&lt;span class="n"&gt;numpy_version&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.16&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;

&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;versions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;groverj3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;.pyenv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;shims&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;usr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;usr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;py_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;py_available&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;initialize&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Even though it found my python installation &lt;code&gt;py_available()&lt;/code&gt; returns false? I
actually forgot to initialize it with &lt;code&gt;use_python()&lt;/code&gt; but you can either do that
or use &lt;code&gt;py_available(initialize = TRUE)&lt;/code&gt;. You also must have the python shared
library installed installed (which it is not when using pyenv). Now, you can
either source the python parsing function from a saved .py script, or run it
inline as follows, and coerce it to an R function:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;read_filter_blast7_lbl_py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;py_run_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="s"&gt;&amp;quot;import pandas as pd&lt;/span&gt;

&lt;span class="s"&gt;def filter_blast7(input_blast_results, header, min_perc_id, min_al_len,&lt;/span&gt;
&lt;span class="s"&gt;                   max_evalue):&lt;/span&gt;
&lt;span class="s"&gt;    df_list = []&lt;/span&gt;
&lt;span class="s"&gt;    with open(input_blast_results, &amp;#39;r&amp;#39;) as input_handle:&lt;/span&gt;
&lt;span class="s"&gt;        for line in input_handle:&lt;/span&gt;
&lt;span class="s"&gt;            if not line.startswith(&amp;#39;#&amp;#39;):&lt;/span&gt;
&lt;span class="s"&gt;                entry = line.strip().split()&lt;/span&gt;
&lt;span class="s"&gt;                perc_id = float(entry[2])&lt;/span&gt;
&lt;span class="s"&gt;                al_len = int(entry[3])&lt;/span&gt;
&lt;span class="s"&gt;                evalue = float(entry[10])&lt;/span&gt;
&lt;span class="s"&gt;                if (&lt;/span&gt;
&lt;span class="s"&gt;                    perc_id &amp;gt;= min_perc_id&lt;/span&gt;
&lt;span class="s"&gt;                    and al_len &amp;gt;= min_al_len&lt;/span&gt;
&lt;span class="s"&gt;                    and evalue &amp;lt;= max_evalue&lt;/span&gt;
&lt;span class="s"&gt;                ):&lt;/span&gt;
&lt;span class="s"&gt;                    df_list.append(entry)&lt;/span&gt;
&lt;span class="s"&gt;    return pd.DataFrame(df_list, columns=header)&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;filter_blast7&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Which shows up as a "python.builtin.function."&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;class&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_filter_blast7_lbl_py&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;python.builtin.function&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;python.builtin.object&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, let's benchmark that:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Benchmark 100 iterations over the first 10000 lines&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_lbl_py&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;milliseconds&lt;/span&gt;
&lt;span class="w"&gt;                                                                       &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_lbl_py&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;60.73332&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;62.57396&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;67.58808&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;63.75887&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;69.98544&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;105.8659&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Benchmark the whole file once&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_lbl_py&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;
&lt;span class="w"&gt;                                                             &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_lbl_py&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;139.1212&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;139.1212&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;139.1212&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;139.1212&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;139.1212&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;139.1212&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It also returns a data frame identical to the python one:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_filter_blast7_lbl_py&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file_head&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1e-50&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;identity&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;align_length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mismatches&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;query_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scaffold88&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;100.000&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;869&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;query_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scaffold88&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;95.732&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;867&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;34&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;query_1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;scaffold1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;100.000&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;869&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;query_1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;scaffold1&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;95.732&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;867&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;34&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;query_3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scaffold88&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;100.000&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;786&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;query_3&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;scaffold1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;100.000&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;786&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;gap_opens&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q_start&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q_end&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s_start&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;s_end&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;evalue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bit_score&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;869&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;733052&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;733920&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1605&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;869&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;734435&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;735298&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1393&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;869&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;4053&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;4921&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1605&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;869&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;5436&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;6299&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1393&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;786&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;735004&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;735789&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1452&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;786&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;6005&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;6790&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1452&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's definitely not as fast as reading it chunked with readr. However, that
Python function was easy to write and performed far better than the base R way to
read and filter line-by-line. In the future, if I have something that I know how
to do in Python I may not try to translate it to R. Just run the python code with
reticulate! &lt;/p&gt;
&lt;h3&gt;Wrapping Things Up&lt;/h3&gt;
&lt;p&gt;You would think that is a commonly done thing, to filter huge datasets while
reading. Apparently, it's not common enough in R for good documentation to exist.
In the end readr wins again with &lt;code&gt;read_delim_chunked()&lt;/code&gt;. This does require some
tuning to get the best performance, but if you pick some sane chunk_size then
further opitmization is probably unnecessary and it will work fine. However, the
revelation that communicating between python and R works so well opens up a lot
of future possibilities! Some things are just better suited to one language or
another. While Python has pandas dataframes and a large ecosystem around them
none of it is as intuitive as the &lt;a href="https://www.tidyverse.org/"&gt;tidyverse&lt;/a&gt; (to
me). Something like the two winners here are ideal for situations where you have
a huge file to load, but you know that most of the file will not meet your
filtering criteria.&lt;/p&gt;
&lt;p&gt;Using Python to do general purpose programming and communicating those results to
R for statistical testing and visualization is definitely something to consider.&lt;/p&gt;
&lt;h3&gt;Addendum: What's the Overhead of Filtering While reading?&lt;/h3&gt;
&lt;p&gt;Let's check it out, first the readr method:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_tsv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Parsed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;specification&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="nf"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_character&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_character&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;identity&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;align_length&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;mismatches&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;gap_opens&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;q_start&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;q_end&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;s_start&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;s_end&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;evalue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;bit_score&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;col_double&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;|=================================================================|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;%&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2685&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MB&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;
&lt;span class="w"&gt;                                                       &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_tsv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;#&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64.35369&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64.35369&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64.35369&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64.35369&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64.35369&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64.35369&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is compared with 76.59867 ms for filtering. That's basically no difference.
How about the reading it in with Python using Pandas read_csv:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;read_csv_py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;py_run_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="s"&gt;&amp;quot;def read_blast7(input_blast_results, header):&lt;/span&gt;
&lt;span class="s"&gt;    return pd.read_csv(input_blast_results, sep=&amp;#39;\t&amp;#39;, comment=&amp;#39;#&amp;#39;)&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;read_blast7&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Benchmark the whole file once&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;microbenchmark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_csv_py&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;
&lt;span class="w"&gt;                               &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;lq&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_csv_py&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;97.45004&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;97.45004&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;97.45004&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;97.45004&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;uq&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;neval&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;97.45004&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;97.45004&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There's clearly a bit of overhead with both of these methods, but it's pretty
minor, on the order of 10-30 ms for a 2.6 gb file.&lt;/p&gt;</content><category term="how-to"></category><category term="bioinformatics"></category><category term="data-science"></category><category term="r"></category><category term="python"></category><category term="big-data"></category></entry><entry><title>Variations on RNAseq Workflows for DEG Analysis</title><link href="https://groverj3.codeberg.page/articles/2019-07-09_variations-on-rnaseq-workflows-for-deg-analysis.html" rel="alternate"></link><published>2019-07-09T00:00:00-04:00</published><updated>2019-07-09T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-07-09:/articles/2019-07-09_variations-on-rnaseq-workflows-for-deg-analysis.html</id><summary type="html">&lt;p&gt;When analyzing RNAseq you're faced with many possible analysis pipelines. The
biggest decision you need to make is what the purpose of your experiment is. I
will make the assumption that &lt;em&gt;most&lt;/em&gt; of the time people want to determine which
genes are differentially expressed between two samples, genotypes, conditions,
etc …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When analyzing RNAseq you're faced with many possible analysis pipelines. The
biggest decision you need to make is what the purpose of your experiment is. I
will make the assumption that &lt;em&gt;most&lt;/em&gt; of the time people want to determine which
genes are differentially expressed between two samples, genotypes, conditions,
etc. In DEG analyss you are interested in gene-level expression. This means you
are &lt;strong&gt;not&lt;/strong&gt; interested in differential isoforms/transcripts or alternative
splicing.The absolute most simple version of this is simply having control and
experimental samples (preferably with &amp;gt;= 3 biological replicates each). However,
this isn't as straightforward as firing up your favorite aligner and going to
town on the data. There are other considerations.&lt;/p&gt;
&lt;h3&gt;I Have a High Quality Annotated Reference Genome or Transcriptome&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;My Reference Genome is High Quality&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Align reads to reference genome (STAR, HISAT2)&lt;/li&gt;
&lt;li&gt;Count reads per gene (HTSeq-count, summarizeOverlaps, featurecounts)&lt;/li&gt;
&lt;li&gt;DEG Analysis (DESeq2, edgeR)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is the standard workflow that you're probably accustomed to. Note: it is
very important to use a &lt;em&gt;modern&lt;/em&gt; splicing-aware aligner. Do not use bowtie. Both
STAR and HISAT2 are very fast compared to older aligners and are designed for
RNAseq. Their default options are generally appropriate for most simple
experimental designs. As a bonus, STAR can actually do step 2 itself, although
the output format is kind of clunky.&lt;/p&gt;
&lt;p&gt;This workflow is a good general purpose one in model organisms, and nobody will
fault you for using it there. However, there are potentially better options.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;My Annotation/transcriptome is High Quality&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pseudoalignment-based abundance estimation (Salmon, Kallisto)&lt;/li&gt;
&lt;li&gt;Aggregate abundances per gene from transcripts (tximport)&lt;/li&gt;
&lt;li&gt;DEG Analysis (DESeq2, edgeR)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This workflow may actually be better
(&lt;a href="https://f1000research.com/articles/4-1521/v2"&gt;ref&lt;/a&gt;) even if you have a
reference genome. I've always assumed that reference-genome alignment is superior
when you have a good reference, but apparently this is not necessarily the case
for the reasons detailed here.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; very fast, potentially more accurate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; no .bam file is generated so you can't look at positional information
from your reads, no ability to discover new transcripts later from your
alignments.&lt;/p&gt;
&lt;p&gt;Either of these workflows will work fine in this situation, and the better your
genome is the closer the first will likely approximate the second. Though, I now
believe that the second workflow should be the standard if your goal is purely
DEG analysis. There are still a lot of good reasons to want a .bam file, though
nothing is stopping you from aligning your reads anyway for future-use.&lt;/p&gt;
&lt;h3&gt;My Genome/Transcriptome is Incomplete&lt;/h3&gt;
&lt;p&gt;In this case you have some deicsions to make, yet again.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Genome is Good but Annotations Are Poor&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Align to reference genome (STAR, HISAT2)&lt;/li&gt;
&lt;li&gt;Assemble transcripts, genome-guided (Stringtie)&lt;/li&gt;
&lt;li&gt;Aggregate abundances per gene from transcripts (tximport)&lt;/li&gt;
&lt;li&gt;DEG Analysis (DESeq2, edgeR)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Another option here is to use a tool like
&lt;a href="https://github.com/PASApipeline/PASApipeline/wik"&gt;PASA&lt;/a&gt; to update the
existing annotations if they exist. I've run that pipeline. It's very quirky, a
pain to get running, and if you don't need genomic coordinates I'd avoid it. You
could also use Salmon/Kallisto with StringTie's transcripts, without using its
quantification, but this seems to be an unnecessary step.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Genome and Transcriptome Are Poor&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Assemble transcriptome (Trinity)&lt;/li&gt;
&lt;li&gt;Pseudoalignment-based abundance estimation (Salmon, Kallisto)&lt;/li&gt;
&lt;li&gt;Aggregate abundances per gene from transcripts (tximport)&lt;/li&gt;
&lt;li&gt;DEG Analysis (DESeq2, edgeR)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In this case you're going to want to do a thorough &lt;em&gt;de-novo&lt;/em&gt; transcriptome
assembly using something like
&lt;a href="https://github.com/trinityrnaseq/trinityrnaseq/wiki"&gt;Trinity&lt;/a&gt;. This
transcriptome can then be used for pseudoalignment-based abundance estimation and
then DEGs can be determined after aggregation of isoform abundances. Trinity can
be quite a resource hog, so you're going to want to
&lt;a href="https://downloadmoreram.com/"&gt;get more ram&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Why Not Cufflinks/Stringtie For Transcript Assembly In Model Organisms?&lt;/h3&gt;
&lt;p&gt;First of all, don't use Cufflinks. Stringtie is essentially a more modern
Cufflinks that's
&lt;a href="https://ccb.jhu.edu/software/stringtie/index.shtml?t=faq#comp"&gt;faster and more accurate&lt;/a&gt;.
Secondly, if you're working in a well annotated genome chances are that "novel
transcripts" you find are more likely noise, or not biologically meaningful
(unless you know better for your use-case!).&lt;/p&gt;
&lt;h3&gt;Concluding Thoughts&lt;/h3&gt;
&lt;p&gt;The paper detailing that transcript abundances, when aggregated to gene level,
improve DEG analysis is particularly interesting. This makes me rethink my usual
assumption and I now believe that tools like Salmon or Kallisto should be the go
to tools for DEG analysis when you have a good transcriptome to work with.&lt;/p&gt;
&lt;p&gt;However, I still think it's worthwhile to align your reads and generate a .bam
file. There are many types of visualizations and comparisons that you simply
can't do without them. For example, calculating coverage over featutres of
interest. If you must compare expression of genes across multiple samples or from
different experiments then you'll probably want to convert your expression values
to some normalized measurement. In this case you can use FPKM or TPM, though the
consensus seems to be that TPM is the way to go these days.&lt;/p&gt;
&lt;p&gt;And, at the end of the day you know that an out-of-date collaborator is probably
going to ask you for FPKM measurements or something anyway.&lt;/p&gt;</content><category term="commentary"></category><category term="bioinformatics"></category><category term="thoughts"></category><category term="rnaseq"></category><category term="workflows"></category></entry><entry><title>Making Better Metaplots With ggplot, Part 2</title><link href="https://groverj3.codeberg.page/articles/2019-06-28_making-better-metaplots-with-ggplot-part-2.html" rel="alternate"></link><published>2019-06-28T00:00:00-04:00</published><updated>2019-06-28T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-06-28:/articles/2019-06-28_making-better-metaplots-with-ggplot-part-2.html</id><summary type="html">&lt;p&gt;&lt;a href="/articles/2019-06-27_making-better-metaplots-with-ggplot-part-1.html"&gt;Last time&lt;/a&gt; we
prepared our data using Deeptools.&lt;/p&gt;
&lt;p&gt;Now we're going to do something kind of scandalous. R and python, living together
in peace. What is this madness? I like R's ecosystem for manipulating data and
plotting with the &lt;a href="https://www.tidyverse.org/"&gt;tidyverse&lt;/a&gt;. It still requires some
tweaking, but with a bit of …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="/articles/2019-06-27_making-better-metaplots-with-ggplot-part-1.html"&gt;Last time&lt;/a&gt; we
prepared our data using Deeptools.&lt;/p&gt;
&lt;p&gt;Now we're going to do something kind of scandalous. R and python, living together
in peace. What is this madness? I like R's ecosystem for manipulating data and
plotting with the &lt;a href="https://www.tidyverse.org/"&gt;tidyverse&lt;/a&gt;. It still requires some
tweaking, but with a bit of a time investment you can have publication-ready
vector images in only a few lines of code. It's great for genomics data as well.
Some out there may prefer &lt;a href="https://matplotlib.org/"&gt;matplotlib&lt;/a&gt; in python, and
it is powerful, but I find it kind of tedious to use without adding another
package on-top like &lt;a href="https://seaborn.pydata.org/"&gt;Seaborn&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Genomics and data science belong together just like python and R!&lt;/p&gt;
&lt;h3&gt;1. Investigate Deeptools' Metaplot Output&lt;/h3&gt;
&lt;p&gt;Your first task with any new data is just to see what it looks like. In a
terminal my initial instinct is always to call:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;head&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But you can also just open deeptools metaplot table in your text editor of
choice. What you'll find is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;bin labels    -1.0Kb    ...    start    ...    end    ...    1.0Kb
bins        1.0 2.0 3.0 ...
sample_name genes score score score ...
sample_name genes score score score ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A tab delimited table of bin labels, bin numbers, and scores (data to plot) for
each of those bins. This is a rather odd format because it's horizontal, rather
than the long format that would be more convenient. We also have a label "genes"
in position 2 of the same row as the score data. The bin labels only have 4
values in the whole row. The &lt;code&gt;--upstream&lt;/code&gt;, &lt;code&gt;--startLabel&lt;/code&gt;, &lt;code&gt;--endLabel&lt;/code&gt;, and
&lt;code&gt;--downstream&lt;/code&gt; values from the &lt;code&gt;computeMatrix&lt;/code&gt; step. We can work with this but
it's a bit unwieldly.&lt;/p&gt;
&lt;h3&gt;2. Load Data Into R&lt;/h3&gt;
&lt;p&gt;Before getting started here make sure you have the tidyverse packages installed:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;install.packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;tidyverse&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There are no built-in functions to read a "transposed tsv" file like this, but
with a little &lt;a href="https://stackoverflow.com/questions/17288197/reading-a-csv-file-organized-horizontally"&gt;googling&lt;/a&gt;
this turned out to not be so bad. My original thought was to read it as a
standard .tsv file with &lt;code&gt;read_tsv()&lt;/code&gt; or base &lt;code&gt;read.csv()&lt;/code&gt; and transpose with
&lt;code&gt;t()&lt;/code&gt; but these didn't like the data. This is because of the need to keep that
first row exactly as it appears, despite most of it technically being empty,
we'll need the blank labels later. So, from that stackoverflow post I was able
to edit a few things:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;read_deeptools_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;count.fields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;\t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;na.rm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;readLines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;.splitvar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;unlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;strsplit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;return&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;do.call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cbind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;lapply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.splitvar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;\t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;paste&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;collapse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;\t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;na.omit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read.csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;\t&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="m"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;,])&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# Remove first row with &amp;quot;gene&amp;quot; label&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;return&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Essentially, this function is finding the length of the lines, reading the lines
as character vectors, splitting the vectors by the tab character, and creating a
new table from the vectors. Reading the data with this gives us a nice dataframe.
From here on I will be using the tidyverse packages so feel free to load them
with &lt;code&gt;library('tidyverse')&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;table_test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_deeptools_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;metaplot.tab&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as_tibble&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# A tibble: 600 x 4&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_1&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;sample_2&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;fct&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;fct&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;fct&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;-1.0&lt;/span&gt;&lt;span class="n"&gt;Kb&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.7382198952879583&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.008900523560209424&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.9565445026178011&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.007329842931937172&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.9879581151832458&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.008376963350785341&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.8026178010471204&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.005235602094240838&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.7968586387434556&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.0031413612565445023&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.593717277486911&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0.005235602094240838&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.36230366492146604&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.004712041884816754&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.5392670157068064&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.9617801047120418&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.0020942408376963353&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.403664921465969&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0.01099476439790576&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;3. Convert the Data to Long Format&lt;/h3&gt;
&lt;p&gt;A quirk of ggplot is that it really likes long format data. Where, instead of
separate columns for the different samples you end up with a column of "scores"
and another of "sample_id." This means that your sample ID is actually a variable
and can be plotted. This results in a new data frame which concatenates the
current sample columns into one, replicates bin.labels and bins as needed, and
creates a new column with the sample ID for each row. The easiest way to do this
is with the &lt;code&gt;gather()&lt;/code&gt; function in tidyr:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;sample&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can check out what this looks like as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;-1.0&lt;/span&gt;&lt;span class="n"&gt;Kb&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.7382198952879583&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.9565445026178011&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.9879581151832458&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.8026178010471204&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.7968586387434556&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0.593717277486911&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="m"&gt;1195&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;595&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.005759162303664922&lt;/span&gt;
&lt;span class="m"&gt;1196&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;596&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_2&lt;/span&gt;&lt;span class="w"&gt;                  &lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;
&lt;span class="m"&gt;1197&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;597&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.006282722513089005&lt;/span&gt;
&lt;span class="m"&gt;1198&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;598&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.017277486910994764&lt;/span&gt;
&lt;span class="m"&gt;1199&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="m"&gt;599&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.020942408376963352&lt;/span&gt;
&lt;span class="m"&gt;1200&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="n"&gt;Kb&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;600&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.012565445026178013&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This would be annoying to work with by hand, but ggplot2 understands it just
fine.&lt;/p&gt;
&lt;h3&gt;4. Build the ggplot Command&lt;/h3&gt;
&lt;p&gt;Now that our data is in the right format it's time to get plotting! We'll start
simple, and make it more complex from there:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;scale_x_continuous&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;breaks&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This will get us started with a simple line plot. The key here is that the x axis
breaks are the bin numbers, but are labeled as the bounds, start, and end of the
features. However, this also creates a major gridline at each break. Not ideal. I
have some sensible plot defaults that I use often, which I have saved to a code
snippet on my &lt;a href="https://codeberg.org/groverj3/genomics_visualizations/src/branch/master/ggplot2_pub_settings.r"&gt;codeberg&lt;/a&gt;.
I'll use these as a starting point for the theming. Another feature we may want
to add is the ability to smooth the line. This can be accomplished by using:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;geom_smooth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;loess&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;se&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This will smooth the data with
&lt;a href="https://en.wikipedia.org/wiki/Local_regression"&gt;loess&lt;/a&gt; regression. The amount of
smoothing can be configured with the &lt;code&gt;span = ...&lt;/code&gt; parameter in &lt;code&gt;geom_smooth()&lt;/code&gt;.
You'll also want to control the size of the plot when it's saved, and perhaps
stretch or shrink its aspect ratio. This can also be controlled by ggplot2 using
&lt;code&gt;ggsave()&lt;/code&gt; at the end of our plotting command. We also will want to add the
ability to specify the colors rather than just using the defaults. It's best to
use a colorblind-friendly palette when possible. Putting this all together, our
plot command becomes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;metaplot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;start_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;end_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y_axis_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="n"&gt;out_prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;smooth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aspect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="n"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;start_bin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;start_label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;end_bin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;end_label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.numeric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;smooth&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;geom_smooth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;loess&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                                 &lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                                 &lt;/span&gt;&lt;span class="n"&gt;se&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;scale_color_manual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;unlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;strsplit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;scale_x_continuous&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;breaks&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                       &lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;long_table&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;bin.labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;geom_vline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xintercept&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_bin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;end_bin&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;linetype&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;dotted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;ylab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_axis_label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;xlab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Position&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;theme_bw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;22&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;legend.title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;element_blank&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;legend.position&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;bottom&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;legend.direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;horizontal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;legend.margin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;legend.box.margin&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;-10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;-10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;-10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;-10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;axis.text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;element_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;black&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;axis.ticks.x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;element_blank&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;panel.grid.major.x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;element_blank&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;panel.grid.minor.x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;element_blank&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;aspect.ratio&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aspect&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;ggsave&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;paste0&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With a few if statements we actually can plot both the smoothed and line plots on
the same coordinate system if we want. Let's test it with some data small RNA
expression data from my
&lt;a href="https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.13910"&gt;2018 paper&lt;/a&gt; over a
set of genomic features:&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="https://codeberg.org/groverj3/genomics_visualizations/raw/branch/master/metaplotteR.png", style="width:600px;height:429px;"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;I think we can agree that this is an improvement. This could still be improved by
showing error when replicates are plotted, but it's pretty good for now.&lt;/p&gt;
&lt;h3&gt;Wrapping up&lt;/h3&gt;
&lt;p&gt;While this required a little patience, I think the results are worth it. Creating
clean visualizations is necessary to get your point across. I've tidied this up
a bit and pushed the full code to github. Bonus: it runs as a standalone script
and works with any number of input samples!
&lt;a href="https://codeberg.org/groverj3/genomics_visualizations/src/branch/master/metaplotteR.r"&gt;Check it out!&lt;/a&gt;&lt;/p&gt;</content><category term="how-to"></category><category term="bioinformatics"></category><category term="data-visualization"></category></entry><entry><title>Making Better Metaplots With ggplot, Part 1</title><link href="https://groverj3.codeberg.page/articles/2019-06-27_making-better-metaplots-with-ggplot-part-1.html" rel="alternate"></link><published>2019-06-27T00:00:00-04:00</published><updated>2019-06-27T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-06-27:/articles/2019-06-27_making-better-metaplots-with-ggplot-part-1.html</id><summary type="html">&lt;p&gt;Commonly, in bioinformatics we're in the business of determining whether
something, be it gene expression, or DNA methylation, or splicing, etc. is
different between multiple conditions. Typically this would be done by comparing
those data and using some kind of statistical test. However, with the continued
advances in sequencing technologies …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Commonly, in bioinformatics we're in the business of determining whether
something, be it gene expression, or DNA methylation, or splicing, etc. is
different between multiple conditions. Typically this would be done by comparing
those data and using some kind of statistical test. However, with the continued
advances in sequencing technologies generating greater read depth, and these
technologies becoming more available to researchers we can also look at
genome-scale data in other ways. Testing purely on count or score data does not
inform one of the positional information associated with that data.&lt;/p&gt;
&lt;p&gt;To look at the positional context associated with genomics data we have several
options. One common way is a visualization that's often referred to as a
"metaplot" or "metagene plot." These plots are similar to the TSS or "peak" plots
commonly used to visualize chip-seq or similar data. In a metaplot the entire
length of a feature is scaled such that each feature now is composed of the same
number of "bins" of data. This allows one to visualize the data associated with
these features across their entire length. There are existing software packages
that can make these plots without too much trouble such as 
&lt;a href="https://deeptools.readthedocs.io/en/stable/"&gt;Deeptools&lt;/a&gt; or the
&lt;a href="https://bioconductor.org/packages/release/bioc/html/genomation.html"&gt;Genomation&lt;/a&gt;
R library. In particular, I find Deeptools to be a great software package, and it
makes some wonderful visualizations that would be a pain to make yourself.
Genomation requires one to be very familiar with R since it isn't a standalone
program. Deeptools is easier to use but its metaplots leave something to be
desired:&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="/figures/2019-06-27_making-better-metaplots-with-ggplot-part-1/deeptools_example_meta.png", style="width:460px;height:275px;"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;I like the control I have over all plot elements and the professional look that
&lt;a href="https://ggplot2.tidyverse.org/"&gt;ggplot&lt;/a&gt; affords. I use it for most of my data
visualization needs. So, I figured, why not make something prettier with it?&lt;/p&gt;
&lt;h3&gt;1. Install Required Packages&lt;/h3&gt;
&lt;p&gt;This guide will use Deeptools, a Python package with a ton of functionality that
you can play around with later, and ggplot2 from the
&lt;a href="https://www.tidyverse.org/"&gt;tidyverse&lt;/a&gt;. The tidyverse is a collection of R
libraries designed by Hadley Wickham that make data science a snap. You can
install them as follows in a terminal:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;--user&lt;span class="w"&gt; &lt;/span&gt;deeptools
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Launch the R interpreter by typing &lt;code&gt;R&lt;/code&gt; and then:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;install.packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;tidyverse&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I recommend installing them into a user-specific library by either the &lt;code&gt;--user&lt;/code&gt;
flag for pip or setting up a .Renviron file with a path to a local library. You
can learn how to do that in my
&lt;a href="/articles/2019-06-25/managing-software-on-a-multiuser-linux-system.html"&gt;previous post&lt;/a&gt;.
You're also going to need &lt;a href="https://samtools.github.io/"&gt;samtools&lt;/a&gt;. Feel free to
use the package manager of your choice if conda is more your jam.&lt;/p&gt;
&lt;h3&gt;2. Generate the Data Table With Deeptools&lt;/h3&gt;
&lt;p&gt;Now that you've got the software installed you'll need to generate per-position
"score" information. If this is expression data or similar you can use Deeptools
again. But you should be able to use other inputs to the later steps as well. If
using expression data you can use your bam file you can use Deeptools'
bamCoverage tool. First, you need to index the alignment .bam file:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;samtools&lt;span class="w"&gt; &lt;/span&gt;index&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;input_bam&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;input_bam&lt;/span&gt;&lt;span class="p"&gt;%.bam&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;${}&lt;/code&gt; is the syntax for using a previously declared variable in BASH and I'll use
that kind of representation throughout for places where values should be
specified.&lt;/p&gt;
&lt;p&gt;Now that you have that out of the way. Your first step is to generate a coverage
file in bigWig format. This is a binary format but contains similar data to a
&lt;a href="https://genome.ucsc.edu/goldenPath/help/bedgraph.html"&gt;bedGraph&lt;/a&gt;. You can use
the bamCoverage tool:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;bamCoverage&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;threads_to_use&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--binSize&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--normalizeUsing&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;RPKM&lt;/span&gt;&lt;span class="p"&gt;|CPM|BPM|RPGC&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--outFileFormat&lt;span class="w"&gt; &lt;/span&gt;bigwig&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--bam&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;input_bam&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--outFileName&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;input_bam&lt;/span&gt;&lt;span class="p"&gt;%.bam&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;.bigWig
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A &lt;code&gt;--binsize&lt;/code&gt; of 1 will just generate per-base converage. This may be slow, and
you could increase the value if you wish. There are also other ways of generating
coverage/depth such as &lt;a href="https://github.com/brentp/mosdepth"&gt;mosdepth&lt;/a&gt; (a great
tool by Brent Pedersen). This comes with Deeptools though, and is easy to get
running. The &lt;code&gt;--normalizeUsing&lt;/code&gt; option will let you normalize the coverage by
several methods, which is particularly useful for plotting multiple experiments
together at the end.&lt;/p&gt;
&lt;p&gt;Next, you'll need to generate a score matrix. In other words, a matrix of
coverages or other values of interest. This step can be done on any score data in
a bedGraph/bigWig file, even if you did not generate it with Deeptools. So, if
you're using data from a tool other than bamCoverage this is your starting point.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;computeMatrix&lt;span class="w"&gt; &lt;/span&gt;scale-regions&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--startLabel&lt;span class="w"&gt; &lt;/span&gt;start&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--endLabel&lt;span class="w"&gt; &lt;/span&gt;end&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--upstream&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;base_pairs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--downstream&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;base_pairs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--regionBodyLength&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;scale_length&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--regionsFileName&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;regions_bed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--scoreFileName&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;input1_bigWig&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;input2_bigWig&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--outFileName&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;output_matrix&lt;/span&gt;&lt;span class="p"&gt;.gz&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;--startLabel&lt;/code&gt; and &lt;code&gt;--endLabel&lt;/code&gt; values can be changed as desired, but don't
forget them! The &lt;code&gt;--upstream&lt;/code&gt; and &lt;code&gt;--downstream&lt;/code&gt; values can be as desired. The
&lt;code&gt;--regionBodyLength&lt;/code&gt; is the value to which all features will be scaled. I suggest
using either the mean or median length of the features of interest. The regions
will be input as a .bed file, and the bigWig files that were generated in the
previous step will be used where indicated. Multiple files can be input,
space separated. You can specify that the matrix be gzipped by simply adding .gz
to the name of your output file. Now, the final step is to generate the plot and
also output the raw data:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;plotProfile&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--startLabel&lt;span class="w"&gt; &lt;/span&gt;start&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--endLabel&lt;span class="w"&gt; &lt;/span&gt;end&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--averageType&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;|median&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--matrixFile&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;input_matrix&lt;/span&gt;&lt;span class="p"&gt;.gz&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--outFileName&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;metaplot&lt;/span&gt;&lt;span class="p"&gt;.svg&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--outFileNameData&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;metaplot&lt;/span&gt;&lt;span class="p"&gt;.tab&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This will generate a plot, but also output the table of per-bin values that were
plotted. I made this with it:&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="/figures/2019-06-27_making-better-metaplots-with-ggplot-part-1/deeptools_example_meta2.png", style="width:460px;height:275px;"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;I could play with Deeptools further, but the options for changing its aesthetics
are more limited than I'd like. In particular, smoothing the lines requires
smoothing the underlying data in the scoreMatrix step. Which I am not a huge fan
of. Now, let's load that table into R and make something prettier in
&lt;a href="/articles/2019-06-28_making-better-metaplots-with-ggplot-part-2.html"&gt;Part 2&lt;/a&gt;.&lt;/p&gt;</content><category term="how-to"></category><category term="bioinformatics"></category><category term="data-visualization"></category></entry><entry><title>Managing Software on a Multiuser Linux System</title><link href="https://groverj3.codeberg.page/articles/2019-06-25_managing-software-on-a-multiuser-linux-system.html" rel="alternate"></link><published>2019-06-25T00:00:00-04:00</published><updated>2019-06-25T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-06-25:/articles/2019-06-25_managing-software-on-a-multiuser-linux-system.html</id><summary type="html">&lt;p&gt;When I started my Ph.D. I had a good amount of experience working in a Linux
environment on my own computers. Mostly as a hobby. My advisor had bought a small
server several years previous for a post-doc's project and I was offered this
system to use for my …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When I started my Ph.D. I had a good amount of experience working in a Linux
environment on my own computers. Mostly as a hobby. My advisor had bought a small
server several years previous for a post-doc's project and I was offered this
system to use for my day-to-day work. It doesn't set any speed records, but it
&lt;em&gt;is&lt;/em&gt; a 24 thread system with 75gb of RAM and 12TB of storage. This makes it
perfect for running analyses that I wouldn't want to do on my laptop, but need to
be tweaked repeatedly and therefore are awkward to run on the university HPC. I
also use this server for jupyter notebooks and it still handles a few users at
a time well.&lt;/p&gt;
&lt;p&gt;Since this system was starting from a blank slate I decided to implement some
simple rules for system management. When I started out I was the only user, but
since then we've added several others and this plan has held up. This is going to
be heavily biased toward running a small server for computational work that's
shared between &amp;lt; 10 users, because that's what I do.&lt;/p&gt;
&lt;p&gt;These are ordered, but feel free to ignore that. They're really more like general
tips.&lt;/p&gt;
&lt;h3&gt;0. Run a Well-Supported (Popular) Linux Server Distro&lt;/h3&gt;
&lt;p&gt;I know, I know, I know. You may have a favorite Linux distribution. It might be
&lt;a href="https://getfedora.org/"&gt;Fedora&lt;/a&gt;, or &lt;a href="https://linuxmint.com/"&gt;Mint&lt;/a&gt;, or
&lt;a href="https://manjaro.org/"&gt;Manjaro&lt;/a&gt; (that's what I've been using). You might use
&lt;a href="https://www.archlinux.org/"&gt;Arch&lt;/a&gt;, you might be a &lt;a href="https://www.gentoo.org/"&gt;masochist&lt;/a&gt;,
or you may enjoy running something with an innovative package management system
like &lt;a href="https://www.gnu.org/software/guix/"&gt;Guix&lt;/a&gt; or &lt;a href="https://nixos.org/"&gt;NixOS&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Maybe you just don't know why everyone uses this *nix stuff and don't know why
you can't just bioinformatics in Excel.&lt;/p&gt;
&lt;p&gt;&lt;center&gt;
&lt;img src="/images/clippy_bioinfo.png"&gt;
&lt;/center&gt;&lt;/p&gt;
&lt;p&gt;You're welcome to use something flashier, but I'd recommend sticking to Ubuntu
Server or CentOS. Fedora Server might also be a good choice. Especially with
Ubuntu potentially not shipping 32bit support in the future. For those with more
time or inclination to fiddle around, Debian would also make a good research
computing environment. The reason for this is that most software that's already
packaged will be either in .deb (Debian and derivate, including Ubuntu) or .rpm
(Redhat, Fedora, SUSE) format. Can you extract these packages and install them on
other systems? Sure. Are you going to want to do that every time you update
stuff. No.&lt;/p&gt;
&lt;p&gt;You also want to make sure that required libraries for software you may need to
compile are available without much fussing around straight from the repositories.
You'll have to do enough annoying things. Don't make this annoying.&lt;/p&gt;
&lt;h3&gt;1. Revoke Other Users' &lt;strong&gt;sudo&lt;/strong&gt; Privileges&lt;/h3&gt;
&lt;p&gt;This may seem obvious but you'd be surprised how many academic labs don't think
about this on their private server (if they have one). It's hard to overstate the
terrible time you'll have as a sysadmin if another one of your users types the
dreaded:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;rm&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;rm&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;/*
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's easy to forget that "." before the "/". &lt;/p&gt;
&lt;p&gt;Or, less catasrophically, that user may try installing software in a brittle way.
Meaning, you, the humble pseudo-sysadmin who's not actually getting paid for
sysadmin tasks, will have to spend time fixing it.&lt;/p&gt;
&lt;p&gt;All it takes is for you, the SUPER USER, the GOD OF THE SERVER, to run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;deluser&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;USERNAME&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;sudo
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Replace {USERNAME} with the user to remove.&lt;/p&gt;
&lt;h3&gt;2. Don't Blindly Install Software From Your Distro's Repos&lt;/h3&gt;
&lt;p&gt;I did just say to pick a distro with lots of stuff in the repos, right? Yes, but
particularly in scientific/research computing you really really really can't
assume these repos are anything close to up-to-date. Don't be afraid to download
the source code and compile, or even easier, there is likely a prebuilt
binary release available on the project's github.&lt;/p&gt;
&lt;p&gt;As an example, if you're running the most recent LTS version of Ubuntu (18.04)
then the version of samtools available to you is v1.7 which is a year and a half
old at the time of writing. If you have control of the system, then at least try
to install the most recent stable versions of critical software.&lt;/p&gt;
&lt;h3&gt;3. Use an Easily Followed Convention for Manual Software Installation&lt;/h3&gt;
&lt;p&gt;When you need to download software and install it manually put it somewhere easy
to remember, and easy to find for others. I put manually installed software in
/opt/software_version and symlink the binaries to /usr/local/bin/. This way,
you quickly know what you have manually installed, and what version they are just
from the directory structure. You also make everything available in the $PATH and
runnable with just the program name.&lt;/p&gt;
&lt;p&gt;The worst thing that can happen in a broken symlink if you change software
versions, and that's an easy fix with a:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;ln&lt;span class="w"&gt; &lt;/span&gt;-s&lt;span class="w"&gt; &lt;/span&gt;/path/to/binary&lt;span class="w"&gt; &lt;/span&gt;/usr/local/bin
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;4. Encourage Users To Test Software in ~/bin&lt;/h3&gt;
&lt;p&gt;Create a private bin directory inside each user's home folder. This is often
pre-configured in each user's path. If not you'll need to add it to each user's
.bashrc or .profile or .bash_profile, depending on which is the preferred method
for your distro:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# set PATH so it includes user&amp;#39;s private bin if it exists&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-d&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/bin&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;then&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/bin:&lt;/span&gt;&lt;span class="nv"&gt;$PATH&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let your users test and if multiple people need it, or they're running something
all the time, then you can install it system-wide in /opt.&lt;/p&gt;
&lt;h3&gt;5. Encourage Python Users to Set Up &lt;a href="https://github.com/pyenv/pyenv"&gt;pyenv&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Linux systems use Python under the hood a lot. Much of the system depends on
python, and your distro's package manager has already likely installed many
python packages. However, these versions are likely old and frozen at the version
number that shipped with the OS. I dislike running software that is &lt;em&gt;years&lt;/em&gt; out
of date. Python's package management with pip is kind of a mess and it doesn't
know which packages are needed by the system, and which are installed with it.
This is improving over time, but it's still not good.&lt;/p&gt;
&lt;p&gt;To avoid this, users should install the most recent stable version of Python.
Pyenv gives you a relatively easy and very lightweight way to do this. It also
allows the system packages to coexist peacefully in the root directory so it's
harder to break things. Plus, the users get the latest Python features.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://github.com/pyenv/pyenv"&gt;pyenv github&lt;/a&gt; has relatively easy to follow
instructions.&lt;/p&gt;
&lt;h3&gt;6. Use User-specific Language Libraries/Packages&lt;/h3&gt;
&lt;p&gt;This pops up for us with both python and R. It boils down to never, ever, using:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;PACKAGE_NAME&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sudo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;R&lt;/span&gt;
&lt;span class="nf"&gt;install.packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;PACKAGE_NAME&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If users can't use sudo there's no danger here anyway, but using user-specific
libraries and packages keeps things consistent. It also means that, once again,
you don't have to manage something. The following will solve this for Python:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;--user&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;PACKAGE_NAME&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This installs packages to ~/.local/lib/python{VERSION}/site-packages &lt;/p&gt;
&lt;p&gt;R requires a bit more doing. To create a user-library I recommend creating a
.Renviron in each user's home directory and adding the following to it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# .Renviron is run every time a new R session is started&lt;/span&gt;
&lt;span class="c1"&gt;# Use .Renviron to set environment variables for R&lt;/span&gt;

&lt;span class="c1"&gt;# Use the local R library&lt;/span&gt;
&lt;span class="n"&gt;R_LIBS_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;~/.local/lib/R/site-library&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Wrapping up&lt;/h3&gt;
&lt;p&gt;In summary, administering a small multi-user system doesn't have to be
complicated. You do want to minimize the ability for your users to break things
though. By no means is this an exhaustive guide, but it might help you out if
you're wondering where to start.&lt;/p&gt;
&lt;p&gt;Despite the proliferation of HPC systems at Universities, and cloud computing in
enterprise environments, a smaller server for your research group is still a
good investment in 2019. Submitting jobs to a queue is fine when you're not doing
iterative work, but if you want to quickly test things it gets old really quick.
Likewise, you can easily get hardware on-par with a remote VM and it's more
readily accessed.&lt;/p&gt;
&lt;p&gt;Tune in next time for something more bioinformatics-focused!&lt;/p&gt;</content><category term="how-to"></category><category term="sysadmin"></category></entry><entry><title>Setting up a Static Site With Pelican and GitHub Pages</title><link href="https://groverj3.codeberg.page/articles/2019-06-15_setting-up-a-static-site-with-pelican-and-github-pages.html" rel="alternate"></link><published>2019-06-15T00:00:00-04:00</published><updated>2019-06-15T00:00:00-04:00</updated><author><name>Jeffrey Grover</name></author><id>tag:groverj3.codeberg.page,2019-06-15:/articles/2019-06-15_setting-up-a-static-site-with-pelican-and-github-pages.html</id><summary type="html">&lt;p&gt;In an effort to aid in my future job searching I decided I needed a
personal/professional website. It needed to look good, contain links to my
relevant social and job-search profiles, host some examples of work from my Ph.D.
, showcase my skillset, and host my CV. GitHub pages …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In an effort to aid in my future job searching I decided I needed a
personal/professional website. It needed to look good, contain links to my
relevant social and job-search profiles, host some examples of work from my Ph.D.
, showcase my skillset, and host my CV. GitHub pages seemed like a natural fit,
since I already share most of my work there. GitHub recommends static site
generation with Jekyll, which I've seen to be a fine way to do that, and they
have integrated tools for working with it. However, I mostly write python
day-to-day (and R) and the idea of using a ruby-based framework for this just
seemed silly to me. So, stubborn as I am, I decided to embark on a quest to use a
python-based alternative. Pelican seemed to be the most actively developed, so I
went ahead with that.&lt;/p&gt;
&lt;p&gt;The issue I ran into is that many of the guides were unnecessarily complicated, 
or didn't contain information for my particular use-case. So, I've compiled here
the steps I used to generate this site, in the hope that it will help others.&lt;/p&gt;
&lt;h3&gt;Note Before Starting&lt;/h3&gt;
&lt;p&gt;Before starting here, I would like to mention that I will not be recommending
using a virtual environment. Why? This is overkill for a simple static site/blog
and adds unnecessary complication to a process that doesn't need to be hard at
all. These sorts of instructions are useful for more advanced deployments, and if
you need them then you probably don't need a guide as simplified as this anyway.&lt;/p&gt;
&lt;p&gt;Personally, I &lt;strong&gt;do&lt;/strong&gt; use &lt;a href="https://github.com/pyenv/pyenv"&gt;pyenv&lt;/a&gt; to manage python
installations on my home and work computers, and well as our lab server. It makes
my life much easier. But it is not &lt;strong&gt;required&lt;/strong&gt; so I won't be going over it here.&lt;/p&gt;
&lt;p&gt;I still recommend installing python packages at the user level though. Mostly a
*nix/macOS thing, I'm pretty sure Windows peeps can ignore this. This will be
explained where relevant.&lt;/p&gt;
&lt;p&gt;This guide will be GNU/Linux centric and I'm not apologizing for it :)&lt;/p&gt;
&lt;p&gt;Don't use Python 2.x, it's 2019. I'm writing these with Ubuntu and derivatives in
mind, so I will specify python3 throughout, since I believe &lt;code&gt;python&lt;/code&gt; still points
to 2.7.&lt;/p&gt;
&lt;h3&gt;Step-by-step Instructions Start Here:&lt;/h3&gt;
&lt;p&gt;Like I said above, there are existing guides for this. However, most of them
recommend installing what amounts to extreme overkill for simple GitHub pages.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Have Python and PIP Installed&lt;/p&gt;
&lt;p&gt;If you're on Linux then congrats, you've already got it. If not, consult the
docs at &lt;a href="https://www.python.org/"&gt;python.org&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;However, you may not have pip, the python package manager. For that check by
running:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;which pip3&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;If it doesn't point to an executable then you'll need to run (Ubuntu-based):&lt;/p&gt;
&lt;p&gt;&lt;code&gt;sudo apt install python3-pip&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;For other distros/package managers or macOS with homebrew consult the docs to
get the specific commands. These will likely require superuser/administrator
privileges.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Install Pelican&lt;/p&gt;
&lt;p&gt;On any *nix or macOS machine the following should do the trick in  or
other terminal:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pip3 install --user pelican ghp-import Markdown typogrify&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--user&lt;/code&gt; flag installs Pelican to your home directory and doesn't require
super-user/administrator privileges and &lt;code&gt;ghp-import&lt;/code&gt; will allow you to push
directly to github.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create a New GitHub Repo and Clone&lt;/p&gt;
&lt;p&gt;It's perfectly fine to use the web interface to create a new repo, so go to
your github homepage and create a new repository with the name:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;{username}.github.io&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This is important, and will allow you to access your site at
{username}.github.io rather than needing extra bits on the end of your github
url. Initializing with a README and LICENSE is up to you! May I recommend the
MIT license for simplifcity and FOSSness?&lt;/p&gt;
&lt;p&gt;Go to your desired dev folder on your machine, I keep all github projects in
"~/Github/{project_name}", and clone:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;cd {project_folder} &amp;amp;&amp;amp; git clone {repo.git}&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run Pelican Quickstart in Your Repo Directory&lt;/p&gt;
&lt;p&gt;Pelican comes with a handy quickstart script. Though, it's not terribly-well
documented. My settings were as follows. Only non-defaults listed (for 
defaults just push enter):&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pelican quick-start&lt;/code&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Do you want to specify a URL prefix? Y (Followed by: 
{https://{username}.github.io})&lt;br&gt; 
What is your time zone? {insert local timezone here}&lt;br&gt;
Do you want to upload your website using GitHub Pages? Y&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This will create the skeleton of your page, and allow you start adding
content! Other things can be changed later in your config files.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create a First Post&lt;/p&gt;
&lt;p&gt;By default, things created in your root level directory are turned into blog
posts. Don't ask me why this is the default, I don't like it. However, this
can be changed/hacked around later. For now create a file called &lt;code&gt;test.md&lt;/code&gt;.
Add the following to it:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Title: This is a Blog Post!&lt;br&gt;
Date: 2019-06-15&lt;br&gt;
Category: Article&lt;/p&gt;
&lt;p&gt;Hello World!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Generate Your Site&lt;/p&gt;
&lt;p&gt;There are several ways to do this, this is the simplest:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make devserver&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This command starts a dev server, which automatically updates the generated
content in real-time. So you can edit and preview simultaneously. Point your
web browser of choice to &lt;code&gt;localhost:8000&lt;/code&gt; and take a look!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add a Static Page&lt;/p&gt;
&lt;p&gt;By default, things in the root directory are blog posts (configurable), but
you'll probably want some static pages that are always linked to and don't
contain blog content. For that, without stopping the devserver, create a
new folder inside the "content" subdirectory called "pages":&lt;/p&gt;
&lt;p&gt;&lt;code&gt;mkdir ./content/pages&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Create a new markdown document in there called "about.md":&lt;/p&gt;
&lt;p&gt;&lt;code&gt;touch ./content/pages/about.md&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Fill this with the following:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Title: About&lt;br&gt;
Date: 2019-06-14&lt;/p&gt;
&lt;p&gt;Hello world! This is a test, using Pelican to create a github pages site.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Preview Your New Pages&lt;/p&gt;
&lt;p&gt;Go back to your browser, which should have been running the whole time, and
refresh on &lt;code&gt;localhost:8000&lt;/code&gt;. You should now see options to go to a new page
called "about." That's it! Easy peasy!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Generate Your Content and Push&lt;/p&gt;
&lt;p&gt;Kill the devserver with ctrl + c. Run the following in your root directory:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make html&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This is probably unnecessary, but in case the devserver wasn't working
correctly, then this ensures you will have no issues.&lt;/p&gt;
&lt;p&gt;Next, run the following to push to github:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;make github&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This will ask for your github username and password, then pushes to your
repo.&lt;/p&gt;
&lt;p&gt;Now direct your browser to:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;https://{username}.github.io&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;You site should be visible now!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Push Your Source Code to a New Branch&lt;/p&gt;
&lt;p&gt;This method of pushing creates a problem from a dev standpoint. It will write
over all content in your repo every time. Plus, it only writes the rendered
site there. If you want to work on things off this same machine, you're going
to want to push the source code. Fortunately, there's an easy workaround for
this.&lt;/p&gt;
&lt;p&gt;Go to the GitHub web interface and create a new branch called "source". This
will copy all current content to it, which is just the rendered page. Now,
back in your development folder, copy all content from your repo's folder
elsewhere (non-hidden stuff only). Then, open a terminal and type:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git checkout source&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This switches you to the source branch. It also replaces the contents with
the rendered content-only.&lt;/p&gt;
&lt;p&gt;Delete the contents again and replace with your copy of the source code.
Now enter:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git add . &amp;amp;&amp;amp; git commit -m "Pushed source" &amp;amp;&amp;amp; git push -f origin source&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This will force a push to the source branch. Technically you don't need the
"origin source" since you've checked it out, but for extra safety since we're
already doing something that is frowned on. This will totally overwrite your
site's content with the source code used to generate it. But only on that
branch. Now you can push using the &lt;code&gt;make github&lt;/code&gt; command, which defaults to
the master branch, when you want to publish, and push with &lt;code&gt;git push origin
source&lt;/code&gt; when you want to update the source code.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Final Thoughts&lt;/h3&gt;
&lt;p&gt;You're now done! And you can switch between branches to see the source and output
from Pelican's rendering. I'll make another post later to detail some more
configuration details. Until then, the &lt;a href="https://docs.getpelican.com"&gt;docs&lt;/a&gt; are a
wonderful resource.&lt;/p&gt;</content><category term="how-to"></category><category term="tutorial"></category><category term="pelican"></category></entry></feed>