Explore the Human Genome with the SciPy stack

Bioinformatics is an interdisciplinary field that develops methods and software tools for analyzing and understanding biological data.

More simply stated, you can simply think of it as data science for biology.

Among the many types of biological data, genomics data is one of the most widely analyzed. Especially with the rapid advancement of next-generation DNA sequencing (NGS) technologies, the volume of genomics data has been growing exponentially. According to Stephens, Zachary D et al., genomics data acquisition is on the exabyte-per-year scale.

n this post, I demo an example of analyzing a GFF3 file for the human genome with the SciPy Stack. Generic Feature Format Version 3 (GFF3) is the current standard text file format for storing genomic features. In particular, in this post you will learn how to use the SciPy stack to answer the following questions about the human genome:

  1. How much of the genome is incomplete?
  2. How many genes are there in the genome?
  3. How long is a typical gene?
  4. What does gene distribution among chromosomes look like?

The latest GFF3 file for the human genome can be downloaded from here. The README file that comes in this directory provides a brief description of this data format, and a more thorough specification is found here.

We will use Pandas, a major component of the SciPy stack providing fast, flexible, and expressive data structures, to manipulate and understand the GFF3 file.


First things first, let’s setup a virtual environment with the SciPy stack installed. This process can be time-consuming if built from the source manually, as the stack involves many packages – some of which depends on external FORTRAN or C code. Here, I recommend using Miniconda, which makes the setup very easy.

The -b flag on the bash line tells it to execute in batch mode. After the commands above are used to successfully install Miniconda, start a new virtual environment for genomics, and then install the SciPy stack.

Note that we have only specified the 3 packages we are going to use in this post. If you want all the packages listed in the SciPy stack, simply append them to the end of the conda create command.

If you are unsure of the exact name of a package, try conda search. Let’s activate the virtual environment and start IPython.

IPython is a significantly more powerful replacement to the default Python interpreter interface, so whatever you used to do in the default python interpreter can also be done in IPython. I highly recommend every Python programmer, who hasn’t been using IPython yet, to give it a try.

Download the Annotation File

With our setup now completed, let’s download the human genome annotation file in GFF3 format.

It is about 37 MB, a very small file compared to the information content of a human genome, which is about 3 GB in plain text. That’s because the GFF3 file only contains the annotation of the sequences, while the sequence data is usually stored in another file format called FASTA. If you are interested, you can download FASTA here, but we won’t use the sequence data in this tutorial.

The prefixed ! tells IPython that this is a shell command instead of a Python command. However, IPython can also process some frequently used shell commands like lspwdrmmkdirrmdir even without a prefixed !.

Taking a look at the head of the GFF file, you will see many metadata/pragmas/directives lines starting with ## or #!.

According to the README,  ## means the metadata is stable, while #! means it’s experimental.

Later on you will also see ###, which is another directive with yet more subtle meaning based on the specification.

Human readable comments are supposed to be after a single #. For simplicity, we will treat all lines starting with # as comments, and simply ignore them during our analysis.

The first line indicates that the version of GFF format used in this file is 3.

Following that are summaries of all sequence regions. As we will see later, such information can also be found in the body part of the file.

The lines starting with #! shows information about the particular build of the genome, GRCh38.p7, that this annotation file applies to.

Genome Reference Consortium (GCR) is an international consortium, which oversees updates and improvements to several reference genome assemblies, including those for human, mouse, zebrafish, and chicken.

Scanning through this file, here are the first few annotation lines.

The columns are seqidsourcetypestartendscorestrandphaseattributes. Some of them are very easy to understand. Take the first line as an example:

This is the annotation of the first chromosome with a seqid of 1, which starts from the first base to the 24,895,622nd base.

In other words, the first chromosome is about 25 million bases long.

Our analysis won’t need information from the three columns with a value of . (i.e. score, strand, and phase), so we can simply ignore them for now.

The last attributes column says Chromosome 1 also has three alias names, namely CM000663.2, chr1, and NC_000001.11. That’s basically what a GFF3 file looks like, but we won’t inspect them line by line, so it’s time to load the whole file into Pandas.

Pandas is good fit for dealing with GFF3 format because it is a tab-delimited file, and Pandas has very good support for reading CSV-like files.

Note one exception to the tab-delimited format is when the GFF3 contains ##FASTA .

According to the specification##FASTA indicates the end of an annotation portion, which will be followed with one or more sequences in FASTA (a non tab-delimited) format. But this is not the case for the GFF3 file we’re going to analyse.

The last line above loads the entire GFF3 file with pandas.read_csv method.

Since it is not a standard CSV file, we need to customize the call a bit.

First, we inform Pandas of the unavailability of header information in the GFF3 with header=None, and then we specify the exact name for each column with names=col_names.

If the names argument is not specified, Pandas will use incremental numbers as names for each column.

sep='\t' tells Pandas the columns are tab-separated instead of comma-separated. The value to sep= can actually be a regular expression (regex). This becomes handy if the file at hand uses different separators for each column (oh yeah, that happens). comment='#' means lines starting with # are considered comments and will be ignored.

compression='gzip' tells Pandas that the input file is a gzip-compressed.

In addition, pandas.read_csv has a rich set of parameters that allows different kinds of CSV-like file formats to be read.

The type of the returned value is a DataFrame, which is the most important data structure in Pandas, used for representing 2D data.

Pandas also has a Series and Panel data structure for 1D and 3D data, respectively. Please refer to the documentation for an introduction to Pandas’ data structures.

Let’s take a look at the first few entries with .head method.

The output is nicely formatted in a tabular format with longer strings in the attributes column partially replaced with ....

You can set Pandas to not omit long strings with pd.set_option('display.max_colwidth', -1). In addition, Pandas has many optionsthat can be customized.

Next, let’s get some basic information about this dataframe with the .info method.

This shows that the GFF3 has 2,601,848 annotated lines, and each line has nine columns.

For each column, it also shows their data types.

That start and end are of int64 type, integers representing positions in the genome.

The other columns are all of type object, which probably means their values consist of a mixture of integers, floats, and strings.

The size of all the information is about 178.7+ MB stored in memory. This turns out to be more compact than the uncompressed file, which will be about 402 MB. A quick verification is shown below.

From a high-level view, we have loaded the entire GFF3 file into a DataFrame object in Python, and all of our following analysis will be based on this single object.

Now, let’s see what the first column seqid is all about.

df.seqid is one way to access column data from a dataframe. Another way is df['seqid'], which is more general syntax, because if the column name is a Python reserved keyword (e.g. class) or contains a . or space character, the first way (df.seqid) won’t work.

The output shows that there are 194 unique seqids, which include Chromosomes 1 to 22, X, Y, and mitochondrion (MT) DNA as well as 169 others seqids.

The seqids starting with KI and GL are DNA sequences – or scaffolds – in the genome that have not been successfully assembled into the chromosomes.

For those who are unfamiliar with genomics, this is important.

Although the first human genome draft came out more than 15 years ago, the current human genome is still incomplete. The difficulty in assembling these sequences is largely due to complex repetitive regions in the genome.

Next, let’s take a look at the source column.

The README says that the source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature.

This is an example of the use of the value_counts method, which is extremely useful for a quick count of categorical variables.

From the result, we can see that there are seven possible values for this column, and the majority of entries in the GFF3 file come from havana, ensembl and ensembl_havana.

You can learn more about what these sources mean and the relationships between them in this post.

To keep things simple, we will focus on entries from sources GRCh38, havana, ensembl, and ensembl_havan.a.

How Much of the Genome Is Incomplete?

The information about each entire chromosome is in the entries from source GRCh38, so let’s first filter out the rest, and assign the filtered result to a new variable gdf.

Filtering is easy in Pandas.

If you inspect the value evaluated from the expression df.source == 'GRCh38', it’s a series of True and False values for each entry with the same index as df. Passing it to df[] will only return those entries where their corresponding values are True.

There are 194 keys in df[] for which df.source == 'GRCh38'.

As we’ve seen previously, there are also 194 unique values in the seqid column, meaning each entry in gdf corresponds to a particular seqid.

Then we randomly select 10 entries with the sample method to take closer look.

You can see that the unassembled sequences are of type supercontig while the others are of chromosome. To compute the fraction of genome that’s incomplete, we first need to know the length of the entire genome, which is the sum of the lengths of all seqids.

In the above snippet first, we made a copy of gdf with .copy(). Otherwise, the original gdf is just a slice of df, and modifying it directly would result in SettingWithCopyWarning (see here for more details).

We then calculate the length of each entry and add it back to gdf as a new column named “length”. The total length turns out to be about 3.1 billion, and the fraction of unassembled sequences is about 0.37%.

Here is how the slicing works in the last two commands.

First, we create a list of strings that covers all seqids of well assembled sequences, which are all chromosomes and mitochondria. We then use the isin method to filter all entries whose seqid are in the chrs list.

A minus sign (-) is added to the beginning of the index to reverse the selection, because we actually want everything that is not in the list (i.e. we want the unassembled ones starting with KI and GL)…

Note: Since the assembled and unassembled sequences are distinguished by the type column, the last line can alternatively be rewritten as follows to obtain the same results.

How Many Genes Are There?

Here we focus on the entries from source ensembl, havana and ensembl_havana since they’re where the majority of the annotation entries belong.

The isin method is used again for filtering.

Then, a quick value count shows that the majority of the entries are exon, coding sequence (CDS), and untranslated region (UTR).

These are sub-gene elements, but we are mainly looking for the gene count. As shown, there are 42,470, but we want to know more.

Specifically, what are their names, and what do they do? To answer these questions, we need to look closely at the information in the attributes column.

They are formatted as semicolon-separated list of tag-value pairs. The information we are most interested in is gene name, gene ID and description, and we will extract them with regular expression (regex).

First, we extract the gene names.

In the regex Name=(?P<gene_name>.+?); , +? is used instead of + because we want it to be non-greedy and let the search stop at the first semicolon; otherwise, the result will match up to the last semicolon.

Also, the regex is first compiled with re.compile instead of being used directly as in re.search for better performance because we will apply it to thousands of attribute strings.

extract_gene_name serves as a helper function to be used in pd.apply, which is the method to use when a function needs to be applied on every entry of a dataframe or series.

In this particular case, we want to extract the gene name for every entry in ndf.attributes, and add the names back to ndf in a new column called gene_name.

Gene IDs and description are extracted in a similar way.

The regex for RE_GENE_ID is a bit more specific since we know that every gene_idmust start with ENSG, where ENS means ensembl and G means gene.

For entries that don’t have any description, we will return an empty string. After everything is extracted, we won’t use the attributes column anymore, so let’s drop it to keep things nice and clean with the method .drop:

In the above call, attributes indicates the specific column we want to drop.

axis=1 means we are dropping a column instead of a row (axis=0 by default).

inplace=True means that the drop is operated on the DataFrame itself instead of returning a new copy with specified column dropped.

A quick .head look shows that the attributes column is indeed gone, and three new columns: gene_namegene_id, and desc have been added.

Out of curiosity, let’s see if all gene_id and gene_name are unique:

Surprisingly, the number of gene names is smaller than that of gene IDs, indicating that some gene_name must correspond to multiple gene IDs. Let’s find out what they are.

We group all entries by the value of gene_name, then count the number of items in each group with .count().

If you inspect the output from ndf.groupby('gene_name').count(), all columns are counted for each group, but most of them have the same values.

Note that NA values won’t be considered when counting, so only take the count of the first column, seqid ( we use .ix[:, 0] to ensure that there are no NA values).

Then sort the count values with .sort_values and reverse the order with .ix[::-1].

In the result, a gene name can be shared with up to seven gene IDs.

A closer look at all the SCARNA20 genes shows that they’re indeed all different.

While they share the same name, they are located at different positions of the genome.

Their descriptions, however, don’t seem very helpful in distinguishing them.

The point here is to know that gene names are not unique for all gene IDs, and about 0.15% of the names that are shared by multiple genes.

How Long Is a Typical Gene?

Similar to what we did when we were trying to understand the incompleteness of the genome, we can easily add a length column to ndf:

.describe() calculates some simple statistics based on the length values:

  • Mean length of a gene is about 36,000 bases
  • Median length of a gene is about 5,200 bases long
  • Minimum and maximum gene lengths are about eight and 2.3 million bases long, respectively.

Because the mean is much larger than the median, it implies that length distribution is skewed to the right. To have a more concrete look, let’s plot the distribution.

Pandas provides a simple interface to matplotlib to make plotting very handy with DataFrames or series.

In this case, it says that we want a histogram plot (kind='hist') with 50 bins, and let the y axis be on a log scale (logy=True).

From the histogram, we can see that the majority of genes are within the first bin. However, some gene lengths can be more than two million bases. Let’s find out what they are:

As you can see, the longest gene is named CNTNAP2, which is short for contactin associated protein-like 2. According to its wikipedia page,

This gene encompasses almost 1.6% of chromosome 7 and is one of the largest genes in the human genome.

Indeed! We just verified that ourselves. In contrast, what about the smallest genes? It turns out that they can be as short as eight bases.

The lengths of the two extreme cases are five orders of magnitude apart (2.3 million vs. 8), which is enormous and which can be an indication of the level of diversity of life.

A single gene can be translated to many different proteins via a process called alternative splicing, something we haven’t explored. Such information is also inside the GFF3 file, but outside the scope of this post.

Gene Distribution Among Chromosomes

The last thing I’d like to discuss is gene distribution among chromosomes, which also serves as an example for introducing the .merge method for combining two DataFrames. Intuitively, longer chromosomes likely host more genes. Let’s see if that is true.

We borrowed the chrs variable from the previous section, and used it to filter out the unassembled sequences. Based on the output, the largest Chromosome 1 indeed has the most genes. While Chromosome Y has the smallest number of genes, it is not the smallest chromosome.

Note that there seem to be no genes in the mitochondrion (MT), which is not true.

A bit more filtering on the first DataFrame df returned by pd.read_csv shows that all MT genes are from source insdc (which were filtered out before when generating edf where we only considered sources of havana, ensembl, or ensembl_havana).

This example also shows how to combine two conditions during filtering with &; the logical operator for “or” would be |.

Note that the parentheses around each condition are required, and this part of the syntax in Pandas is different from Python, which would have been literal and and or.

Next, let’s borrow the gdf DataFrame from the previous section as a source for the length of each chromosome:

The columns that are not relevant to the analysis are dropped for clarity.

Yes, .drop can also take a list of columns and drop them altogether in one operation.

Note that the row with a seqid of MT is still there; we will get back to it later. The next operation we will perform is merge the two datasets based on the values of seqid.

Since chr_gene_counts is still a Series object, which doesn’t support a merge operation, we need to convert it to a DataFrame object first with .to_frame.

.reset_index() converts the original index (i.e. seqid) into a new column and resets current index as 0-based incremental numbers.

The output from cdf.head(2) shows what it looks like. Next, we used the .mergemethod to combine the two DataFrame on the seqid column (on='seqid').

After merging gdf and cdf, the MT entry is still missing. This is because, by default, .merge operates an inner join, while left join, right join, or outer join are available by tuning the how parameter.

Please refer to the documentation for more details.

Later, you may find that there is also a related .join method. .merge and .joinare similar yet have different APIs.

According to the official documentation says

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.

Basically, .merge is more general-purpose and used by .join.

Finally, we are ready to calculate the correlation between chromosome lengthand gene_count.

By default .corr calculates the Pearson correlation between all pairs of columns in a dataframe.

But we have only a single pair of columns in this case, and the correlation turns out to be positive – 0.73.

In other words, the larger the chromosome, the more likely it is to have more genes. Let’s also plot the two columns after sorting the value pairs by length:

As seen in image above, even though it is a positive correlation overall, it does not hold for all chromosomes. In particular, for Chromosome 17, 16, 15, 14, 13, the correlation is actually negative, meaning the number of genes on the chromosome decreases as the chromosome size increases.

Findings and Future Research

That ends our tutorial on the manipulation of an annotation file for human genome in GFF3 format with the SciPy stack. The tools we’ve mainly used include IPython, Pandas, and matplotlib. During the tutorial, not only have we learned some of the most common and useful operations in Pandas, we also answered some very interesting questions about our genome. In summary:

  1. About 0.37% of the human genome is still incomplete even though the first draft came out over 15 years ago.
  2. There are about 42,000 genes in the human genome based on this particular GFF3 file we used.
  3. The length of a gene can range from a few dozen to over two million bases.
  4. Genes are not evenly distributed among the chromosomes. Overall, the larger the chromosome, the more genes it hosts, but for a subset of the chromosomes, the correlation can be negative.

The GFF3 file is very rich in annotation information, and we have just scratched the surface. If you are interested in further exploration, here are a few questions you can play with:

  1. How many transcripts does a gene typically have? What percentage of genes have more than 1 transcript?
  2. How many isoforms does a transcript typically have?
  3. How many exons, CDS, and UTRs does a transcript typically have? What sizes are they?
  4. Is it possible to categorize the genes based on their function as described in the description column?

This article was originally published in: https://www.toptal.com/python/comprehensive-introduction-your-genome-scipy