An Introduction to Accessing COVID-19 Virus Genomic Data using Biopython for the Curious

Emiko Sano
5 min readApr 2, 2021
Illustration of a DNA structure
DNA Molecule from WikiMedia

Biopython is a python package used for many bioinformatics applications. What is bioinformatics, do you ask? If I were to summarize it in very simple terms, it’s the process of analyzing biological sequencing data. The data are usually DNA or protein sequencing data. With the advancement of sequencing technologies, there is an abundance of biological data available. NIH (National Institute of Health) maintains a database for biological data (called NCBI or National Center for Biotechnology Information), and it’s probably the largest publicly accessible database. With the large abundance, there is also a need for statistical analysis and software. Biopython has many of those necessary tools available for parsing and analyzing the sequences. Sequences are long strings of four letters (A, T, G, or C). The order of these is unique to each organism, and we can learn a lot from looking at and comparing those sequences. The first step is to collect the sequencing data. Using biopython, you can access sequencing data from NCBI.

The setup for biopython is easy. All you need is a quick pip install using:

pip install biopython

Now that you have it installed, let’s access some genomic data. There are couple of different types of formats that you can download the data in: FASTA or GenBank. The simple difference is that the first one has a simple format that looks like this (using Mycobacterium tuberculosis as an example; yes, the causative agent of tuberculosis):

>NC_000962.3 Mycobacterium tuberculosis H37Rv, complete genome
TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCCGAACTTAACGGCGACCCTAAGGTTGACGACGGACCCAGCAGTGATGCTAATCTCAGCGCTCCGCTGACCCCTCAGCAAAGGGCTTGGCTCAATCTCGTCCAGCCATTGACCATCGTCGAGGGGTTTGCTCTGTTATCCGTGCCGAGCAG…

The first part following > will vary depending on the entry, but it’s a descriptor the organism. Of course the TTGACC… part is the actual DNA sequence.

GenBank format on the other hand has a lot more information upfront followed by the DNA sequences. The same genomic sequence of Mycobacterium tuberculosis from above in GenBank format looks like this:

LOCUS       NC_000962            4411532 bp    DNA     linear   CON 14-DEC-2017
DEFINITION Mycobacterium tuberculosis H37Rv, complete genome.
ACCESSION NC_000962
VERSION NC_000962.3
DBLINK BioProject: PRJNA57777 Assembly: GCF_000195955.2
KEYWORDS RefSeq; complete genome.
SOURCE Mycobacterium tuberculosis H37Rv
ORGANISM Mycobacterium tuberculosis H37Rv Bacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis complex.

This is only the very start, and it has a lot more information including the names of the genes and where those genes start and stop in the genome. If you are curious, you can access the whole thing here.

Here, I want to get sequences for COVID-19 virus (officially it’s called, Severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2), and let’s say that I want it in GenBank format.

Biopython has many classes, and class Entrez will enable interaction with the server, and the complete terms can be accessed here. This is similar to searching on the search bar at the GUI on the NCBI website.

from Bio import EntrezEntrez.email = “somename@something.com” # enter your email herehandle = Entrez.efetch(db=”nuccore”, id=”NC_045512", rettype=”gb”, retmode=”text”)

Here, db is the database, and the choices can be found here. Because we want to use the Nucleotide database, we specify ‘nuccore’ here. For id, this is the identifying number(s) for the organism(s). By specifying rettype and retmode, (genbank in text format), we get the GenBank in plain text format.

We can print out the result by:

print(handle.read())

That gets something that start like this:

LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020
DEFINITION Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1,complete genome.
ACCESSION NC_045512
VERSION NC_045512.2
DBLINK BioProject: PRJNA485481
KEYWORDS RefSeq.
SOURCE Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
ORGANISM Severe acute respiratory syndrome coronavirus 2 Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus.
REFERENCE 1 (bases 1 to 29903)
AUTHORS Wu,F., Zhao,S., Yu,B., Chen,Y.M., Wang,W., Song,Z.G., Hu,Y., Tao,Z.W., Tian,J.H., Pei,Y.Y., Yuan,M.L., Zhang,Y.L., Dai,F.H., Liu,Y., Wang,Q.M., Zheng,J.J., Xu,L., Holmes,E.C. and Zhang,Y.Z.
TITLE A new coronavirus associated with human respiratory disease in China
JOURNAL Nature 579 (7798), 265-269 (2020)
PUBMED 32015508
REMARK Erratum:[Nature. 2020 Apr;580(7803):E7. PMID: 32296181]
REFERENCE 2 (bases 13476 to 13503)
AUTHORS Baranov,P.V., Henderson,C.M., Anderson,C.B., Gesteland,R.F., Atkins,J.F. and Howard,M.T

You are probably wondering that the above code is not helpful because 1. You need to know the id number of the organism (which I looked up). Alternative to this, we can search and get the nucleotide sequences using a search term with egquery.

First, we can check how many entries there are:

handle2 = Entrez.egquery(term=”SARS-CoV-2")record = Entrez.read(handle) # this is a dictionary

What I called record above is dictionary with term and 1eGQueryResult as keys. 1eGQueryResult contains another dictionary as value. This secondary dictionary has these two keys DbName and Count as keys (among others). The DbName has the database name as value, and where are interested in nucleotide records, so we have to find the records in the ‘nuccore’ database. The Count would give us the number of results within that particular database. This code would then give us the number of records that we want:

for item in record[“eGQueryResult”]:if item[“DbName”]==”nuccore”: # pulling only records that are on the nuccore databaseprint(item[“Count”])

Using these terms as of March 26, 2021, we get 124,030 entries. This is a lot of entries. Not all of these are going to be complete genomes. Sequences with the complete genome of SARS-Cov-2, which is 29903 base-pair (bp) long. To limit my search, I will be limiting to only records that are available in the GenBank format.

I need to list of IDs to be able to download the records. By searching using ‘SARS-CoV-2’ as a term, and looking at the record, I can obtain the list.

handle = Entrez.esearch(db=”nuccore”, term=”SARS-CoV-2")record = Entrez.read(handle)gi_list = record[“IdList”]

I can apply this list to obtain the sequences in GenBank format.

handle = Entrez.efetch(db=”nuccore”, id=gi_list, rettype=”gb”, retmode=”text”)

We know have our sequences downloaded to memory. We can save these to file and start analyzing the sequences.

Because of the strong interest in COVID-19, NCBI has a dedicated page for all SARS-Cov-2 sequences. NCBI also has an API and other comman line tools to access, but for this blog, my goal was to only focus on using Biopython.

--

--

Emiko Sano

Microbiologist, Data Scientist In-Training, and STEM education enthusiast