ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES

Searching, retrieving, and parsing data from NCBI databases through the Unix command line.

INTRODUCTION

Entrez Direct (EDirect) provides access to Entrez, the NCBI's suite of interconnected databases, from a Unix terminal window. Search terms are entered as command-line arguments. Individual operations are connected with Unix pipes to construct multi-step queries. Selected records can then be retrieved in a variety of formats.

PROGRAMMATIC ACCESS

EDirect connects to Entrez through the Entrez Programming Utilities interface. It supports searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports.

Navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.

Accessory programs (nquire, transmute, and xtract) can help eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect programs and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

NAVIGATION FUNCTIONS

Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query for the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:

  esearch -db pubmed -query "selective serotonin reuptake inhibitor"

Search terms can also be qualified with a bracketed field name to match within the specified index:

  esearch -db nuccore -query "insulin [PROT] AND rodents [ORGN]"

Elink looks up precomputed neighbors within a database, or finds associated records in other databases, or uses the NIH Open Citation Collection service (PMID 31600197) to follow reference lists:

  elink -related

  elink -target gene

  elink -cited

  elink -cites

Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:

  efilter -molecule genomic -location chloroplast -country sweden -mindate 1985

Efetch downloads selected records or reports in a style designated by -format:

  efetch -format abstract

Individual query commands are connected by a Unix vertical bar pipe symbol:

  esearch -db pubmed -query "tn3 transposition immunity" | efetch -format apa

The vertical bar also allows query steps to be placed on separate lines:

  esearch -db pubmed -query "raynaud disease AND fish oil" |
  efetch -format medline

Each program has a -help command that prints detailed information about available arguments.

EDirect programs are designed to work on large sets of data. There is no need to use a script to loop over records in small groups, or write code to retry a query after a transient network or server failure, or add a time delay between requests. All of those features are already built into the system.

ACCESSORY PROGRAMS

Nquire retrieves data from remote servers with URLs constructed from command line arguments:

  nquire -get https://icite.od.nih.gov api/pubs -pmids 2539356 |

Transmute converts a concatenated stream of JSON objects or other structured formats into XML:

  transmute -j2x |

Xtract uses waypoints to navigate complex XML hierarchies, and obtains data values by field name:

  xtract -pattern data -element cited_by |

The resulting output can be post-processed by Unix utilities or scripts:

  fmt -w 1 | sort -V | uniq

INSTALLATION

EDirect consists of a set of scripts and programs that are downloaded to the user's computer. To install the software, open a terminal window and execute one of the following two commands:

  sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"

  sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"

One installation is complete, run the following to set the PATH for the current terminal session:

  export PATH=${HOME}/edirect:${PATH}

For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile and .zshrc configuration files:

  export NCBI_API_KEY=unique_api_key

DISCOVERY BY NAVIGATION

PubMed related articles are identified by a statistical text retrieval algorithm using the title, abstract, and medical subject headings (MeSH terms). The connections between papers can be used for making discoveries. An example of this is finding the last enzymatic step in the vitamin A biosynthetic pathway.

Lycopene cyclase in plants converts lycopene into beta-carotene, the immediate biochemical precursor of vitamin A. Beta-Carotene is an essential nutrient, required in the diet of herbivores. This indicates that lycopene cyclase is not present in animals (with a few exceptions caused by horizontal gene transfer), and that the enzyme responsible for converting beta-carotene into vitamin A is not present in plants.

An initial search on the lycopene cyclase enzyme finds 306 articles. Looking up precomputed neighbors returns 19,146 papers, some of which might be expected to discuss other enzymes in the pathway:

  esearch -db pubmed -query "lycopene cyclase" | elink -related |

We cannot reliably limit the results to animals in PubMed, but we can for sequence records, which are indexed by the NCBI taxonomy. Linking the publication neighbors to their associated protein records finds 604,878 sequences. Restricting those to mice excludes plants, fungi, and bacteria, which eliminates the earlier enzymes:

  elink -target protein | efilter -organism mouse -source refseq |

This matches only 32 sequences, which is small enough to examine by retrieving the individual records:

  efetch -format fasta

As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:

  ...
  >NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
  MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
  SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
  ...

A better example used Entrez protein neighbors to instantly rediscover the similarity between a human colon cancer gene and microbial DNA repair genes. Unfortunately, precomputed BLAST links were discontinued due to the exponential growth of the sequence databases.

XML DATA EXTRACTION

The ability to obtain Entrez records in structured format, and to easily extract the underlying data, allows the user to ask novel questions that are not addressed by existing analysis software.

The xtract program uses command-line arguments to direct the conversion of data in eXtensible Markup Language format. It allows record detection, path exploration, element selection, conditional processing, and report formatting to be controlled independently.

The -pattern command partitions an XML stream by object name into individual records that are processed separately. Within each record, the -element command does an exhaustive, depth-first search to find data content by field name.

Neither explicit object paths nor complicated path formulas are needed for element identification.

FORMAT CUSTOMIZATION

By default, the -pattern argument divides the results into rows, while placement of data into columns is controlled by -element, to create a tab-delimited table.

Formatting commands allow extensive customization of the output. The line break between -pattern rows is changed with -ret, while the tab character between -element columns is modified by -tab.

Multiple instances of the same element are distinguished using -sep, which controls their separation independently of the -tab command. The following query:

  efetch -db pubmed -id 6271474,6092233,16589597 -format docsum |
  xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name

returns a tab-delimited table with individual author names separated by vertical bars:

  6271474     1981            Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
  6092233     1984 Jul-Aug    Calderon IL|Contopoulou CR|Mortimer RK
  16589597    1954 Dec        Garber ED

The -sep value also applies to distinct -element arguments that are grouped with commas. This can be used to keep data from multiple related fields in the same column:

  -sep " " -element Initials,LastName

The -def command sets a default placeholder to be printed when none of the comma-separated fields in an -element clause are present:

  -def "-" -sep " " -element Year,Month,MedlineDate

Repackaging commands (-wrp, -enc, and -pkg) wrap extracted data values with bracketed XML tags given only the object name. For example, "-wrp Word" issues the following formatting instructions:

  -pfx "<Word>" -sep "</Word><Word>" -sfx "</Word>"

It also sets an internal flag to ensure that data values containing encoded ampersands, angle brackets, apostrophes, and quotation marks remain properly encoded inside the new XML.

ELEMENT VARIANTS

Derivatives of -element were created to avoid the inconvenience of having to write post-processing scripts to perform trivial modifications or calculations on extracted data. Other variants were added for content normalization, report formatting, or index generation. They are subdivided into several categories. Substitute for -element as needed. A representative selection is shown below:

  Positional:    -first,  -last,  -even,  -odd,  -backward
  Numeric:       -num,  -len,  -inc,  -dec,  -mod,  -bin,  -hex,  -bit,  -sqt,  -lge,  -lg2,  -log
  Statistics:    -sum,  -acc,  -min,  -max,  -dev,  -med,  -avg,  -geo,  -hrm,  -rms
  Character:     -upper,  -lower,  -title,  -mirror,  -alpha,  -alnum
  Text:          -terms,  words,  -pairs,  -letters,  -split,  -order,  -reverse,  -prose
  Sequence:      -revcomp,  -fasta,  -ncbi2na,  -cds2prot,  -molwt,  -pept
  Citation:      -year,  -month,  -date, -auth,  -initials,  -page,  -author,  -journal
  Other:         -doi,  -wct,  -trim,  -pad,  -accession,  -numeric

The original -element prefix shortcuts, "#" and "%", are redirected to -num and -len, respectively.

VALUE SUBSTITUTION

External values can be inserted by reading a two-column, precomputed file or ad hoc conversion table with -transform, and then requesting a replacement by applying -translate to an element:

  xtract -transform accn-to-uid.txt  ...  -translate Accession

  xtract -transform <( echo -e "Genomic\t1\nCoding\t2\nProtein\t3\n" )  ...

PARSING FIELDS

The -with and -split commands can parse multiple clauses that are packed into a single field:

  -wrp Item -with ";" -split Attributes

SUBSTRING LIMITS

A subrange is selected with start and stop positions inside square brackets and separated by a colon. Endpoints for removal of specific prefix and suffix strings are indicated by a vertical bar inside brackets:

  -author Initials[1:1],LastName -prose "Title[phospholipase | rattlesnake]"

  -wrp Tag -element "Item[|=]" -wrp Val -element "Item[=|]"

LOCAL CONTEXT

An -element argument can use the parent / child construct to limit selection when items can only be disambiguated by position, not by name. In this case it prevents the display of additional PMIDs that might be present in CommentsCorrections objects deeper in the MedlineCitation container:

  xtract -pattern PubmedArticle -element MedlineCitation/PMID

EXPLORATION CONTROL

Exploration commands allow precise control over the order in which XML record contents are examined, by separately presenting each instance of the chosen subregion. This limits what subsequent commands "see" at any one time, allowing related fields in an object to be kept together.

In contrast to the simpler DocumentSummary format, records retrieved as PubmedArticle XML:

  efetch -db pubmed -id 1413997 -format xml |

have authors with separate fields for last name and initials:

  <Author>
    <LastName>Mortimer</LastName>
    <Initials>RK</Initials>
  </Author>

Without being given any guidance about context, an -element command on initials and last names:

  xtract -pattern PubmedArticle -element Initials LastName

will explore the current record for each argument in turn, printing all initials followed by all last names:

  RK    CR    JS    Mortimer    Contopoulou    King

Inserting a -block command adds another exploration layer between -pattern and -element, and redirects data exploration to present the authors one at a time:

  xtract -pattern PubmedArticle -block Author -element Initials LastName

Each time through the loop, the -element command only sees the current author's values. This restores the correct association of initials and last names in the output:

  RK    Mortimer    CR    Contopoulou    JS    King

Grouping the two author subfields with a comma, and adjusting the -sep and -tab values:

  xtract -pattern PubmedArticle -block Author \
    -sep " " -tab ", " -element Initials,LastName

produces a more traditional formatting of author names:

  RK Mortimer, CR Contopoulou, JS King

NESTED EXPLORATION

Exploration command names (-group, -block, and -subset) are assigned to a precedence hierarchy:

  -pattern > -group > -block > -subset > -element

and are combined in ranked order to control object iteration at progressively deeper levels in the XML data structure. Each command argument acts as a "nested for-loop" control variable, retaining information about the context, or state of exploration, at its level.

A nucleotide or protein sequence record can have multiple features. Each feature can have multiple qualifiers. And every qualifier has separate name and value nodes. Exploring this natural data hierarchy, with -pattern for the sequence, -group for the feature, and -block for the qualifier:

  efetch -db nuccore -id NM_021486.4 -format gbc |
  xtract -pattern INSDSeq -element INSDSeq_accession-version \
    -group INSDFeature -deq "\n\t" -element INSDFeature_key \
      -block INSDQualifier -deq "\n\t\t" \
        -element INSDQualifier_name INSDQualifier_value

keeps qualifiers, such as gene and product, associated with their parent features, and keeps qualifier names and values together on the same line:

  NM_021486.4
      source
                organism       Mus musculus
                mol_type       mRNA
      gene
                gene           Bco1
      CDS
                gene           Bco1
                product        beta,beta-carotene 15,15'-dioxygenase isoform 1
                protein_id     NP_067461.2
                translation    MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGM ...
                ...

SAVING DATA IN VARIABLES

A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -KEY). Saved values are retrieved by placing an ampersand before the variable name (e.g., "&KEY") in an -element statement:

  efetch -db nuccore -id NM_021486.4 -format gbc |
  xtract -pattern INSDSeq -element INSDSeq_accession-version \
    -group INSDFeature -KEY INSDFeature_key \
      -block INSDQualifier -deq "\n\t" \
        -element "&KEY" INSDQualifier_name INSDQualifier_value

This prints the feature key on each line before the qualifier name and value, even though the feature key is now outside of the visibility scope (which is the current qualifier):

  NM_021486.4
      source    organism       Mus musculus
      source    mol_type       mRNA
      gene      gene           Bco1
      CDS       gene           Bco1
      CDS       product        beta,beta-carotene 15,15'-dioxygenase isoform 1
      CDS       protein_id     NP_067461.2
      CDS       translation    MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGM ...
      ...

Variables can be (re)initialized with an explicit literal value inside parentheses. A variable can also save the modified data resulting from an -element variant operation. This can allow multiple sequential transitions within a single xtract command:

  -COM "(, )" -END -sum "Start,Length" -MID -avg "Start,&END"

CONDITIONAL EXECUTION

Conditional processing commands (-if and -unless) restrict object exploration by data content. They check to see if the named field is within the scope, and may be used in conjunction with string, numeric, or object constraints to require an additional match by value. Use -and and -or to build compound tests, and -select to remove records that do not satisfy the condition. For example:

  esearch -db pubmed -query "Havran W [AUTH]" | efetch -format xml |
  xtract -pattern PubmedArticle -select Language -equals eng |
  xtract -pattern PubmedArticle \
    -block Author -if LastName -is-not Havran \
      -sep ", " -tab "\n" -author LastName,Initials[1:1] |
  sort-uniq-count-rank

limits the results to papers written in English and prints a table of the most frequent collaborators, using a range to keep only the first initial so that variants like "Berg, C" and "Berg, CM" are combined:

  35    Witherden, D
  15    Boismenu, R
  12    Jameson, J
  10    Allison, J
  10    Fitch, F
  ...

Numeric constraints can compare the integer values of two fields. This can be used to find genes that are encoded on the minus strand of a particular chromosome:

  -if ChrLoc -equals X -and ChrStart -gt ChrStop

Object constraints will compare the string values of two named fields, and can look for internal inconsistencies between fields whose contents should (in most cases) be identical:

  -if Chromosome -differs-from ChrLoc

The -position command restricts presentation of objects by relative location or index number:

  -block Author -position last -sep ", " -element Lastname,Initials

The -else command can run an alternative -element or -lbl instruction if the condition is not satisfied:

  -if ChrStart -gt ChrStop -lbl "minus strand" -else -lbl "plus strand"

GENERATING ATTRIBUTES

Additional commands (-tag, -att, -atr, -cls, -slf, and -end) allow generation of XML tags with attributes. The following will produce regular and self-closing XML objects, respectively:

  -tag Item -att type journal -cls -element Source -end Item

  <Item type="journal">J Bacteriol</Item>

  -tag Item -att type journal -atr name Source -slf

  <Item type="journal" name="J Bacteriol" />

XML NAMESPACES

Namespace prefixes are followed by a colon, while a leading colon matches any prefix:

  nquire -url http://webservice.wikipathways.org getPathway -pwId WP455 |
  xtract -pattern "ns1:getPathwayResponse" -decode ":gpml" |

The embedded Graphical Pathway Markup Language object can then be processed:

  xtract -pattern Pathway -block Xref \
    -if @Database -equals "Entrez Gene" \
      -tab "\n" -element @ID

AUTOMATIC FORMAT CONVERSION

Xtract can now detect and convert input data in JSON, text ASN.1, and GenBank/GenPept flatfile formats. Explicit transmute or shortcut commands are only needed for inspecting the intermediate XML or for overriding the default conversion settings.

MULTI-STEP TRANSFORMATIONS

Although xtract provides -element derivatives to do simple data manipulation, more complex tasks may be broken up into a series of simpler transformations, also known as "processing chains".

BioSample document summaries:

  efetch -db biosample -id SAMN38051082 -format docsum |

store preferred qualifier names in a "harmonized_name" XML attribute:

  <Attribute harmonized_name="strain">BALB/c</Attribute>
  <Attribute harmonized_name="isolate">Mtb infected Spleen MZB-2</Attribute>
  <Attribute harmonized_name="geo_loc_name">Singapore</Attribute>

Piping the data to the first xtract command, and using the "@" sign to select the attribute:

  xtract -rec BioSampleInfo -pattern DocumentSummary \
    -wrp Accession -element Accession \
    -group Attribute -if @harmonized_name \
      -TAG -lower @harmonized_name -wrp "&TAG" -element Attribute |

generates an intermediate form, with XML tag names taken from the original XML attributes:

  <BioSampleInfo>
    <Accession>SAMN38051082</Accession>
    <strain>BALB/c</strain>
    <isolate>Mtb infected Spleen MZB-2</isolate>
    <geo_loc_name>Singapore</geo_loc_name>
    ...

Desired fields can then be selected by name in the second xtract command:

  xtract -pattern BioSampleInfo -def "-" -first Accession \
    geo_loc_name strain isolate

BIOLOGICAL DATA IN ENTREZ

EDirect provides additional functions, scripts, and exploration constructs to simplify the extraction of complex data obtained from the interconnected Entrez biological databases.

SEQUENCE QUALIFIERS

The NCBI data model for sequence records (PMID 11449725) is based on the central dogma of molecular biology. Sequences, including genomic DNA, messenger RNAs, and protein products, are "instantiated" with the actual sequence letters, and are assigned accession numbers for reference.

Features contain information about the biology of a region, including the transformations involved in gene expression. Qualifiers store specific details about a feature (e.g., name of the gene, genetic code used for protein translation, accession of the product sequence, cross-references to external databases).

A gene feature indicates the location of a heritable region of nucleic acid that confers a measurable phenotype. An mRNA feature on genomic DNA represents the exonic and untranslated regions that remain after message transcription and intron splicing. A coding region (CDS) feature has a product reference to the translated protein sequence record.

As a convenience for exploring sequence records, the xtract -insd helper function generates the appropriate nested extraction commands from feature and qualifier names on the command line. (Two computed qualifiers, feat_location and sub_sequence, are also supported.)

SNAIL VENOM PEPTIDE SEQUENCES

A search for cone snail venom mature peptides:

  esearch -db protein -query "conotoxin" -feature mat_peptide |
  efetch -format gpc |
  xtract -insd complete mat_peptide "%peptide" product mol_wt peptide |
  grep -i conotoxin | sort-table -u -k 2,2n

uses the -insd function to print the accession number, mature peptide length, product name, calculated molecular weight, and amino acid sequence for a sample of neurotoxic peptides::

  ADB43131.1    15    conotoxin Cal 1b      1708    LCCKRHHGCHPCGRT
  ADB43128.1    16    conotoxin Cal 5.1     1829    DPAPCCQHPIETCCRR
  AIC77105.1    17    conotoxin Lt1.4       1705    GCCSHPACDVNNPDICG
  ADB43129.1    18    conotoxin Cal 5.2     2008    MIQRSQCCAVKKNCCHVG
  ADD97803.1    20    conotoxin Cal 1.2     2206    AGCCPTIMYKTGACRTNRCR
  AIC77085.1    21    conotoxin Bt14.8      2574    NECDNCMRSFCSMIYEKCRLK
  ADB43125.1    22    conotoxin Cal 14.2    2157    GCPADCPNTCDSSNKCSPGFPG
  ...

SNP-MODIFIED PRODUCT PAIRS

Single nucleotide polymorphisms can represent different substitutions at the same position, but variation records do not explicitly match a specific CDS modification to its altered protein product:

  efetch -db snp -id 11549407 -format docsum |

The hgvs2spdi script converts 1-based HGVS data ("NM_000518.5:c.118C>T") into 0-based SPDI format ("NM_000518.5:167:C:T"). For SNPs on cDNA transcripts the position is CDS-relative, and the script retrieves the GenBank record in order to calculate the absolute sequence offset:

  snp2hgvs | hgvs2spdi | spdi2tbl |

The normalized results are saved in a tab-delimited data table:

  rs11549407    NC_000011.10    5226773    G    A    Genomic    Substitution    HBB
  rs11549407    NM_000518.5     167        C    T    Coding     Substitution    HBB
  rs11549407    NP_000509.1     39         Q    *    Protein    Termination     HBB
  rs11549407    NP_000509.1     39         Q    E    Protein    Missense        HBB
  ...

A final step then translates the coding region locations (after nucleotide substitution), and sorts them with protein sequences (after residue replacement):

  tbl2prod

to produce adjacent matching CDS/protein pairs:

  rs11549407    NM_000518.5:167:C:T    MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT*R ...
  rs11549407    NP_000509.1:39:Q:*     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT*R ...
  rs11549407    NM_000518.5:167:C:G    MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTER ...
  rs11549407    NP_000509.1:39:Q:E     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTER ...
  rs11549407    NM_000518.5:167:C:A    MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTKR ...
  rs11549407    NP_000509.1:39:Q:K     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTKR ...
  ...

GENES IN A REGION

Records for protein-coding genes on the human X chromosome are retrieved by running:

  esearch -db gene -query "Homo sapiens [ORGN] AND X [CHR]" |
  efilter -status alive -type coding | efetch -format docsum |

Gene names and chromosomal positions are extracted by piping the records to:

  xtract -pattern DocumentSummary -NAME Name -DESC Description \
    -block GenomicInfoType -if ChrLoc -equals X \
      -min ChrStart,ChrStop -element "&NAME" "&DESC" |

The -if statement eliminates coordinates from pseudoautosomal gene copies present on the Y chromosome telomeres. Results can now be sorted by position, and then filtered and partitioned:

  sort-table -k 1,1n | cut -f 2- |
  grep -v pseudogene | grep -v uncharacterized | grep -v hypothetical |
  between-two-genes AMER1 FAAH2

to produce an ordered table of known genes located between two markers flanking the centromere:

  FAAH2      fatty acid amide hydrolase 2
  SPIN2A     spindlin family member 2A
  ZXDB       zinc finger X-linked duplicated B
  NLRP2B     NLR family pyrin domain containing 2B
  ZXDA       zinc finger X-linked duplicated A
  SPIN4      spindlin family member 4
  ARHGEF9    Cdc42 guanine nucleotide exchange factor 9
  AMER1      APC membrane recruitment protein 1

TAXONOMIC LINEAGE

To accommodate recursively-defined data, entry to an internal object is blocked when its name matches the current exploration container. The double star / child construct removes the search constraint to recursively visit every object regardless of depth, and can flatten a complex structure into a linear set of elements in a single step:

  efetch -db taxonomy -id 9606 -format xml |
  xtract -pattern Taxon \
    -first TaxId -tab "\n" -element ScientificName \
    -block "**/Taxon" -if Rank -is-not "no rank" -and Rank -excludes "root" \
      -tab "\n" -element Rank,ScientificName

This prints all of the individual internal lineage nodes:

  9606         Homo sapiens
  domain       Eukaryota
  clade        Opisthokonta
  kingdom      Metazoa
  clade        Eumetazoa
  clade        Bilateria
  clade        Deuterostomia
  phylum       Chordata
  subphylum    Craniata
  clade        Vertebrata
  ...

SEQUENCE ANALYSIS

EDirect sequence processing functions are provided by the transmute program. They can handle huge sequences as strings, without requiring any special coding techniques or custom data structures.

For example, the nucleotide sequence in a GenBank record can be extracted, reverse-complemented, and saved in FASTA format with:

  efetch -db nuccore -id U00096 -format gb |
  gbf2fsa | transmute -revcomp | transmute -fasta -width 50

PATTERN SEARCHING

The pBR322 cloning vector is a circular plasmid with unique restriction sites in two antibiotic resistance genes. The transmute -search function takes a list of sequence patterns with optional labels (such as restriction enzyme names), and uses a finite-state algorithm to simultaneously search for all patterns:

  efetch -db nuccore -id J01749 -format fasta |
  transmute -search -circular GGATCC:BamHI GAATTC:EcoRI CTGCAG:PstI

The starting positions and labels for each match are printed in a two-column table:

  374     BamHI
  3606    PstI
  4358    EcoRI

FEATURE LOCATIONS

A table of coding region locations and gene names can be saved directly as XML with xtract -insdx:

  efetch -db nuccore -id NC_000011 -format gb -style master |
  xtract -insdx CDS gene feat_location > cds_loc.xml

The human beta-globin coding region location is retrieved and stored in a Unix shell variable with:

  loc=$( xtract -input cds_loc.xml -pattern Rec \
           -if gene -equals HBB -element feat_location )

Location intervals are shown in biological order, where start is greater than stop on the minus strand:

  5227021..5226930,5226799..5226577,5225726..5225598

SEQUENCE TRANSFORMATIONS

The repercussions of a genomic SNP can be followed with transmute functions: -replace applies the substitution, -extract uses the location to isolate the altered coding sequence, and -cds2prot translates the modified CDS into protein with the designated genetic code:

  efetch -db nuccore -id NC_000011 -format fasta |
  transmute -replace -offset 5226773 -delete G -insert A |
  transmute -extract -1-based "$loc" |
  transmute -cds2prot -gcode 1 -frame 0 -every -trim

EXTERNAL DATA INTEGRATION

The nquire program uses command-line arguments to obtain data from external RESTful, CGI, or FTP servers. (Xtract can read JSON, ASN.1, and GenBank formats directly, but previously-required conversion commands - now for inspecting XML or overriding defaults - are shown below in light text.)

JSON ARRAYS

Human beta-globin information from a Scripps Research data integration project (PMID 23175613):

  nquire -get http://mygene.info/v3 gene 3043 | transmute -j2x |

contains a multi-dimensional JavaScript Object Notation array of exon coordinates:

  "position": [
    [ 5225463, 5225726 ],
    [ 5226576, 5226799 ],
    [ 5226929, 5227071 ]
  ],
  "strand": -1,

Conversion to XML assigns distinct tag names to each level with the "-nest element" default:

  <position>
    <position_E>5225463</position_E>
    <position_E>5225726</position_E>
  </position>
  ...

HETEROGENEOUS DATA

A query for the human green-sensitive opsin gene:

  nquire -get http://mygene.info/v3/gene/2652 | transmute -j2x |

returns data containing a heterogeneous mixture of objects in the pathway section:

  <pathway>
    <reactome>
      <id>R-HSA-162582</id>
      <name>Signal Transduction</name>
    </reactome>
    ...
    <wikipathways>
      <id>WP455</id>
      <name>GPCRs, Class A Rhodopsin-like</name>
    </wikipathways>
  </pathway>

The parent / star construct is used to visit the individual components of a parent object without needing to explicitly specify their names. For printing, the name of a child object is indicated by a question mark:

  xtract -pattern opt -group "pathway/*" \
    -pfc "\n" -element "?,name,id"

This displays a table of pathway database references:

  reactome        Signal Transduction                R-HSA-162582
  reactome        Disease                            R-HSA-1643685
  ...
  reactome        Diseases of the neuronal system    R-HSA-9675143
  wikipathways    GPCRs, Class A Rhodopsin-like      WP455

TABLES TO XML

Tab-delimited files are easily converted to XML with transmute -t2x (or tbl2xml):

  nquire -ftp ftp.ncbi.nlm.nih.gov gene/DATA gene_info.gz |
  gunzip -c | grep -v NEWENTRY | cut -f 2,3 |
  transmute -t2x -set Set -rec Rec -skip 1 Code Name

This takes a series of command-line arguments with tag names for wrapping the individual columns, and skips the first line of input, which contains header information, to generate a new XML file:

  <Rec>
    <Code>1246500</Code>
    <Name>repA1</Name>
  </Rec>
  <Rec>
    <Code>1246501</Code>
    <Name>repA2</Name>
  </Rec>
  ...

The tbl2xml -header argument will instead obtain tag names from the first line of the input data.

Similarly, transmute -c2x (or csv2xml) will convert comma-separated values (CSV) files to XML.

GENBANK DOWNLOAD

The most recent GenBank virus release file can also be downloaded from NCBI servers:

  nquire -lst ftp.ncbi.nlm.nih.gov genbank |
  grep "^gbvrl" | grep ".seq.gz" | sort -V |
  tail -n 1 | skip-if-file-exists |
  nquire -dwn ftp.ncbi.nlm.nih.gov genbank

GenBank flatfile records can be selected by organism name or taxon identifier, or by presence or absence of an arbitrary text string, with transmute -gbf (or filter-genbank).

While this can be read directly by xtract, explicit conversion to INSDSeq XML with transmute -g2x (or gbf2xml) may be up to three times faster for large sets of records:

  gunzip -c *.seq.gz | filter-genbank -taxid 11292 | gbf2xml |

The XML can then be piped to xtract -insd to obtain feature location intervals and underlying sequences of individual coding regions:

  xtract -insd CDS gene product feat_location sub_sequence

LOCAL PUBMED ARCHIVE

Fetching data from Entrez works well when a few thousand records are needed, but it does not scale for much larger sets of data, where the time it takes to download becomes a limiting factor.

LOCAL RECORD CACHE

EDirect can now preload over 38 million live PubMed records onto an inexpensive external 1 TB solid-state drive as individual files for rapid retrieval. For example, PMID 2539356 would be stored at:

  /pubmed/Archive/02/53/93/2539356.xml.gz

using a hierarchy of folders to organize the data for random access to any record.

The local archive is a completely self-contained turnkey product, with no need for the user to download, configure, and maintain complicated third-party database software.

Set an environment variable in your configuration file(s) to reference a section of your external drive:

  export EDIRECT_LOCAL_ARCHIVE=/Volumes/external_drive_name/

Then run archive-pubmed to download the PubMed release files and distribute each record on the drive. The initial download may take around 6 hours, depending on your network connection speed, with initial archiving taking an additional 2 hours. Subsequent updates are incremental, and should finish in minutes.

Retrieving over 135,000 compressed PubMed records from the local archive with xfetch:

  esearch -db pubmed -query "PNAS [JOUR]" -pub abstract | xfetch |

takes about 70 seconds. Retrieving those records from NCBI's network service, with efetch -format xml, would take around 40 minutes.

Even modest sets of PubMed query results can benefit from using the local cache. A reverse citation lookup on 191 papers:

  esearch -db pubmed -query "Cozzarelli NR [AUTH]" | elink -cited |

requires 5 seconds to match 9784 subsequent articles. Retrieving them with separate decompression:

  xfetch -stream | gunzip -c |

takes about one second. Printing the names of all authors in those records:

  xtract -pattern PubmedArticle -block Author \
    -sep " " -tab "\n" -author LastName,Initials[1:1] |

allows creation of a frequency table that lists the authors who most often cited the original papers:

  sort-uniq-count-rank

Fetching from the network service would extend the 6 second running time to over 2 minutes.

LOCAL SEARCH INDEX

A similar strategy was used to create a local information retrieval system suitable for large data mining queries. Run archive-pubmed -index to populate retrieval index files from records stored in the local archive. The initial indexing will also take a few hours. Since PubMed updates are released once per day, it may be convenient to schedule reindexing to start in the late evening and run during the night.

For PubMed titles and primary abstracts, the indexing process deletes hyphens after specific prefixes, removes accents and diacritical marks, splits words at punctuation characters, corrects encoding artifacts, and spells out Greek letters for easier searching on scientific terms. It then prepares inverted indices with term positions, and uses them to build distributed term lists and postings files.

For example, the term list that includes "cancer" in the title or abstract would be located at:

  /pubmed/Postings/TIAB/c/a/n/c/canc.TIAB.trm

A query on cancer thus only needs to load a very small subset of the total index. The software supports expression evaluation, wildcard truncation, phrase queries, proximity searches, and partial matches.

The xinfo, xsearch, xlink, and xfilter scripts provide access to the local search system.

Names of indexed fields, all terms for a given field, and terms plus record counts, are shown by:

  xinfo -fields

  xinfo -terms SUBH

  xinfo -totals PROP

Terms are truncated with a trailing asterisk, and can be expanded to show individual postings counts:

  xinfo -count "catabolite repress*"

  xinfo -counts "catabolite repress*"

Query evaluation includes Boolean operations and parenthetical expressions:

  xsearch -query "(literacy AND numeracy) NOT (adolescent OR child)"

Adjacent words in title or abstract fields are treated as a contiguous phrase:

  xsearch -query "selective serotonin reuptake inhibitor [TITL]"

Each plus sign will replace a single word inside a phrase, and runs of tildes indicate the maximum distance between sequential phrases:

  xsearch -query "vitamin c + + common cold"

  xsearch -query "vitamin c ~ ~ common cold"

Ranked partial term matching is available in any field with -match:

  xsearch -match "tn3 transposition immunity [PAIR]" | just-top-hits 1

An exact substring match, without special processing of Boolean operators or indexed field names, can be obtained with -title (on the article title) or -exact (on the title or abstract):

  xsearch -title "Genetic Control of Biochemical Reactions in Neurospora."

MeSH identifier code, MeSH hierarchy key, and year of publication are also indexed, and MESH field queries are supported by internally mapping to the appropriate CODE or TREE entries:

  xsearch -query "C14.907.617.812* [TREE] AND 2015:2019 [YEAR]"

PMIDs processed through an external source can be reintroduced to a local query pipeline with xfilter:

  ... | xfilter -db pubmed -query "Raynaud Disease [MESH]" |

DATA ANALYSIS AND VISUALIZATION

All query commands return a structured message containing the database name and a list of UIDs, which can be piped directly to xfetch to retrieve the uncompressed records. For example:

  xsearch -query "selective serotonin ~ ~ ~ reuptake inhibit*" |
  xfetch |
  xtract -pattern PubmedArticle -num AuthorList/Author |
  sort-uniq-count -n | reorder-columns 2 1 |
  head -n 25 | align-columns -g 4 -a lr

performs a proximity search with dynamic wildcard expansion (matching phrases like "selective serotonin and norepinephrine reuptake inhibitors") and fetches 14,978 PubMed records from the local archive. It then counts the number of authors for each paper (a consortium is treated as a single author), printing a frequency table of the number of papers per number of authors.

The cumulative size of PubMed can be calculated with a running sum of the annual record counts. Exponential growth over time will appear as a roughly linear curve on a semi-logarithmic graph:

  xinfo -db pubmed -totals YEAR | print-columns '$2, $1, total += $1' |
  print-columns '$1, log($2)/log(10), log($3)/log(10)' |
  filter-columns '$1 >= 1800 && $1 < YR' | xy-plot annual-and-cumulative.png

The sharp jump after World War II was caused by several factors, including the release of declassified papers, a policy of expanding biomedical research in postwar America, and the introduction of computers that could keep up with the indexing of articles from a broader range of subjects.

NATURAL LANGUAGE PROCESSING

NLM's Biomedical Text Mining Group performs computational analysis to extract chemical, disease, and gene references from article contents (PMID 31114887). NLM indexing of PubMed records assigns Gene Reference into Function (GeneRIF) mappings (PMID 14728215).

Running archive-nlmnlp -index periodically (monthly) will automatically refresh any out-of-date support files and then index the connections in CHEM, DISZ, GENE, GRIF, GSYN, and PREF fields:

  xinfo -terms DISZ | grep -i Raynaud

  xinfo -counts "Raynaud* [DISZ]"

  xsearch -query "Raynaud Disease [DISZ]"

FOLLOWING CITATION LINKS

Running archive-nihocc -index will download the latest NIH Open Citation Collection monthly release and build CITED and CITES indices, the local equivalent of elink -cited and -cites commands.

Citation links are retrieved by piping one or more PMIDs to xlink -target:

  xsearch -db pubmed -query "Havran W* [AUTH]" |
  xlink -target CITED |

This returns PMIDs for 6670 articles that cite the original 97 papers. The results are then restricted to a range of recent years, and those records are fetched. The xtract -histogram shortcut builds a journal frequency table from the subsequent articles:

  xfilter -query "2020:2025 [YEAR]" | xfetch |
  xtract -pattern PubmedArticle -histogram Journal/ISOAbbreviation |
  sort-table -nr | head -n 10

The archive-pid -index command reads the PubMed local archive's incremental inverted index files, and builds a PMCID index that allows xlink to return PubMed Central identifiers from PMIDs.

ADDITIONAL EXPERIMENTAL ARCHIVES

New database domains are also easily added, with records obtained from public data resources.

Running archive-pmc -index downloads PMC release files, and collects primary author names, citation details, section titles, and full-text paragraphs. It then converts them to a more tractable form, and builds an archive from those derived records.

Similarly, archive-taxonomy -index archives novel records assembled from NCBI taxonomy data tables retrieved from the FTP site.

A new database is ready for use once its population script has finished all downloading, conversion, validation, caching, indexing, inversion, and postings steps.

USER-SPECIFIED TERM INDEX

Running custom-index with a PubMed indexer script and the names of the fields it populates:

  custom-index $( which idx-grant ) GRNT

integrates user-specified indices into the local search system. The idx-grant script:

  xtract -set IdxDocumentSet -rec IdxDocument -pattern PubmedArticle \
    -wrp IdxUid -element MedlineCitation/PMID -clr -rst -tab "" \
    -group PubmedArticle -pkg IdxSearchFields \
      -block PubmedArticle -wrp GRNT -element Grant/GrantID

has reusable boilerplate in its first three lines, and indexes PubMed records by Grant Identifier:

  ...
  <IdxDocument>
    <IdxUid>2539356</IdxUid>
    <IdxSearchFields>
      <GRNT>AI 00468</GRNT>
      <GRNT>GM 07197</GRNT>
      <GRNT>GM 29067</GRNT>
    </IdxSearchFields>
  </IdxDocument>
  ...

SOLID-STATE DRIVE PREPARATION

To initialize a solid-state drive for hosting the local archive on a Mac, log into an admin account, run Disk Utility, choose View -> Show All Devices, select the top-level external drive, and press the Erase icon. Set the Scheme popup to GUID Partition Map, and APFS will appear as a format choice. Set the Format popup to APFS, enter the desired name for the volume, and click the Erase button.

To finish the drive configuration, disable Spotlight indexing on the drive with:

  sudo mdutil -i off "${EDIRECT_LOCAL_ARCHIVE}"
  sudo mdutil -E "${EDIRECT_LOCAL_ARCHIVE}"

and turn off FSEvents logging with:

  sudo touch "${EDIRECT_LOCAL_ARCHIVE}/.fseventsd/no_log"

Also exclude the drive from being backed up by Time Machine or scanned by a virus checker.

Finally, in Apple -> System Settings -> Privacy & Security -> Full Disk Access, turn on the Terminal slide switch.

PYTHON INTEGRATION

Controlling EDirect from Python scripts is easily done with assistance from the edirect.py library file, which is included in the EDirect archive.

At the beginning of your program, import the edirect module with the following commands:

  #!/usr/bin/env python3

  import sys
  import os
  import shutil

  sys.path.insert(1, os.path.dirname(shutil.which('xtract')))
  import edirect

The first argument to edirect.execute is the Unix command you wish to run. It can be a string:

  ("efetch -db nuccore -id NM_000518.5 -format fasta")

or a sequence of strings, which allows a variable's value to be substituted for a specific parameter:

  accession = "NM_000518.5"
  (('efetch', '-db', 'nuccore', '-id', accession, '-format', 'fasta'))

An optional second argument accepts data to be passed to the Unix command through stdin. Multiple steps are chained together by using the result of the previous command as the data argument in the next command:

  seq = edirect.execute("efetch -db nuccore -id NM_000518.5 -format fasta")
  sub = edirect.execute("transmute -extract -1-based -loc 51..494", seq)
  prt = edirect.execute(('transmute', '-cds2prot', '-every', '-trim'), sub)

Data piped to the script itself is relayed by using "sys.stdin.read()" as the second argument.

Alternatively, the edirect.pipeline function can execute a string containing several piped commands:

  edirect.pipeline('''efetch -db nuccore -id NM_000518.5 -format gb |
                      xtract -insd CDS gene product feat_location''')

or can accept a sequence of individual command strings to be piped together for execution:

  edirect.pipeline(('efetch -db protein -id NP_000509.1 -format gp',
                    'xtract -insd Protein mol_wt sub_sequence'))

An edirect.efetch shortcut that uses named arguments is also available:

  edirect.efetch(db="nuccore", id="NM_000518.5", format="fasta")

To run a custom shell script, make sure the execute permission bit is set, supply the full execution path, and follow it with any command-line arguments:

  db = "pubmed"
  res = edirect.execute(("./datefields.sh", db), "")

PROGRAMMING

A program written in a compiled language is translated into a computer's native machine instruction code, and will run much faster than an interpreted script. Piping FASTA data to the basecount binary executable (compiled from the basecount.go source code file, below):

  efetch -db nuccore -id J01749,U54469 -format fasta | basecount

will return rows containing an accession number followed by counts for each base:

  J01749.1    A 983    C 1210    G 1134    T 1034
  U54469.1    A 849    C 699     G 585     T 748

Programs in Google's Go language ("golang") start with package main and then import additional software libraries (many included with Go, others residing in commercial repositories like github.com):

  package main

  import (
      "cmp"
      "eutils"
      "fmt"
      "maps"
      "os"
      "slices"
  )

Each compiled Go binary has a single main function, which is where program execution begins:

  func main() {

The fsta variable is assigned to a data channel that streams individual FASTA records one at a time:

      fsta := eutils.FASTAConverter(os.Stdin, false)

The countLetters subroutine will be called with the identifier and sequence of each FASTA record:

      countLetters := func(id, seq string) {

An empty counts map is created for each sequence, and its memory is freed when the subroutine exits:

          counts := make(map[rune]int)

A for loop on the range of the sequence string visits each sequence letter. The map keeps a running count for each base or residue, with "++" incrementing the current value of the letter's map entry:

          for _, base := range seq {
             counts[base]++
          }

A sorted keys array is produced by calling slices.SortedFunc. The alphabetical sort order is determined by the second argument, which is is an anonymous function literal:

          keys := slices.SortedFunc(maps.Keys(counts),
              func(i, j rune) int { return cmp.Compare(i, j) })

The sequence identifier is printed in the first column:

         fmt.Fprintf(os.Stdout, "%s", id)

Iterating over the array prints letters and base counts in alphabetical order, with tabs between columns:

          for _, base := range keys {
              num := counts[base]
              fmt.Fprintf(os.Stdout, "\t%c %d", base, num)
          }

A newline is printed at the end of the row, and then the subroutine exits, clearing the map and array:

          fmt.Fprintf(os.Stdout, "\n")
      }

The remainder of the main function uses a loop to drain the fsta channel, passing the identifier and sequence string of each successive FASTA record to the countLetters function. The main function then ends with a final closing brace:

      for fsa := range fsta {
          countLetters(fsa.SeqID, fsa.Sequence)
      }
  }

Save the following script to a file named build.sh, in the same directory as the basecount.go file. Adjust optional GOOS and GOARCH environment variables to cross-compile for a different platform:

  #!/bin/bash

  if [ ! -f "go.mod" ]
  then
    go mod init "$( basename $PWD )"
    echo "replace eutils => $HOME/edirect/eutils" >> go.mod
    go get eutils
  fi
  if [ ! -f "go.sum" ]
  then
    go mod tidy
  fi

  for fl in *.go
  do
    env GOOS=darwin GOARCH=arm64 go build -o "${fl%.go}" "$fl"
  done

The build script creates module files used to track dependencies and retrieve imported packages. It also computes the path for finding the local eutils helper library included with EDirect. Set the Unix execution permission bit for the build script and compile the program(s) by running:

  chmod +x build.sh
  ./build.sh

DOCUMENTATION

Documentation for EDirect is on the web at:

  http://www.ncbi.nlm.nih.gov/books/NBK179288

Information on how to obtain an API Key is described in this NCBI blogpost:

  https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities

Introductions to shell scripting for non-programmers, and to the Go programming language, are at:

  https://missing.csail.mit.edu/2020/shell-tools/
  https://cacm.acm.org/research/the-go-programming-language-and-environment/

Instructions for downloading and installing the Go compiler are at:

  https://golang.org/doc/install#download

To download the free Aspera Connect file transfer client, open the IBM Aspera Connect subsection at:

  https://www.ibm.com/products/aspera/downloads#cds

Questions or comments on EDirect may be sent to info@ncbi.nlm.nih.gov.

This research was supported by the Intramural Research Program of the National Library of Medicine at the NIH.
