benben-miao
Articles71
Tags8
Categories0
NCBI Parser

NCBI Parser

约1.6k字 预计需要10分钟

1. NCBI-Parser introduction

NCBI-Parser: 最近根据数据获取需要,基于Python开发了新的终端程序NCBI-Parser(ncbiparser),方便科研工作者从NCBI子数据库快速下载匹配的搜索结果条目的GenBank文件及提取所有关键信息导出到表格文件中。本文介绍程序使用及开发技巧。

NCBI-Parser: Search any term from all NCBI databases to obtain matched entries, and download all NCBI GenBank format files and sequences corresponding to accessory number in genbank directory, finally export all Species Latin names, Accessory Number, Sequence Length, Division, Collection Country, Collection Date, Collected Worker, Identified Worker, and Reference information, etc. of the search records as tabular files.

Developer: benben-miao

Github: https://github.com/benben-miao

Source Code: https://github.com/benben-miao/NCBI-Parser

[PyPI: https://pypi.org/project/ncbiparser/]https://pypi.org/project/ncbiparser/)

2. ncbiparser Usage

# Install the package
pip install ncbiparser

# Run the command internally
ncbiparser --help
Usage: ncbi-parser.exe [OPTIONS]

  Description: Search any term from all NCBI databases to obtain matched
  entries, and download all NCBI GenBank format files and sequences
  corresponding to accessory number in genbank directory, finally export all
  Species Latin names, Accessory Number, Sequence Length, Division, Collection
  Country, Collection Date, Collected Worker, Identified Worker, and Reference
  information, etc. of the search records as tabular files.

  1. Get options and parameters help:

  ncbiparser --help

  2. Example (simple information). Users only need to input the search term,
  and the default search is in NCBI nucleotide database:

  ncbiparser --term "Acanthopagrus 16S" --output results.xls

  3. Example (complete information). Specify the NCBI database type, input the
  search term, specify the number of records to download and extract, and
  suggest setting a larger parameter max_record:

  ncbiparser --db_type nucleotide --term "Acanthopagrus 16S" --max_record 500
  --res_type gb --output results.xls

Options:
  --db_type TEXT     Please input NCBI database type, default: "nucleotide",
                     including: [pubmed, protein, nuccore, ipg, nucleotide,
                     structure, genome, annotinfo, assembly, bioproject,
                     biosample, blastdbinfo, books, cdd, clinvar, gap,
                     gapplus, grasp, dbvar, gene, gds, geoprofiles,
                     homologene, medgen, mesh, ncbisearch, nlmcatalog, omim,
                     orgtrack, pmc, popset, proteinclusters, pcassay, protfam,
                     pccompound, pcsubstance, seqannot, snp, sra, taxonomy,
                     biocollections, gtr]  [required]
  --term TEXT        Please input search term content, default: "Acanthopagrus
                     16S"  [required]
  --max_record TEXT  Please input max record number, default: "100", up to:
                     10000  [required]
  --res_type TEXT    Please input result type, default: "gb", including:
                     ["gb", "fasta", "gbwithparts", "gbcoll"]
  --output TEXT      Please input output full name (path + name + extension).
                     default="results.xls"
  --help             Show this message and exit.

3. NCBI GenBank format

LOCUS       LC649152                1081 bp    DNA     linear   VRT 03-SEP-2021
DEFINITION  Acanthopagrus bifasciatus Kuroshio Biological Research Foundation
            KBF-I 361 mitochondrial genes for 12S rRNA, tRNA-Val, 16S rRNA,
            partial and complete sequence.
ACCESSION   LC649152
VERSION     LC649152.1
KEYWORDS    .
SOURCE      mitochondrion Acanthopagrus bifasciatus (twobar seabream)
  ORGANISM  Acanthopagrus bifasciatus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Actinopterygii; Neopterygii; Teleostei; Neoteleostei;
            Acanthomorphata; Eupercaria; Spariformes; Sparidae; Acanthopagrus.
REFERENCE   1
  AUTHORS   Sado,T., Fukuchi,T. and Miya,M.
  TITLE     Reference data for MiFish metabarcoding analysis
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 1081)
  AUTHORS   Sado,T., Fukuchi,T. and Miya,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (27-AUG-2021) Contact:Tetsuya Sado Natural History Museum
            & Institute, Chiba; 955-2 Aoba-cho, Chuo-ku, Chiba, Chiba 260-8682,
            Japan URL :http://www2.chiba-muse.or.jp/NATURAL/
FEATURES             Location/Qualifiers
     source          1..1081
                     /organism="Acanthopagrus bifasciatus"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /specimen_voucher="Kuroshio Biological Research Foundation
                     KBF-I 361"
                     /db_xref="taxon:767411"
                     /PCR_primers="fwd_name: L-708-12S, fwd_seq:
                     ttayacatgcaagtatccgc, rev_name: H-1784-16SG, rev_seq:
                     ttcagctttcccttgcggtac"
     rRNA            <1..894
                     /product="12S ribosomal RNA"
     tRNA            895..967
                     /product="tRNA-Val"
     rRNA            968..>1081
                     /product="16S ribosomal RNA"
ORIGIN      
        1 acccccgtga aaatgcccta cagttccccg cccggaaaca aggagccggt atcaggcaca
       61 ttcaatttag cccacgacac cttgctcagc cacaccctca agggtactca gcagtgataa
      121 accttgacac ataagtgaaa acttgaatca gttaaagcta agtagggccg gtaaaactcg
      181 tgccagccac cgcggttata cgagaggccc aagttgttag aaatcggcgt aaagggtggt
      241 taagaataag attaaaatta aagccgaaca tctttagtag ctgttatacg ctttcaaaga
      301 caagaagccc aactgcgaaa gtagctttat attttctgaa cccacgaaag ctaaggtaca
      361 aactgggatt agatacccca ctatgcttag ccgtaaacat cgacagttta ttacattttc
      421 tgtccgcctg ggtactacaa gcattagctc aaaacccaaa ggacttggcg gtgctttaga
      481 cccacctaga ggagcctgtt ctagaaccga tattccccgt tcaacctcac ctctccttgc
      541 ctctcagcct atataccgcc gtcgttcagc ttaccctgtg aagggcaaaa agtaagcaaa
      601 attggcactg cccagtacgt caggtcgagg tgtagtcaat ggagtgggaa gaaatgggct
      661 acattccctt gtcttcaggg aactacgaat ggtgcactga aaatgtgtgc ctgaaggagg
      721 atttagcagt aagtagtaat ttagaatatt ctactgaagc cggctcttaa gcgcgcacac
      781 accgcccgtc actctccccg agactttaaa ttcacattaa ctaaaatatt aaatatcata
      841 gaggggaggc aagtcgtaac atggtaagtg taccggaagg tgtacttgga aaaccagcgc
      901 atagctaaac tagataaagc acctccctta cactgagaag atattcgtgc aaatcgaatt
      961 gccctgagcc tatcagctag ccctctaaca aaaaacaaca cacccccatc aattaacccc
     1021 caatgcactt acattaaatt aaacaaatca tttttccacc caagtatggg cgacagaaaa
     1081 g

4. Click usage

import click

@click.command()
@click.option('--db_type', type=str, default="nucleotide", required=True, nargs=1, 
            help='Please input NCBI database type, default: "nucleotide", including: [pubmed, protein, nuccore, ipg, nucleotide, structure, genome, annotinfo, assembly, bioproject, biosample, blastdbinfo, books, cdd, clinvar, gap, gapplus, grasp, dbvar, gene, gds, geoprofiles, homologene, medgen, mesh, ncbisearch, nlmcatalog, omim, orgtrack, pmc, popset, proteinclusters, pcassay, protfam, pccompound, pcsubstance, seqannot, snp, sra, taxonomy, biocollections, gtr]')
@click.option('--term', type=str, default="", required=True, nargs=1, 
            help='Please input search term content, default: "Acanthopagrus 16S"')
@click.option('--max_record', type=str, default="100", required=True, nargs=1, 
            help='Please input max record number, default: "100", up to: 10000')
@click.option('--res_type', type=str, default="gb", required=False, nargs=1, 
            help='Please input result type, default: "gb", including: ["gb", "fasta", "gbwithparts", "gbcoll"]')
@click.option('--output', type=str, default="results.xls", nargs=1, 
            help='Please input output full name (path + name + extension). default="results.xls"')

def run(db_type, term, max_record, res_type, output):

  """
  Description:
  Search any term from all NCBI databases to obtain matched entries, and download all NCBI GenBank format files and sequences corresponding to accessory number in genbank directory, finally export all Species Latin names, Accessory Number, Sequence Length, Division, Collection Country, Collection Date, Collected Worker, Identified Worker, and Reference information, etc. of the search records as tabular files. \n
  1. Get options and parameters help: \n
  ncbi-parser --help \n
  2. Example (simple information). Users only need to input the search term, and the default search is in NCBI nucleotide database: \n
  ncbi-parser --term "Acanthopagrus 16S" --output results.xls \n
  3. Example (complete information). Specify the NCBI database type, input the search term, specify the number of records to download and extract, and suggest setting a larger parameter max_record: \n
  ncbi-parser --db_type nucleotide --term "Acanthopagrus 16S" --max_record 500 --res_type gb --output results.xls \n
  """

  results_file = output

5. Pyinstaller usage

mamba create -n ncbiparser python=3.8
conda activate ncbiparser

which python
pip install biopython click
python ncbiparser.py --help

pip install pyinstaller
pyinstall -F ncbiparser.py -i favicon.ico

6. Package and publish ncbiparser to PyPI

6.1 Create setup.py file

#!/usr/bin/python
# _*_ coding:utf-8 _*_

from setuptools import setup, find_packages

### 1. Read the README.md contents
from os import path
this_directory = path.abspath(path.dirname(__file__))
with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
    long_description = f.read()
readme = 'ncbiparser/README.md'

### 2. Setup the package configure
setup(
    # 2.1 Package information
    name="ncbiparser",
    version="1.0.0",
    url="https://github.com/benben-miao/NCBI-Parser",
    keywords=["CLI", "Biology", "NCBI", "Parser"],
    description="Search and fetch the Species Latin names, Accessory Number, Sequence Length, Division, Collection Country, Collection Date, Collected Worker, Identified Worker, and Reference information, etc. of the search records as tabular files.",
    long_description=long_description,
    long_description_content_type='text/markdown',
    download_url="https://pypi.org/project/ncbi-parser",

    # 2.2 Developer information
    author="benben-miao",
    author_email="benben.miao@outlook.com",
    maintainer="benben-miao",
    maintainer_email="benben.miao@outlook.com",
    license="MIT License",

    # 2.3 Package configure
    # include_package_data=True,
    platforms="any",
    install_requires=['biopython', 'click'],
    # packages=find_packages(),
    packages=['ncbiparser'],
    entry_points={
        'console_scripts': ['ncbiparser=ncbiparser.ncbiparser:run']
    },
    classifiers=[
        "Programming Language :: Python :: 3",
        "Operating System :: OS Independent",
    ],
    data_files=[readme],
    # package_data={
    #     '':['examples/*.xlsx']
    # },
    include_package_data=True
    # exclude_package_data={
    # }
)

### 3. Package build, test and publish
# 3.1 Install setuptools with wheel and twine
# pip install --upgrade setuptools wheel twine

# 3.2 Check the setup.py content
# python setup.py check

# 3.2 Build source code tarball
# python setup.py sdist

# 3.3 Build wheel package and test
# python setup.py bdist_wheel
# pip install dist\*.whl

# 3.4 Test package and uninstall test version
# ncbiparser --help
# pip uninstall ncbiparser

# 3.5 Publish the package to PyPI
# twine upload dist/*

6.2 Create the MANIFEST.in file

include *.py *.md
recursive-include examples *.sh *.py *.ipynb *.txt *.xlsx *.gbk *.fasta *.seq

6.3 Commands

### 3. Package build, test and publish
# 3.1 Install setuptools with wheel and twine
pip install --upgrade setuptools wheel twine

# 3.2 Check the setup.py content
python setup.py check

# 3.2 Build source code tarball
python setup.py sdist

# 3.3 Build wheel package and test
python setup.py bdist_wheel
pip install dist\*.whl

# 3.4 Test package and uninstall test version
ncbiparser --help
pip uninstall ncbiparser

# 3.5 Publish the package to PyPI
twine upload dist/*
×