The PMC File Transfer Protocol (FTP) Service supports usage of the PMC Article Datasets with the following services:
- Available for: PMC Open Access Subset, Author Manuscript Dataset, and Historical OCR Dataset
- Packages include: XML or plain text files packaged in compressed baseline and daily incremental packages with each baseline containing 100's of thousands of articles (Note: The Historical OCR Dataset is only available in plain text format.)
- Available for: PMC Open Access Subset only
- Packages include: XML, PDF (if present), media files, and supplementary materials for a single article
- Available for: PMC Open Access Subset only
- Individual PDFs of articles: only available for non-commercial use licensed articles
PMC ID Cross-referencing
- Cross reference any PMC article ID with identifiers such as PubMed IDs, DOIs, and Author Manuscript IDs
- File: PMC-ids.csv.gz, a file in the top-level FTP directory
Base FTP URL: https://ftp.ncbi.nlm.nih.gov/pub/pmc
*Tip* NCBI recommends setting the TCP buffer size to 32Mb for best performance. NCBI supports secure FTP via SFTP. For more information, please see https://ftp.ncbi.nlm.nih.gov/README.ftp.
If you have questions or comments about the PMC FTP Service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on Developer Resources page.
If you only are interested in the metadata and text of an article or author manuscript, then bulk download may be what you want to use. Bulk packages group together hundreds of thousands of articles in XML or plain text formats in compressed packages (Note: The Historical OCR Dataset is only available in plain text format). If you are also interested in media files, supplementary materials, or PDFs, please see the sections on Individual Article Download and PDF Download.
Bulk Download Updates (April 2022)
In September 2021 PMC released new bulk download directory structures and packages to our FTP Service for two datasets: the PMC Open Access (OA) Subset and the Author Manuscript Dataset. The old bulk download structure remained in place until December 2021; the week of December 5-11 the old bulk files were moved respectively to sub-directories of oa_bulk and manuscript both named "deprecated". These directories named "deprecated" were deleted April 1, 2022.
Baseline Packages Update Schedule
New baseline packages will be created at least two times per year. Previous baseline and incremental packages and the accompanying file lists will be deleted whenever a new baseline is created.
New baselines will be created:
- as needed*
*PMC is sometimes required to suppress an article from public view for legal reasons if the case involves a legal injunction or a breach of patient privacy. In such cases, a new set of baseline packages will be created for the impacted dataset. This is not a frequent occurence.
Directories Organized by Dataset, License Terms, and File Content Type
Bulk downloads are available on the FTP Service by dataset:
PMC Open Access Subset - Bulk Author Manuscript Dataset - Bulk Historical OCR Dataset - Bulk
We have further divided the PMC Open Access Subset bulk packages into three groups based on available license terms:
- Commercial Use Allowed - CC0, CC BY, CC BY-SA, and CC BY-ND licenses
- Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND
- Other - no machine-readable license, no license, or a custom license
PMC OA Subset - Commercial Use PMC OA Subset - Non-Commercial Use Only PMC OA Subset - Other
To access the complete PMC OA Subset you will need to retrieve ALL of the OA Subset packages. These groups are complementary rather than duplicative.
Each of these datasets or groupings is divided into separate directories by file content type: XML (
\xml) and plain text (
\txt). The baseline packages for each of these OA Subset groups and for the Author Manuscript Dataset are divided by PMCID range (e.g., PMC004XXXXXX) in order to keep package sizes reasonable.
The result is the following directory structure:
|_ manuscript/ |___ txt/ |___ xml/ |_ oa_bulk/ |___ oa_comm/ |_____ txt/ |_____ xml/ |___ oa_noncomm/ |_____ txt/ |_____ xml/ |___ oa_other/ |_____ txt/ |_____ xml/
There are csv and txt formatted file lists available for each package. The file lists have been updated to:
- include a flag indicating if an article has been retracted (yes/no, where yes = retracted and no = not retracted).
- bring the csv and txt file lists into sync (we found that we had updated the csv file with extra fields, but not the txt files of the current production file lists.)
Note: Author manuscripts have different metadata information available than PMC OA Subset articles, so do not assume the same structure for the file lists for these two different datasets.
Sample Bulk File Names
- Baselist file list: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.filelist.csv
- Baseline: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.tar.gz
- Incremental file list: oa_comm_xml.incr.2021-09-17.filelist.csv
- Incremental update: oa_comm_xml.incr.2021-09-17.tar.gz
In each of the sample file names above you can substitute various parts to get to the files you want, e.g.
oa_noncommto get PMC OA Subset non-commerical use articles or replace with
oa_otherto get PMC OA Subset articles without explicity tagged Creative Commons licenses. Replace it with
author_manuscriptto get author manuscripts.
_txtto get plain text files vs. XML files
incrto switch from a baseline file to one of the daily incremental files, be sure to update the date and remove the PMC00#XXXXXX from the file name
PMC008XXXXXXin baseline file names to get the articles in the specified grouping with PMCIDs in the range from PMC8000000 to PMC8999999; to get all articles you must retrieve all the PMCID ranges
- Replace the date (e.g.
2021-09-16) with the new baseline date if the baseline has been updated since this documentation was written; replace the date for incremental files with the date you want to retrieve
.txtas the file extension for the file list to get a tab separated plain text version of the file list
Individual Article Download (PMC Open Access Subset Only)
PMC Open Access Subset Individual Article Packages
If you only want to download some of the PMC OA Subset based on search criteria or if you want to download complete packages for articles that include XML, PDF, media, and supplementary materials, you will need to use the individual article download packages. To keep directories from getting too large, the packages have been randomly distributed into a two-level-deep directory structure. You can use the file lists in CSV or txt format to search for the location of specific files or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article metadata.
- Filenames: PMCXXXXXXX.tar.gz where the X's represent a specific PMCID
- File lists: oa_file_list.csv or oa_file_list.txt (Located up one level in the top level PMC FTP directory)
The first line of each file list is the timestamp the file was written. Subsequent rows contain metadata for each article.
Each row is divided into 6 metadata fields for CSV (5 for TXT), delimited by comma (tab) characters, For example:
oa_package/66/8b/PMC555938.tar.gz BMC Bioinformatics. 2005 Mar 7; 6:44 PMC555938 PMID15748298 CC BY
The fields in the files are:
- The fully qualified name of the .tar.gz file for an article
- The article citation, comprising the journal title abbreviation, publication date, volume, issue, and the page range or elocation ID
- PMC accession number (PMCID)
- Last updated timestamp (YYYY-MM-DD HH:MM:SS) (NOT INCLUDED in TXT files)
- PubMed ID (PMID)
- License type*
*The field value for “license type” can be any of the standard Creative Commons license variants (e.g., CC BY; CC BY-NC; CC BY-NC-ND) or “NO-CC CODE”. “NO-CC CODE” appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.
PDF Download (PMC Open Access Subset Only)
PMC Open Access Subset PDF Files
Individual article PDF downloads are only available for non-commercial use licensed articles. To keep directories from getting too large, the article PDFs have been randomly distributed into a two-level-deep directory structure. You can use the oa_non_comm_use_pdf file lists in CSV or txt format to search for the location of specific files, or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article citation and license information, as well as the date the article was last updated in PMC.
- Filenames: filename.PMCXXXXXXX.pdf where filename is the original name of the source file and the X's represent a specific PMCID
- File lists: oa_non_comm_use_pdf.csv or oa_non_comm_use_pdf.txt (Located in the top level PMC FTP directory)
Articles in these datasets are made available consistent with either the terms of applicable article-level license statements or the funder’s policy. See PMC Copyright for more information.
How to Cite
See the individual dataset pages on how to cite the PMC Open Access Subset and PMC Author Manuscript Dataset.