FTP Service

The PMC File Transfer Protocol (FTP) Service supports usage of the PMC Article Datasets with the following services:

Bulk download

Available for: PMC Open Access Subset, Author Manuscript Dataset, and Historical OCR Dataset
Packages include: XML or plain text files packaged in compressed baseline and daily incremental packages with each baseline containing 100's of thousands of articles (Note: The Historical OCR Dataset is only available in plain text format.)

Individual article download

Available for: PMC Open Access Subset only
Packages include: XML, PDF (if present), media files, and supplementary materials for a single article

PDF download

Available for: PMC Open Access Subset only
Individual PDFs of articles: only available for non-commercial use licensed articles

PMC ID Cross-referencing

Cross reference any PMC article ID with identifiers such as PubMed IDs, DOIs, and Author Manuscript IDs
File: PMC-ids.csv.gz, a file in the top-level FTP directory

Base FTP URL: https://ftp.ncbi.nlm.nih.gov/pub/pmc

*Tip* If you are having difficulties with FTP, please consider trying the HTTPS protocol instead, e.g. [https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/oa_comm/xml/oa_comm_xml.incr.2021-09-17.filelist.csv. NCBI also supports secure FTP via SFTP.

If you have questions or comments about the PMC FTP Service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on Developer Resources page.

Bulk Download

If you only are interested in the metadata and text of an article or author manuscript, then bulk download may be what you want to use. Bulk packages group together hundreds of thousands of articles in XML or plain text formats in compressed packages (Note: The Historical OCR Dataset is only available in plain text format). If you are also interested in media files, supplementary materials, or PDFs, please see the sections on Individual Article Download and PDF Download.

Bulk Download Updates (April 2022)

In September 2021 PMC released new bulk download directory structures and packages to our FTP Service for two datasets: the PMC Open Access (OA) Subset and the Author Manuscript Dataset. The old bulk download structure remained in place until December 2021; the week of December 5-11 the old bulk files were moved respectively to sub-directories of oa_bulk and manuscript both named "deprecated". These directories named "deprecated" were deleted April 1, 2022.

Learn more about the updated FTP structures

Baseline Packages Update Schedule

New baseline packages will be created at least two times per year. Previous baseline and incremental packages and the accompanying file lists will be deleted whenever a new baseline is created.

New baselines will be created:

mid-June
mid-December
as needed*

*PMC is sometimes required to suppress an article from public view for legal reasons if the case involves a legal injunction or a breach of patient privacy. In such cases, a new set of baseline packages will be created for the impacted dataset. This is not a frequent occurence.

Directories Organized by Dataset, License Terms, and File Content Type

Bulk downloads are available on the FTP Service by dataset:

PMC Open Access Subset - Bulk Author Manuscript Dataset - Bulk Historical OCR Dataset - Bulk

We have further divided the PMC Open Access Subset bulk packages into three groups based on available license terms:

Commercial Use Allowed - CC0, CC BY, CC BY-SA, and CC BY-ND licenses
Non-Commercial Use Only - CC BY-NC, CC BY-NC-SA, CC BY-NC-ND
Other - no machine-readable license, no license, or a custom license

PMC OA Subset - Commercial Use PMC OA Subset - Non-Commercial Use Only PMC OA Subset - Other

To access the complete PMC OA Subset you will need to retrieve ALL of the OA Subset packages. These groups are complementary rather than duplicative.

Each of these datasets or groupings is divided into separate directories by file content type: XML (\xml) and plain text (\txt). The baseline packages for each of these OA Subset groups and for the Author Manuscript Dataset are divided by PMCID range (e.g., PMC004XXXXXX) in order to keep package sizes reasonable.

The result is the following directory structure:

|_ manuscript/
|___ txt/
|___ xml/
|_ oa_bulk/
|___ oa_comm/
|_____ txt/
|_____ xml/
|___ oa_noncomm/
|_____ txt/
|_____ xml/
|___ oa_other/
|_____ txt/
|_____ xml/

File Lists

There are csv and txt formatted file lists available for each package. The file lists have been updated to:

include a flag indicating if an article has been retracted (yes/no, where yes = retracted and no = not retracted).
bring the csv and txt file lists into sync (we found that we had updated the csv file with extra fields, but not the txt files of the current production file lists.)

Note: Author manuscripts have different metadata information available than PMC OA Subset articles, so do not assume the same structure for the file lists for these two different datasets.

Sample Bulk File Names

Baselist file list: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.filelist.csv
Baseline: oa_comm_xml.PMC003XXXXXX.baseline.2021-09-16.tar.gz
Incremental file list: oa_comm_xml.incr.2021-09-17.filelist.csv
Incremental update: oa_comm_xml.incr.2021-09-17.tar.gz

In each of the sample file names above you can substitute various parts to get to the files you want, e.g.

Replace oa_comm with oa_noncomm to get PMC OA Subset non-commerical use articles or replace with oa_other to get PMC OA Subset articles without explicity tagged Creative Commons licenses. Replace it with author_manuscript to get author manuscripts.
Replace _xml with _txt to get plain text files vs. XML files
Replace baseline with incr to switch from a baseline file to one of the daily incremental files, be sure to update the date and remove the PMC00#XXXXXX from the file name
Replace PMC003XXXXXX with PMC008XXXXXX in baseline file names to get the articles in the specified grouping with PMCIDs in the range from PMC8000000 to PMC8999999; to get all articles you must retrieve all the PMCID ranges
Replace the date (e.g. 2021-09-16) with the new baseline date if the baseline has been updated since this documentation was written; replace the date for incremental files with the date you want to retrieve
Replace .csv with .txt as the file extension for the file list to get a tab separated plain text version of the file list

Individual Article Download (PMC Open Access Subset Only)

PMC Open Access Subset Individual Article Packages

If you only want to download some of the PMC OA Subset based on search criteria or if you want to download complete packages for articles that include XML, PDF, media, and supplementary materials, you will need to use the individual article download packages. To keep directories from getting too large, the packages have been randomly distributed into a two-level-deep directory structure. You can use the file lists in CSV or txt format to search for the location of specific files or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article metadata.

Filenames: PMCXXXXXXX.tar.gz where the X's represent a specific PMCID
File lists: oa_file_list.csv or oa_file_list.txt (Located up one level in the top level PMC FTP directory)

The first line of each file list is the timestamp the file was written. Subsequent rows contain metadata for each article.

Each row is divided into 6 metadata fields for CSV (5 for TXT), delimited by comma (tab) characters, For example:

oa_package/66/8b/PMC555938.tar.gz BMC Bioinformatics. 2005 Mar 7; 6:44 PMC555938 PMID15748298 CC BY

The fields in the files are:

The fully qualified name of the .tar.gz file for an article
The article citation, comprising the journal title abbreviation, publication date, volume, issue, and the page range or elocation ID
PMC accession number (PMCID)
Last updated timestamp (YYYY-MM-DD HH:MM:SS) (NOT INCLUDED in TXT files)
PubMed ID (PMID)
License type*

*The field value for “license type” can be any of the standard Creative Commons license variants (e.g., CC BY; CC BY-NC; CC BY-NC-ND) or “NO-CC CODE”. “NO-CC CODE” appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.

PDF Download (PMC Open Access Subset Only)

PMC Open Access Subset PDF Files

Individual article PDF downloads are only available for non-commercial use licensed articles. To keep directories from getting too large, the article PDFs have been randomly distributed into a two-level-deep directory structure. You can use the oa_non_comm_use_pdf file lists in CSV or txt format to search for the location of specific files, or you can use the OA Web Service API. The file lists and OA Web Service API also provide basic article citation and license information, as well as the date the article was last updated in PMC.

Filenames: filename.PMCXXXXXXX.pdf where filename is the original name of the source file and the X's represent a specific PMCID
File lists: oa_non_comm_use_pdf.csv or oa_non_comm_use_pdf.txt (Located in the top level PMC FTP directory)

License

Articles in these datasets are made available consistent with either the terms of applicable article-level license statements or the funder’s policy. See PMC Copyright for more information.

Contact

pubmedcentral@ncbi.nlm.nih.gov

How to Cite

See the individual dataset pages on how to cite the PMC Open Access Subset and PMC Author Manuscript Dataset.