Data repositories
This list is part of the Open Access Directory.
- This is a list of repositories and databases for open data.
- Please annotate the entries to indicate the hosting organization, scope, licensing, and usage restrictions (if any). If a repository is open in some respects but not others, please include it with an annotation rather than exclude it.
- If you're not sure whether a given dataset or data collection is open, post your query to Is It Open Data?
- Related lists in OAD: Disciplinary repositories (primarily for texts, not data).
- For news about data repositories, including some newly launched repositories not yet listed here, follow the oa.repositories.data tag of the Open Access Tracking Project.
- See also:
- Re3data.org. The re3data.org project intends to create a global registry of research data repositories.
- Recommended Data Repositories from Nature.
- FAIRsharing. The FAIRsharing project compares data repositories for their compliance with the FAIR principles and journal data-sharing policies.
Archaeology
- Also see Social sciences.
- Fasti Online . Subdivided in Excavation, Restauration and Survey.
- Open Context. From the Alexandria Archive Institute.
Astronomy
- Also see Physics.
- Astrophysics Data System. From the Smithsonian Astrophysical Observatory (SAO) and National Aeronautics and Space Administration (NASA).
- National Space Science Data Center. From the US National Aeronautics and Space Administration (NASA).
Biology
- Also see BCO-DMO, Marine Biology data, listed with Marine Sciences repositories.
- Also see DataONE, Entrez databases, KNB, and PANGAEA, listed under Multidisciplinary repositories.
- The Arabidopsis Information Resource - The Arabidopsis Information Resource (TAIR) maintains a database of genetic and molecular biology datafor the model higher plant Arabidopsis thaliana.
- BOND (Biomolecular Object Network Databank). From Unleashed Informatics.
- The Cell: An Image Library Images of all cell types from all organisms, including intracellular structures and movies or animations demonstrating functions. This project relies upon the cell biology community to populate the library. The Cell: An Image Library™ is a freely accessible, easy-to-search, public repository of reviewed and annotated images, videos, and animations of cells from a variety of organisms, showcasing cell architecture, intracellular functionalities, and both normal and abnormal processes. The purpose of this database is to advance research, education, and training, with the ultimate goal of improving human health.
- Databases at EBI. From the European Bioinformatics Institute (EBI). This is a web directory of the EBI databases. Also see the FTP interface.
- DataBasin. OA data in conservation. From the Conservation Biology Institute in partnership with Rhiza Labs.
- Dryad Dryad is an international repository of data underlying scientific and medical publications, particularly data for which no specialized repository exists. All material in Dryad is associated with a scholarly publication. Most data in the repository are associated with peer-reviewed articles, although data associated with non-peer reviewed publications from reputable academic sources, such as dissertations, are also accepted. Dryad is a non-profit organization.
- Gene Expression Omnibus High-throughput functional genomic data, including all array-based applications and some high-throughput sequencing data.
- Global Biodiversity Information Facility (GBIF). "Free and open access to biodiversity data." Data portal launched in 2007 by institutions in 17 countries under a non-binding inter-governmental agreement.
- Molecular Biology Databases. From Shirley Fung. A list of 34 databases with annotations to show their openness under six criteria. Also see her list of 7 databases which comply with the Science Commons Open Access Data Protocol.
- MorphoBank. "Homology of phenotypes over the web." Hosted by the State University of New York at Stony Brook.
- National Biological Information Infrastructure A broad, collaborative program to provide increased access to data and information on the nation's biological resources. The NBII links diverse, high-quality biological databases, information products, and analytical tools maintained by NBII partners and other contributors in government agencies, academic institutions, non-government organizations, and private industry. (Note: In the President's budget for Fiscal Year 2012 the repository was terminated.)
- PaleoBiology Database. "We are bringing together taxonomic and distributional information about the entire fossil record of plants and animals." From a large number of researchers at a large number of institutions.
- Planet A network of European Plant Databases
- Peptidome. For "tandem mass spectrometry peptide and protein identification data." From the US National Center for Biotechnology Information.
- RCSB Protein Data Bank. From the Research Collaboratory for Structural Bioinformatics (RCSB).
- TreeBASE. "A Database of Phylogenetic Knowledge." Released in March 2010 based on a prototype launched in 1994. Hosted by the Phyloinformatics Research Foundation.
- The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data..
Chemistry
- Also see BCO-DMO, Marine Biology data, listed with Marine Sciences repositories.
- Also see Entrez databases, listed under Multidisciplinary repositories.
- Cambridge Structural Database The CCDC is a non-profit, charitable Institution whose objectives are the general advancement and promotion of the science of chemistry and crystallography for the public benefit.
- ChemSpider. Hosted by the Royal Society of Chemistry.
- ChemStar. Maintained by India's National Chemical Laboratory and sponsored by India's Department for Scientific & Industrial Research.
- ChemSynthesis. A database of chemicals and their physical properties.
- ChemXSeer. Hosted by Pennsylvania State University.
- Cooperative Association for Internet Data Analysis (CAIDA) Archive of data for scientific analysis of network functions.
- CrystalEye. From the Unilever Cambridge Centre for Molecular Informatics at the University of Cambridge.
- Crystallography Open Database. A joint project of the Mineralogical Society of America, Mineralogical Association of Canada, European Journal of Mineralogy, International Union of Crystallography, and the US National Science Foundation. Data are in the public domain.
- eCrystals. From the Southampton Chemical Crystallography Group and the EPSRC UK National Crystallography Service.
- NMRShiftDB. For "organic structures and their nuclear magnetic resonance (nmr) spectra." Distributed nodes from the EBI, University of Mainz and the Max Plank Institute for Chemical Ecology. Data are license with the GNU FDL.
- ChemStar. Maintained by India's National Chemical Laboratory and sponsored by India's Department for Scientific & Industrial Research.
- Open Notebook Science Solubility Challenge. Maintained by Jean-Claude Bradley, Rajarshi Guha, Andrew Lang and Cameron Neylon. A database of non-aqueous solubility measurements with links to lab notebook pages where experiments were recorded. The database can be searched via Web Query or alternate means.
- PubChem. From the U.S. National Center for Biotechnology Information of the National Institutes of Health (NIH).
- WorldWideMolecularMatrix. "An Open collection of information on small molecules." From the University of Cambridge.
- ZINC. "A free database of commercially-available compounds for virtual screening." From the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco.
Computer Science
- CiteSeerX provides its databases of nearly 2 million documents and the associated texts and pdfs for research.
- Cooperative Association for Internet Data Analysis (CAIDA) Archive of data for scientific analysis of network functions.
- GitHub keeps your public and private code available, secure, and backed up.
- Google Code Project Hosting Project Hosting on Google Code provides a free collaborative development environment for open source projects. Each project comes with its own member controls, Subversion/Mercurial repository, issue tracker, wiki pages, and downloads section. Our project hosting service is simple, fast, reliable, and scalable, so that you can focus on your own open source development.
- Launchpad can host your project’s source code using the Bazaar version control system. We also import over 2000 CVS, SVN, Git and Mercurial projects, so you can use Bazaar with those too.
- SourceForge 2.7 million developers create powerful software in over 260,000 projects. Our popular directory connects more than 46 million consumers with these open source projects and serves more than 2,000,000 downloads a day. SourceForge is where open source happens.
- SNAP Stanford Large Network Dataset Collection. The SNAP library is being actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges.
- KONECT (the Koblenz Network Collection) is a project to collect large network datasets of all types in order to perform research in network science and related fields, collected by the Institute of Web Science and Technologies at the University of Koblenz–Landau.
- pajek's network data sources
Energy
- DOE Data Explorer. From the US Department of Energy (DOE). Data generated by DOE-sponsored research.
- OpenEI: Open Energy Information. Freely-available energy data, tools, models, and other resources.
Environmental sciences
- Also see BCO-DMO, Marine Biology data, listed with Marine Sciences repositories.
- Also see DataONE, KNB, and PANGAEA, listed under Multidisciplinary repositories.
- Also see Dryad, listed with Biology repositories.
- British Atmospheric Data Centre (BADC). From the Natural Environment Research Council (NERC). Many datasets are openly accessible but some are restricted.
- Climate Change Data Portal. From the Environment Department of the World Bank.
- Climate Data. A section within the Comprehensive Knowledge Archive Network of the Open Knowledge Foundation (OKF). A joint project of the OKF, Climate Code and Real Climate.
- California Water CyberInfrastructure. Hydrology data on California's watersheds. From the Berkeley Water Center.
- Consortium of Universities for the Advancement of Hydrologic Science, Inc HIS stands for Hydrologic Information System. CUAHSI's HIS is an internet based system to support the sharing of hydrologic data. It consists of databases connected using the internet through web services as well as software for data discovery, access and publication.
- The Marine Geoscience Data System (MGDS) The Marine Geoscience Data System (MGDS) provides access to data portals for the NSF-supported Ridge 2000 and MARGINS programs, the Antarctic and Southern Ocean Data Synthesis, the Global Multi-Resolution Topography Synthesis, and Seismic Reflection Field Data Portal.
- National Snow and Ice Data Center (NSIDC) Cryospheric datasets from ground field research and satellites.
- National Ecological Observatory Network (NEON). A joint project of 50+ US universities and laboratories.
- Polar Data Catalogue A primarily Canadian archive of free RADARSAT imagery as well as Arctic, Antarctic, and other cryospheric data sets covering a range of disciplines, from natural sciences and policy to health and social sciences.
Geology
- Also see PANGAEA, listed under Multidisciplinary repositories.
- GSA Data Repository. From the Geological Society of America.
- IRIS (Incorporated Research Institutions for Seismology). From 100+ US universities and the National Science Foundation.
Geosciences and geospatial data
- Also see DataONE and PANGAEA, listed under Multidisciplinary repositories.
- Commons of Geographic Data. "This site is intended for any data in any format that can be referenced to location on the earth." From the University of Maine.
- GeoCommons. From FortiusOne.
- GeoNames. A database of placenames, under a CC-BY license. Founded by Marc Wick.
- The Geosciences Network (GEON) project is a collaboration among a dozen PI institutions and a number of other partner projects, institutions, and agencies to develop cyberinfrastructure in support of an environment for integrative geoscience research. GEON is funded by the NSF Information Technology Research (ITR) program.
- National Geographic Data Center Archive of national and international marine environmental and ecosystem datasets.
- The National Space Science Data Center serves as the permanent archive for NASA space science mission data. "Space science" means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science. As permanent archive, NSSDC teams with NASA's discipline-specific space science "active archives" which provide access to data to researchers and, in some cases, to the general public.
- Polar Data Catalogue A primarily Canadian archive of free RADARSAT imagery as well as Arctic, Antarctic, and other cryospheric data sets covering a range of disciplines, from natural sciences and policy to health and social sciences.
- ShareGeo. Integrating the older GRADE (Geospatial Repository for Academic Deposit and Extraction) repository. From EDINA.
Linguistics
- See the 40+ members of the Open Language Archives Community (OLAC).
- TROLLing. Hosted by UiT. TROLLing "is designed as an archive of linguistic data and statistical code. The archive is open access, which means that all information is available to to everyone. All postings are accompanied by searchable metadata that identify the researchers, the languages and linguistic phenomena involved, the statistical methods applied, and scholarly publications based on the data (where relevant). Linguists worldwide are invited to post datasets and statistical models used in linguistic research."
Marine sciences
- Also see DataONE and PANGAEA, listed under Multidisciplinary repositories.
- BCO-DMO. The Biological and Chemical Oceanography Data Management Office, provides access to data sets contributed by investigators funded by the Biological and Chemical Oceanography sections of the US National Science Foundation (NSF).
- Naval Oceanography Portal Data Services. From the United States Naval Observatory (USNO).
- SeaDataNet. Funded by the EU and coordinated by Institut Français de Recherche pour l'Exploitation de la Mer (IFREMER).
Medicine
- Also see Entrez databases, listed under Multidisciplinary repositories.
- Dryad Dryad is an international repository of data underlying scientific and medical publications, particularly data for which no specialized repository exists. All material in Dryad is associated with a scholarly publication. Most data in the repository are associated with peer-reviewed articles, although data associated with non-peer reviewed publications from reputable academic sources, such as dissertations, are also accepted. Dryad is a non-profit organization.
- GenBank. From the U.S. National Center for Biotechnology Information of the National Institutes of Health.
- Gene Expression Omnibus. From the U.S. National Center for Biotechnology Information of the National Institutes of Health.
- OpenTrials. OpenTrials is a repository of clinical trial data hosted by Open Knowledge International.
- The Health and Medical Care Archive (HMCA) is the data archive of the Robert Wood Johnson Foundation (RWJF), the largest philanthropy devoted exclusively to health and health care in the United States. Operated by the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan, HMCA preserves and disseminates data collected by selected research projects funded by the Foundation and facilitates secondary analyses of the data. The data collections in HMCA include surveys of health care professionals and organizations, investigations of access to medical care, surveys on substance abuse, and evaluations of innovative programs for the delivery of health care. Our goal is to increase understanding of health and health care in the United States through secondary analysis of RWJF-supported data collections.
- MIRAGE (Middlesex medical Image Repository with a CBIR ArchivinG Environment). From JISC and Middlesex University.
- Melanoma Molecular Map Project. On melanoma biology and treatment.
- National Center for Biotechnology Information (NCBI) The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.
- NeuroMorpho. Neuronal morphology data. From the Krasnow Institute for Advanced Study at George Mason University.
- Project Data Sphere, LLC, is a repository to broadly share, integrate and analyze historical, de-identified, patient-level data from academic and industry cancer Phase II-III clinical trials. Access to the Project Data Sphere platform is available to researchers affiliated with life science companies, hospitals and institutions, as well as independent researchers, at no cost and without requiring a research proposal.
Multidisciplinary repositories
- Also see Social Sciences.
- Also see BCO-DMO, Marine Biology data, listed with Marine Sciences repositories.
- 3TU.Datacentre. A consortial data repository for Delft University of Technology, Eindhoven University of Technology and the University of Twente.
- Data Archiving and Networked Services. Dutch research data in the humanities and social sciences. From the Royal Netherlands Academy of Arts and Sciences (KNAW) and the Netherlands Organisation for Scientific Research (NWO).
- DataONE DataONE is an international federation of data repositories containing earth observations data, including data from fields such as ecology, biology, evolution, and environmental sciences such as hydrology, oceanography, and atmospheric science. DataONE is a federation with participation from hundreds of field stations, universities, and government agencies through the DataONE Member Nodes.
- The Dataverse Network. From Harvard's Institute for Quantitative Social Science (IQSS).
- Dryad Dryad is an international repository of data underlying scientific and medical publications, particularly data for which no specialized repository exists. All material in Dryad is associated with a scholarly publication. Most data in the repository are associated with peer-reviewed articles, although data associated with non-peer reviewed publications from reputable academic sources, such as dissertations, are also accepted. Dryad is a non-profit organization.
- Edinburgh DataShare hosted by Edinburgh University Data Library. A repository for data produced by research at the University of Edinburgh.
- Entrez databases. A directory of chemical, biochemical, biomedical, and medical databases from the U.S. National Center for Biotechnology Information of the National Institutes of Health.
- FigShare. Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures.
- KNB The Knowledge Network for Biocomplexity (KNB) is an international data repository containing ecology, biology, and environmental science data with a global distribution. The KNB is a grass-roots partnership of collaborating feld stations, laboratories, and research networks that openly publish and share data. Founding partners include the National Center for Ecological Analysis and Synthesis (NCEAS) and the Long-term Ecological Research Network (LTER). The KNB is a Member Node within the DataONE data federation.
- KPBC. Regional academic repository for data in all fields. Poland
- Open Commons Consortium (OCC). The OCC is a not for profit that manages and operates cloud computing and data commons infrastructure to support scientific, medical, health care and environmental research. OCC members span the globe and include over 30 universities, companies, government agencies and national laboratories.
- Open Science Data Cloud (OSDC). The OSDC is a data science ecosystem in which researchers can house and share their own scientific data, access complementary public datasets, build and share customized virtual machines with whatever tools necessary to analyze their data, and perform the analysis to answer their research questions. It is a one-stop shop for making scientific research faster and easier.
- Open Science Framework (OSF) Open Science Framework serves as a scholarly commons for documentation, files, collaboration, and connecting to services for research outputs.
- PANGAEA. "PANGAEA" stands for "Publishing Network for Geoscientific & Environmental Data". Hosted by the Alfred Wegener Institute for Polar and Marine Research and the University of Bremen's Center for Marine Environmental Sciences. Open to deposits from any scientist. Most datasets are open; some are restricted.
- Public Data Sets on AWS. From Amazon Web Services. The site already hosts OA datasets in biology, chemistry, and economics, and is willing to host them in any field.
- Scholars Portal Dataverse. A data repository hosted by Scholars Portal, a consortial service of the Ontario Council of University Libraries in Canada. Open to deposits from any user across all fields of research.
- Science 3.0 Open Data. A repository for RDF datasets in the public domain, in any field. From Science 3.0.
- USU Repository University of Sumatera Utara, Medan, Indonesia.
- UPSpace University of Pretoria Research Repository, South Africa.
Physics
- Also see Astronomy.
- Blue Obelisk Data Repository. Repository of isotope masses, under MIT license. From the Blue Obelisk. Described in 10.1021/ci050400b.
- HEP Data The data comprise total and differential cross sections, structure functions, fragmentation functions, distribuitions of jet measures, polarisations, etc... from a wide range of interactions.
- CERN Scientific Information Online particle physics data and information
- Nist Atomic Spectra Database The Atomic Spectra Database (ASD) contains data for radiative transitions and energy levels in atoms and atomic ions. Data are included for observed transitions of 99 elements and energy levels of 56 elements.
Social sciences
- Also see Multidisciplinary repositories.
- Association of Religion Data Archives Coverage includes international surveys, U.S. church membership data, and U.S. Surveys.
- Australian Social Science Data Archive. From the Australian Demographic and Social Research Institute at the Australian National University.
- CESSDA Data Portal. From the Council of European Social Science Data Archives (CESSDA).
- Databrary A repository for sharing and reusing research video data and related metadata in the developmental and learning sciences. Hosted at New York University with support from The Pennsylvania State University.
- Digital Repositories E-Science Network (DReSNeT). From the UK Engineering & Physical Sciences Research Council (EPSRC). A network of social science repositories for texts and data.
- Economic and Social Science Data Service. From the UK Data Archive (UKDA) and Institute for Social and Economic Research (ISER), University of Essex; Manchester Information and Associated Services (MIMAS), and the Cathie Marsh Centre for Census and Survey Research (CCSR), University of Manchester. Access to data requires registration.
- ICPSR (Inter-University Consortium for Political and Social Research). At the University of Michigan.
- National Archive of Criminal Justice Data holds over 700 data collections relating to criminal justice.
- Roper Center for Public Opinion Research data from surveys of public opinion from the 1930s to the present.