Charting a FAIR Direction for the US Government Information Ecosystem
Continuing our exploration of research data management and its commonalities and overlaps with the wider government information sphere, we now turn to the potential application of the FAIR Principles as a framework to explore the ways in which government information may or may not meet public needs for current and future use.
I. FAIR Principles
II. Federal Information Dissemination Practices
III. How FAIR is Federal Information?
IV. More FAIR Federal Information
V. Concluding Thoughts and Next Steps
FAIR Principles
The FAIR Principles for making research data accessible were first formally articulated in a Nature article in 2016 as guiding principles that provide a starting framework. These principles recognize a key challenge: research datasets are often shared that are unusable, unpreservable, or undocumented and not curated in a way that enables validation and/or reproduction of research results. FAIR stands for findable, accessible, interoperable, and reusable. These principles serve as guides for best practices in the curation, transferring, sharing, and preservation of research data.
Findability ensures that the data has persistent identifiers, rich metadata, and can be located in a searchable resource, such as a search engine or a registry.
Accessibility is focused on the ability of a user to access the data and/or its associated metadata. In most cases this means that the user can access at minimum the metadata through a standard internet protocol rather than proprietary tools. Furthermore, metadata should persist even if the data is of limited availability or entirely unavailable.
Interoperability is a key area. At minimum, the (meta)data needs to be machine readable without the need for specialized algorithms or translators. For instance, metadata should use an interoperable schema like Dublin Core. In addition, variables in datasets or metadata fields should use controlled vocabulary as much as possible. Finally, connections between datasets need to be clear, including citations to predecessor data.
Reusability: Finally, FAIR calls for reusability. Fundamentally this focuses on the ability of a person to decide if a dataset is useful to them. This means we need more than just descriptive metadata but also contextual metadata, including how the dataset was created, under what conditions, etc. Additionally, licensing should be clear.
For a helpful in-depth overview, see “Interpreting FAIR” from the GO FAIR Foundation.
Federal Information Dissemination Practices
We explore federal information dissemination practices because the U.S. government could have a significant influence on the application of FAIR principles as one of the largest publishers of information and research data in the world. FAIR Principles are increasingly part of the practices that federal agencies take into consideration for improving public access to federally-supported and federally-produced research data. Some of these practices have been developed following the guidance of the Office of Science and Technology Policy (OSTP); however, as we’ve noted before, government information extends far more broadly beyond research data.
Per the Open Government Data Act of 2018, federal data produced for public access and use must be published using standardized, machine-readable formats, and registered in Data.gov, the U.S. government’s open data registry. However, full compliance with these requirements still awaits implementation guidance from the Office of Management and Budget (OMB), as noted in a May, 2023 update to OMB priority open recommendations from the Government Accountability Office (GAO). While other guidelines such as the “Desirable Characteristics of Data Repositories for Federally Funded Research” apply to data repositories, the majority of government information resources are published directly to agency websites and only later may be curated for long term access.
Dissemination of all types of public information, including data, is nevertheless subject to a variety of requirements and guidelines – for a few examples, see OMB Circular A-130 and Section508.gov. Once information is published online, its long-term stewardship falls to a myriad of responsible agencies including the National Archives & Records Administration (NARA), the U.S. Government Publishing Office (GPO), and more. There are also non-governmental collaborative projects like the End of Term Crawl Project that seek to add assurances to the preservation of the federal government web domain.
How FAIR is Federal Information?
The FAIR Principles guide better practices with research data, as shown in a myriad of practical use cases from FAIR-IMPACT. The intent and impact of FAIR also encompasses the tools and services to access and use data – for an example, see FAIR Assessment Checklist for Data Repositories.
Because all born-digital (and digitally reformatted) government information is itself data, we want to explore the potential value of FAIR for understanding the current – messy! – arena for federal information. First, though, we need to narrow our view to data that has some form of publicly accessible metadata. This immediately excludes the majority of federally produced information published directly to the web, including HTML, content distributed in PDF, Word, or PowerPoint formats, audiovisual and multimedia content, social media posts, and much more. Unless the content has been subsequently captured for a catalog or database, it cannot meet FAIR criteria in anything more than an accidental manner.
The vast majority of federal government information produced for public access is not published into a repository, but there are different finding tools. For example, GPO catalogs federal publications for its Catalog of Government Publications (CGP). While these publications represent only a portion of the digital National Collection of U.S. Government Public Information, materials that are included in the CGP can be evaluated with respect to FAIR Principles.
Let’s look at “Artemis Plan : NASA’s Lunar Exploration Program overview” – CGP system number 001171263. This is the metadata for NASA publication NP-2020-05-2853-HQ, which is currently available from NASA’s Artemis Mission site.
FINDABLE: The metadata is fairly rich, including provenance information and relevant controlled subject headings. The CGP record includes a system number, but, as with most library online public access catalogs, the record link is system-dependent. The publication itself is linked with a PURL, which can be updated when the location of the document changes, but does not meet all criteria for a persistent identifier. (For more on this, see Report of the Depository Library Council’s Working Group on Exploring the Durability of PURLs and Their Alternatives.) The PURL is included in the metadata but only as an access point, not as an identifier.
ACCESSIBLE: Authorized libraries can retrieve CGP metadata using the Z39.50 protocol, but there is no open and publicly available endpoint. Instead, GPO maintains a GitHub repository that is occasionally updated to match the database.
INTEROPERABLE: GPO uses current library industry standards for cataloging. Its metadata is expressly meant to be used in other library catalogs, as GPO distributes its metadata to FDLP libraries to be ingested into their local catalogs, but the extent to which metadata created for a library catalog meets FAIR Principles is a deeper concern that depends in part on the machine readability of the RDA metadata. See for example Koster, Lukas, and Saskia Woutersen-Windhouwer. “FAIR Principles for Library, Archive and Museum Collections: A Proposal for Standards for Reusable Collections.” The Code4Lib Journal, no. 40 (May 4, 2018). https://journal.code4lib.org/articles/13427.
REUSABLE: The metadata primarily describes the content, though there is an indication of the digital object format in its source (“Description based on online resource, PDF version”). The copyright status of the work is not indicated in the record; even the work itself does not clearly indicate that it is a work of the U.S. government.
So, is the CGP FAIR? Publications that are cataloged by GPO come closer to the standards articulated in the FAIR Principles; however, there could be significant improvements in each area that would in turn create improved public access to these resources for the long term future.
More FAIR Federal Information
The CGP is just one example from the expansive federal information landscape. Let’s turn now to examples for each FAIR criterion that can better illustrate the current variety of information practices across federal agencies. With so much variety, it is challenging to characterize current federal information dissemination practices with respect to whether they are FAIR enabling; however, these examples illustrate some of the potential for exploration.
Findable
Some government repositories or databases may meet some criteria for findability but lack others – for example, there may be rich descriptive metadata, but no persistent identifier, or there may be a persistent identifier but the contents are not searchable outside of the database, etc.
For example, the USGS Publications Warehouse provides rich metadata in multiple formats, including RIS and Dublin Core XML. Each publication is associated with a DOI that returns to the metadata record, which indicates the formats available for full text downloads (or provides a second DOI link to the publisher’s website, if not published by USGS). Author information includes ORCiD iDs when available. Search results can be downloaded in standard formats or accessed via RSS, and the database can be queried using a public API.
Accessible
All publicly available federal databases can at a minimum be accessed by users over standard internet protocols (HTTPS, FTP, SFTP, etc.); however, this access may not persist over a long period of time unless the agency hosting it has committed to provide persistent access. Practices may vary widely when data are deleted as to whether the metadata are still available.
One notable example of providing metadata regardless of data availability is ResearchDataGov.org, which is a product created by ICPSR under contract by the National Science Foundation and in partnership with federal statistical agencies. The purpose of this database is to make microdata and other research data discoverable, regardless of the access mechanism. Many of the datasets listed here can only be accessed through rigorous security protocols at specific research data centers.
Interoperable
Government information available in repositories are typically described with standardized metadata, which may include controlled subject headings and name authorities. However, with an unknown number of federal agencies and offices operating their own repositories and databases, including some that can best be described as archaic legacy systems, interoperability is an exception rather than a rule. Still, there are systems that are designed to improve interoperability of available information resources.
For example, FRED economic data from the St Louis Federal Reserve provides 823,000 US and international time series from 114 sources that have been fully regularized and made available through its web interface.
Reusable
Documentation can be thin. Although the majority of informational works are produced in standard formats (PDF, PPTX, etc), they typically lack metadata attributes that provide the technical provenance for the files. Most works of the U.S. government are not protected by copyright within the U.S. but there are often intermingled works with varying copyright status, and some data may require a cost recovery fee to access. Still the bulk of publications and data are available for reuse from a copyright perspective, and federal government information corpora are often open to reuse.
NASA Earthdata has Data Pathfinders that are designed to guide users to groups of databases that can support their inquiries. Each pathfinder includes data archives and active sensors related to the topic, and more specific topics along with deep examples of how the data are used for real-world problem solving.
Concluding Thoughts and Next Steps
The FAIR principles have become extremely important for assuring that research data funded by the US government is curated and made available. However, the federal government information landscape is expansive, and research data is only one segment of this space.
The wider adoption of FAIR principles we see now is a result of over a decade of policy initiatives designed to address access to federally funded research and further application outside of the immediate research data context in the future would require policy motivation. By laying groundwork for a proof-of-concept exploration of the potential applicability of FAIR-enabling processes to a broader swath of government-produced information, we hope to better understand if a similar policy direction for a broader scope of federal information may lead to better long-term access for born-digital public information. That is only FAIR 🙂