In an era where digital information is exponentially expanding, accessing and analyzing vast troves of government documents has become increasingly challenging. The End of Term Web Archive (EOTWA), initiated in 2008 during George W. Bush’s second term and now encompassing materials up to 2024, serves as a monumental repository preserving the digital footprints of United States presidential administrations. This extensive archive hosts millions of PDF files encompassing an array of formats including images, textual documents, graphs, and redacted pages. While invaluable for historians, journalists, and the public, the overwhelming volume and diversity of data in the archive significantly hamper efficient and effective information retrieval.
Recognizing this obstacle, a research team led by the University of Washington has developed GovScape, an advanced multimodal search system that revolutionizes searching within this vast collection of government PDFs. GovScape harnesses cutting-edge artificial intelligence to scan and index tens of millions of pages, providing users with powerful tools to conduct searches not merely based on simple keyword matches but also on semantic and visual content. This enables the retrieval of relevant documents even when the user’s search terms do not appear explicitly within the documents, a feature especially significant for navigating complex and heterogeneous government data.
Technically, GovScape operates by segmenting each PDF into individual pages, subsequently transforming these pages into images and extracting their textual content. This process is integral because government documents often feature a blend of text, images, charts, and other visual elements that pose a significant challenge to conventional search technologies. The system then employs efficient AI models to generate embeddings—numerical representations that encode both the visual and textual essence of each page. These embeddings enable dimensional reduction and semantic grouping akin to how traditional library classification structures books according to subject matter and content similarity.
The core innovation in GovScape resides in its multimodal indexing and search capabilities. Keyword searches function through text-based indices similar to a traditional book index, effectively identifying pages containing specific terms like “FAFSA.” In contrast, semantic and image-based searches transform user queries into embeddings and compare them against the precomputed embeddings from the document pages. By calculating vector similarities in a high-dimensional embedding space, the system returns documents most semantically aligned to the user’s query, even in the absence of explicit keyword matches. This blending of textual and visual semantics marks a significant breakthrough in navigating complex government archives.
One of the remarkable aspects of GovScape lies in its cost efficiency. Processing the 10 million PDF pages from Donald Trump’s first presidential term reportedly cost under $1,500—an extraordinary feat considering the computational demands of running AI models at scale. To contextualize, commercial solutions such as Google’s Document AI charge approximately $1 per 100 pages, highlighting GovScape’s optimized processing pipeline and the strategic utilization of efficient AI embedding models. These advancements will be pivotal as the team aspires to extend GovScape’s capabilities to index and search the archive’s entirety of roughly 70 million PDFs spanning 2008 to 2024.
The research team’s vision anticipates future aspirations beyond PDFs, aiming to integrate other prevalent file types found in government archives, such as spreadsheets, HTML pages, and image files. This is a critical consideration given the diversity of document formats housed within governmental data repositories. Moreover, extending multimodal search functions to encompass these formats promises enhanced accessibility and usability, empowering both casual users and professional researchers to extract nuanced insights from an increasingly complex digital landscape.
Presenting the findings on July 5 at the Annual Meeting of the Association for Computational Linguistics, the research highlights not only technical innovations but also addresses the broader societal importance of accessible government information. Benjamin Charles Germain Lee, the project’s principal investigator and an assistant professor at the University of Washington’s Information School, emphasizes how the massive scale of modern digital archives like the Internet Archive—with its trillionth page milestone—requires revolutionary search systems to transform raw data into actionable knowledge. This democratization of access is crucial for transparency, accountability, and the informed functioning of a democratic society.
Moreover, GovScape’s design underscores a sophisticated integration of contemporary AI methodologies including natural language processing and computer vision. By leveraging embeddings that jointly capture textual and visual semantics, it surpasses traditional search engines that typically rely on text only. This is particularly pertinent for government documents that frequently embed critical information within charts, graphs, or redacted images—elements conventionally challenging for standard keyword search paradigms.
The research collaboration behind GovScape reflects a multidisciplinary effort. Contributors span multiple institutions including Boston University, Harvard University, the Massachusetts Institute of Technology, the University of North Texas, and the American Institute of Physics. The involvement of doctoral and master’s students alongside established researchers points to a vibrant academic ecosystem facilitating innovation at the intersection of information science, machine learning, and information retrieval.
By employing multimodal embeddings, GovScape introduces new dimensions to document similarity and relevance metrics. Unlike keyword-based searches that use exact text matching, embedding-based approaches capture latent semantic content, enabling more intuitive and contextually relevant results. This avenue is transformative for users seeking nuanced government data that might be referenced in varied terminologies or embedded within complex visual contexts.
An additional factor contributing to GovScape’s usability is its user-friendly interface which permits three distinct search modalities: keyword, semantic, and visual. The visual search option, a novel feature, enables inquiries based on document characteristics such as “redacted documents,” “aerial photographs,” or even specific data visualizations like “pie charts,” exploiting the comprehensive visual embeddings. This capability transforms how users interact with dense digital repositories, moving beyond text-centric queries and accommodating the richness of government archive content.
Looking ahead, as the GovScape project scales and potentially integrates additional document types and archives, it sets the stage for a new paradigm in digital archive interaction. The synthesis of advanced AI techniques with vast archival data not only enhances information retrieval effectiveness but also embodies a broader commitment to open access and the empowerment of civic engagement. By enabling easier discovery of government documents, GovScape embodies a critical tool in the pursuit of transparency and democratic accountability in the digital age.
The research team invites further collaborations and inquiries as they refine and expand the system. Those interested in learning more or engaging with the project can contact Benjamin Charles Germain Lee at [email protected]. This ongoing work promises to inspire both technological innovation and policy discourse surrounding the future of digital archives and government information accessibility.
Subject of Research:
Multimodal AI Search Systems for Large-Scale Government PDF Archives
Article Title:
GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
News Publication Date:
5-Jul-2026
Web References:
End of Term Web Archive: https://eotarchive.org/
GovScape: https://govscape.net/
Google Document AI Pricing: https://cloud.google.com/document-ai/pricing
Research Paper: https://arxiv.org/abs/2511.11010
Internet Archive: https://archive.org/
Keywords
Search engines, Semantic search, Multimodal AI, Government archives, Document embeddings, Information retrieval, Digital data, Big data
Tags: advanced search tools for public recordsAI-powered government document retrievaldigital preservation of presidential administrationsEnd of Term Web Archive 2008-2024government document archiveshistorical government data accesslarge-scale PDF document analysismultimodal search technologyovercoming information overload in archivessemantic search in government PDFsUniversity of Washington research on document searchvisual content indexing in archives