In an era characterized by an unprecedented explosion of digital data, driven largely by artificial intelligence (AI) development, big data analytics, and the proliferation of smart devices, the demand for innovative, sustainable, and efficient data storage solutions has never been more urgent. Traditional storage platforms, such as hard drives and cloud infrastructures, are fast approaching their practical limits, hindered by elevated costs, thermodynamic inefficiencies, capacity ceilings, and degradation over time. Addressing these challenges head-on, a pioneering interdisciplinary team from The Hong Kong Polytechnic University (PolyU) has unveiled a groundbreaking method that leverages engineered proteins as novel carriers for digital information storage. This innovation not only pushes the boundaries of storage capacity and stability but also demonstrates robust encryption and random access capabilities, marking a paradigm shift in the future of data preservation and retrieval.
At the heart of this breakthrough is the visionary work led by Professor Zhongping YAO, Associate Head and Professor of the Department of Applied Biology and Chemical Technology at PolyU. His collaborative team includes Dr. Cheuk-chi NG and Professor Chung-Ming Francis LAU, blending expertise from protein engineering, synthetic biology, biochemistry, analytical chemistry, and computer science. Their recent findings, published in the prestigious journal Nature Communications, detail the unprecedented realization of a full cycle—from digital data encoding and protein expression via living cells to data retrieval through sophisticated mass spectrometric analysis—in de novo designed unnatural proteins. This accomplishment highlights proteins as a sustainable and scalable medium with remarkable longevity and functional versatility.
Digital information, inherently binary, is conventionally stored as sequences of 0s and 1s within electronic or magnetic media. Translating such data into molecular formats involves encoding binary strings into sequences of monomer units in macromolecules. DNA has long served as a prototype molecule for this purpose due to its natural information storage capabilities; however, it suffers from several intrinsic limitations, such as containing only four nucleotide monomer types, which confines storage density, alongside its susceptibility to chemical and enzymatic degradation. Previous work by Prof. Yao’s group explored peptides—polymers of amino acids—as alternative carriers. Peptides benefit from the availability of 20 natural amino acid monomers and a host of non-natural analogs, facilitating higher information density and increased molecular stability. Nevertheless, peptides’ relatively short sequences and costly chemical synthesis restricted their usability for large-scale data storage applications.
Expanding on this foundation, the team’s innovative leap involves harnessing full-length proteins as data storage media. Unlike peptides, proteins possess vastly longer sequences of amino acids, which drastically elevates potential storage density and efficiency. Moreover, proteins can be biosynthesized enzymatically within living cells, such as genetically engineered Escherichia coli strains, circumventing expensive chemical syntheses and enabling mass production at scale. Proteins also demonstrate enhanced stability during storage, whether in dry powdered forms or aqueous solutions, sometimes enduring conditions that degrade DNA swiftly. This combination fosters a sustainable, cost-effective platform primed for the demands of future data-intensive landscapes.
However, transitioning to protein-based data storage involves surmounting formidable challenges. First, the data-encoded amino acid sequences typically manifest as highly irregular and non-naturally evolved patterns, which often impair protein folding, solubility, and stability, complicating both design and expression within host organisms. Second, contemporary protein sequencing techniques primarily aim to identify proteins by matching partial sequences to known databases rather than reconstructing entire sequences—yet full-length sequence retrieval is imperative to accurately decode embedded digital information. The PolyU team confronted these hurdles with originality and ingenuity.
Drawing inspiration from collagen’s biophysical properties—an archetypal natural protein famed for its striking stability and longevity—the researchers engineered a collagen-like protein scaffold to serve as a robust backbone. This scaffold enhances structural integrity and resists chemical degradation, providing an ideal framework for incorporating data-encoding segments within its architecture. Through precise genetic engineering, these bespoke sequences were inserted into the collagen template, enabling successful expression of the hybrid protein constructs in E. coli. Such bio-fabrication methodologies mark a crucial advance, marrying synthetic biology with information technology.
Subsequent data retrieval involved enzymatic digestion of the expressed proteins into smaller peptide fragments, followed by comprehensive liquid chromatography–tandem mass spectrometry (LC-MS/MS) analysis. This analytical regimen allowed discrete peptide sequences to be identified with high resolution. The mass spectrometric data were then processed using specially developed algorithmic software capable of assembling full-length protein sequences from overlapping peptide fragments. This sophisticated bioinformatics pipeline also incorporated error-correcting codes to rectify minor sequence ambiguities, collectively ensuring that the original digital bit strings could be reconstructed with remarkable accuracy, thereby validating the feasibility of the entire storage-retrieval cycle.
The superiority of protein-based storage solutions was further underscored by comparative analyses with previously developed peptide-based systems, which had already demonstrated notable stability under conditions relevant to space exploration missions in China’s next-generation manned spacecraft. Prof. Yao emphasized that “the protein samples in our research achieved 30 times the storage density at only 10% of the cost of the peptide-based method.” Additionally, unlike DNA that rapidly degrades in acidic environments or aqueous solutions, protein samples remained intact and readable after protracted periods, underscoring their exceptional chemical resilience.
Beyond basic data encoding, the team advanced the concept of “functionalizing” these proteins to implement random data access and cryptographic protections. Traditional molecular storage systems require decoding the entire data set to extract specific information segments—a process that is inefficient and inflexible. By grafting specific affinity tags onto data-bearing proteins, the researchers enabled selective binding and isolation of targeted sequences via corresponding antibodies, permitting random access to discrete data portions without full dataset decoding. Moreover, embedding encrypted messages into these proteins and selectively recovering them only with predefined affinity compounds demonstrated an innovative approach to molecular-level data encryption, underscoring the potential for secure information storage at the biochemical scale.
Professor Yao highlighted the broader implications of protein-based data storage: “Their inherent biocompatibility implies the intriguing prospect of embedding digital archives within living organisms, opening new frontiers in biological data integration.” This prospect invites visionary applications spanning bioinformatics, synthetic biology, and personalized medicine, where biological systems could not only process but inherently retain digital information. The research team envisages next-generation endeavors targeting mass storage scalability, acceleration of data writing and reading processes, reduction of biosynthetic costs, and diversification of protein scaffolds to incorporate additional functionalities and improved performance parameters.
This pioneering work intersects multiple high-impact disciplines, spanning protein engineering, synthetic biology, analytical chemistry, bioinformatics, and computer science. Beyond pushing the scientific envelope, it addresses a pressing societal need precipitated by the deluge of AI-generated data worldwide. Supported by the Hong Kong Research Grants Council through the Collaborative Research Fund and Research Impact Fund, this breakthrough represents a beacon of innovation in sustainable, ultra-dense molecular data storage technologies.
Subject of Research: Protein-based molecular data storage and retrieval using engineered collagen-like proteins expressed via E. coli.
Article Title: Data storage and retrieval with unnatural proteins expressed via E. coli
News Publication Date: 28-Feb-2026
Web References:
https://www.nature.com/articles/s41467-026-70061-7
http://dx.doi.org/10.1038/s41467-026-70061-7
Image Credits: polyu
Keywords
Proteins, Data storage, Collagen, Artificial intelligence, Sequence analysis, Biochemistry
Tags: AI-driven big data storage solutionsbiochemical data storage innovationsengineered protein information carriershigh-capacity biomolecular storageHong Kong Polytechnic University researchinterdisciplinary synthetic biology researchnext-generation data storage breakthroughsovercoming traditional storage limitationsprotein-based data storage technologyrandom access protein storage systemsstable protein data encryption methodssustainable digital data preservation


