Development of standardized file format to maximize data shareability across disciplines (Faculty/Junior Researcher Collaboration Opportunity)

Development of standardized file format to maximize data shareability across disciplines

PI: Jean-Paul Armache (Biochemistry and Molecular Biology)

Apply as Junior Researcher 

Tuition support will be provided through departmental RA funding. Additional salary coverage will be sought through pending instrumentation and infrastructure grants.

Team Members

• Mike Carnegie, Systems Administrator, The Huck Institutes of the Life Sciences Co-supervises project, installs and configures open-source dependencies, and container management.

• Dr. Wen Jiang, Faculty Director, Cryo-Electron Microscopy Core Facility, Huck Institutes of the Life Sciences; Professor, Department of Biochemistry and Molecular Biology Provides expertise in cryo-EM field shareability, communication with external development and standards bodies.

Departments and Units

• Eberly College of Science

• Huck Institute of the Life Sciences

Level of Effort

• Two semesters at 8 hours a week are appropriate to complete the initial planning, requirements gathering and to design the file format and metadata structure. There may be time to develop a few plugins for the common software packages.

Plan for Funding Tuition or Remainder of Salary

• Tuition support will be provided through departmental RA funding.

• Additional salary coverage will be sought through pending instrumentation and infrastructure grants.

Project Description

Data reproducibility and data sharing are critical in modern times, for evaluation, validation and confirmation of results.

In this proposal, we intend on establishing a standardized file format designed for data sharing in reviews, or as project summaries. It will not be meant for sharing large raw datasets, for which there already are specialized upload servers, but rather to incorporate selected, representative information, often needed for reviews, in one place. Instead of sharing multiple files on servers in Zip files, lacking any intended descriptions, definitions, representation, we opt for creating a single cohesive file. In this file we would incorporate data such as representative images for a dataset, location of the selected elements on these images, 3D atomic coordinates, areas of these coordinates and their names and colors, 3D density maps, crosslinking data, or computer code.

The standard will be established based on in-depth discussions with the most influential software-development laboratories in the cryo-EM field, super-users, standards bodies and journals. We will identify the important data types, formats, and meta-data information to be incorporated, structuring the data format with shareability and expandability in mind. Instead of “reinventing the wheel”, we will adapt the existing open-source projects (HDF5, JPEGXL, etc.), to future-proof our file-format.

To further the accessibility to the field, we will develop readers/writers to the most popular software packages/viewers (Napari, UCSF ChimeraX, PyMol, ImageJ/Fiji, Coot), in which reusable, python-based code will be adapted for the operations necessary for interpreting the files. In time, we will also provide OSX Finder and Microsoft Windows Explorer plugins, to simplify the file creation process.

We envision this as a file format that would at first be focused on maximizing shareability in the cryo-EM field. However, due to the similarity of information (images, coordinates, sessions, layers), we will also enhance it to be further embraced by other communities, thereby increasing adoption as a standard for sharing.

We envision this approach becoming the de facto standard used by labs (to share and represent their projects and data), and journals, for submission. The project will involve designing the file-format (content and metadata) and accompanying readers/writers and ensuring the application compliance with institutional data governance standards.

Our proposal aligns very well with an opportunity by National Science Foundation “Reproducibility and Replicability in Science”, as it addresses “Advancing the science of reproducibility and replicability” and “Educational efforts to build a scientific culture that supports reproducibility and replicability”.

Given the rapid and extraordinary advancements in AI, it is becoming increasingly easy for bad-faith actors to fabricate or manipulate data in ways that can make flawed findings appear publishable – and therefore eligible for funding. To address this growing concern, we aim to develop a streamlined, transparent method for sharing data with reviewers and fellow researchers. This approach will emphasize clearly annotated and well-defined data elements, structured in a way that can be interpreted and validated by software tools. This format could also contain hard-wired links to the public servers containing the deposited raw data.

Desired Skills or Expertise

• Python development – scripting, data handling, and/or plugin development

• Scientific metadata standards – familiarity with frameworks like REMBI or related ontologies

• Imaging workflows or file formats – knowledge of imaging acquisition, processing, or visualization pipelines

• UI/UX design – especially for scientific or research-oriented applications

• Version control using GitHub – collaborative development and issue tracking

• Passion for open-source development – contributing to community-driven tools and standards

Additional Requirements and Expectations

• Preferred background: Graduate students in computer science, bioinformatics, data science, engineering, or a related field, particularly those with experience in scientific computing or research software

• Time commitment: Availability for regular project meetings

• Collaboration mindset: Willingness to explore, test, and iterate on different technical solutions and community standards

Immediate Objectives for this Phase

• Develop a working prototype of a flexible data storage file format tailored for microscopy workflows

• Implement reader/writer plugins for integration with visualization tools such as Napari, ImageJ/Fiji, UCSF ChimeraX, and others

• Document and prepare preliminary results for inclusion in upcoming grant applications

• Draft and submit a conference abstract or workshop presentation outlining project outcomes

Medium to Long-Term Goals

• Contribute to a formal NSF data infrastructure proposal incorporating the developed tools

• Author a methods paper or software note describing the platform and its applications

• Integrate the developed tools with institutional data repositories and workflows, particularly within Cryo-EM and advanced imaging facilities

Connection to ICDS Mission

This project advances ICDS’s mission by leveraging data science and software development to support interdisciplinary research, enhance reproducibility, and facilitate access to structured scientific data.

Engagement with ICDS

Mike Carnegie has participated in previous ICDS events and workshops and intends to engage further by mentoring Junior Researchers and integrating ICDS tools and best practices into the project. The team is committed to regular engagement through seminars and interdisciplinary collaborations facilitated by ICDS.