Skip to main content

About the ETAF Data Commons

The scientific case for shared infrastructure — why family-based cohorts need a common resource, what the ETAF Data Commons aims to build, and what becomes scientifically possible at scale.

The scientific case for shared infrastructure

Twin, adoption, and extended-family cohorts have delivered decades of insight into genetic and environmental influences on human behavior, health, development, and social outcomes. Yet these datasets remain scattered across institutions, governed independently, difficult for outside researchers to access, and often measured using different instruments.

Many of the most important questions in behavioral and biomedical science require scale, family structure, longitudinal measurement, and genomic data together — conditions that individual cohorts and conventional population biobanks typically cannot meet alone. Shared infrastructure can substantially increase the scientific return on decades of cohort investment.

Lowers barriers to entry

Shared infrastructure makes it easier for researchers to access genetically informative data without navigating separate access processes, data agreements, and analysis environments for every contributing cohort.

Harmonizes phenotypes

Phenotype harmonization will be pursued where scientifically appropriate, and cohort-specific measures will be preserved when harmonization would obscure meaningful differences or reduce scientific value.

Increases power and supports replication

Combining independently collected cohorts increases statistical power and enables systematic replication — two conditions essential for producing findings that are credible, robust, and cumulative.

Supports new family-based genomic methods

Modern methods — within-family GWAS, indirect genetic effects models, extended pedigree analyses — require large samples of families with genomic data that no single cohort can currently provide.

A secure, integrated research resource

The ETAF Data Commons is intended to provide shared computational and governance infrastructure for integrating family-based cohorts in a secure, controlled-access environment. All elements described below are in development; nothing has been finalized or deployed.

Contributing Cohorts

Twin, adoption, extended family, and related genetically informative studies worldwide

Intake & Metadata

Study documentation, family-structure metadata, and consent constraint mapping

Phenotype Harmonization

Cross-cohort harmonization where appropriate; cohort-specific measures preserved

Genomic Processing

QC, imputation, and integration of genomic data where available; DNA pathways explored

Secure Cloud Platform

Controlled-access analysis environment; no broad individual-level data download

Planned

Approved Researchers & Discovery

Qualified investigators with approved projects conduct analyses and generate new knowledge

All stages are under development. Future access will be project-based, governed by data-use agreements, and subject to cohort-specific consent constraints. Individual-level data will not be broadly downloadable.

What becomes possible at scale

Combining family-based cohorts with harmonized phenotypes and genomic data enables scientific approaches that are not feasible in isolated datasets. The following represent a sample of anticipated use cases.

Within-family genomics

Sibling and twin-pair analyses to estimate causal genetic effects, control for passive gene-environment correlation, and test causal genetic hypotheses.

Adoption & rearing environment

Adoption designs offer strong leverage on causal effects of rearing environments, independent of genetic transmission from biological parents.

Twin & sibling comparisons

Classic and extended twin designs to decompose genetic and environmental variance, test GxE interactions, and evaluate biometric model assumptions.

Extended pedigree & intergenerational

Multi-generational data to study intergenerational transmission, indirect genetic effects, and the developmental origins of health and behavior.

Gene-environment correlation & interplay

Family-based designs offer powerful approaches to distinguish active, reactive, and passive gene-environment correlations and identify GxE interactions.

Assortative mating

Spousal data and family pedigrees to study mate selection, phenotypic and genetic resemblance between partners, and implications for population genetic structure.

Developmental & life-course research

Longitudinal phenotyping across the life span to study how genetic and environmental influences on traits change from childhood through adulthood.

Replication & method benchmarking

Cross-cohort replication of key findings, evaluation of statistical methods under different design assumptions, and construction of multi-cohort reference datasets.