Skip to main content

About the ETAF Data Commons

The scientific case for shared infrastructure — why family-based cohorts need a common resource, what the ETAF Data Commons aims to build, and what becomes scientifically possible at scale.

The scientific case for shared infrastructure

Twin, adoption, and extended-family cohorts have delivered decades of insight into genetic and environmental influences on human behavior, health, development, and social outcomes. Yet these datasets remain scattered across institutions, governed independently, difficult for outside researchers to access, and often measured using different instruments.

Many of the most important questions in behavioral and biomedical science require scale, family structure, longitudinal measurement, and genomic data together — conditions that individual cohorts and conventional population biobanks typically cannot meet alone. Shared infrastructure can substantially increase the scientific return on decades of cohort investment.

Genome-wide studies of unrelated individuals have transformed genetic discovery, but they cannot fully distinguish effects operating within individuals from effects that arise through families and broader social contexts. People are not independent units: their genetics, environments, and outcomes are intertwined across generations. Twin, adoption, sibling, spouse, parent-offspring, and extended-family designs — especially when combined with measured genomic data — make it possible to separate direct genetic effects from family-mediated genetic effects, shared environmental transmission, assortative mating, and gene-environment correlation. The goal of the ETAF Data Commons is not simply to aggregate more data, but to build the infrastructure that makes cumulative, rigorous, family-based genetically informed research possible at global scale.

Lowers barriers to entry

Shared infrastructure makes it easier for researchers to access genetically informative data without navigating separate access processes, data agreements, and analysis environments for every contributing cohort.

Harmonizes phenotypes

Phenotype harmonization will be pursued where scientifically appropriate, and cohort-specific measures will be preserved when harmonization would obscure meaningful differences or reduce scientific value.

Increases power and supports replication

Combining independently collected cohorts increases statistical power and enables systematic replication — two conditions essential for producing findings that are credible, robust, and cumulative.

Supports new family-based genomic methods

Modern methods — within-family GWAS, interpersonal genetic effects models, extended pedigree analyses — require large samples of families with genomic data that no single cohort can currently provide.

A secure, integrated research resource

The ETAF Data Commons is intended to provide shared computational and governance infrastructure for integrating family-based cohorts in a secure, controlled-access environment. All elements described below are in development; nothing has been finalized or deployed.

Contributing Cohorts

Twin, adoption, extended family, and related genetically informative studies worldwide

Intake & Metadata

Study documentation, family-structure metadata, and consent constraint mapping

Phenotype Harmonization

Cross-cohort harmonization where appropriate; cohort-specific measures preserved

Genomic Processing

QC, imputation, and integration of genomic data where available; DNA pathways explored

Secure Cloud Platform

Controlled-access analysis environment; no broad individual-level data download

Planned

Approved Researchers & Discovery

Qualified investigators with approved projects conduct analyses and generate new knowledge

All stages are under development. Future access will be project-based, governed by data-use agreements, and subject to cohort-specific consent constraints. Individual-level data will not be broadly downloadable.

What becomes possible at scale

Combining family-based cohorts with harmonized phenotypes and genomic data enables scientific approaches that are not feasible in isolated datasets. The following represent a sample of anticipated use cases.

Within-family genomics

Sibling and twin-pair analyses to estimate causal genetic effects, control for passive gene-environment correlation, and test causal genetic hypotheses.

Adoption & rearing environment

Adoption designs offer strong leverage on causal effects of rearing environments, independent of genetic transmission from biological parents.

Twin & sibling comparisons

Classic and extended twin designs to decompose genetic and environmental variance, test GxE interactions, and evaluate biometric model assumptions.

Extended pedigree & intergenerational

Multi-generational data to study intergenerational transmission, interpersonal genetic effects, and the developmental origins of health and behavior.

Gene-environment correlation & interplay

Family-based designs offer powerful approaches to distinguish active, reactive, and passive gene-environment correlations and identify GxE interactions.

Assortative mating

Spousal data and family pedigrees to study mate selection, phenotypic and genetic resemblance between partners, and implications for population genetic structure.

Developmental & life-course research

Longitudinal phenotyping across the life span to study how genetic and environmental influences on traits change from childhood through adulthood.

Replication & method benchmarking

Cross-cohort replication of key findings, evaluation of statistical methods under different design assumptions, and construction of multi-cohort reference datasets.

Community momentum

A community survey of the twin- and family-register world suggests that the phenotypic and genotypic resources for global expansion already exist.

73
Registers surveyed
32
Countries represented
16
First-wave responders
20
Letters of support
Important note regarding letters of support: Letters of support indicate enthusiasm, interest, or advisory willingness. They are not participation agreements and do not imply that data will be transferred, that consent or governance review has been completed, or that a cohort has agreed to participate in the ETAF Data Commons.