KPMP Data Extraction
Related to Figure 6.
Kidney Precision Medicine Project (KPMP) atlas — Lake et al., Nature 2023. Accessed via the CellxGene Census open API (no DUA required). Collection ID: bcb61471-2a44-4d00-a0af-ff085512674c. Cells queried: tissue_general == 'kidney' AND (disease == 'chronic kidney disease' OR disease == 'normal'). All scoring is deferred to R; this script saves pseudobulk CSVs only.
Gene signatures
All four signature gene lists are downloaded in a single Census query. Scoring happens entirely in R.
CAAH_10_GENES = ["CLU", "LRP2", "LAMP2", "COL4A2", "SPINK1",
"WFDC2", "PAX8", "LYZ", "S100A9", "CDH13"]
HYPOXIA_GENES = ["HIF1A", "EPAS1", "EPO", "EPOR", "VEGFA",
"ALDOA", "LDHA", "PGK1", "SLC2A1", "ENO1",
"PGAM1", "TPI1", "ANGPTL4", "HILPDA", "PLAUR",
"SLC16A1", "BNIP3", "CA9", "P4HA1"]
GLYCOLYSIS_GENES = ["HK2", "PFKM", "PKM", "LDHA", "HK1",
"ALDOA", "GAPDH", "PGK1", "PGAM1", "ENO1",
"TPI1", "GPI", "PFKL", "PFKP", "LDHB"]
ECM_GENES = ["COL1A1", "COL1A2", "COL3A1", "COL4A1", "COL4A2",
"COL12A1", "FN1", "TIMP1", "TIMP2", "MMP2",
"ACTA2", "VIM", "TGFB1", "CCN2", "POSTN", "THBS1"]
QUERY_GENES = list(dict.fromkeys(
CAAH_10_GENES + HYPOXIA_GENES + GLYCOLYSIS_GENES + ECM_GENES
))
Census query
get_anndata() returns raw, unnormalized counts. The var_value_filter restricts to signature genes only so the download is small; obs_column_names pulls just the metadata fields needed for pseudobulking.
KPMP_COLLECTION_ID = "bcb61471-2a44-4d00-a0af-ff085512674c"
with cellxgene_census.open_soma() as census:
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
kpmp_ids = datasets[datasets["collection_id"] == KPMP_COLLECTION_ID]["dataset_id"].tolist()
id_filter = " or ".join([f"dataset_id == '{d}'" for d in kpmp_ids])
value_filter = (f"({id_filter}) and tissue_general == 'kidney'"
f" and (disease == 'chronic kidney disease' or disease == 'normal')")
adata = cellxgene_census.get_anndata(
census,
organism = "Homo sapiens",
obs_value_filter = value_filter,
var_value_filter = f"feature_name in {QUERY_GENES}",
obs_column_names = ["soma_joinid", "donor_id", "disease", "dataset_id"],
)
Pseudobulk construction
Counts are summed per donor across all their cells. A donor may appear in multiple KPMP datasets (source + integrated versions); summing across all their cells is correct — log-CPM normalization handles library-size differences between donors.
X = adata.X.toarray() if sp.issparse(adata.X) else np.array(adata.X)
gene_names = adata.var["feature_name"].values
obs = adata.obs.reset_index(drop=True)
expr_df = pd.DataFrame(X, columns=gene_names)
expr_df["donor_id"] = obs["donor_id"].astype(str).values
expr_df["disease"] = obs["disease"].astype(str).values
pb = (expr_df
.groupby(["donor_id", "disease"], observed=True)
.sum()
.reset_index())
Log-CPM normalization
pb_expr_raw = pb.drop(columns=["donor_id", "disease"])
row_sums = pb_expr_raw.sum(axis=1)
pb_norm = np.log1p(pb_expr_raw.div(row_sums, axis=0) * 1e6)
Save outputs
pb_expr_out = pd.concat(
[pb[["donor_id", "disease"]].reset_index(drop=True),
pb_norm.reset_index(drop=True)],
axis=1
)
pb_expr_out.to_csv(OUT_DIR / "kpmp_pseudobulk_expr.csv", index=False)
meta_out = pb[["donor_id", "disease"]].copy()
meta_out["n_cells"] = (expr_df
.groupby(["donor_id", "disease"], observed=True)
.size()
.reset_index(name="n_cells")["n_cells"]
.values)
meta_out.to_csv(OUT_DIR / "kpmp_pseudobulk_meta.csv", index=False)
Outputs
| File | Description |
|---|---|
results/tables/kpmp/kpmp_pseudobulk_expr.csv |
log-CPM per donor × gene (donors as rows) |
results/tables/kpmp/kpmp_pseudobulk_meta.csv |
donor_id, disease, n_cells |