KPMP Data Extraction

Related to Figure 6.

Kidney Precision Medicine Project (KPMP) atlas — Lake et al., Nature 2023. Accessed via the CellxGene Census open API (no DUA required). Collection ID: bcb61471-2a44-4d00-a0af-ff085512674c. Cells queried: tissue_general == 'kidney' AND (disease == 'chronic kidney disease' OR disease == 'normal'). All scoring is deferred to R; this script saves pseudobulk CSVs only.

Gene signatures

All four signature gene lists are downloaded in a single Census query. Scoring happens entirely in R.

CAAH_10_GENES    = ["CLU", "LRP2", "LAMP2", "COL4A2", "SPINK1",
                    "WFDC2", "PAX8", "LYZ", "S100A9", "CDH13"]
HYPOXIA_GENES    = ["HIF1A", "EPAS1", "EPO", "EPOR", "VEGFA",
                    "ALDOA", "LDHA", "PGK1", "SLC2A1", "ENO1",
                    "PGAM1", "TPI1", "ANGPTL4", "HILPDA", "PLAUR",
                    "SLC16A1", "BNIP3", "CA9", "P4HA1"]
GLYCOLYSIS_GENES = ["HK2", "PFKM", "PKM", "LDHA", "HK1",
                    "ALDOA", "GAPDH", "PGK1", "PGAM1", "ENO1",
                    "TPI1", "GPI", "PFKL", "PFKP", "LDHB"]
ECM_GENES        = ["COL1A1", "COL1A2", "COL3A1", "COL4A1", "COL4A2",
                    "COL12A1", "FN1", "TIMP1", "TIMP2", "MMP2",
                    "ACTA2", "VIM", "TGFB1", "CCN2", "POSTN", "THBS1"]

QUERY_GENES = list(dict.fromkeys(
    CAAH_10_GENES + HYPOXIA_GENES + GLYCOLYSIS_GENES + ECM_GENES
))

Census query

get_anndata() returns raw, unnormalized counts. The var_value_filter restricts to signature genes only so the download is small; obs_column_names pulls just the metadata fields needed for pseudobulking.

KPMP_COLLECTION_ID = "bcb61471-2a44-4d00-a0af-ff085512674c"

with cellxgene_census.open_soma() as census:
    datasets   = census["census_info"]["datasets"].read().concat().to_pandas()
    kpmp_ids   = datasets[datasets["collection_id"] == KPMP_COLLECTION_ID]["dataset_id"].tolist()

    id_filter    = " or ".join([f"dataset_id == '{d}'" for d in kpmp_ids])
    value_filter = (f"({id_filter}) and tissue_general == 'kidney'"
                    f" and (disease == 'chronic kidney disease' or disease == 'normal')")

    adata = cellxgene_census.get_anndata(
        census,
        organism         = "Homo sapiens",
        obs_value_filter = value_filter,
        var_value_filter = f"feature_name in {QUERY_GENES}",
        obs_column_names = ["soma_joinid", "donor_id", "disease", "dataset_id"],
    )

Pseudobulk construction

Counts are summed per donor across all their cells. A donor may appear in multiple KPMP datasets (source + integrated versions); summing across all their cells is correct — log-CPM normalization handles library-size differences between donors.

X = adata.X.toarray() if sp.issparse(adata.X) else np.array(adata.X)
gene_names = adata.var["feature_name"].values

obs = adata.obs.reset_index(drop=True)
expr_df             = pd.DataFrame(X, columns=gene_names)
expr_df["donor_id"] = obs["donor_id"].astype(str).values
expr_df["disease"]  = obs["disease"].astype(str).values

pb = (expr_df
      .groupby(["donor_id", "disease"], observed=True)
      .sum()
      .reset_index())

Log-CPM normalization

pb_expr_raw = pb.drop(columns=["donor_id", "disease"])
row_sums    = pb_expr_raw.sum(axis=1)
pb_norm     = np.log1p(pb_expr_raw.div(row_sums, axis=0) * 1e6)

Save outputs

pb_expr_out = pd.concat(
    [pb[["donor_id", "disease"]].reset_index(drop=True),
     pb_norm.reset_index(drop=True)],
    axis=1
)
pb_expr_out.to_csv(OUT_DIR / "kpmp_pseudobulk_expr.csv", index=False)

meta_out = pb[["donor_id", "disease"]].copy()
meta_out["n_cells"] = (expr_df
                       .groupby(["donor_id", "disease"], observed=True)
                       .size()
                       .reset_index(name="n_cells")["n_cells"]
                       .values)
meta_out.to_csv(OUT_DIR / "kpmp_pseudobulk_meta.csv", index=False)

Outputs

File Description
results/tables/kpmp/kpmp_pseudobulk_expr.csv log-CPM per donor × gene (donors as rows)
results/tables/kpmp/kpmp_pseudobulk_meta.csv donor_id, disease, n_cells