SISKA Atlas Extraction

Related to Figure 6.

SISKA mouse kidney single-nucleus atlas — Kloetzer et al., Nature Genetics 2025. The m_humphreys_DKD sub-cohort contains three groups: Healthy controls, CKD untreated, and CKD + ACEi. Download Mouse.h5ad from Zenodo (record 15007208) and place at data/siska/Mouse.h5ad. All scoring is deferred to R; this script saves pseudobulk and cell-type CSVs only.

Gene signatures (mouse)

CAAH_MOUSE = ["Clu", "Lrp2", "Lamp2", "Col4a2", "Spink1",
              "Wfdc2", "Pax8", "Lyz2", "S100a9", "Cdh13"]

HYPOXIA_MOUSE    = ["Hif1a", "Epas1", "Epo", "Epor", "Vegfa",
                    "Aldoa", "Ldha", "Pgk1", "Slc2a1", "Eno1",
                    "Angptl4", "Hilpda", "Bnip3"]
GLYCOLYSIS_MOUSE = ["Hk2", "Pfkm", "Pkm", "Ldha", "Hk1",
                    "Aldoa", "Gapdh", "Pgk1", "Pgam1", "Eno1",
                    "Tpi1", "Gpi", "Ldhb"]
ECM_MOUSE        = ["Col1a1", "Col1a2", "Col3a1", "Col4a1", "Col4a2",
                    "Col12a1", "Fn1", "Timp1", "Timp2", "Mmp2",
                    "Acta2", "Vim", "Tgfb1", "Postn", "Thbs1"]

ALL_GENES = list(set(CAAH_MOUSE + HYPOXIA_MOUSE + GLYCOLYSIS_MOUSE + ECM_MOUSE))

Load the atlas and join treatment metadata

The h5ad obs contains orig_ident (sample ID) and proj (sub-cohort) but treatment labels live in a separate metadata CSV. They are joined by orig_ident.

adata = sc.read_h5ad(H5AD)

meta = pd.read_csv(META_CSV)
for col in ["treated", "condition_harmonized", "disease"]:
    adata.obs[col] = adata.obs["orig_ident"].map(
        meta.set_index("orig_ident")[col]
    )

Subset to the three-group comparison

keep = (
    (adata.obs["proj"] == "m_humphreys_DKD") &
    (adata.obs["treated"].isin(["Control_healthy", "Control_diseased", "ACEi"]))
)
adata_sub = adata[keep].copy()

GROUP_MAP = {
    "Control_healthy":  "Healthy",
    "Control_diseased": "CKD",
    "ACEi":             "CKD + ACEi",
}
adata_sub.obs["Group"] = adata_sub.obs["treated"].map(GROUP_MAP)

Save per-cell metadata

The cell-level metadata CSV is used in R for cell-type breakdown figures. Cell type labels come from annotation_final_level1.

cell_meta_cols = ["orig_ident", "Group", "treated",
                  "annotation_final_level1", "annotation_final_level1B",
                  "nCount_RNA", "nFeature_RNA"]
cell_meta_cols = [c for c in cell_meta_cols if c in adata_sub.obs.columns]

adata_sub.obs[cell_meta_cols].reset_index(drop=True).to_csv(
    OUT_DIR / "siska_mouse_cell_meta.csv", index=False
)

Pseudobulk construction

Counts are summed per sample (orig_ident), restricted to signature genes only to keep the CSV manageable.

X = adata_sub.X.toarray() if sp.issparse(adata_sub.X) else np.array(adata_sub.X)
gene_names = list(adata_sub.var_names)

found   = [g for g in ALL_GENES if g in gene_names]
sig_idx = [gene_names.index(g) for g in found]
X_sig   = X[:, sig_idx]

obs = adata_sub.obs.reset_index(drop=True)
expr_df             = pd.DataFrame(X_sig, columns=found)
expr_df["orig_ident"] = obs["orig_ident"].astype(str).values
expr_df["Group"]      = obs["Group"].astype(str).values

pb = expr_df.groupby(["orig_ident", "Group"], observed=True).sum().reset_index()
pb_meta = pb[["orig_ident", "Group"]].copy()
pb_expr = pb.drop(columns=["orig_ident", "Group"])

row_sums = pb_expr.sum(axis=1)
pb_norm  = np.log1p(pb_expr.div(row_sums, axis=0) * 1e6)

pd.concat([pb_meta.reset_index(drop=True),
           pb_norm.reset_index(drop=True)], axis=1).to_csv(
    OUT_DIR / "siska_mouse_pseudobulk_expr.csv", index=False
)

Cell-type × sample mean expression

Used in R for the cell-type heatmap. Mean raw expression (not log-CPM) is computed per cell type per sample; the R script applies its own normalization for the heatmap.

expr_df_ct = pd.DataFrame(X_sig, columns=found)
expr_df_ct["orig_ident"] = obs["orig_ident"].astype(str).values
expr_df_ct["Group"]      = obs["Group"].astype(str).values
expr_df_ct["cell_type"]  = obs["annotation_final_level1"].astype(str).values

ct_pb = expr_df_ct.groupby(
    ["orig_ident", "Group", "cell_type"], observed=True
).mean().reset_index()
ct_pb.to_csv(OUT_DIR / "siska_mouse_celltype_expr.csv", index=False)

expr_df_ct.groupby(
    ["orig_ident", "Group", "cell_type"], observed=True
).size().reset_index(name="n_cells").to_csv(
    OUT_DIR / "siska_mouse_celltype_counts.csv", index=False
)

Outputs

File	Description
`siska_mouse_pseudobulk_expr.csv`	log-CPM per sample × gene
`siska_mouse_cell_meta.csv`	per-cell metadata (cell type, group, QC)
`siska_mouse_celltype_expr.csv`	mean expression per cell type × sample
`siska_mouse_celltype_counts.csv`	cell counts per cell type × sample

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search