GSAM User Guide

The following examples demonstrate the use of the GSAM package for common analysis scenarios.

The data files used for these examples can be retrieved from the data/ directory on the associated documentation repository. If you wish you run these examples yourself, first download the files in that directory to your local machine and ensure that they are on your current MATLAB path.

1. Single GSA, Liver hepatocellular carcinoma

1.1. Load the data and gene set collection file

Load the differential expression (DE) results for hepatocellular carcinoma (LIHC) (paired-normal tissue vs. primary tumor tissue).

DEdata = readtable('DEresults_LIHC.txt');

Take a look at the table contents.

head(DEdata)

% ans =
% 
%   8×4 table
% 
%          gene_IDs           gene_names       logFC         PValue  
%     ___________________    ____________    __________    __________
% 
%     {'ENSG00000000003'}    {'TSPAN6'  }      -0.26345      0.050906
%     {'ENSG00000000419'}    {'DPM1'    }    -0.0030857       0.97084
%     {'ENSG00000000457'}    {'SCYL3'   }      0.044663       0.63024
%     {'ENSG00000000460'}    {'C1orf112'}        0.7297    0.00021919
%     {'ENSG00000000938'}    {'FGR'     }       -1.2409    3.6644e-15
%     {'ENSG00000000971'}    {'CFH'     }      -0.72639    0.00069828
%     {'ENSG00000001036'}    {'FUCA2'   }      0.050749       0.66557
%     {'ENSG00000001084'}    {'GCLC'    }      -0.60349    1.1461e-06

Extract the data into individual variables

geneIDs   = DEdata.gene_IDs;     % gene Ensembl IDs
geneNames = DEdata.gene_names;   % gene abbreviations
log2FC    = DEdata.logFC;        % log2(fold-changes) (normal vs. tumor)
pvals     = DEdata.PValue;       % fold-change significance (p-values)

Load the KEGG_GSC.gmt gene set collection (GSC) file using the importGSC function, and look at the first few lines to see how it's formatted.

gsc = importGSC('KEGG_GSC.gmt');

% Gene set collection contains 146 gene sets and 4966 unique genes.

gsc(1:5,:)

% ans =
%
%   5×2 cell array
%
%     {'KEGG-GLYCOLYSIS-GLUCONEOGENESIS'}    {'ACSS2'}
%     {'KEGG-GLYCOLYSIS-GLUCONEOGENESIS'}    {'GCK'  }
%     {'KEGG-GLYCOLYSIS-GLUCONEOGENESIS'}    {'PGK2' }
%     {'KEGG-GLYCOLYSIS-GLUCONEOGENESIS'}    {'PGK1' }
%     {'KEGG-GLYCOLYSIS-GLUCONEOGENESIS'}    {'PDHB' }

Note

The KEGG_custom_GSC.gmt file used in this example is a modified version of the KEGG gene set collection retrieved from MSigDB, where all disease-related gene sets were removed.

1.2. Run the gene set analysis

Run the gene set analysis using the geneSetAnalysis function.

GSAres = geneSetAnalysis(geneNames, pvals, log2FC, gsc, ...
    'method', 'Wilcoxon', 'nPerm', 50000, 'gsSizeLim', [20, Inf]);

% Gene set collection contains 146 gene sets and 4966 unique genes.
% Checking for empty gene sets... Removed 0 empty sets.
% Checking for duplicated rows in GSC... Removed 0 duplicated rows.
% Checking gene set sizes... Removed 27 gene sets not satisfying size limits.
% Final number of gene sets remaining: 119
% Subsetting data for mixed-directional calculations... Done.
% Calculating test statistic... Done.
% Calculating significance via gene shuffling... Done.

Here, we chose the Wilcoxon rank-sum test as the method for combining gene-level statistics, set the number of permutations for significance calculation to 50000, and excluded gene sets that contain less than 20 genes.

Important

We used geneNames (instead of e.g. geneIDs) as input to the geneSetAnalysis function because gene names are used in the gsc that was loaded. It is important that the gene names or IDs in the gene list are of the same type as those used in the GSC file.

1.3. View and/or export the results

The results are returned as the GSAres table, which contains information about the size (number of genes) and significance (p-values, raw and adjusted) for each of the evaluated gene sets. The table can be viewed by double-clicking on GSAres in the Workspace, or using the open function.

open('GSAres');
GS_name                                                                             GS_size stat_nondir p_nondir padj_nondir stat_mixup p_mixup padj_mixup stat_mixdn p_mixdn padj_mixdn stat_distup p_distup padj_distup stat_distdn p_distdn padj_distdn
KEGG-GLYCOLYSIS-GLUCONEOGENESIS 51 519848 0.0102 0.0371 62259 0.5766 0.8788 196844 0.0018 0.0092 302878 0.9998 1 571415 0.00018 0.00067
KEGG-CITRATE-CYCLE-TCA-CYCLE 30 294717 0.0832 0.1799 44790 0.3987 0.8041 101736 0.0759 0.1966 200384 0.9817 1 313906 0.0183 0.0496
KEGG-PENTOSE-PHOSPHATE-PATHWAY 21 198224 0.2115 0.3872 47870 0.2579 0.6929 51476 0.3019 0.5239 176522 0.5622 1 183481 0.4379 0.6596
KEGG-PENTOSE-AND-GLUCURONATE-INTERCONVERSIONS 21 189170 0.3415 0.5080 8507 0.8360 0.9871 84792 0.2356 0.4344 103850 0.9998 1 256153 0.0002 0.0007
KEGG-FRUCTOSE-AND-MANNOSE-METABOLISM 32 278495 0.4365 0.6076 74220 0.2803 0.6948 65609 0.6202 0.8317 283091 0.3800 1 265485 0.6201 0.8580
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

The tabular format of the GSA results can often be difficult to interpret. A more intuitive way to visualize the results is to use a heatmap. Generate a heatmap to visualize GSAres using the GSAheatmap function.

GSAheatmap(GSAres, 'filterMethod', 'pval', 'filterThreshold', 0.01);

LIHC GSA heatmap

By default, the GSAheatmap function will use the adjusted p-values. Here, we chose to filter the gene sets to only show those with an adjusted p-value ≤ 0.01 for at least one of the five directionality types.

Note

For more information on the meaning and interpretation of the different p-value classes (distinct directional down and up, mixed-directinal down and up, and non-directional), see the publication where they were introduced.

Alternatively, the results can be written to a file.

writetable(GSAres,'GSAresults.txt','Delimiter','\t');  % export to tab-delimited .txt file
writetable(GSAres,'GSAresults.csv','Delimiter',',');  % export to comma-delimited .csv file

2. Using gene set collections derived from GEMs

Genome-scale metabolic models (GEMs) are convenient collections of all known reactions comprising the metabolism of an organism or cell type. Since most GEMs contain gene information, we can extract gene set collections from the associations of these genes with other model components, such as metabolites or subsystems (pathways).

2.1. Extract and use a metabolite-based GSC

For this example, we will use a human genome-scale metabolic model, Human-GEM. A .mat version of the model can be retrieved from the Human-GEM GitHub repository.

Once the HumanGEM.mat file is downloaded and added to the current working folder (or somewhere else on the MATLAB path), load it into MATLAB.

load('HumanGEM.mat');  % the model is loaded as a structure named 'ihuman'

Extract a metabolite-based GSC from the model using the extractMetaboliteGSC function.

gsc_met = extractMetaboliteGSC(ihuman);

% Gene set collection contains 3715 gene sets and 3625 unique genes.

Just as other GSCs (such as the KEGG collection described above) associate biological functions or processes to sets of genes, the metabolite GSC associates metabolites to sets of genes. This provides a convenient link between gene-level information and metabolism.

Use the same LIHC DE data as was used in Example 1 above:

DEdata = readtable('DEresults_LIHC.txt');

geneIDs   = DEdata.gene_IDs;
geneNames = DEdata.gene_names;
log2FC    = DEdata.logFC;
pvals     = DEdata.PValue;

Run a GSA with the LIHC DE data, now using the metabolite GSC gsc_met.

GSAres = geneSetAnalysis(geneIDs, pvals, log2FC, gsc_met, ...
    'method', 'Reporter', 'nPerm', 50000, 'gsSizeLim', [20, 200]);

% Checking for empty gene sets... Removed 71 empty sets.
% Checking for duplicated rows in GSC... Removed 0 duplicated rows.
% Checking gene set sizes... Removed 3236 gene sets not satisfying size limits.
% Final number of gene sets remaining: 408
% Subsetting data for mixed-directional calculations... Done.
% Calculating test statistic... Done.
% Calculating significance via gene shuffling... Done.

Note

This approach of using metabolite-based gene sets with the "reporter" method for combining gene-level statistics is termed reporter metabolite analysis.

Important

Notice that we used the geneIDs here instead of geneNames. This is because the Human-GEM model by default uses gene Ensembl IDs, which means the gsc_met will also contain Ensembl IDs. Try running the GSA using geneNames instead - you should see an error.

The results can again be visualized using the GSAheatmap function. Since a very large number of gene sets in this example exhibit a low p-value (e.g., < 0.001) we use the top each filter method to only include the top 10 most significant gene sets for each column (p-value type) in the heatmap.

GSAheatmap(GSAres, 'filterMethod', 'top each', 'filterThreshold', 10, 'colorMax', 5);

Note that we set the colorMax input to 5 to prevent the color range from autoscaling to the maximum -log10(p-value), which would result in very faint colors for most gene sets.

2.2. Extract and use a subsystem-based GSC

In the same way that a metabolite-based GSC can be extracted from a genome-scale model, we can generate a subsystem-based GSC:

gsc_subsys = extractSubsystemGSC(ihuman);

Subsystems are groups of reactions forming a pathway or sharing a common metabolic function. A few examples can be seen from gsc_subsys:

subsystem_list = unique(gsc_subsys);
subsystem_list(1:5)

% ans =
% 
%   5×1 cell array
% 
%     {'Acyl-CoA hydrolysis'                        }
%     {'Acylglycerides metabolism'                  }
%     {'Alanine, aspartate and glutamate metabolism'}
%     {'Amino sugar and nucleotide sugar metabolism'}
%     {'Aminoacyl-tRNA biosynthesis'                }

Run a GSA on the LIHC DE results with the same settings as above, but now using gsc_subsys.

GSAres = geneSetAnalysis(geneIDs, pvals, log2FC, gsc_subsys, ...
    'method', 'Reporter', 'nPerm', 50000, 'gsSizeLim', [20, 200]);

Visualize the results with a heatmap, filtering based on (adjusted) p-values.

GSAheatmap(GSAres, 'filterMethod', 'pval', 'filterThreshold', 0.05, 'colorMax', 5);

3. Combining the results of multiple GSAs

In some cases we have multiple sets of p-values and fold-changes from multiple DE analyses, and may want to compare the GSA results among these data sets. Both the geneSetAnalysis and GSAheatmap functions allow input of multiple datasets or results.

We will demonstrate two example approaches for running and comparing multiple GSAs which yield the same result. The examples will use the metabolite GSC from the Human-GEM model, and the "Reporter" method for combining statistics. Both approaches begin with the same steps of loading a GSC and specifying the list of dataset files:

% extract a metabolite GSC (see Example 2 for more detail)
load('HumanGEM.mat');
gsc_met = extractMetaboliteGSC(ihuman);
% specify list of DE data files to analyze
cancer_types = {'COAD'; 'KIRC'; 'LIHC'; 'STAD'};
fnames = strcat('DEresults_', cancer_types, '.txt');

% fnames =
% 
%   4×1 cell array
% 
%     {'DEresults_COAD.txt'}
%     {'DEresults_KIRC.txt'}
%     {'DEresults_LIHC.txt'}
%     {'DEresults_STAD.txt'}

3.1. Run multiple GSAs separately and combine the results

One approach is to run each GSA separately (e.g., in a for-loop), and then combine the results before calling the GSAheatmap function.

% load and analyze each dataset individually
for i = 1:numel(fnames)
    DEdata = readtable(fnames{i});
    GSAres{i} = geneSetAnalysis(DEdata.gene_IDs, DEdata.PValue, DEdata.logFC, ...
        gsc_met, 'method', 'Reporter', 'nPerm', 10000, 'gsSizeLim', [20, 200]);
    GSAres{i}.Properties.Description = cancer_types{i};  % label the table as the cancer type
end

3.2. Run multiple GSAs together in one function call

Another approach is to first combine the datasets, and then supply the combined data to the geneSetAnalysis function.

% load and combine datasets into cell arrays
for i = 1:numel(fnames)
    DEdata = readtable(fnames{i});
    logFC{i} = DEdata.logFC;
    PValue{i} = DEdata.PValue;
    genes{i} = DEdata.gene_IDs;
end

% run the GSA
GSAres = geneSetAnalysis(genes, PValue, logFC, gsc_met, 'method', 'Reporter', ...
    'nPerm', 10000, 'gsSizeLim', [20, 200], 'dataNames', cancer_types);


% ##### Running GSA on COAD (run 1 of 4) #####
% 
% Checking for empty gene sets... Removed 106 empty sets.
% Checking for duplicated rows in GSC... Removed 0 duplicated rows.
% Checking gene set sizes... Removed 3250 gene sets not satisfying size limits.
% Final number of gene sets remaining: 359
% Subsetting data for mixed-directional calculations... Done.
% Calculating test statistic... Done.
% Calculating significance via gene shuffling... Done.
%
%
% ##### Running GSA on KIRC (run 2 of 4) #####
% 
% Checking for empty gene sets... Removed 63 empty sets.
% Checking for duplicated rows in GSC... Removed 0 duplicated rows.
% Checking gene set sizes... Removed 3288 gene sets not satisfying size limits.
% Final number of gene sets remaining: 364
% Subsetting data for mixed-directional calculations... Done.
% Calculating test statistic... Done.
% Calculating significance via gene shuffling... Done.
% 
% ...

Note that in this second approach, we supplied the dataset names cancer_types as input to the geneSetAnalysis function.

3.3. Generate multi-GSA heatmaps

Both the above approaches should yield essentially the same GSAres result (though not identical because of stochasticity). To visualize the results of the multiple GSAs in the same plot, the GSAheatmap function can be used.

Here we choose to show the top 10 most significant gene sets (metabolites) from each dataset, and to scale the colorbar to a maximum -log10(p-value) of 5.

GSAheatmap(GSAres, 'filterMethod', 'top each', 'filterThreshold', 10, 'colorMax', 5);

% Warning! The following gene sets are not present in all GSAres tables and will therefore be ignored:
%   1,2-diacylglycerol-LD-SM pool
%   1-(1-alkenyl)-sn-glycero-3-phosphoethanolamine
%   1-acylglycerol-3P pool
%   10-HETE
%   11,12,15-THETA
%   ...and 70 more.

multi-GSA heatmap

Two heatmaps are produced: one showing the non-directional p-values (left), and one showing the distinct-direcitonal p-values (right). The distinct-directional p-values have been signed (+/-) after log-transformation, based on which direction (increase or decrease) was more significant.

Important

From the warning message, we see that some gene sets (metabolites) cannot be shown because they are not present in all GSA results across the different runs. This occurs because the different datasets each have a different gene list (usually due to filtering of low-count genes). When genes that are absent from the dataset are removed from the GSC in the pre-processing step, the gene set sizes may change and some gene sets may no longer satisfy the 20-200 range that was specified in the input.