1 Introduction

Somatic mutations play an important role in cancer development and disease progression, and mutational profiling is used far more commonly than other ‘omics’ analyses in clinical practice. With the development of PD-1 immunotherapy technology, many experiments show that patients with high TMB have been shown more clinical benefits of immunotherapy (PD-1/PD-L1). However, to date, WES remains confined to research settings, due to the high cost of the large genomic space sequenced. In the clinical setting, instead, targeted enrichment panels (gene panels) of various genomic sizes are emerging as the routine technology for TMB assessment. This package attempts to develop a new pathway-based gene panel for TMB assessment (pathway-based tumor mutational burden, PTMB), using somatic mutations files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format. Besides, we develop a multiple machine learning method using sample PTMB profiles to identify cancer-specific dysfunction pathways, which can be a biomarker of prognostic and predictive for cancer immunotherapy.

2 MAF field requirements

MAF files contain many fields ranging from chromosome names to cosmic annotations. However, most of the analysis in PTMB uses the following fields.

  • Mandatory fields:

Complete specification of MAF files can be found on NCI GDC documentation page.

3 Installation


4 Overview of the package

The pathwayTMB package is a Bioinformatics tool to identify Cancer-specific pathways. And pathwayTMB functions can be categorized into mainly Visualization and Analysis modules. Each of these functions and a short description is summarized as shown below:

1.Obtain survival benefit mutations frequency matrix.
2.Calculate the length of encoded genes.
3.Infer pathway-based tumor mutational burden profiles (PTMB).
4.Identify Cancer-specific dysfunction pathways.
5.Visualization results:
5.1 Plot Patients’ Kaplan-Meier Survival Curves.
5.2 Plot patient-specific dysfunction pathways and user-interested geneset mutually exclusive and co-occurrence plots.
5.3 Plot patient-specific dysfunction pathways’ waterfall plots.
5.4 Plot the ROC curve.

4.2 Calculate coding genes’ length using gene transfer format (GTF) file.

The function get_gene_length can extract coding genes’ length from gene transfer format (GTF) file. The GTF file can download from GENCODE.

#calculate coding genes' length, filepath--the path of the GTF file

4.3 Infer pathway-based tumor mutational burden profiles–PTMB.

To identify cancer-specific pathways, PTMB requires sample mutation data with survival information. For each pathway, PTMB is defined as pathway tumor mutational burden corrected by genes’ length and number, the calculation formula is as follows:

where \(mut_{ij}\) is the \(i\) th sample’s number of mutations of the \(k\) th gene in the \(j\) th pathway; \(length_{jk}\) is the length of \(k\) th gene in the \(j\) th pathway; \(N\) is the total number of genes in pathway \(j\).The \(PTMB_{ij}\) denotes the pathway \(j\) tumor mutational burden for sample \(i\), which was calculated in the context of survival-related genes’ mutation frequency matrix.

#perform the function `get_PTMB`
#show the first six lines of PTMB matrix
#>                                           Pat03   Pat06     Pat07 Pat08
#> Hematopoietic cell lineage            0.0000000 0.00000 0.0000000     0
#> Complement and coagulation cascades   0.0000000 0.00000 4.1867281     0
#> Platelet activation                   0.0000000 0.00000 2.1018828     0
#> Toll-like receptor signaling pathway  0.0000000 0.00000 0.8930421     0
#> NOD-like receptor signaling pathway   0.4252838 2.25137 2.8148339     0
#> RIG-I-like receptor signaling pathway 0.0000000 0.00000 2.6852846     0
#>                                         Pat100   Pat101
#> Hematopoietic cell lineage            0.000000 0.000000
#> Complement and coagulation cascades   3.145643 0.000000
#> Platelet activation                   3.938654 0.000000
#> Toll-like receptor signaling pathway  0.000000 0.000000
#> NOD-like receptor signaling pathway   1.962122 0.000000
#> RIG-I-like receptor signaling pathway 0.000000 3.281809

4.4 Identify cancer-specific pathways

In order to identify cancer-specific dysfunction pathways, we develop a multiple machine learning method using pathway-based tumor mutational burden profiles. The multiple machine learning method consists of three sequential steps, as follows:

  1. Wilcoxon test is used to tentatively identify differentially PTMB pathways between death and alive sample groups.
  2. Then random forest regression model is performed for further dysfunction pathways screening.
  3. Lasso+cox regression model is also performed for feature selection.

The function get_final_signature in the pathwayTMB package can implement the above process.

# filter the survival-related pathways
#view the final_character
#>                                          a1 
#> "Natural killer cell mediated cytotoxicity" 
#>                                          a4 
#>       "Antigen processing and presentation"

4.5 Visualization results

  1. The function ‘plotKMcurves’ uses to draw Kaplan-Meier survival curves based on PTMB-related risk score. The risk score is generated by the signature’s PTMB and the coefficient of ‘Univariate’ or ‘Multivariate’ cox regression. The command lines are as follows:

#Drawing Kaplan Meier Survival Curves using the final survival-related PTMB.
plotKMcurves(t(PTMB_matrix[final_character,]),sur=sur,returnAll = FALSE,risk.table = TRUE)

  1. The function ‘plotMutInteract’ uses Fisher Exact tests to detect mutually exclusive, co-occurring, and altered genesets or pathways, then visualize the results.

#a mutually exclusive co-occurrence chart showing the top 20 genes for mutation rates
genes<-names(sort(gene_fre,decreasing = TRUE))[1:20]
plotMutInteract(freq_matrix=mut_matrix, genes=genes,returnAll = FALSE)