Anticancer peptides (ACPs) are promising therapeutic agents

Abstract

Anticancer peptides (ACPs) are promising therapeutic agents to target and kill cancer cells. The accurate prediction of ACPs from the given peptide sequences remains an open problem in the immunoinformatics field. Recently, machine learning algorithms have emerged as a promising tool tohelp experimental scientists predict ACPs. However, the performance of existing methods still needs to be improved. In this study, we attempt to present a novel approach for the accurate prediction of ACPsnvolving two steps: (i) First, weapplied a two-step feature selection protocol on 7 feature encodings that cover various aspects of sequence information (composition-based, physicochemical properties, and profiles) and obtained their corresponding optimal feature-based models. The resultant predicted probabilities of ACPs were further utilized as feature vectors. (ii) The predicted probability feature vectors were then used as an input to support a vector machine in developing the final prediction model called mACPpred. Cross-validation analysis showed that the proposed predictor performs significantly better as compared to individual feature encodings. Furthermore, mACPpred considerably outperformed the existing methods compared in this study when accurately evaluated on an independent dataset.

Introduction

The complex process due to which normal cells are transformed into abnormal cancer cells is known as carcinogenesis or tumorigenesis [1]. Such processes may be attributed to several factors such as hereditation [2], environment [3], or a change in the physiological microenvironment of the affected cells [4]. Thus, most cancers (regardless of the driving factors) are distinguished by the continuing accumulation of genetic modifications in the founder cells [5]. In general, the division and differentiation of normal cells are strictly regulated by numerous signaling pathways. However, normal cells escape these signals sometimesthus leading to uncontrolled growth and proliferationand later to cancer [1]. According to the World Health Organization (WHO), the most common types of cancers are found in lung, liver, colorectal, stomach, prostate, skin and breast [https://www.who.int/news-room/fact-sheets/detail/cancer]. Every year, cancer devours millions of lives in both developing and developed countries. In 2018, it was anticipated that about 18 million new cancer cases and over 9 million deaths could occur due to cancer [6]. The number of deaths could reach over 13 million by 2030 [7]. In the United States (US) alone, approximately 1.7 million new cancer cases and over 600,000 cancer related deaths are estimated for 2019 [8].

Traditional Methods for Treatment

Traditional methods for the treatment of cancer include surgery, radiation therapy, and chemotherapy. Treatment may also depend on the location, stage of the disease, and patient condition [9]. Despite continuous advancements in the field, these methods are rather expensive and often exhibit damaging effects on normal cells. Additionally, there is a growing concern that cancer cells may develop resistance to chemotherapy and molecularly-targeted therapies [10]. Moreover, cancer cells are known to develop multidrug resistance through a broad range of mechanisms making these cells resistant to the respective drug in use for treatment.and several other compounds [11]. As soon as the molecular mechanism behind cancer (or, as a matter of fact, any disease) is understood, the next logical step is to discover a desirable remedy. [12]. Therefore, there is an urgent need to discover and design novel anti-cancer drugs for combatting against this noxious disease.

The role of Peptides

During the last few decades, the role of peptides as anti-cancer therapeutic agents has been promising. In fact, their effective utilization is apparent from several strategies available to address the progression of tumor growth and spreading of the disease [13]. These anti-cancer peptides (ACPs) have displayed the potential to inactivate various types of cancer cells [11]. ACPs are short peptides (typically 10–50 amino acids in length) that exhibit high specificity, high tumor penetration, and ease of synthesis and modification in addition to low cost of production [14, 15]. Generally, most of the ACPs demonstrate either an α-helical or a β-sheet conformation. However, in some cases, extended structures have also been identified [16]. ACPs can be classified into two major groups; i) peptides that are toxic to both cancerous and normal cells (exhibiting little evidence of selectivity), and ii) peptides that are toxic to cancer cells but not to normal mammalian cells and erythrocytes [11]. The mechanisms involving ACPs effecting cancer cells are not completely understood yet. However, the role of membranolytic or non-membranolytic mechanisms is implicated [11]. Furthermore, the mechanisms that are involved in the inhibition of certain biological processes such as angiogenesis, protein–protein interactions, signal transduction pathways, and gene expression (including the inhibition of enzymes or proteins) have also been highlighted [13].

ACPs Derived from Protein Sequences

Since most of the ACPs are derived from protein sequences [17], the discovery of novel ACPs for cancer treatment will be a focus of research for future studies. It is expected that the number of ACPs will increase with the rapid growth of protein sequences in public databases as a consequence of high-throughput sequencing projects [15]. Identification and development of novel ACPs from experimental methods are costly and time consuming. Therefore, it is essential to develop sequence-based computational methods to promptly identify potential ACP candidates from the sequencing data prior to their synthesis. In this study, we have constructed a lowest redundancy benchmark dataset and used it for the development of a prediction model. To develop a prediction model, we attempted to explore 7 feature encodings; amino acid composition (AAC), dipeptide composition (DPC), composition-transition-distribution (CTD), quasi-sequence-order (QSO), amino acid index (AAIF), binary profile (NC5), and conjoint triad (CTF). To exclude irrelevant features on each of the feature encodings, we first applied a two-step feature selection protocol and identified their corresponding optimal feature-based models. Finally, the predicted probability obtained from the 7feature encoding models was used as an input to support vector machine (SVM) to construct the final model called mACPpred. Furthermore, our recommended method mACPpred achieved consistent performance on both benchmark and independent datasets.

Results

Performance of Various Feature Encodings

Firstly, we examined the capability of each feature encoding in classifying ACPs from non-ACPs. It must be mentioned that optimal ML parameters for each feature encoding were obtained by conducting 10 independent 10-fold cross-validations. The best performance achieved by each feature encoding is shown in Figure 1. Results show that AAIF achieved the best performance with an accuracy of 88.72% while AAC-, QSO-, DPC-, CTD-, CTF-, and NC5-based performance ranked positions 2 to 7 respectively. Overall, the 7 feature encodings achieved a reasonable performance with an accuracy ranging between 81.0-88.7%. Furthermore, we observed that low-ranked feature encodings achieved the highest sensitivity and specificity. For instance, CTF achieved the highest sensitivity of 90.0%(1.5–20% higher than the other encodings). Similarly, NC5 achieved the highest specificity of 91.35%( 1.06–18.0% higher than the other encodings). Although the basic nature of each feature encoding covers a different aspect of sequence information, each contributes towards better prediction. Therefore, it is indispensable to integrate these seven feature encoding-based models into a single model to overcome the limitation(s) of each modeland achieve a more balanced and stable performance.

Comparison of SVM and Other Classifiers

To evaluate the effectiveness of SVM classifiers, we compared the performance of SVM-based classifiers against three other commonly used ML classifiers, namely Random Forest (RF), K-Nearest Neighbors (KNN), and Logistic Regression (LR), on the 7 feature encodings [18]. Using a 10-fold cross-validation test, the performance of the three other methods is shown in Table S1 (Supplementary Materials) and Figure 2. Results revealedthat SVM performed consistently better than the three other classifiers on six out of seven feature encodings. Precisely, the average accuracy achieved by SVM was ~1.1% higher than RF, ~2% higher than LR, and ~6% higher than KNNthereby indicating that SVM has a slight advantage over other methods in classifying ACPs from non-ACPs. Hence, we decided to only utilize the SVM classifiers for further analysis.

Selection of the Optimal Features for Each Encoding

Since DPC, CTF, and other encodings have a larger dimension, some of the features may be redundant or not equally important. Therefore, it is mandatory to apply a feature selection protocol to remove redundant and irrelevant features. There are various feature selection techniques available in the literature [19–23]. However, inspired by recent studies [24–26], we attempted to applied a two-step feature selection procedure to check whether it was capable of reducing feature dimensions and bringing improvement in the overall performance or not . In particular, the F-score algorithm for ranking features (present in each feature encoding) was employedfollowed by a sequential forward search to find the optimal feature set (Figure 3). Table 1 shows the number of features significantly reduced in case of DPC (66.25%), CTF (58.82%), and NC5 (73%). On the other hand, a slight reduction can be observed in case of AAIF (4.76%), QSO (1.0%), and CTD (10.62%). No reduction was witnessed in the case of AAC.Next, we inspected the performance of each feature encoding based on the unique optimal features and compared it with the respective control (using all features). Figure 4 shows a significant improvement in the performance for three feature encodings; NC5, DPC, and CTF by 4.18%, 3.08%, and 2.63% respectively as compared to their control. CTD and QSO improvement is marginal (<1%) while no improvement is seen in AAC and AAIF. Although AAIF showed no improvement, the number of feature dimensions is slightly reduced. In the case of AAC, all the features are equally important for obtaining the best performance.

To examine whether the optimal features are better than the excluded features for each feature encodings, we developed excluded features-based prediction models using the procedure as described in Section 2.1. We also compared their performance with the control (using all features) and the optimal features. Notably, only four feature encodings (CTD, CTF, DPC, and NC5) were used for this analysis while the remaining three feature encodings (AAC, AAI, and QSO) were excluded considering the size of the optimal feature dimension and the similar controls.. Figure S1 shows that the optimal feature-based models are consistently better than the control and also exclude feature-based models. Explicitly, the average accuracy achieved by the optimal feature-based models is 16.8% higher than the excluded feature-based models and 2.7% higher than the control. It indicates that a two-step feature selection protocol selected more important features thereby contributing to an improved performance. The optimal features for each feature encoding are provided in Table S2 (Supplementary Materials).

Construction of the Final Predictor

The optimal feature-based model obtained for each feature encoding was utilized in the development of a final prediction model. Some of the previous methods used hybrid features (a linear combination of various feature encodings) as an input to an ML classifier for the development of the prediction model without any feature selection techniques [27]. However, we only deliberated on the predicted probability of ACPs (values in the range of 0.0 to 1.0) from 7 individual optimal models as input features to SVM. Later, we developed a final prediction model called mACPpred. Our proposed predictor achieves a Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity and AUC of 0.836, 0.917, 0.891, 0.944, and 0.968 respectively. To show the effectiveness of mACPpred, we compared its performance with seven feature encoding predictors (Figure 5A). Specifically, the MCC and accuracy of the proposed predictor was 4.6–13.8% and 3.5–7.3% higher than the individual predictors thus indicating the effectiveness of our approach by integrating various feature encodings and contributions for an improved performance.

It might be possible that methods employing hybrid features (combination of different feature encodings) perform better than the current approach as they utilize multiple elements and complete the feature space as well. To investigate this possibility, we developed six hybrid-feature-based models using the following procedure: (i) Seven feature encodings were ranked according to the accuracy obtained from base-line models (Figure 1) and incorporated with AAI one by one (H1: AAI+AAC; H2: AAI+AAC+QSO; H3: AAI+AAC+QSO+DPC; H4: AAI+AAC+QSO+DPC+CTD; H5: AAI+AAC+QSO+DPC+CTD+CTF; H6: AAI+AAC+QSO+DPC+CTD+CTF+NC5). Each of the hybrid features were used as an input to SVM and their corresponding models were developed using the same procedure as described in Section 2.1. Figure 5B shows the performance comparison of mACPpred with the hybrid-feature-based models where mACPpred performed better with an MCC and accuracy value 3.46–9.5% and 1.7–4.7% higher than the hybrid models respectively. It demonstrates that our approach helped in achieving the best performance.

Performance Comparison on the Independent Dataset

There are several examples where the prediction model showed an excellent performance during cross-validation. However, these performances are not transferrable while evaluating an independent dataset. Hence, an independent evaluation is needed to validate the robustness of the proposed method. Most importantly, the independent dataset constructed in this study did not share greater than 90% sequence identity with our training dataset and other existing methods’ training datasets. Therefre, we compared the performances of mACPpred with the previous methods such as MLACP and iACP. It should be noted that MLACP contains two prediction models based on RF (RFACP) and SVM (SVMACP) and both the models were considered for comparison.

Table 2 shows that mACPpred achieves an MCC, accuracy, sensitivity, specificity, and AUC of 0.829, 0.914, 0.885, 0.943, and 0.967 respectively. More specifically, the MCC and accuracy of mACPpred is 23.7–49.1% and 14.6–33.4% higher respectively than the other methods compared in this study thereby demonstrating that the proposed method is capable of achieving an encouraging performance. It must be noted that it is difficult to get statistical estimation from the above-mentioned threshold-based comparison. Hence, we utilized rank-based comparison using ROC [28] where two AUC values of different methods were assessed by a two-tailed test from which the p value for the observed differences were obtained [29]. Table 2 and Figure 6 show that the mACPpred significantly outperformed the existing predictors on the independent dataset.

Webserver Implementation

mACPpred webserver is accessible for free at the following link: www.thegleelab.org/mACPpred. Users can upload or paste query peptide sequences in the FASTA format. After submitting peptide sequences, retrieved results in a separate interface can be obtained. All datasets used in this study can be downloaded from the following link: http://thegleelab.org/mACPpred/ACPData.html to check the reproducibility of our findings.

Discussion

In this study, we developed a novel predictor called mACPpred to predict ACPs from the given peptide sequence. To develop a predictor, a two-step feature selection protocol was applied on seven feature encodings (AAC, DPC, CTD, CTF, AAI, QSO, and NC5) to obtain optimal feature-based prediction models whose predicted probabilities of ACPs were further used as a feature vector. Finally, the probabilistic feature vector was used as an input to a SVM for development of the final prediction model. The benchmark and independent validation demonstrated that the mACPpred was able to clearly outperform existing predictors compared in this study for ACPs prediction. The novelty of our method is as follows: (i) The benchmark or training dataset has the lowest redundancy among the datasets reported in the literature; (ii) among various feature encodings employed in this study, this is the first instance where CTF and QSO are employed in ACP prediction, and (iii) most of the existing predictors either utilize single feature encodings or a combination of multiple feature encodings. Hence, their feature dimension is very high. However, we have used only seven probabilistic features that cover a wide range of features (position specific, physicochemical, and compositional information). Basically, it transforms the complex high-dimensional feature into a low-dimensional one, further facilitating better discrimination between ACPs and non-ACPs.

Moreover, our approach can be applied to other sequence-based prediction problems including post-translational modifications, peptide function predictions, and DNA/RNA function predictions. Although the proposed predictor has shown an excellent performance as compared to other methods, there is still room for improvement. This may include exploration of other ML algorithms such as decision tree-based [31,32] and neural network-based algorithms [33–35] on the same dataset, incorporation of novel features, and computational approach as implemented in References [36–39], and increasing the size of the training dataset based on the future experimental data. Furthermore, we have implemented our proposed algorithm in the form of a user-friendly web-server for the wider research community to use and implement. It is expected that mACPpred will be helpful in the identification of novel potential ACPs.

(ACPs) Anticancer Peptides are therapeutic agents