QCQuan

v0.8
Home
Documentation
New Job
Job Lookup
DoE Creator
Contact

Quick Start

This step-by-step tutorial will let QCQuan analyze your labeled LC-MS/MS proteomics differential expression experiment and provide you with output files on both the non-redundant-peptide level as well as protein level, including a quality control and differential expression report in PDF format.

  1. You'll need the peptide-spectrum matches in .txt, .csv/tsv or .xlsx format. These should be in separate files: one per LC-MS/MS run. Please verify that all required data 'observables' or 'columns' are present in your PSM files and have valid contents, especially if you're not using files generated by Proteome Discoverer (PD).
    The required columns should have these exact names and {possible values}:
    • Sequence: {string} the peptide sequence.
    • Master Protein Accessions: {string;string;...} accession numbers of the proteins associated with a PSM, separated by semi-colons.
    • First Scan: {integer} identifier for the MS scan in which the peptide was detected. If you don't have such a column, just create one with all values set to a unique integer. PSMs detected by multiple times by different PSM Algorithms should have the same value.
    • (???): {string} names of the reporter labels or channel names you provided in the DoE file.
    If this is not the case, simply create a text file (wrapper) with on each line YOUR_NAME for the observable, as well as OUR_NAME, separated by a TAB character.
    If you have any additional columns or observables you would like QCQuan to use, please refer to the complete set of compatible column names.
  2. You'll also need a 'design of experiment' (DoE) file. Create one by going to the "DoE Creator" and following the instructions. You'll have to give names to your different 'LC-MS/MS runs', specify which biological/chemical conditions were used in which MS run, and indicate for each MS run which 'channels' or 'reporter labels' (as they appear in the headers of your PSM files) belong to which of those conditions. At the end of the procedure, you can download the DoE file you need. This file can easily be modified with a text editor if you want to correct a mistake, or quickly generate your own new DoE file.
  3. Got your PSM files and DoE file? Great, in the "New Job" form, choose a job name. Then, upload your DoE file.
  4. Fill out the Job Settings form. Please remark:
    • If you are not using Proteome Discoverer for generating PSM files, you will be warned to provide the wrapper file(s) you (should have) created in the PSM file verification step by uploading one under the "Advanced" button for each PSM data file.
    • In case you used multiple search engines to generate PSMs, the information from the Master engine will be preferred over that of the Slave(s).
    • QCQuan will refrain from using 'shared peptides' (peptides that may correspond to multiple proteins) unless you check the box under "Advanced settings". In that case a second analysis will be done for the assuming that all proteins make (full) use of every peptide they share with another protein.
  5. Press "Submit job" and make sure you land on the "Job Info" page which states your job is running. If you didn't provide an e-mail address, be sure to copy and save the job ID. You can use your job ID to retrieve your job information from the "Job Lookup" page in case you close your Job Info page.
  6. As soon as the job is done, the Job Info page will provide you with download links for the report file as well a .zip file containing:
    • normalized data on the non-redundant, modified peptide level (1 file per LC-MS/MS run)
    • differential expression analysis on the protein level (in 1 comprehensive file as well as split out in 1 file per condition)

Publications


Documentation

Here you can find the scientific description as well as prerequisites and technical specifications of the QCQuan workflow, which runs on a server located at VITO and is accessible through a web interface.

The scientific workflow consists of three distinct steps:

  1. Processing: cleans the input data while gathering some QC information, and aggregates the PSMs to the modified, non-redundant peptide level, then performs the CONSTANd normalization.
  2. Analysis: while gathering additional QC information, transforms the information from the peptide to the protein level and performs a differential expression analysis (DEA). Also performs an exploratory analysis for quality control (QC) purposes, including a principal component analysis (PCA) and hierarchical clustering (HC) on the peptide-level data matrix.
  3. Report: produces a PCA plot (first 2 components), HC dendrogram and for each non-reference condition a volcano plot with a list of the top differential proteins. From the gathered QC info, produces some statistics and MS1 calibration and intensity plots if possible. Lastly, Summarizes all visualizations, statistics and other relevant information (including meta-data) into a PDF report.

The output for the user contains the created report, as well as some tab-separated (.tsv) data files: - normalized data on the non-redundant, modified peptide level (1 file per LC-MS/MS run) - differential expression analysis on the protein level (in 1 comprehensive file as well as split out in 1 file per condition)

If you wish to acquire any intermediate output files, please contact us.

Most of the parameters which control the behaviour of the workflow are fixed when using the web interface. If you wish to run a job with a custom set of parameters, please contact us.

QCQuan workflow overview

QCQuan workflow overview. Red: main input/output and workflow backbone. Blue: optional input/output and workflow steps. Green: input and workflow steps directly influenced by user-specifiable parameters. Grey: currently disabled.

Prerequisites

The following requirements are to be met in order to use QCQuan:

  • This workflow was built for labeled shotgun proteomics with TMT-labeled N-plex peptide samples eluted through a liquid chromatography column, then ionised, then selected for m/z value by an MS1 and then further fragmented to be quantified by an MS2. However it also works for experiments based on equivalent techniques, f.i. iTRAQ labeling.
  • The MS2 spectra of the mass spectrometer should have already been analyzed and converted into peptide-spectrum matches (PSMs), for instance using one or multiple database search algorithms like, for instance, SEQUEST [DOI: 10.1016/1044-0305(94)80016-2] or MASCOT [DOI: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2].
  • In case your experiment consists of more than one multiplexed LC-MS/MS run, make sure that the amount of pooled samples is the same for each LC-MS/MS run. Do not, f.i., try to combine a TMT 6-plex with a TMT 10-plex.
  • In case multiple samples belonging to the same condition are spread across multiple LC-MS/MS runs, make sure that the samples belonging to the same condition are always evenly spread across all the LC-MS/MS runs they appear in. Do not, f.i., put 5 samples of condition A and 1 of condition B in the first LC-MS/MS run, and then 1 sample of condition A and 5 of condition B in the second. Rather, you should respect intelligent experimental design and spread the samples so that each run contains 3 samples of each condition.
  • QCQuan makes use of the CONSTANd normalization algorithm. Conditions for applicability (DOI: 10.1074/mcp.M115.056911) can be verified on an MA-plot of your peptide- or protein-level data [IMG1: MA-plot]:
    1. The majority of proteins/peptides are not differentially expressed, to avoid a bias in the estimate of the mean value. It is assumed that up-down shifts are not due to biological causes. The reference set used in the normalization step is the set of all peptides identified in the experiment.
      MA-plot: most dots lie around M=0 and form a central cloud (red zone).
    2. The number of up-regulated proteins/peptides is roughly equal to the number of down-regulated ones. If the data were skewed, this would lead to a bias in the normalization result.
      MA-plot: the number of dots significantly above the central cloud is roughly equal to the number of dots below.
    3. The systematic bias between the samples is proportional to the quantification signal intensity. It is an additive effect with a relative magnitude, so that it is possible to find one normalization factor for each quantification sample.
      MA-plot: the shape of the central cloud with increasing A looks rectangular: the top and bottom sides are horizontal, not inclined.

Processing

The input data files are PSM files, preferrably exported from Proteome Discoverer (PD) 2.1, although it is perfectly possible to use different platforms as long as the data complies to the specifications. They can be represented as tables (one for each LC-MS/MS run) where each row represents a unique PSM corresponding to a peptide observation, containing the observed values of the variables/quantities described by the column headers.

The following steps process the data into a human-readabe format at the peptide level that is ready for interpretation by the data analysis part of the workflow. Relevant parameters from the processing configuration file (automatically generated by the web interface) are in teletype font. Column header or variable names are italicized.

  1. Read the PSM data file into a Pandas data frame, apply the wrapper and load the processing configuration parameters.
  2. Remove obsolete (non-wanted) columns/variables. wantedColumns.
  3. Information about regular Protein Accessions and their corresponding information in the Protein Descriptions strings are removed; only the information about master proteins is kept.
  4. Remove PSMs that have missing values in noMissingValuesColumns, or that have missing values in all of the quanColumns.
  5. Remove PSMs with Confidence levels worse than removeBadConfidence_minimum. removeBadConfidence_bool.
  6. Remove PSMs with Isolation Interference [%] levels higher than removeIsolationInterference_threshold. removeIsolationInterference_bool.
  7. Clean PTM information by keeping only the PTM identities, not their locations, and removing all 'TMT' label modifications. The remaining PTMs are then sorted alphabetically.
  8. Remove redundancy due to the use of multiple PSMEnginePriority PSM detection algorithms, which are specifiable by the user. removePSMEngineRedundancy_bool. If a peptide observation yielded more than one PSM (one for each algorithm) then we only keep the PSM of the PSM engine with the highest priority, and the others are removed.
  9. (Currently unavailable) Correct isotopic impurities, see Isotope impurity correction.
  10. Aggregate on retention time RT [min] (RT), see Aggregating duplicate observations. aggregate_method.
  11. Aggregate on Charge if possible, see Aggregating duplicate observations. aggregateCharge_bool, aggregate_method.
  12. Aggregate on post-translational Modifications (PTMs) (Disabled by default) if possible, see Aggregating duplicate observations. aggregatePTM_bool, aggregate_method.
  13. Perform CONSTANd normalization, see CONSTANd.
  14. Save all results to disk and pass them on to the analysis step.
For all removed data, at least the columns specified by removalColumnsToSave are saved – sometimes together with more information relevant at a certain removal step – into either one file for all removal steps or into separate files per removal step, as specified by removedDataInOneFile.

Aggregating duplicate observations

Removing redundancy in the list of PSMs due is what we call aggregating. This is necessary to avoid artificial enrichment by counting the same peptide multiple times in the DEA. We consider three possible variables that cause redundancy, which we may aggregate on:

  1. RT: depending on the data-dependent acquisition (DDA) settings of the MS, it forgets the peptides previously detected by the MS1 after a fixed time interval tDDA (i.e. after each MS1 sweep) and possibly detects the same peptide multiple times at different RT values, if it takes longer than tDDA to elute from the LC column.
  2. Charge (aggregateCharge_bool): the electrospray ionization is a stochastic process which may place zero or more than one positive charge on each peptide. Since the MS cannot distinguish peptides but only m/z values, the same peptide can be detected multiple times (with different charge values) even within the same MS1 sweep.
  3. PTM (aggregatePTM_bool): just as charge influences the m/z value, so does the mass of the peptide being detected by the MS1. As a peptide may carry modifications – either multiple TMT labels or biologically relevant modifications like phosphorylation – it may thus appear with different mass values and be detected multiple times even within the same MS1 sweep. Please note that on the peptide level we use the PTM information as-is. Since PD includes PTM location information, this means that the same ligand in a different location is considered a different PTM. On the protein level (later in the workflow), the N-terminal TMT label modifications as well as any PTM location information are filtered out.

The aggregations on charge and/or PTM are optional in theory, but are specified by their respective boolean parameters. By default Charge aggregation is enabled (unless the corresponding column is missing from the data frame) and PTM aggregation is disabled. The aggregation on RT is mandatory, as we assume its variability is never of biological importance. It is always executed, even if the 'RT [min]' column does not exist as we assume RT to be the only other source of redundancy. Therefore, there should be no additional properties causing peptide sequence redundancy in your data set, except for the use of multiple PSMEnginePriority.

For each aggregation a representative PSM is chosen. This representative is added to the data frame and the representees are removed. It contains all the information of the representee that had the best PSM score, except for the quantification values. The best PSM score is the best score calculated by the Master PSM algorithm, if available, or otherwise the best score calculated by (one of) the Slave algorithm(s), if available, or if no score is available the program just takes the information of the first representee encountered. The quantification values of the representative can be determined in one of three ways, specified by aggregate_method:

  1. {bestMatch}: the quantification values of the PSM with the best PSM score (see above). In case all PSMs have missing scores, the most intense is chosen instead (see below).
  2. {mostIntense}: the quantification values of the PSM with the highest sum of its MS2 quantification values.
  3. {mean}: the value for each quantification sample is the mean of the quantification values of all representees for that same quantification sample.
  4. {geometricMedian}: the components of the quantification vector that is the geometric mean of the quantification vectors of the representees, where the quantification vectors are vectors in Euclidian N-space with their ℓ1-norm set to 1 and with the (re-scaled) quantification values as its components. To calculate the geometric median, we use the algorithm developed by Vardi et al. (DOI: 10.1073/pnas.97.4.1423).

Note that the only information carried by a representative that is (currently) also used in the downstream analysis, is the information identical for the corresponding representees (like the peptide sequence) – except for the quantification values about which we have just elaborated.

Any data that has been removed at any point can be saved to a 'removed data collection' which may be used by the QCQuan team only anonymously and only for meta-analysis purposes.

(disabled) Isotope impurity correction

This functionality is currently disabled. Provided with the isotope impurity matrix (not the TMT-style format), one can easily calculate the true quantification values from the observed ones by just solving the linear system of equations as explained by Shadforth et al. (DOI: 10.1186/1471-2164-6-145). Of course this requires one to know the observed quantification values for all of the PSM peaks that receive a contribution from reporter isotopes, and those may not always appear at the same m/z-values as the other labels and hence will not be available in the PD2.1 output (if PD2.1 did not apply the correction itself).

CONSTANd

CONSTANd (CONstrained STANdardization) is a data-driven normalization algorithm recently developed by Maes et al. (DOI: 10.1074/mcp.M115.056911), whose prerequisites we have already discussed. In a nutshell, this algorithm corrects biases which are systematic for all peptides across an entire sample or individual peptides across an entire experiment. It employs two assumptions - the relative nature of the quantification values and the equimolarity of the samples - to impose two corresponding constraints on the PSM quantification matrix. Here, we get into the practical details of its operation.

If P if the number of peptides and N the number of samples, we want to impose two constraints to transform the unnormalized P × N matrix A that contains the quantification values, into a normalized matrix K that does obey these constraints. The first constraint is that the values of each row must sum to 1 (relative quantification) and on the other hand that the values of each column must sum to P/N (equimolarity). As elaborated in the original paper by Maes et al., we do not need to explicitly define and minimize a loss function to obtain such a matrix K; we can simply use the RAS-procedure to write the solution as K = RAS. The diagonal matrices R and S may be obtained simultaneously using the following iterative scheme:

  1. Start at i = 0 with K 0 = A;
  2. Calculate Ri as the inverse of N times the P × P diagonal matrix with the mean quantification value of the rows of Ki as the respective diagonal elements;
  3. Define Ki' = Ri Ki (this step ensures that the row mean is equal to 1);
  4. Calculate Si as the inverse of N times the N × N diagonal matrix with the mean quantification value of the columns of Ki as the respective diagonal elements;
  5. Define Ki+1 = Ki' Si (this step ensures that the column mean is equal to P/N );
  6. Repeat from step 2.

In this fashion, one finds after an infinite number of iterations: R = product of all Ri and S = product of all Si. Of course we can only perform a finite number of iterations and therefore one can never guarantee K to be obtained exactly since we are trying to impose P + N constraints on a max(P,N)-dimensional vector space, but the RAS-procedure has been shown to have a monotonically converging entropy, ℓ1-measure and likelihood. Therefore, in practice CONSTANd will just end the RAS-procedure as soon as the desired precision has been reached. We define the precision after iteration i as N/2 × |Ri−1 − IP |; the number of columns divided by two, times the total deviation of the matrix with row means from the P × P identity matrix divided by N. Note that after each iteration, the deviation of the column means is of course exactly zero. The convergence of this algorithm – and thus also the precision improvement – is exponential in the number of iterations, and in fact the convergence is asymptotically identical if one were to swap steps 2 and 3 with steps 4 and 5.

If there happen to be missing values in the quantification matrix A, we just calculate the numpy.nanmean instead of the mean. This function from the Python package numpy ignores any missing values in the calculation of the mean, so a row with one missing value now has a sum of (N − 1)/N . However, the mean value for the normalized quantification values in that row will still be 1/N and the algorithm needs no further adaptations.

Analysis

After the processing steps have been performed for each MS run individually, the quantification values are normalized and the data is ready for analysis.

During the analysis steps some metadata is also extracted, some of which will be shown in the report file and some of which is kept by the QCQuan team but may only be used anonymously and only for meta-analysis purposes.

The necessary ingredients to make visualizations of the DEA and Quality Control and a report in general are calculated and extracted in the following steps in the two sections below. Relevant parameters from the processing configuration file (automatically generated by the web interface) are in teletype font. Column header or variable names are italicized.

Differential Expression

  1. Combine the data frames of all MS runs into one data frame by performing an 'outer join': simply a vertical concatenation but with the quantification columns of each MS run kept separate, still retaining their distinct names. In this way, peptides with the same Sequence can appear multiple times (will be considered a repeated measurement), but no peptide of one MS run has any values in the quantification columns corresponding to other MS runs (those values are 'missing').
  2. Use latter combined data frame to compile a mapping – across all MS runs – between the proteins found and their respective peptides. This can be done in two ways because some peptides may possibly correspond to more than one (master) protein, even though the proteins have already been grouped:
    • Each protein gets contributions only from non-shared peptides: they are uniquely associated to that protein. minExpression_bool
    • Each protein gets the full contribution from all peptides that are associated with it, even if they are in reality shared with other proteins. fullExpression_bool
    We refer to the former as the ‘minimal expression’ case, and to the latter as the ‘full expression’ case. The minimal expression cannot be disabled by the user, but the user may enable the full expression which is by default not performed.
  3. Execute the peptide-protein mappings according to previously mentioned dictionary, to obtain a data frame on the protein level. For each protein entry the data frame contains the protein description, the list of (representative) peptides and the list of corresponding quantification values, split into one column for each condition.
  4. Perform a DE test for each protein using a t-test with Benjamini-Hochberg correction. The p-values and Benjamini-Hochberg adjusted p-values are added to the protein-level data frame.
  5. Calculate for each protein and each condition the log2 fold change w.r.t. the reference condition referenceCondition specified by the user. Parameter (pept2protCombinationMethod) specifies whether the mean or the median of the quantification values is used to calculate the fold change between the two lists of peptides. End-users cannot change the default which is to use the mean.
  6. Calculate the DEA significance indicator for each protein:
    • yes (both the p-value and fold change are significant)
    • p (only the p-value is significant)
    • fc (only the fold change is significant)
    • no (neither the p-value nor the fold change are significant)
    where the p-value and fold change significance thresholds are specified by alpha (significant means p < alpha) and FCThreshold (significant means log2(condition/reference) > FCThreshold), respectively.
  7. Save all results to disk and pass them on to the Report & visualization procedure.

DE test: t-test

We would like to perform a t-test for each protein between the quantification values of the reference condition and each other condition to see whether their means differ. In order to do this, for each protein we first average its associated peptide quantifications within the same sample because they are repeated measurement, and then for each condition pool the values of all associated samples, across all MS runs. This does mean we are not using all information available and thus underestimate our statistical power, but it is justified since the idea behind QCQuan is to keep it fast, simple, transparent, qualitative and also conservative. We explicitly do not want to employ linear mixed models, which is a cumbersome and time-consuming procedure for a few thousand proteins.

To control the false discovery rate – and not the family-wise error rate, as the Bonferroni correction does – we employ the Benjamini-Hochberg correction and obtain adjusted p-values.

When calculating the mean quantification value per condition for a certain protein over all its measurements, we have to take the nanmean to ignore missing values. In the case that there is just one value available for both conditions udner consideration, the result of the t-test is nonsensical and this protein is removed from the data frame after all. In the future we will evaluate the ability of the non-parametric Mann-Witney U test to cope with small sample size and deviation from the distributional assumptions of the parametric t-test.

Quality Control

  1. Combine only the quantification and Sequence columns of the data frames of all MS runs into one data frame, but this time as an 'inner join' on the peptide sequence. This way, only peptides observed in at least one sample of each MS run are present. They occur only once, and they only have missing values in quantification columns (which are all still distinguishable) if they were not detected in the corresponding sample. The result is a data frame with each unique peptide sequence that was found at least one in each condition across in all MS runs.
  2. Perform a Principal Component Analysis (PCA) and Hierarchical Clustering (HC) on the transpose of the quantification matrix associated with the data frame from the previous step. Missing values will be imputed to be 0, because PCA and HC cannot handle them and zero- imputation does make sense in this situation. After all, the missing values that now remain are not due to a peptide not being present in a certain MS run or condition (because such entries have been removed during the construction of this data frame). That means they are missing quite probably because they are actually zero, or just very small (so that they did not meet the detection threshold). Therefore we deem this zero-imputation justified.
  3. Save all results to disk and pass them on to the Report & visualization procedure.

Principal component analysis

Performing a principal component analysis involves calculating the single-value decom- position (SVD) of the transposed quantification matrix across all the MS runs, which is of size Pc × sum of all Ni , where Pc is the amount of peptides found at least once in all MS runs and conditions, and Ni is the amount of samples in MS run i. The standard matplotlib function is rather slow for a matrix of this size, and therefore we will use the custom SVD method described by Halko et al. (DOI: 10.1137/090771806) and borrow an open source implementation from GitHub. Only the first two PCs are extracted.

Hierarchical clustering

A hierarchical clustering is performed on the transposed quantification matrix across all MS runs, so that each sample represents an observation and each peptide quantification is an observable (just like with the PCA). As a linkage criterium UPGMA (mean linkage) was used, together with a Euclidian metric.

Report

The Report & visualization step takes the protein-level data frames with differential expression results, the PCA and HC results as well as other QC information and plots, the meta-data and application log file as input for the following steps.

  1. For each DE test, calculate a list of significantly (according to adjusted p-value) differentially expressed proteins, sorted according to adjusted p-value. For each protein, also provide the description, log2 fold change, adjusted p-value and total amount (before averaging repeated measurements) of supporting peptides observed. If both minimally and fully expressed protein data frames are given, create two lists of mutually exclusive proteins and add it to the meta-data for a consecutive sensitivity analysis.
  2. For each DE test, make a volcano plot of the minimal and/or full data frame, with coloring accord- ing to significance level (see last step of DE), and with the protein IDs indicated for all data points of the significance levels specified by labelVolcanoPlotAreas. End-users cannot change this behavior and by default only data points with a 'yes' significance level are labeled.
  3. Make a PCA plot, which is a 2D scatterplots of the first (horizontal axis) and second (vertical axis) principal component scores, where data points are colored according to condition and with different marker symbols for each MS run. The units on the axes bear no meaning other than dimensionless similarity measures.
  4. Make a HC Dendrogram plot, which is a dendrogram displaying the results of the hierarchical clustering, where singletons are colored according condition. The length of the dendrogram lines indicate the Euclidean distance between two leaves, although they bear no meaning other than a dimensionless similarity measure.
  5. Make an MS1 calibration plot, which is a scatterplot of the PSMs' relative (w.r.t. the maximum) PSM Engine Scores versus their "DeltaM [ppm]" value. Information from each tandem-MS run is plotted in a different color.
  6. Make an MS1 intensity histogram, which shows the amount of detected PSMs in function of their MS1 intensity. A histogram including information from only those PSMs actually used by QCQuan (i.e. not discarded for some reason).
  7. Generate a HTML report using flask, which contains the volcano plot(s), PCA plot, HC plot, list of top n differential proteins (sorted on adjusted p-value and with n specified by numDifferentials) and meta-data, possibly with additional info from the log file.
  8. Generate a PDF report from the HTML file and create a .zip-file with the DEA data frames in .tsv-format for each non-reference condition.
  9. Return all output files from the previous step to the web interface, which will display them on the front-end as well as e-mail them.

The PCA plot and HCD serve as quality control tools, as they give the user an immediate impression of the influence of a sample belonging to a certain condition as opposed to belonging to a certain MS run. If the normalization was successful and there are sufficient differences between the conditions, one ought to see samples grouped together per condition (and not per MS run or some other criterium).
Also, when the samples corresponding to a certain condition are spread across multiple MS runs, one could be worried that one particular sample is – for whatever reason, biological or accidental or statistical – not like all the other samples that belong to the same condition, which would distort the true statistics. However if such an anomaly were to occur, it would be visible from the PCA and/or HC plot.

Technical Specifications

Input file formats

The workflow requires the following input files. The ones to be uploaded by the user are marked with (U), the others are automatically generated when using the web interface.

  • (U) a Design of Experiments (DoE) file (can also be created using the web interface);
  • (U) one PSM quantification data file per MS run (PD produces the right format);
  • (U) (optional) a wrapper file when using custom data headers.
  • one processing configuration file per MS run with the QCQuan processing settings;
  • a job configuration file, auto-generated by the web interface based a.o. on some user-specified settings;
  • (currently not available; optional) a file with the isotope impurities in matrix or TMT table format.

Design of Experiments file

This file is supposed to be uploaded by the user, who may create one from scratch or by using the DoE creator. The schema of the experimental setup is a tab-separated text file (.tsv or tabular .txt) with a layout like the example shown in the table below. For each separate tandem-MS run a name is be specified which will function as a prefix if no quanColumn aliases are provided (see further).

MS run name Condition:quanColumns:aliases Condition:quanColumns:aliases ...
TMT8_tuesday mouse:126,127:m1,m2 human:128,129:h1,h2 ...
iTRAQ8_2 human:114,115:h3,h4 mouse:116,117: ...
... ... ... ...

For each MS run one can specify an arbitrary number of conditions, which do not have to be in the same column of the DoE file in case the same condition appears in multiple MS runs. Each condition entry - which corresponds to one field in the file / table - is formatted as follows: first the name of the condition, followed by a colon, followed by a comma-separated list of all data frame quantification column names, optionally followed by a colon, optionaly followed by a comma-separated list of aliases for those quantification column names (in the same order). These quantification column names and corresponding aliases will be automatically appended to the wrapper file, so there is no need to add them manually.
If no aliases are specified, the aliases are automatically generated to be the original names prefixed by the MS run's name and an underscore symbol. For instance, for the second condition of the iTRAQ8_2 MS run in the example table, the generated aliases would be 'iTRAQ8_2_mouse_116' and 'iTRAQ8_2_mouse_117'.

PSM files (quantification data)

The PSM files (maximum combined size: 75MB, can be changed on request) may be formatted as .xlsx (Microsoft Office), .csv (comma-separated), .tsv (tab-separated) or .txt (custom delimiter) files. By default, Proteome Discoverer (PD) produces tab- separated .txt files, and its PSM output files have names suffixed with “PSMs” should meet all requirements specified above. If you are using a different platform, read this section carefully and also take a look at the quick fixes in case you are unsure about how to proceed.

The following columns/quantities are absolutely required in each data file and should not contain missing values (the corresponding detections/rows will be discarded).

Column name Type Details
Sequence string The raw peptide sequence as a sequence of 1-letter amino-acid representations, without modifications or any other annotations.
Master Protein Accessions string; string; ... UNIPROT identifiers of the proteins associated with a PSM, separated by '; ' (semi-colon and space). The master protein is the representative protein of choice for a whole protein group to which a peptide may correspond, as provided by PD2.1.
First Scan integer Identifier for the MS scan in which the peptide was detected. If you don't have such a column, just create one with all values set to a unique integer.
(???) string Names of the reporter labels or channel names you provided in the DoE file.

The above list also includes the names of the reporter quantification samples present in your data, although only one of them is required to contain an actual non-missing value in order not to be discarded. One can use either PD's intensities or S/N values as quantification values, since QCQuan can handle both and we find they produce practically indistinguishable results.

The following columns/quantities are only required under the circumstances mentioned and should in that case not contain missing values (the corresponding detections/rows will be discarded):

Column name Type Details
Identifying Node Type Mascot; Sequest HT; ... (algorithm used to generate the PSM) if you used multiple PSM scoring algorithms.
Charge integer (charge of the precursor ion) if you wish to avoid aggregation on Charge (see Processing aggregation step) or do not care about missing values.
Modifications string: mod1; mod2; ... (Chemical modifications present on the peptide sequence) if you wish to avoid aggregation on Modifications (see Processing aggregation step) or do not care about missing values. The format should be a single string, suing ';' as a separator between multiple modifications. Information outside parentheses () is excluded automatically (e.g. PTM location information in the PD format).
Protein Accessions string (Uniprot identifiers of all matching protein according to the ‘Identifying Node’) if you wish to have a 'descriptions' column in your DEA output with Master Protein Descriptions;
Protein Descriptions string (Uniprot description of each matching protein) if you wish to have a 'descriptions' column in your DEA output with Master Protein Descriptions;
XCorr/Ions Score/... float (SEQUEST/MASCOT/... PSM score) if using ‘bestMatch’ as the aggregate_method value (see Processing aggregation step) and used SEQUEST/MASCOT/... to calculate PSMs, or if you wish to get MS1 calibration QC plot;
Confidence string: Low, Medium, High (confidence level of the PSM identification) if you wish (removeBadConfidence_bool) to filter out PSMs with low confidence values (removeBadConfidence_minimum);
Isolation Interference [%] float (measure for the relative amount of co-isolation) if you wish (removeIsolationInterference_bool) to filter out PSMs with high isolation interference values (removeIsolationInterference_threshold).
DeltaM [ppm] float (MS1 mass error) if you wish to get QC information w.r.t. the error on the precursor ion mass.
Ion Inject Time [ms] float (necessary MS2 ion accumulation time) if you wish to get QC information w.r.t. the necessary MS2 ion accumulation time.
Other platforms

At this moment QCQuan is not really vendor-neutral, but we aspire to make it so. In the meanwhile, the following non-exhaustive list of suggestions may provide users with quick fixes so they can use the workflow even with different platforms:

  • use a wrapper file to automatically transform column headers and increase your ease-of-use;
  • if your PSMs do not have information about the first scan number, fake this column and fill it with unique integers;
  • if your PSMs do not have information about charge or modifications, fake these columns and fill them with zeroes or empty strings, respectively; Caution: do not fill them with "None" or "N/A" strings or equivalent, as they will then be interpreted as missing values.
  • if your PSM algorithm does not provide PSM scores, fake a score column filled with zeroes;
  • if your PSM algorithm provides positive scores where ‘lower’ means ‘better’, negate all scores.

Wrapper file

The optional wrapper file is simply a .tsv- or tab-separated .txt-file with two columns that map original variable names (first column) onto new ones (second column). Each line thus contains both a variable name from the user's custom data file and a new name – which ought to be one from the list specified by the wantedColumns parameter (see next section) – separated by a tab character.
The quantification column names and corresponding aliases from the DoE file will be automatically extracted and appended to the wrapper, so there is no need to add them manually.

Processing Configuration file

Each processing configuration file contains the settings to be used in the corresponding Processing step and is generated automatically when using the web interface. Such files are .ini files which contains only a [DEFAULT] section and which is read using a ConfigParser from the configparser module, with options allow_no_values=True, comment_prefixes=’;’ and inline_comment_prefixes=’@’. All possible parameters are listed in the table below, although none of these settings can be explicitly changed by the end-user.

Parameter name default; other values Details
data Path on disk to the quantification data file. Auto-generated by the web interface from the DoE file.
delim_in When uploading .txt files, it is advisable to explicitly specify the delimiter used, although the workflow does a pretty good job at auto-detecting it.
wrapper Path on disk to the wrapper file. Auto-generated by the web interface from the DoE file.
quanColumns Columns headers of the quantification values. Auto-generated by the web interface from the experimental setup schema.
wantedColumns ["Confidence", "Identifying Node Type", "Sequence", "Modifications", "Master Protein Accessions", "Protein Accessions", "# Protein Groups", "Protein Descriptions", "m/z [Da]", "Charge", "Deltam/z [Da]", "Isolation Interference [%%]", "RT [min]", "First Scan"]; [custom] List of columns that are extracted from the PSM data files. All other columns are removed. This step happens after the wrapper has been applied. The quanColumns are automatically appended by the web interface.
requiredColumns ["First Scan", "Sequence", "Master Protein Accessions"]; [custom] List of columns which are absolutely required for the workflow to complete successfully. The quanColumns are automatically appended by the web interface.
noMissingValuesColumns ["First Scan", "Sequence", "Identifying Node Type", "Master Protein Accessions", "Charge", "Modifications"]; [custom] List of columns in which no missing values may appear. If a missing value is found, the corresponding row is removed. The quanColumns are automatically appended by the web interface.
removalColumnsToSave ["First Scan", "Sequence", "Identifying Node Type", "Master Protein Accessions"]; [custom] List of columns to save to disk whenever data is removed from the data frame for any reason (except during Aggregation). The quanColumns are automatically appended by the web interface. This information will only ever be used (anonymously) for meta-analysis purposes.
aggregateColumnsToSave ["First Scan", "Sequence", "Identifying Node Type", "RT [min]", "Charge", "Modifications", "Master Protein Accessions", "Degeneracy"]; [custom] List of columns to save to disk whenever data is removed from the data frame during Aggregation. The quanColumns are automatically appended by the web interface. This information will only ever be used (anonymously) for meta-analysis purposes.
removeBadConfidence_bool true; false Remove PSMs with ‘Confidence’ lower than removeBadConfidence_minimum.
removeBadConfidence_minimum Medium; High; Low PSMs with ‘Confidence’ lower than this value are removed if so specified by removeBadConfidence_bool
removeIsolationInterference_bool true Remove PSMs with ‘Isolation interference [%]’ lower than removeIsolationInterference_threshold.
removeIsolationInterference_threshold 30; float between 0 and 100 2 PSMs with ‘Isolation interference [%]’ higher than this value are removed if so specified by removeIsolationInterference_bool
PSMEnginePriority String-encoded Python dictionary of PSM detection algorithms and their score column names used while removing redundancy due to the use of multiple PSM algorithms/engines. They are ordered according to priority (lower list index means higher priority).
PSMEnginePriority: engineNames [unspecified]; [Mascot, Sequest HT]], [custom, custom, ...], ... Names of PSM algorithms/engines, ordered according to priority (lower list index means higher priority). If set to "unspecified", the workflow assumes there is no PSM Algorithm information available and removePSMEngineRedundancy_bool has no effect.
PSMEnginePriority: scoreNames [unspecified]; [Ions Score, XCorr], [custom, custom, ...], ... Score columns names corresponding to the list of PSM algorithms/engines, in the same order.
removePSMEngineRedundancy_bool true; false Remove redundancy due to the use of multiple PSM algorithms. Behavior of the workflow with value 'false' has not been tested.
removePSMEngineRedundancy_exclusive_bool false; true Controls whether only PSMs detected by the master algorithm should be kept.
(Currently disabled) isotopicCorrection_bool false Perform isotopic correction using isotopicCorrection_matrix or not.
(Currently disabled) isotopicCorrection_matrix Path to the isotope impurity matrix. Auto-generated by the web interface after uploading an isotope impurity file.
aggregate_method bestMatch; mostIntense, mean, geometricMedian The method specifying how to combine quantification values of multiple representee PSMs.
aggregateCharge_bool true; false Remove redundancy due to charge.
aggregatePTM_bool false; true Remove redundancy due to post-translational modifications.
precision 10-5; float Desirable precision of the CONSTANd algorithm.
maxIterations 50; integer > 0 Maximum number of iterations of the CONSTANd algorithm.
removedDataInOneFile_bool false; true If enabled, saves all data frames with information about removed data in separate files.
path_out Path on disk to save output files to. Auto-generated by the web interface from the experimental setup schema.
filename_out result Auto-generated by the web interface from the experimental setup schema.
delim_out \t Text delimiter used when writing Processing output to disk.

Job configuration file

The job configuration file contains the settings to be used in the corresponding Analysis and Report steps and is generated automatically when using the web interface, in which case only the parameters marked with (U) can be adjusted by the user. It is a .ini file which contains only a [DEFAULT] section and which is read using a ConfigParser from the configparser module, with options allow_no_values=True, comment_prefixes=’;’ and inline_comment_prefixes=’@’. All possible parameters are listed below, together with their {default values; other possible values} and descriptions.

Parameter name default; other values Details
date now Full Python datetime (%Y-%m-%d %H:%M:%S.%f) at upload of DoE file, auto-generated by the web interface.
schema String-encoded Python dictionary containing the DoE information as well as some additional meta-information. Auto-generated by the web interface from the experimental setup schema and uploaded files.
getRTIsolationInfo_bool false, true If enabled, gathers RT isolation statistics during the RT Aggregation step and stores it in metadata. This information will only ever be used (anonymously) for meta-analysis purposes.
pept2protCombinationMethod mean; median Central measure to generate one quantification value for a protein from a list of peptide quantification values when calculating fold changes for the DEA.
minExpression_bool true; false Do a DEA for the minimally expressed set of proteins.
(U) fullExpression_bool false; true Do a DEA for the fully expressed set of proteins.
(U) referenceCondition Name of the condition to be used as the reference in the DEA.
alpha 0.05; float between 0 and 1 Significance level for the differential expression t-test.
FCThreshold 1; float > 0 Threshold above which a fold change value is considered significant in th DEA.
labelVolcanoPlotAreas [true, false, false, false]; ... Significance levels whose proteins to label with their protein IDs. Order: [‘yes’, ‘p’, ‘fc’, ‘no’].
PCA_components 2; integer ∈ {2..amount of conditions} Number of principal components to be extracted in the PCA
(U) numDifferentials 10; integer > 0 Number of top differentially expressed proteins to show in the report.
(U) jobname Name of the job, specified by the user. Will be appended to the timestamp to form the job ID.
path_out Path on disk to where output files ought to be saved.
(U) delim_out \t; any character Text delimiter used when writing Analysis and Report output to disk.
jobID date_jobname Concatenation of job date and job name with underscore separator. Auto-generated by web interface.
(U) mailRecipient E-mailadress to which output and job status updates are sent.
Schema structure

The schema in the table above is a String-encoded, nested Python dictionary with the key-value structure shown below, where 'channel' is synonymous for 'quantification column'.

Key Value / Nested Key ... ...
allMSRuns [MS runs]
allConditions [conditions]
MSRun allMSRunConditions [conditions]
allMSRunChannelNames [channelNames]
allMSRunChannelAliases [channelAliases]
config path
data path
wrapper path
isotopicCorrection_matrix (disabled) path
{conditions} channelNames [names]
channelAliases [aliases]
{...} {...} [...]

(disabled) Isotope impurities file

For each data file one may optionally specify the isotope impurities using another .tsv- file. This file may either contain a matrix (used directly by the workflow) or a TMT-style formatted table (IDT) that is automatically transformed into the correct matrix by the web interface. In the former case this is a tab-separated isotope impurity matrix (IIM). In the latter case the format is also tabular, resembling the format used in the Thermo Fisher's COA documents and with the string “IDT” as the columns header for the sample names.

Server & software:

Virtual server

The QCQuan workflow is deployed on a virtual server located at VITO, controlled through VMWare vSphere. Its operating system at the time of deployment (October 2017) is Ubuntu Server 16.04 LTS. Below is a list of custom packages required, available through the default repositories.

  • imagemagick (8:6.8.9.9-7ubuntu5.9)
  • iptables-persistent (1.0.4)
  • nginx (1.10.3-0ubuntu0.16.04.2)
  • python3-pip (8.1.1-2ubuntu0.4)
  • postfix (3.1.0-3)
  • python3-tk (3.5.1-1)
  • sqlite3 (3.11.0-1ubuntu1)
  • (Python weasyprint extra requirements:) python3-dev, python3-pip, python3-lxml, python3-cffi, libcairo2, libpango1.0-0, libgdk-pixbuf2.0-0, shared-mime-info

Workflow

The QCQuan workflow is written in Python 3 , version 3.5.2. The application is dependent on the following modules which are not included in the standard library but are available through the default repositories using the python3-pip package mentioned in previous section:

  • flask (0.12.1)
  • matplotlib (2.0.0)
  • numpy (1.12.1)
  • pandas (0.20.3)
  • requests (2.9.1)
  • scipy (0.19.0)
  • sklearn / scikit-learn (0.18.1)
  • statsmodels (0.8.0)
  • uwsgi (2.0.15)
  • weasyprint (0.36)
  • werkzeug (0.12.1)
  • wtforms (2.1)

Web interface

The web interface is also written in Python 3.5.2, more specifically using the Flask 4 framework, version 0.12.1. QCQuan runs as a WSGI application, communicating the uWSGI protocol with the WSGI application server named 'uwsgi'. The latter then communicates with an NGINX web server.

Via the web site - located at web interface - users can read documentation, start workflow jobs, check the status of workflow jobs, retrieve data from completed jobs, and create DoE files.

Found a bug? Send us the details and job ID at info@qcquan.net.