Stanford MicroArray Database
WORLD
  Password   
Sign Out

SMD : Help : KNNimpute Help
 

Help : KNNimpute Help


Contents


  • Background

    KNNimpute is a method that was developed by Olga Troyanskaya (Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T.,Tibshirani, R., Botstein, D., and Altman, R.B., (2001) Missing value estimation methods for DNA microarrays, Bioinformatics, 17:6,520-525) to impute (estimate) missing data from microarray experiments.   It is based on the K Nearest Neighbors algorithm.   This method is useful when one wants to use an analysis method such as Principal Components Analysis or Singular Value Decomposition that requires that there be no missing data.

    Olga and her coauthors studied several methods for imputing missing values and concluded that KNNimpute does a better job than does the row average method, SVDimpute (another method based on Singular Value Decomposition), or simply filling in missing values with zeroes.  This is particularly true for genes whose expression values form small clusters.   This conclusion was based on taking actual complete datasets, removing 1-20% of the data at random, testing various methods for estimating the now missing values and then comparing the partially imputed sets to the real data.

    In this method, a row of microarray data (the data in the rows are usually reporting on genes, so in this document we refer to them as genes) with missing data is compared to other 'gene' rows. The 'gene' rows that are most similar in terms of expression data(the 'nearest neighbors') are used as the basis for estimating the missing data points. A weighted average is used to replace each missing data value, with the contribution of each gene's data weighted by overall similarity to data for the gene for which missing data are being imputed. In SMD, a euclidean distance metric is used to assess similarity.

    File format:
    The file used to run KNNimpute must be a tab-delimited text file in the form of a matrix, where each row has data for one gene and each column has data for one experiment.  Make sure each column has a header(experiment name).   The data are expression values.   A PCL file is in the proper format (the eweight and gweight columns are ignored) , and the icon for KNNimpute appears next to such files in the SMD repositories.   Matrices with as few as six columns(experiments) have been used with accurate results. This method is not recommended for matrices with less than four columns.

  • Running KNNimpute in SMD

    From your repository, click on the double-ended arrow icon next to a file.   This will take you to the KNNimpute page. Now you have some parameter choices to make:

    Number of nearest neighbors:
    K = the number of neighbors that you select to use.  If you select too few neighbors on which to base the estimation, it can be biased by a few dominant profiles.   If you select too many neighbors, the estimation base may expand too much so that the calculation includes non-relevant profiles.   Also, with a larger amount of data, noise may overwhelm signal. The optimal K probably depends on the average cluster size in your dataset.

    For the datasets tested in O. Troyanskaya et al. and in further work, when different numbers of neighbors were tested, the range from 10-17 gave the best performance and there wasn't much variation in results within that range.

    Output file name:
    Enter the name you want to assign to your output file.

    Output description:
    This is a place for notes to yourself about the derivation of your result file.

    To start the run, click on "Submit Impute ".   A new screen will appear with the message:

    "This process will run off line. You will be notified via email when processing is complete.
    Output name: (the file name you entered)
    N neighbors: (from your input)"

    While your process is running, the job should appear in the list of queued jobs.   It may take a fair amount of time to run, depending on the file size and the number of other jobs in the queue.   When it is finished, you will receive an email message, and the new file(which includes the imputed results) will appear in your repository.   To examine the imputed values, you can download the pre- and post-imputation files from your repository(double-click on the "Data" icon) and note where missing values were filled in.  You should be mindful of which data were imputed as you continue in your analysis. [an error occurred while processing this directive]