Phage Informatics Group.
PIG.

This is what I did:

  1. Start with the blasts. Figure out all pairwise sims using the script pairwise_sims.pl
  2. Use the script get_score to get the scores from the phylip files. This generates the output counted.txt
  3. There is a problem with counted.txt that I can not explain. Some of the lines are totally screwed up. I spent hours trying to fix this, and can not, so I decided to just re-run those screwed up ones.
  4. counted.txt has the following columns (separated by tabs): protein1, protein2, average E-value from the blast, protdist score. The average E value is because p1 and p2 may be compared in several different blast searches. In most cases I only run protdist once, so there is a single score. In some cases I screwed up and ran it twice so there are two scores. These should be identical (?) but I am not sure.Use the script combine.pl to generate the combination between pairwise sims and counted.txt. The result is combined.txt.
  5. Use the script get_extra_seqs to get the missing proteins. These are in extra_proteins.tgz
  6. This is about another 90,000 proteins - oh my god!! I am running these through clustal/phylip on dolphin.
  7. Run results.pl on the results to generate the list of missing_data. This is just p1 p2 score, not quite there yet.
  8. Finally, run combine2.pl to get the final data set, all_data.txt. This data set has 916,898 protein pairs, and should have everything you need
  9. Bugger

    More analysis:

  10. Use pairwise_sims2.pl to generate a list of just the BLAST E value scores.
  11. Use results.pl to generate a list of just the protdist scores.
  12. Use combine3 to combine these two outputs into a new data file recombined_data.txt.
  13. Use the script min_max to generate the minimum and maximum for each column as a test of integrity. This also checks that everything has a datapoint that should!
  14.   Prot 1Prot 2E valueProtdist
    Minumum14000.000000
    Maximum26807268251038.620803

Free graphs

  1. All data
  2. All data x-axis max=1; y-axis max=5
  3. All data x-axis max=5; y-axis max=5
  4. All data x-axis max=0.01; y-axis max=5
  5. All data x-axis on a log scale
  6. This is the new data you want: recombined_data.txt