Phage Informatics Group.
PIG.
This is what I did:
- Start with the blasts. Figure out all pairwise sims using the script pairwise_sims.pl
- Use the script get_score to get the scores from the phylip files. This generates the output counted.txt
- There is a problem with counted.txt that I can not explain. Some of the lines are totally screwed up. I spent hours trying to fix this, and can not, so I decided to just re-run those screwed up ones.
- counted.txt has the following columns (separated by tabs): protein1, protein2, average E-value from the blast, protdist score. The average E value is because p1 and p2 may be compared in several different blast searches. In most cases I only run protdist once, so there is a single score. In some cases I screwed up and ran it twice so there are two scores. These should be identical (?) but I am not sure.Use the script combine.pl to generate the combination between pairwise sims and counted.txt. The result is combined.txt.
- Use the script get_extra_seqs to get the missing proteins. These are in extra_proteins.tgz
- This is about another 90,000 proteins - oh my god!! I am running these through clustal/phylip on dolphin.
- Run results.pl on the results to generate the list of missing_data. This is just p1 p2 score, not quite there yet.
- Finally, run combine2.pl to get the final data set, all_data.txt. This data set has 916,898 protein pairs, and should have everything you need
Bugger
More analysis:
- Use pairwise_sims2.pl to generate a list of just the BLAST E value scores.
- Use results.pl to generate a list of just the protdist scores.
- Use combine3 to combine these two outputs into a new data file recombined_data.txt.
- Use the script min_max to generate the minimum and maximum for each column as a test of integrity. This also checks that everything has a datapoint that should!
| | Prot 1 | Prot 2 | E value | Protdist |
| Minumum | 1 | 40 | 0 | 0.000000 |
| Maximum | 26807 | 26825 | 10 | 38.620803 |
Free graphs
- All data
- All data x-axis max=1; y-axis max=5
- All data x-axis max=5; y-axis max=5
- All data x-axis max=0.01; y-axis max=5
- All data x-axis on a log scale