split - Loading Big files into Hashes in Perl (BLAST tables) -
i'm perl beginner, please me out query... i'm trying extract information blast table (a snippet of looks below): it's standard blast table input... want extract information on list of reads (look @ second script below , idea of want do).... anyhow precisely i've done in second script:
inputs:
1) blast table:
38.1 0.53 59544 gh8nflv01a02ed gh8nflv01a02ed rank=0113471 x=305.0 y=211.5 length=345 1 yp_003242370 dynamin family protein [paenibacillus sp. y412mc10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590 34.3 7.5 123828 gh8nflv01a03qj gh8nflv01a03qj rank=0239249 x=305.0 y=1945.5 length=452 1 xp_002639994 hypothetical protein cbg10824 [caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124 37.7 0.70 62716 gh8nflv01a09b8 gh8nflv01a09b8 rank=0119267 x=307.0 y=1014.0 length=512 1 xp_002756773 predicted: probable g-protein coupled receptor 123-like, partial [callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306 37.7 0.98 33114 gh8nflv01a0h5c gh8nflv01a0h5c rank=0066011 x=298.0 y=2638.5 length=573 1 xp_002756773 predicted: probable g-protein coupled receptor 123-like, partial [callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306 103 1e-020 65742 gh8nflv01a0mxi gh8nflv01a0mxi rank=0124865 x=300.5 y=644.0 length=475 1 abz08973 hypothetical protein aloha_hf4000apkg6b14ctg1g18 [uncultured marine crenarchaeote hf4000_apkg6b14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68 41.6 0.053 36083 gh8nflv01a0qkx gh8nflv01a0qkx rank=0071366 x=301.0 y=1279.0 length=526 1 xp_766153 hypothetical protein [theileria parva strain muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60 45.4 0.003 78246 gh8nflv01a0z29 gh8nflv01a0z29 rank=0148293 x=304.0 y=1315.0 length=432 1 zp_04111769 hypothetical protein bthur0007_56280 [bacillus thuringiensis serovar monterrey bgsc 4aj1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150 71.6 4e-011 97250 gh8nflv01a14mr gh8nflv01a14mr rank=0184885 x=317.5 y=609.5 length=314 1 zp_03823721 dna replication protein [acinetobacter sp. atcc 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52 58.2 5e-007 154555 gh8nflv01a1kch gh8nflv01a1kch rank=0309994 x=310.0 y=2991.0 length=267 1 zp_03823721 dna replication protein [acinetobacter sp. atcc 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2) reads list:
gh8nflv01a09b8 gh8nflv01a02ed etc etc
3) output want:
37.7 0.70 62716 gh8nflv01a09b8 gh8nflv01a09b8 rank=0119267 x=307.0 y=1014.0 length=512 1 xp_002756773 predicted: probable g-protein coupled receptor 123-like, partial [callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306 38.1 0.53 59544 gh8nflv01a02ed gh8nflv01a02ed rank=0113471 x=305.0 y=211.5 length=345 1 yp_003242370 dynamin family protein [paenibacillus sp. y412mc10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
i want subset of information in first list, given list of read names want extract (that found in 4th column) instead of hashing reads list (only?) want hash blast table itself, , use information in column 4 (of blast table)as keys extract values of each key, when key may have more 1 value(i.e: each read name might have more 1 hit , or associated blast result in table), keeping in mind, value includes whole row key(readname) in it.
my greplist.pl script this, very slow, think , ( , correct me if i'm wrong) loading whole table in hash, should speed things tremendously ...
thank help.
my scripts: broken 1 (mambo5.pl)
#!/usr/bin/perl -w # purpose: extract blastx data list of readnames use strict; open (data,$argv[0]) or die ("usage: ./mambo5.pl blastxtable readslist"); open (list,$argv[1]) or die ("usage: ./mambo5.pl blastxtable readslist"); %hash = <data>; close (data); $filename=$argv[0]; open(out, "> $filename.bololom"); $readname; while ( <list> ) { #########; if(/^(.*?)$/)# { $readname=$1;# chomp $readname; if (exists $hash{$readname}) { print "bingo!"; $output =$hash{$readname}; print out "$output\n"; } else { print "it aint workin\n"; #print %hash; } } } close (list);
the slow , quick cheat (that works) , slow (my blast tables can 400mb 2gb large, i'm sure can see why it's slow)
#!/usr/bin/perl -w ## # script finds list of names in blast table , outputs result in new file # name must exist , list must correctly formatted # not output using "normal" blast file, must table blast # if have standard blast output use blast2table script use strict; $filein=$argv[0] or die ("usage: ./listgrep.pl readslist blast_table\n"); $db=$argv[1] or die ("usage: ./listgrep.pl readslist blast_table\n"); #open reads want grep $read; $line; open(readslist,$filein); while($line=<readslist>) { if ($line=~/^(.*)$/) { $read = $1; print "$read\n"; system("grep \"$read\" $db >$read\_.out\n"); } #system("grep $read $db >$read\_.out\n"); } system("cat *\_.out >$filein\_greps.txt\n"); system("rm *.out\n");
i don't know how define 4th column key : maybe use split function, i've tried find way table of more 2 columns no avail... please help! if there easy way out of please let me know
thanks !
i'd opposite i.e read readslist file hash walk thru big blast file , print desired lines.
#!/usr/bin/perl use strict; use warnings; use 5.010; # read readslist file hash open $fh, '<', 'readslist' or die "can't open 'readslist' reading:$!"; %readslist = map { chomp; $_ => 1 }<$fh>; close $fh; open $fh_blast, '<', 'blastfile' or die "can't open 'blastfile' reading:$!"; # loop on blastfile lines while (<$fh_blast>) { chomp; # retrieve key (4th column) ($key) = (split/\s+/)[3]; # print line if key exists in hash $_ if exists $readslist{$key}; } close $fh_blast;
Comments
Post a Comment