split - Loading Big files into Hashes in Perl (BLAST tables) -


i'm perl beginner, please me out query... i'm trying extract information blast table (a snippet of looks below): it's standard blast table input... want extract information on list of reads (look @ second script below , idea of want do).... anyhow precisely i've done in second script:

inputs:

1) blast table:

38.1    0.53    59544   gh8nflv01a02ed  gh8nflv01a02ed rank=0113471 x=305.0 y=211.5 length=345  1   yp_003242370    dynamin family protein [paenibacillus sp. y412mc10] -1  0   48.936170212766 40.4255319148936    47  345 1213    13.6231884057971    3.87469084913438    31  171 544 590 34.3    7.5 123828  gh8nflv01a03qj  gh8nflv01a03qj rank=0239249 x=305.0 y=1945.5 length=452 1   xp_002639994    hypothetical protein cbg10824 [caenorhabditis briggsae] 3   0   52.1739130434783    32.6086956521739    46  452 367 10.1769911504425    12.5340599455041    111 248 79  124 37.7    0.70    62716   gh8nflv01a09b8  gh8nflv01a09b8 rank=0119267 x=307.0 y=1014.0 length=512 1   xp_002756773    predicted: probable g-protein coupled receptor 123-like, partial [callithrix jacchus]   1   0   73.5294117647059    52.9411764705882    34  512 703 6.640625    4.83641536273115    43  144 273 306 37.7    0.98    33114   gh8nflv01a0h5c  gh8nflv01a0h5c rank=0066011 x=298.0 y=2638.5 length=573 1   xp_002756773    predicted: probable g-protein coupled receptor 123-like, partial [callithrix jacchus]   -3  0   73.5294117647059    52.9411764705882    34  573 703 5.93368237347295    4.83641536273115    131 232 273 306 103 1e-020  65742   gh8nflv01a0mxi  gh8nflv01a0mxi rank=0124865 x=300.5 y=644.0 length=475  1   abz08973    hypothetical protein aloha_hf4000apkg6b14ctg1g18 [uncultured marine crenarchaeote hf4000_apkg6b14]  2   0   77.9411764705882    77.9411764705882    68  475 151 14.3157894736842    45.0331125827815    2   205 1   68 41.6    0.053   36083   gh8nflv01a0qkx  gh8nflv01a0qkx rank=0071366 x=301.0 y=1279.0 length=526 1   xp_766153   hypothetical protein [theileria parva strain muguga]    -1  0   66.6666666666667    56.6666666666667    30  526 304 5.70342205323194    9.86842105263158    392 481 31  60 45.4    0.003   78246   gh8nflv01a0z29  gh8nflv01a0z29 rank=0148293 x=304.0 y=1315.0 length=432 1   zp_04111769 hypothetical protein bthur0007_56280 [bacillus thuringiensis serovar monterrey bgsc 4aj1]   3   0   51.8518518518518    38.8888888888889    54  432 193 12.5    27.979274611399 48  209 97  150 71.6    4e-011  97250   gh8nflv01a14mr  gh8nflv01a14mr rank=0184885 x=317.5 y=609.5 length=314  1   zp_03823721 dna replication protein [acinetobacter sp. atcc 27244]  1   0   92.5    92.5    40  314 311 12.7388535031847    12.8617363344051    193 312 13  52 58.2    5e-007  154555  gh8nflv01a1kch  gh8nflv01a1kch rank=0309994 x=310.0 y=2991.0 length=267 1   zp_03823721 dna replication protein [acinetobacter sp. atcc 27244]  1   0   82.051282051282 82.051282051282 39  267 311 14.6067415730337    12.540192926045 142 258 1   39 

2) reads list:

gh8nflv01a09b8 gh8nflv01a02ed etc etc 

3) output want:

37.7    0.70    62716   gh8nflv01a09b8  gh8nflv01a09b8 rank=0119267 x=307.0 y=1014.0 length=512 1   xp_002756773    predicted: probable g-protein coupled receptor 123-like, partial [callithrix jacchus]   1   0   73.5294117647059    52.9411764705882    34  512 703 6.640625    4.83641536273115    43  144 273 306 38.1    0.53    59544   gh8nflv01a02ed  gh8nflv01a02ed rank=0113471 x=305.0 y=211.5 length=345  1   yp_003242370    dynamin family protein [paenibacillus sp. y412mc10] -1  0   48.936170212766 40.4255319148936    47  345 1213    13.6231884057971    3.87469084913438    31  171 544 590 

i want subset of information in first list, given list of read names want extract (that found in 4th column) instead of hashing reads list (only?) want hash blast table itself, , use information in column 4 (of blast table)as keys extract values of each key, when key may have more 1 value(i.e: each read name might have more 1 hit , or associated blast result in table), keeping in mind, value includes whole row key(readname) in it.

my greplist.pl script this, very slow, think , ( , correct me if i'm wrong) loading whole table in hash, should speed things tremendously ...

thank help.

my scripts: broken 1 (mambo5.pl)

#!/usr/bin/perl -w # purpose:  extract blastx data list of readnames use strict; open (data,$argv[0]) or die ("usage: ./mambo5.pl blastxtable readslist"); open (list,$argv[1]) or die ("usage: ./mambo5.pl blastxtable readslist"); %hash = <data>; close (data); $filename=$argv[0]; open(out, "> $filename.bololom");  $readname;  while ( <list> ) {     #########;     if(/^(.*?)$/)#     {         $readname=$1;#         chomp $readname;         if (exists $hash{$readname})         {             print "bingo!";             $output =$hash{$readname};             print out "$output\n";         }         else          {             print "it aint workin\n";             #print %hash;         }                } } close (list); 

the slow , quick cheat (that works) , slow (my blast tables can 400mb 2gb large, i'm sure can see why it's slow)

#!/usr/bin/perl -w ##  # script finds list of names in blast table , outputs result in new file # name must exist , list must correctly formatted # not output using "normal" blast file, must table blast # if have standard blast output use blast2table script  use strict; $filein=$argv[0] or die ("usage: ./listgrep.pl readslist blast_table\n"); $db=$argv[1] or die ("usage: ./listgrep.pl readslist blast_table\n"); #open reads want grep $read; $line; open(readslist,$filein); while($line=<readslist>) {     if ($line=~/^(.*)$/)      {         $read = $1;         print "$read\n";         system("grep \"$read\" $db >$read\_.out\n");     }       #system("grep $read $db >$read\_.out\n"); } system("cat *\_.out >$filein\_greps.txt\n"); system("rm *.out\n"); 

i don't know how define 4th column key : maybe use split function, i've tried find way table of more 2 columns no avail... please help! if there easy way out of please let me know

thanks !

i'd opposite i.e read readslist file hash walk thru big blast file , print desired lines.

#!/usr/bin/perl  use strict; use warnings; use 5.010;  # read readslist file hash open $fh, '<', 'readslist' or die "can't open 'readslist' reading:$!"; %readslist = map { chomp; $_ => 1 }<$fh>; close $fh;  open $fh_blast, '<', 'blastfile' or die "can't open 'blastfile' reading:$!"; # loop on blastfile lines while (<$fh_blast>) {     chomp;     # retrieve key (4th column)     ($key) = (split/\s+/)[3];     # print line if key exists in hash     $_ if exists $readslist{$key}; } close $fh_blast; 

Comments

Popular posts from this blog

python - ('The SQL contains 0 parameter markers, but 50 parameters were supplied', 'HY000') or TypeError: 'tuple' object is not callable -

objective c - Language Translation API for iPhone -

jasper reports - Fixed header in Excel using JasperReports -