Hadoop or Postgresql for effective processing -
i student trying use of machine learning algorithms large data set.we have 140 million records in our training set(currently in postgresql tables) , there 5 tables 6 million records exhibit primary key - foreign key relationships.
we have 2 machines following configurations 1) 6gb ram 2nd generation i5 processor 2) 8gb ram 2nd generation i7 processor
we right planning split them logical groupings before running our statistical analysis since turnaround time quite high.
1) should split them separate tables in postgresql , them use matlab or r programming or 2) should use hadoop hbase porting database 3) should combine , use them(i.e) decompose them based on logical groups , dump in postgresql database , setup hadoop +hbase analysis , use based on necessary algorithms.
thanks
it hard believe in such small cluster hadoop effective. if can parralelize task without - more effective sure
consideration take account - iteration time in learning process. if iteration takes dozens of seconds - hadoop job overhead (which 30 seconds) much.
can hadoop - effective external parralel sort - shuffle stage is. if need - consider using hadoop.
please note in general not easy port relational schema hbase - since joins not supported.
Comments
Post a Comment