ruby - Fast way to check if partial value is present in large array of hashes -


i have large array (a) containing 50,000 values (hashes). have 2 other arrays containing blacklist (b) , whitelist (w) of values (also strings) checked against a. know array.include? function have check if value b or w partially in key "text" of each hash in a.

so example:

a = [ { :text => 'cat', :some_other_key => 'bla' },        { :text => 'dog', :some_other_key => 'blub' },        { :text => 'wolve', :some_other_key => 'example' },        { :text => 'bird', :some_other_key => 'test' },        { :text => 'white whale', :some_other_key => 'test' } ] w = [ 'whale', 'cow' ] b = [ 'pig', 'chicken' ] 

i want match 'white whale' against 'whale' w. far know, include? match if values same.

what did far is:

a.each |value|      if w.any? { |w| value[:text] =~ /#{w}/ }         #do stuff         next     elsif b.any? { |w| value[:text] =~ /#{w}/ }         #do other stuff         next     end end 

this works me, it's slow if a, b or w getting big. leads me question: know better , faster way of achieving this?

building regexes cost can move out of inner loop:

w_rx = w.map { |w| /#{w}/ }  b_rx = b.map { |w| /#{w}/ }  a.each |value|     if w_rx.any? { |rx| value[:text] =~ rx }         #do stuff         next     elsif b_rx.any? { |rx| value[:text] =~ rx }         #do other stuff         next     end  end 

also please @ steenslag's answer, replacing w_rx = w.map { |w| /#{w}/ } big_w_rx = regepx.union(w) , w_rx.any? { |rx| value[:text] =~ rx } value[:text] =~ big_w_rx can lot faster. tested 5 times faster, depend bit on data.

the other way improve on loop find way skip entries in a, w or b. in essence indexing does, have problem here each entry in new you, cost of indexing scheme higher cost of scanning whole list.

if have knowledge structure in each [:text] use advantage. instance if lengths of [:text] can routinely less items in w , b, there no need check matches against longer strings. sort w , b length, create hash of points in sorted lists each length, , test minimum slice of w , b in inner loops - that's specific how data behaves, i'm not going write out in full, may not you.


a benchmark, regexp.union vs arr_of_regex.any?

require 'benchmark'  w = [ 'whale', 'cow', 'flat', 'bundle', 'bubbles',        'clarinet', 'cash', 'fish', 'donkey' ]  # dummy data (lots of misses w, might not representative) = array.new( 100000 ) { |i|    (1..(5+rand(20))).map { |j|      "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz".chars.sample    }.join  }  big_rx = regexp.union(w)  w_rx = w.map { |w| /#{w}/ }  benchmark.bm |bm|   bm.report(:big_rx) { a.select { |s| s =~ big_rx } }   bm.report(:w_rx) { a.select { |s| w_rx.any? { |rx| s =~ rx } } } end 

output:

       user     system      total        real big_rx  0.110000   0.000000   0.110000 (  0.112621) w_rx  0.650000   0.000000   0.650000 (  0.660112) 

if modify w 1000 records, built a in above example, difference closes more factor of 2 (still, worth having):

       user     system      total        real big_rx 21.380000   0.010000  21.390000 ( 21.398051) w_rx 39.720000   0.030000  39.750000 ( 39.759734) 

Comments

Popular posts from this blog

javascript - Count length of each class -

What design pattern is this code in Javascript? -

hadoop - Restrict secondarynamenode to be installed and run on any other node in the cluster -