ruby - Fast way to check if partial value is present in large array of hashes -
i have large array (a) containing 50,000 values (hashes). have 2 other arrays containing blacklist (b) , whitelist (w) of values (also strings) checked against a. know array.include? function have check if value b or w partially in key "text" of each hash in a.
so example:
a = [ { :text => 'cat', :some_other_key => 'bla' }, { :text => 'dog', :some_other_key => 'blub' }, { :text => 'wolve', :some_other_key => 'example' }, { :text => 'bird', :some_other_key => 'test' }, { :text => 'white whale', :some_other_key => 'test' } ] w = [ 'whale', 'cow' ] b = [ 'pig', 'chicken' ]
i want match 'white whale' against 'whale' w. far know, include? match if values same.
what did far is:
a.each |value| if w.any? { |w| value[:text] =~ /#{w}/ } #do stuff next elsif b.any? { |w| value[:text] =~ /#{w}/ } #do other stuff next end end
this works me, it's slow if a, b or w getting big. leads me question: know better , faster way of achieving this?
building regexes cost can move out of inner loop:
w_rx = w.map { |w| /#{w}/ } b_rx = b.map { |w| /#{w}/ } a.each |value| if w_rx.any? { |rx| value[:text] =~ rx } #do stuff next elsif b_rx.any? { |rx| value[:text] =~ rx } #do other stuff next end end
also please @ steenslag's answer, replacing w_rx = w.map { |w| /#{w}/ }
big_w_rx = regepx.union(w)
, w_rx.any? { |rx| value[:text] =~ rx }
value[:text] =~ big_w_rx
can lot faster. tested 5 times faster, depend bit on data.
the other way improve on loop find way skip entries in a, w or b. in essence indexing does, have problem here each entry in new you, cost of indexing scheme higher cost of scanning whole list.
if have knowledge structure in each [:text]
use advantage. instance if lengths of [:text]
can routinely less items in w
, b
, there no need check matches against longer strings. sort w
, b
length, create hash of points in sorted lists each length, , test minimum slice of w , b in inner loops - that's specific how data behaves, i'm not going write out in full, may not you.
a benchmark, regexp.union
vs arr_of_regex.any?
require 'benchmark' w = [ 'whale', 'cow', 'flat', 'bundle', 'bubbles', 'clarinet', 'cash', 'fish', 'donkey' ] # dummy data (lots of misses w, might not representative) = array.new( 100000 ) { |i| (1..(5+rand(20))).map { |j| "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz".chars.sample }.join } big_rx = regexp.union(w) w_rx = w.map { |w| /#{w}/ } benchmark.bm |bm| bm.report(:big_rx) { a.select { |s| s =~ big_rx } } bm.report(:w_rx) { a.select { |s| w_rx.any? { |rx| s =~ rx } } } end
output:
user system total real big_rx 0.110000 0.000000 0.110000 ( 0.112621) w_rx 0.650000 0.000000 0.650000 ( 0.660112)
if modify w 1000 records, built a
in above example, difference closes more factor of 2 (still, worth having):
user system total real big_rx 21.380000 0.010000 21.390000 ( 21.398051) w_rx 39.720000 0.030000 39.750000 ( 39.759734)
Comments
Post a Comment