Hacker News new | past | comments | ask | show | jobs | submit login

Spark has sliding window operations:

    windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)
Search for "window operations" on http://spark.apache.org/docs/latest/streaming-programming-gu.... Unless you meant something different?

You can replay streams against Spark, too. streamingContext.textFileStream will stream data from files dumped in a directory - to replay them, just dump them there again.




Yeah.. that's exactly what I used in the spark version i threw together yesterday.

reduceByKeyAndWindow is what my simple non-spark proof of concept is missing. otherwise the code is basically the same.

my non spark code is essentially

  while True:
    data = {}
    for line in input:
        rec = parse(line)
        data = aggregate(data, rec)
    data = filter(is_bad, data)
    pprint(data)
the spark version is 99% the same code:

  lines.map(parse).reduceByKeyAndWindow(add, sub, 3600, 60).
    filter(is_bad).pprint()
Figuring out how to do stateful processing is a little tricky, but updateStateByKey seems to do what i need.. I need to dedup the output per key for time period t. Though, some recommendations are just to use something like redis or memcached which would work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: