Why not parallel-process?

Oct 30, 2022

You would have one lambda that reads the file size, e.g. 10GB, splits the file in 10000 parts of 10MB each and puts these 10'000 requests into SQS.

Each of the SQS consumers starts reading by its offset, e.g. the first sets the S3 range parameter to 0, the second to 10MB. The lambda finds the real start point of the next line, which is 0 for the offset 0 and the next \n for all others. And it reads 11MB. When the 10MB mark is reached, it completes the correct line and then stops.

Or better, ask the file producer to create multiple files of a reasonable size instead of one large. If possible.

Written by Werner Daehn

No responses yet