Werner Daehn
Oct 30, 2022

Why not parallel-process?

You would have one lambda that reads the file size, e.g. 10GB, splits the file in 10000 parts of 10MB each and puts these 10'000 requests into SQS.

Each of the SQS consumers starts reading by its offset, e.g. the first sets the S3 range parameter to 0, the second to 10MB. The lambda finds the real start point of the next line, which is 0 for the offset 0 and the next \n for all others. And it reads 11MB. When the 10MB mark is reached, it completes the correct line and then stops.

Or better, ask the file producer to create multiple files of a reasonable size instead of one large. If possible.

Werner Daehn
Werner Daehn

Written by Werner Daehn

Data Integration expert for Big Data and SAP

No responses yet