Centralised logging for AWS Lambda, REVISED (2018) – Hacker Noon

First of all, I would like to thank all of you for fol­low­ing and read­ing my con­tent. My post on cen­tralised log­ging for AWS Lamb­da has been viewed more than 20K times by now, so it is clear­ly a chal­lenge that many of you have run into.

In the post, I out­lined an approach of using a Lamb­da func­tion to ship all your Lamb­da logs from Cloud­Watch Logs to a log aggre­ga­tion ser­vice such as Logz.io.

In the demo project, I also includ­ed func­tions to:

  • auto-sub­scribe new log groups to the log-ship­ping func­tion
  • auto-update the reten­tion pol­i­cy of new log groups to X num­ber of days (default is Nev­er Expire which has a long term cost impact)

This approach works well when you start out. How­ev­er, you can run into some seri­ous prob­lems at scale.

Mind the concurrency

When pro­cess­ing Cloud­Watch Logs with a Lamb­da func­tion, you need to be mind­ful of the no. of con­cur­rent exe­cu­tions it cre­ates. Because Cloud­Watch Logs is an asyn­chro­nous event source for Lamb­da.

When you have 100 func­tions run­ning con­cur­rent­ly, they will each push logs to Cloud­Watch Logs. This in turn can trig­ger 100 con­cur­rent exe­cu­tions of the log ship­ping func­tion. Which can poten­tial­ly dou­ble the num­ber of func­tions that are con­cur­rent­ly run­ning in your region. Remem­ber, there is a soft, region­al lim­it of 1000 con­cur­rent exe­cu­tions for all func­tions!

This means your log ship­ping func­tion can cause cas­cade fail­ures through­out your entire appli­ca­tion. Crit­i­cal func­tions can be throt­tled because too many exe­cu­tions are used to push logs out of Cloud­Watch Logs — not a good way to go down 😉

You can set the Reserved Con­cur­ren­cy for the log ship­ping func­tion, to lim­it its max num­ber of con­cur­rent exe­cu­tions. How­ev­er, you risk los­ing logs when the log ship­ping func­tion is throt­tled.

You can also request a raise to the region­al lim­it and make it so high that you don’t have to wor­ry about throt­tling.

A better approach at scale is to use Kinesis

How­ev­er, I would sug­gest that a bet­ter approach is to stream the logs from Cloud­Watch Logs to a Kine­sis stream first. From there, a Lamb­da func­tion can process the logs and for­ward them on to a log aggre­ga­tion ser­vice.

With this approach, you have con­trol the con­cur­ren­cy of the log ship­ping func­tion. As the num­ber of log events increas­es, you can increase the num­ber of shards in the Kine­sis stream. This would also increase the num­ber of con­cur­rent exe­cu­tions of the log ship­ping func­tion.

Take a look at this repo to see how it works. It has a near­ly iden­ti­cal set up to the demo project for the pre­vi­ous post:

  • a set-retention func­tion that auto­mat­i­cal­ly updates the reten­tion pol­i­cy for new log groups to 7 days
  • a subscribe func­tion auto­mat­i­cal­ly sub­scribes new log groups to a Kine­sis stream
  • a ship-logs-to-logzio func­tion that process­es the log events from the above Kine­sis stream and ships them to Logz.io
  • a process_all script to sub­scribe all exist­ing log groups to the same Kine­sis stream

You should also check out this post to see how you can autoscale Kine­sis streams using Cloud­Watch and Lamb­da.

read original article here