Keeping your connections alive — A Clojure – Redis story

Created with

Puns definitely intended, keeping some connections alive is very essential for a healthy life in reality and production.

Redis pubsub is pretty awesome, along with the awesome Carmine — all-clojure redis client(Clojure because we at Swym like our functions lispy :)). Redis standard Documentation reads that the pubsub connections are never-dying, meaning they can survive long idle times.

Source —

Sounds like the perfect story. And like every perfect story, it is not perfect :/ , a paradox. Redis client connections have been the topic of numerous discussions, across tech stacks (Issue links at the bottom of the post).

  1. How does one ensure long living connections or undying connections for their pubsub clients?
  2. How do the connections get recycled without holding on to dead beats, in the client connection sense?
  3. What is the best config to handle this, surely we are not the only ones on the face of this planet to encounter this?

At Swym, we have a few production life cycles where Redis pubsub is used to push out events to mostly purge any cached data. The cache is distributed across load balanced services. Hence the events need to reach the far depths of the service to really purge all relevant data. That’s the crux, so it is critical for it to be healthy. With some notorious issues with our Redis service, our pubsub connections will die out for “no” reason. Ideally, we would like to be alerted when those go down, so something can be done, or even better, to keep them alive as much as possible.

Are you alive even?

According to the client, the connection is alive, according to the Redis server, the connection is kinda alive, but not really. Here is an example below.

Client state 1
Publish 1
Subscriber 1

… A few minutes later …

Connection state 2
Publish 2
Subscriber 2 — nope, can’t hear you

Another case is when Redis clustering kicks in with the secondary shards kicking and kinda renders the previous connection invalid, or the Redis instance decides to reboot.

So we had two main issues

  1. Connection failures are not notified as an “error”, in most cases these are system failures and should be reported to devops/tools so something can be done about it
  2. Connections should not fail because of idleness. The connections should continue to survive periods of radio silence, reducing the number of stale connections (read 1).

One potential way to manage tcp timeouts – by setting tcp-keepalive to some reasonable value instead of defaults (300 in my case). >, but as you see in the screenshots, the connection “existed” at 322, maybe my client tcp needs to be configured as well. This seems like a configuration issue, or is it?

Using the source

Carmine is a great library with all things one can think of, and shoutout to Peter Taoensso for putting such awesome pieces of Clojure out there. I started with a reading of the source, like Master Yoda would counsel (may his memory last longer than these redis connections :/). I was happy to see a few outstanding issues and code TODOs in the source.

The issue has been open for years. So, naturally, this presents an opportunity to roll up my sleeves and figure out a way. Redis (v3.2+) has had “PING” on pubsub connections for a while now. This means that connection idle time can be managed with a series of heart beat pings.

The way carmine with-pubsub-new-listener works is,

  1. Start a long running thread with a future-call to poll on the connection socket stream
  2. When a reply is received pass it back to the message handlers.

There is one big issue with this — as a Clojure future catches an error within the thread of execution and does not notify anyone unless another thread tries to realize the future. In this case, the future never finish evaluation unless there is an socket read/timeout error. This needs a timely poll to check the evaluation state of the future, like so

(require '[taoensso.carmine :as car])
;(def conn-spec {:host "localhost" :port 6379}) ;; replace with your spec
(def l (car/with-new-pubsub-listener conn-spec {"ps-foo" #(println %)} (car/subscribe "ps-foo")))
(realized? (:future l))
; false
(def l-status (future (while (not (realized? (:future l)))
(println "realized" (:future l)))))
;;realized #object[clojure.core$future_call$reify__6962 0x154a8d5 {:status :pending, :val nil}]
;;...... A lot of logs later ......
;;realized #object[clojure.core$future_call$reify__6962 0x154a8d5 {:status :failed, :val #error {
; :cause nil
; :via
; [{:type java.util.concurrent.ExecutionException
; :message
; :at [java.util.concurrent.FutureTask report 122]}
; {:type
; :message nil
; :at [ readByte 267]}]
; :trace
; [... <> ...]}}]

That doesn’t sound like the best use of CPU cycles. So first thing, add a way to callback an error handler in the future-call — This will allow end clients to be notified of connection failures, and retry based on the exception.

Next up, add a future call to PING in case there are no messages received after a specified timeout ( ping-ms in the conn-spec, default to 30 seconds). Add timely heartbeats when connection goes idle.

With those two in place (using my new library :)), here is how it looks like.

(require '[taoensso.carmine :as car])
(require '[redis-pubsub.core :as pubsub])
(def keepalive-l (pubsub/with-new-keepalive-pubsub-listener
"ps-foo" #(println %)
"pubsub:ping" #(println "ping" %)
"pubsub:listener:fail" #(println "listener failed - add try again" %)}
(car/subscribe "ps-foo")))
;; prints ping [pong pubsub:ping] after 30 seconds of idleness
;; When listener fails, immediately a callback is fired
; listener failed - add try again [pubsub:error pubsub:listener:fail #error {
; :cause nil
; :via
; [{:type
; :message nil
; :at [ readByte 267]}]
; :trace
; [... <> ...]}]

After running test cases successful, time to share the goodness with other people. There is a Pull Request opened on the main library project, but it could take a while as I could have potentially broken other deeper test cases (I hope none). So here it is as an add-on library (my first on clojars, yay!) that can be included with carmine and give your connections the heartbeat they need.

Thanks for reading! Do try it out and let me know how you have handled it in your Clojure services?




You have come this far :), maybe you’d like my other Clojure posts too, just putting it out there –

read original article here