The Problem
When you run multiple microservices in a production banking environment, you quickly discover that every team solves the same cross-cutting problems independently. At HDFC Bank, the pattern was clear: unstandardized retry logic, inconsistently configured circuit breakers, and no common approach to cold-start latency. The divergence caused real instability — bugs fixed in one service would reappear in others, and services would fail in unpredictable ways under load.
The answer was a Resilience & Utility Library — a Spring Boot starter that every service pulls in as a single dependency.
Design Goals
Before writing a line of code, we agreed on three constraints:
- Zero forced transitive conflicts. Services manage their own dependency versions. The library must not impose them.
- Opt-in everything. Each feature is behind a
@ConditionalOnProperty. If you don't configure it, it doesn't load. - Observable by default. Every primitive automatically exposes Prometheus metrics via Micrometer.
Configurable Retry with Exponential Backoff
The retry engine is the most-used primitive. Services configure it declaratively:
resilience:
retry:
enabled: true
max-attempts: 3
initial-interval-ms: 200
multiplier: 2.0
max-interval-ms: 5000
retryable-exceptions:
- java.net.ConnectException
- org.springframework.dao.TransientDataAccessException
The AOP interceptor wraps any @Retryable-annotated method with this policy. The key design decision: exponential backoff with jitter. Without jitter, all retrying services hit the downstream at the same intervals and create synchronized retry storms.
long delay = Math.min(
(long) (initialInterval * Math.pow(multiplier, attempt)),
maxInterval
);
// Add ±20% jitter to desynchronize retries
long jitter = (long) (delay * 0.2 * (Math.random() - 0.5));
Thread.sleep(delay + jitter);
Circuit Breakers
We wrapped Resilience4j's CircuitBreaker with a minimum-calls guard that prevents false trips on low-traffic paths:
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(70)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.minimumNumberOfCalls(20) // don't open on 1-of-2 failures at 2 AM
.slidingWindowType(SlidingWindowType.TIME_BASED)
.slidingWindowSize(60)
.build();
The minimumNumberOfCalls(20) is the critical guard. Low-traffic services were tripping breakers on statistical noise — two failures out of four calls is not a real outage.
Automated Database Warmup
Cold-start latency was a recurring complaint. When a pod restarts, the first few requests see high latency as HikariCP lazily establishes connections to PostgreSQL and Yugabyte DB. We solved this with a SmartLifecycle bean that runs at startup:
@Component
public class DatabaseWarmup implements SmartLifecycle {
@Override
public void start() {
log.info("Warming up database connection pool...");
IntStream.range(0, dataSourceProperties.getMinIdle())
.parallel()
.forEach(i -> {
try (Connection c = dataSource.getConnection()) {
c.createStatement().execute("SELECT 1");
} catch (SQLException e) {
log.warn("Warmup connection {} failed: {}", i, e.getMessage());
}
});
log.info("Database warmup complete — pool is hot.");
running = true;
}
}
This runs before Kubernetes marks the pod as Ready, so the readiness probe only passes once the pool is warm. Cold-start latency dropped to zero in production.
Standardized Observability Hooks
Every feature auto-registers Micrometer metrics. No service needs to instrument itself:
resilience.retry.attempts(counter, tagged by service + method)resilience.circuit-breaker.state(gauge: 0=closed, 1=open, 2=half-open)db.warmup.duration(timer)
These feed directly into the Prometheus/Grafana dashboards we configured, giving the platform team visibility across all services without any per-service instrumentation effort.
Results
After rollout:
- Cold-start latency eliminated — Database Warmup proved immediately effective at pod restarts.
- Service stability improved — standardized retry + backoff removed thundering-herd retry storms between dependent services.
- Operational overhead reduced — one library to update instead of patching retry logic in N services.
The library is under 6,000 lines of code with high test coverage. The biggest lesson: shared infrastructure code must be more conservative than application code. When it breaks, everything breaks.