How to size Java memory in Kubernetes (MaxRAMPercentage and OOMKill)

Tue, 09 Jun 2026 19:30:00 GMT

This post came out of a real problem. A fleet of Java applications running on Kubernetes had standardized JVMs on -XX:InitialRAMPercentage=75 -XX:MaxRAMPercentage=75 and sized memory requests and limits by looking at container.memory.usage and working_set in observability. On paper it looked right: the JVM takes 75% of the limit, 25% is left for everything else, and you track RSS to tune.

In practice, two symptoms showed up:

RSS for every application sat pinned at ~75% of the limit — including dev environments that barely saw traffic. There was no way to tell who was actually wasting memory.
When we tried to cut limits based on that RSS, several applications started getting OOMKilled (exit 137).

This is one of those cases where the metric you’re looking at is lying — not because it’s wrong, but because you’re interpreting it wrong. I’ll document what I found investigating data from real environments and, at the end, I built a reproducible PoC (Docker + Java 21) that proves every claim with measured numbers. All PoC code lives in the public repo LucasBG0/poc-jvm-memory-containers and runs with ./run.sh.

The “production” numbers throughout the text come from a real Java app fleet on K8s, anonymized. The “PoC” numbers were measured on my machine and you can reproduce them.

The basics everyone gets wrong

Before anything else, you need to separate four flags that look like they do the same thing and don’t.

`-Xms` and `-Xmx` (absolute)

These are the initial (-Xms) and maximum (-Xmx) heap sizes, in absolute values (-Xmx512m). The classic container problem: for a long time, the JVM did not see the cgroup and calculated these values from the host’s full RAM. You set a 512 MiB pod limit and the JVM thought it had 64 GiB to play with → OOMKill on the first load. That was fixed (JDK 8u191+ and 10+ are container-aware), but fixed -Xms/-Xmx is still manually coupled: if someone changes the container limit and forgets to change -Xmx, the two drift apart.

`-XX:MaxRAMPercentage` (the ceiling, cgroup-aware)

Sets the maximum heap as a percentage of available memory (which, in a container, is the cgroup limit). This is the modern, recommended approach: you change the pod limit and the heap follows. MaxRAMPercentage=75 in a 768 MiB container → 576 MiB max heap.

`-XX:InitialRAMPercentage` (the initial size, cgroup-aware)

Same idea, but for the initial heap size (the percentage equivalent of -Xms). This is where much of the confusion in this post lives: setting a high InitialRAMPercentage does not mean your app needs that much — it means the JVM will commit that much at boot.

`-XX:MinRAMPercentage` (the gotcha)

This is the JDK’s most misleading name. MinRAMPercentage does not define a heap floor, despite the name. It only kicks in when available memory is small (below ~256 MiB by default) and, in that case, sets the max heap to that percentage. For any container with “normal” memory, MinRAMPercentage is simply ignored and MaxRAMPercentage wins. I’ll prove that with the PoC later.

The PoC: what it measures

The PoC is a single Java program (MemReport.java) running in Java 21 “source mode”, inside an eclipse-temurin:21-jdk image, with a fixed container limit via docker run --memory. It:

retains a controlled live set (30 MiB of byte[] that survive GC) — representing the memory the app actually needs;
generates churn (short-lived garbage) to fill eden;
takes two snapshots: BOOT (right after startup, before allocating) and POST (after retaining the live set + churn);
forces a System.gc() at the end and measures the post-GC live set (old + survivor);
reads /sys/fs/cgroup/memory.current and memory.max from inside the container to report real RSS and limit.

I ran seven scenarios, all with the same 30 MiB live set, varying only the flags. Here is the full table (values in MiB):

scenario	limit	heap max (Xmx)	committed BOOT	RSS BOOT	committed POST	RSS POST	heap used POST	non-heap committed	threads	live set (post-GC)
`default`	768	192	27	85	192	175	82	21	6	64
`init-max-75`	768	576	576	104	576	239	212	21	6	64
`init-max-75-pretouch`	768	576	576	645	576	654	212	21	6	64
`low-init-max-75`	768	576	118	108	576	392	187	21	6	64
`xmx-xms`	768	512	512	114	512	231	160	21	6	64
`minram-small`	200	100	25	67	100	122	70	21	6	64
`minram-large`	768	576	21	92	576	379	188	21	6	64
`oom-xmx-acima-do-limit`	600	—	—	—	—	exit 137	—	—	—	—

I’ll unpack each lesson using this table.

Pitfall #1: `InitialRAMPercentage` commits, but that’s not “usage”

Look at the committed BOOT column (heap committed at boot, before any allocation):

default (default Initial ~1.5%): 27 MiB
low-init-max-75 (Initial=15%): 118 MiB
init-max-75 (Initial=75%): 576 MiB
xmx-xms (-Xms512m): 512 MiB

In other words: InitialRAMPercentage/-Xms controls how much heap the JVM reserves and commits at startup, regardless of what the app needs. With Initial=75, the JVM commits 576 MiB of heap before the app does anything useful.

But — and this “but” is the heart of the problem — committing is not touching. Look at the RSS BOOT column: even after committing 576 MiB of heap, init-max-75 boots with RSS of only 104 MiB, practically the same as default (85 MiB). The kernel only counts in RSS the pages that were actually accessed (page fault). Committed but untouched heap is reserved address space, not physical memory.

Here is the boot snapshot from the init-max-75 scenario, straight from the PoC log:

### BOOT (before allocating)
  container limit (cgroup) : 768 MiB
  container RSS   (cgroup) : 120 MiB
  heap max (effective Xmx) : 576 MiB
  heap used                : 10 MiB
  heap committed           : 576 MiB   <-- committed 75% of limit

So why did production RSS stay pinned at 75%?

Two reasons, and the PoC shows both.

(a) AlwaysPreTouch. If the JVM starts with -XX:+AlwaysPreTouch (common in setups that pin the heap for predictable latency), it touches every committed page at boot. See the init-max-75-pretouch scenario: RSS BOOT jumps from 104 to 645 MiB. Now RSS reflects committed, not demand.

(b) Fixed heap + real load. With Initial = Max, the heap never shrinks, and as the app runs (allocation, GC evacuation), pages get touched until RSS hits committed and stays there. In production, with continuous traffic for days, that’s exactly what happens: RSS saturates at ~75% and stays.

The outcome is the same either way: working_set/container.memory.usage stop reflecting real demand and start marking ~75% of the limit for everyone. That’s why dev, with no load, showed the same RSS as production. Sizing request from that number means sizing from your own MaxRAMPercentage, not from what the app needs.

Pitfall #2: “real usage” is the post-GC live set, and it’s invariant

If RSS lies, what number doesn’t? The live set: what remains on the heap after a GC — the objects the app actually holds.

The JVM splits the heap into generations. The identity is exact:

heap_used = eden + survivor + old

eden: where new objects are born. It’s churn — short-lived garbage the GC sweeps. It grows and shrinks with available heap.
survivor + old: what survived GC. That’s the live set — memory the app actually retains.

The proof is in the PoC. In all seven scenarios that ran to completion, the post-GC live set was exactly 64 MiB (last column), because it’s always the same app retaining the same blocks. (I retained 30 arrays of 1 MiB, but each one spills into a G1 region and rounds to two → ~60 MiB of humongous objects + retained classes ≈ 64 MiB; the detail doesn’t matter, what matters is that it’s constant.) Heap configuration doesn’t change what the app needs — only how much empty space surrounds it.

Now look at the perverse side effect in the heap used POST column (peak heap used during churn):

default (192 MiB heap): peak of 82 MiB
init-max-75 (576 MiB heap): peak of 212 MiB

Same app, same 64 MiB live set, but the heap_used peak is 2.5× larger just because the heap is bigger. Why? Bigger heap → GC runs less often → more floating garbage (eden + dead objects not yet collected) accumulates between collections. That’s another reason looking at the peak of heap_used (or RSS, which follows touched heap) overestimates real need. The honest number is the post-GC trough: old + survivor.

The metrics that survive the distortion are:

jvm.gc.old_gen_size + jvm.gc.survivor_size → live set (retained heap);
jvm.non_heap_memory (Metaspace, Code Cache, Compressed Class) → off-heap, grows on demand;
jvm.buffer_pool.direct.used (DirectByteBuffer) and jvm.thread_count (≈ 1 MiB per thread stack) → native memory, off-heap, but counted in RSS.

And the ones you should stop using for sizing while Initial is high: container.memory.usage, working_set, and jvm.heap_memory_committed — all inflated.

Pitfall #3: the one that hurts — OOMKill 137

This is the part that cost us. With “real usage” in hand, the first attempt was the intuitive formula:

request = limit = real_live_usage × 1.2

Take old + survivor + non_heap_used + direct + threads, multiply by 1.2 for headroom, and cut the limit. Looked great on the dashboard: ~50% savings.

Result in dev/qa: a wave of applications getting OOMKilled (exit 137).

The root cause had two parts, both ignored by the naive formula:

non_heap_committed, not non_heap_used. Metaspace and Code Cache reserve (commit) blocks slightly above what they use and almost never give them back. It’s committed that counts in RSS and triggers OOM, not used. The gap between the two is small, but it’s worth using non_heap_committed out of conservatism.
Invisible native overhead. GC and JIT internal structures, page cache, and especially APM/monitoring agents (Datadog Agent, New Relic, AppDynamics, Elastic APM…) — none of this shows up in jvm.* metrics, but all of it occupies RSS. Reconciling against real RSS in that fleet, this native residue was 66–254 MiB (average ~130 MiB). The 150 MiB constant was the value that worked for that set of services; the right number for your fleet depends on what runs inside the container. How to calibrate is described in the formula section below.

Add the two together and you can see why limits cut with the “live × 1.2” formula landed below the JVM’s physical floor and died.

The PoC reproduces exit 137 deterministically in the oom-xmx-acima-do-limit scenario: a 600 MiB container with -Xms700m -Xmx700m -XX:+AlwaysPreTouch. The configured heap (700 MiB) doesn’t fit in the container (600 MiB), and AlwaysPreTouch tries to touch everything at boot:

>> scenario: oom-xmx-acima-do-limit  (--memory=600m)  flags: -Xms700m -Xmx700m -XX:+AlwaysPreTouch
   [!] container exited with code 137 (137 = OOMKill)

Same mechanism, more explicit: when heap_max + non_heap_committed + native exceeds limit, the kernel kills the process. In production it happened quietly because nobody was adding non-heap and native to the math.

The formula that survived

After the 137s, the sizing formula became this (request = limit policy, i.e. Guaranteed QoS — the JVM tends to grow to the ceiling, so request < limit doesn’t help):

limit = live_heap_pico / occ
      + non_heap_committed
      + direct.used
      + thread_count × 1 MiB
      + N MiB   (native residue: calibrated per fleet — see below)

The terms, mapped to metrics that survive the distortion:

term	metric	role
`live_heap_pico`	`max(old_gen_size + survivor_size)`	heap the app retains
`occ`	—	target heap occupancy (live as fraction of heap)
`non_heap_committed`	`max(non_heap_memory_committed)`	Metaspace + Code Cache (reserved)
`direct.used`	`max(buffer_pool.direct.used)`	DirectByteBuffer (native)
`thread_count × 1 MiB`	`max(thread_count)`	thread stacks (native)
`N MiB`	calibrated constant	native overhead with no direct metric

How to calibrate native residue (`N`)

N has no dedicated JVM metric because it lives outside managed heap and non-heap. In practice, it’s the gap between measured RSS and everything you can sum directly:

N  ≈  RSS_stable  −  heap_committed  −  non_heap_committed  −  direct.used  −  (threads × 1 MiB)

Stable RSS is container.memory.usage (or cgroup memory.current) read when the app is warmed up and under representative load, but without high InitialRAMPercentage or AlwaysPreTouch — otherwise RSS reflects committed but untouched heap, and the calculated N is artificially inflated. Use a low InitialRAMPercentage config (e.g. 25%) for this measurement.

The main components of this residue:

APM agent (Datadog Agent, New Relic, AppDynamics, Elastic APM…): the Java agent attaches as a -javaagent and allocates its own native memory — 30 to 100+ MiB depending on the agent and instrumentation level.
Metrics exporter (Prometheus JMX Exporter, Micrometer…): smaller impact, but not zero.
Native thread stacks beyond the nominal 1 MiB: each thread’s real stack (default -Xss1m on Linux) plus associated kernel structures.
GC internal overhead: G1GC maintains card tables, remembered sets, and marking bitmaps that scale with heap size (typically 1–5% of max heap).
Kernel page cache and I/O buffers: mmap’d files, network buffers — the kernel counts them in process RSS.

In the fleet that originated this post, N = 150 MiB covered most services well (actual range: 66–254 MiB). If you run a heavy APM agent or have many threads, measure and adjust; 150 MiB is a starting point, not a universal constant. The worst case is underestimating: you’ll see OOMKill. The second worst is overestimating a lot: you waste memory but the app survives.

With JVM metrics collected, you can build a dashboard that applies this formula automatically and shows the recommended request/limit per service:

Why divide by `occ`?

The heap must be larger than the live set to fit eden allocation between GCs, G1 evacuation working space, and spikes. Rule of thumb: live set shouldn’t exceed ~70% of heap, or GC enters thrashing (chained full GCs → high CPU → OOM from GC overhead limit).

occ = 0.70 (aggressive): heap = live × 1.43. Saves more memory, more frequent GC.
occ = 0.60 (recommended): heap = live × 1.67. More headroom, less GC, a bit more memory.

The central trade-off is memory ↔ CPU/safety. We settled on 0.60 as the default.

Two equivalent ways to apply it

The target heap (live/occ) is the same; only how you express it changes:

(A) Percentage:

-XX:InitialRAMPercentage= -XX:MaxRAMPercentage=
where P = (live/occ) / limit × 100

The heap follows the limit and never exceeds it — MaxRAMPercentage gives you that guardrail for free.

(B) Explicit:

-Xms -Xmx

Direct, but decoupled from the limit. Requires a guardrail in Helm/CI ensuring -Xmx + non_heap + native ≤ limit, or you hit the PoC’s exit 137 scenario.

In real microservice apps, that P landed in the 26–45% range — far from the standardized 75%. That was the waste.

Bonus pitfall: high `MaxRAMPercentage` embeds OOM risk

There’s a dangerous structural detail. Because MaxRAMPercentage ties heap_max to the limit, in small apps the JVM’s floor can already exceed the limit:

heap_max (75% of 768) = 576 MiB
+ non_heap_committed   = 227 MiB   (real case)
= 803 MiB  >  limit of 768 MiB

And that’s before counting native overhead. With MaxRAMPercentage=75, a non-heap-heavy app is born with a theoretical ceiling above the limit. It works while the heap doesn’t fill — but it’s a time bomb. The immediate mitigation in production was lowering Initial/Max from 75 → 65 on the tightest apps, then applying the formula per service.

The `MinRAMPercentage` gotcha, proven

Back to the misleading name. Compare the two PoC scenarios, both with -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=75:

minram-small (200 MiB container): heap max = 100 MiB = 50% of 200 → MinRAMPercentage governed.
minram-large (768 MiB container): heap max = 576 MiB = 75% of 768 → MinRAMPercentage was ignored, MaxRAMPercentage governed.

The rule: below ~256 MiB of available memory, MinRAMPercentage sets the ceiling; above that, it does nothing. In practice, for 99% of app containers, setting MinRAMPercentage has no effect — and it’s a recurring source of confusion. If you want to control heap, the lever is MaxRAMPercentage.

Actionable best-practices checklist

What we learned, summarized:

Don’t size JVM memory from container.memory.usage/working_set if you use high InitialRAMPercentage or AlwaysPreTouch. Those numbers mark ~MaxRAMPercentage of the limit, not demand.
Measure real usage from the post-GC live set: old_gen_size + survivor_size. Eden is churn, don’t add it.
Don’t forget non-heap and native. non_heap_committed + thread stacks + direct buffers + native residue (N). Ignoring them caused the OOMKill 137s. N depends on what runs in the container — calibrate by measuring RSS_stable − heap_committed − non_heap_committed − direct − threads×1MiB. 150 MiB worked for that fleet; yours may differ.
request = limit (Guaranteed) for JVM — it grows to the ceiling.
MaxRAMPercentage vs fixed -Xmx: MaxRAMPercentage follows the limit and gives you a guardrail. If you use -Xmx, add an explicit guardrail in CI/Helm.
MinRAMPercentage is almost never what you want. It only acts in tiny containers (<256 MiB).
Watch high MaxRAMPercentage on small apps: heap_max + non_heap_committed may already exceed the limit.

Appendix: running the PoC

Everything is in poc. Prerequisite: Docker with cgroup v2 — runs on the eclipse-temurin:21-jdk image.

git clone git@github.com:LucasBG0/poc-jvm-memory-containers.git
cd poc
./run.sh

The script builds the image, runs seven scenarios with the same container limit, and generates results.md (the table in this post) and a logs/.log with two memory snapshots per run. Scenarios cover:

default, init-max-75, init-max-75-pretouch, low-init-max-75 — effect of Initial/Max on committed, RSS, and heap peak;
xmx-xms — explicit form, equivalent to percentage;
minram-small / minram-large — the MinRAMPercentage gotcha;
oom-xmx-acima-do-limit — deterministic exit 137.

Numbers vary slightly between runs (RSS is instantaneous and oscillates with GC), but the deterministic signals — heap committed at boot, heap ceiling, RSS with AlwaysPreTouch, post-GC live set — are stable and tell the whole story.

LucasBG0.com

How to size Java memory in Kubernetes (MaxRAMPercentage and OOMKill)

The basics everyone gets wrong

`-Xms` and `-Xmx` (absolute)

`-XX:MaxRAMPercentage` (the ceiling, cgroup-aware)

`-XX:InitialRAMPercentage` (the initial size, cgroup-aware)

`-XX:MinRAMPercentage` (the gotcha)

The PoC: what it measures

Pitfall #1: `InitialRAMPercentage` commits, but that’s not “usage”

So why did production RSS stay pinned at 75%?

Pitfall #2: “real usage” is the post-GC live set, and it’s invariant

Pitfall #3: the one that hurts — OOMKill 137

The formula that survived

How to calibrate native residue (`N`)

Why divide by `occ`?

Two equivalent ways to apply it

Bonus pitfall: high `MaxRAMPercentage` embeds OOM risk

The `MinRAMPercentage` gotcha, proven

Actionable best-practices checklist

Appendix: running the PoC

About

My journey

About this blog

Where to find me

LucasBG0.com

How to size Java memory in Kubernetes (MaxRAMPercentage and OOMKill)

The basics everyone gets wrong

-Xms and -Xmx (absolute)

-XX:MaxRAMPercentage (the ceiling, cgroup-aware)

-XX:InitialRAMPercentage (the initial size, cgroup-aware)

-XX:MinRAMPercentage (the gotcha)

The PoC: what it measures

Pitfall #1: InitialRAMPercentage commits, but that’s not “usage”

So why did production RSS stay pinned at 75%?

Pitfall #2: “real usage” is the post-GC live set, and it’s invariant

Pitfall #3: the one that hurts — OOMKill 137

The formula that survived

How to calibrate native residue (N)

Why divide by occ?

Two equivalent ways to apply it

Bonus pitfall: high MaxRAMPercentage embeds OOM risk

The MinRAMPercentage gotcha, proven

Actionable best-practices checklist

Appendix: running the PoC

About

My journey

About this blog

Where to find me

`-Xms` and `-Xmx` (absolute)

`-XX:MaxRAMPercentage` (the ceiling, cgroup-aware)

`-XX:InitialRAMPercentage` (the initial size, cgroup-aware)

`-XX:MinRAMPercentage` (the gotcha)

Pitfall #1: `InitialRAMPercentage` commits, but that’s not “usage”

How to calibrate native residue (`N`)

Why divide by `occ`?

Bonus pitfall: high `MaxRAMPercentage` embeds OOM risk

The `MinRAMPercentage` gotcha, proven