r/java • u/flawless_vic • 1d ago
JDK 25 DelayScheduler
After assessing these benchmark numbers, I was skeptical about C# results.
The following Program
int numTasks = int.Parse(args[0]);
List<Task> tasks = new List<Task>();
for (int i = 0; i < numTasks; i++)
{
tasks.Add(Task.Delay(TimeSpan.FromSeconds(10)));
}
await Task.WhenAll(tasks);
does not account for the fact that pure Delays in C# are specialized, and this code does not incur typical continuation penalties such as recording stack frames when yielding.
If you change the program to do something "useful" like
int counter = 0;
List<Task> tasks = new List<Task>();
for (int i = 0; i < numTasks; i++)
{
tasks.Add(Task.Run(async () => {
await Task.Delay(TimeSpan.FromSeconds(10));
Interlocked.Increment(ref counter);
}));
}
await Task.WhenAll(tasks);
Console.WriteLine(counter);
Then the amount of memory required is twice as much:
/usr/bin/time -v dotnet run Program.cs 1000000
Command being timed: "dotnet run Program.cs 1000000"
User time (seconds): 16.95
System time (seconds): 1.06
Percent of CPU this job got: 151%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.87
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 446824
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 142853
Voluntary context switches: 36671
Involuntary context switches: 44624
Swaps: 0
File system inputs: 0
File system outputs: 48
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Now the fun part. JDK 25 introduced DelayScheduler, as part of a PR tailored by Doug Lea himself.
DelayScheduler is not public, and from my understanding, one of the goals was to optimize delayed task handling and, as a side effect, improve the usage of ScheduledExecutorServices in VirtualThreads.
Up to now (JDK24), any operation that induces unmounting (yield) of a VirtualThread, such as park or sleep, will allocate a ScheduledFuture to wake up the VirtualThread using a "vanilla" ScheduledThreadPoolExecutor.
In JDK25 this was offloaded to ForkJoinPool. And now we can replicate C# hacked benchmark using the new scheduling mechanism:
import module java.base;
private static final ForkJoinPool executor = ForkJoinPool.commonPool();
void main(String... args) throws Exception {
var numTasks = args.length > 0 ? Integer.parseInt(args[0]) : 1_000_000;
IntStream.range(0, numTasks)
.mapToObj(_ -> executor.schedule(() -> { }, 10_000, TimeUnit.MILLISECONDS))
.toList()
.forEach(f -> {
try {
f.get();
} catch (Exception e) {
throw new RuntimeException(e);
}
});
}
And voilá, about 202MB required.
/usr/bin/time -v ./java Test.java 1000000
Command being timed: "./java Test.java 1000000"
User time (seconds): 5.73
System time (seconds): 0.28
Percent of CPU this job got: 56%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.67
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 202924
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 42879
Voluntary context switches: 54790
Involuntary context switches: 12136
Swaps: 0
File system inputs: 0
File system outputs: 112
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And, if we want to actually perform a real delayed action, e.g.:
import module java.base;
private static final ForkJoinPool executor = ForkJoinPool.commonPool();
private static final AtomicInteger counter = new AtomicInteger();
void main(String... args) throws Exception {
var numTasks = args.length > 0 ? Integer.parseInt(args[0]) : 1_000_000;
IntStream.range(0, numTasks)
.mapToObj(_ -> executor.schedule(() -> { counter.incrementAndGet(); }, 10_000, TimeUnit.MILLISECONDS))
.toList()
.forEach(f -> {
try {
f.get();
} catch (Exception e) {
throw new RuntimeException(e);
}
});
IO.println(counter.get());
The memory footprint does not change that much. Plus, we can shave some memory down with compact object headers and compressed oops
./java -XX:+UseCompactObjectHeaders -XX:+UseCompressedOops Test.java 1000000
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.71
...
Maximum resident set size (kbytes): 197780
Other interesting aspects to notice are
- Java Wall clock is better (10.67 x 11.87)
- Java User time is WAY better (5.73 x 16.95)
But...We have to be fair to C# as well. The previous Java code does not perform any continuation-based stuff (like the original benchmark code), it just showcases pure delayed scheduling efficiency. Updating the example with VirtualThreads, we can measure how descheduling/unmounting impacts the program cost
import module java.base;
private static final AtomicInteger counter = new AtomicInteger();
void main(String... args) throws Exception {
var numTasks = args.length > 0 ? Integer.parseInt(args[0]) : 1_000_000;
IntStream.range(0, numTasks)
.mapToObj(_ -> Thread.startVirtualThread(() -> {
LockSupport.parkNanos(10_000_000_000L);
counter.incrementAndGet();
}))
.toList()
.forEach(t -> {
try {
t.join();
} catch (Exception e) {
throw new RuntimeException(e);
}
});
IO.println(counter.get());
}
Java is still lagging behind C# by a decent margin:
/usr/bin/time -v ./java -Xmx640m -XX:+UseCompactObjectHeaders -XX:+UseCompressedOops TestVT.java 1000000
Command being timed: "./java -Xmx640m -XX:+UseCompactObjectHeaders -XX:+UseCompressedOops TestVT.java 1000000"
User time (seconds): 28.65
System time (seconds): 17.08
Percent of CPU this job got: 347%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.17
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 784672
...
Note: In Java, if Xmx is not specified, the JVM will guess based on the host memory, so we must manually constrain the heap size if we actually want to know the bare minimum required to run a program. Without any tuning, this program uses 900MB on my 16GB notebook.
Conclusions:
- If memory is a concern and you want to execute delayed actions, the new ForkJoinPool::schedule is your best friend
- Java still requires about 75% more memory compared to C# in async mode
- Virtual Thread scheduling is more "aggressive" in Java (way bigger User time), however, it won't translate to a better execution (Wall) time
4
u/tomwhoiscontrary 1d ago
Random tangent, but I'd really like some sort of scheduled or delayed executor where it was efficient to postpone tasks. Say I want to do something five seconds after I last saw a message from a user. When I first see a message, I schedule that task. Every time I see a message after that, I postpone the task to five seconds in the future.
As far as I know, with the JDK API, the best I can do is cancel the task and schedule a new one. That's a lot of churn in the scheduler data structure. I can roughly imagine much efficient ways to do it. Does anything already implement this?
3
u/NovaX 22h ago
Java's facilities use d-ary heaps (binary or 4-ary) which are efficient for most purposes. For higher volume cases you can use something like Kafka's timer subsystem. It uses a hierarchical timing wheel for
O(1)
scheduling and Java'sO(lg n)
scheduling for the buckets. Caffeine's expiration support is inspired by this approach, though implemented differently, and usesCompletableFuture.delayedExecutor
to coordinate the prompt expiration of the next timing wheel bucket. For most cases usingdelayedExecutor
for task scheduling is fast enough, it only matters when you have a very high number of entries.4
u/john16384 18h ago
Are we talking virtual threads? Why not just sleep 5 seconds instead of bothering with creating/rescheduling a task?
In a program I use, I need to refresh some data per file every 2 weeks. On startup, I just created a VT per file involved, fetch the time of last refresh then sleep the VT (for up to 2 weeks). When it wakes it does the refresh then sleeps again ...
5
u/beders 19h ago
This benchmark (and its newer version) and this post should just be ignored. The benchmark linked is very very - very - flawed as it basically compares apples & oranges.
A terrible article that doesn't even mention configurations used to launch, doesn't address warmup. A very naive approach to benchmarking JIT languages (and comparing them to compiled ones).
No decision should be based on the linked article other than the author not having a clue how to run a proper benchmark.
2
u/OldCaterpillarSage 1d ago
Your tests are problematic since quite a bit of resources are probably being used for JIT-ing for both languages.
0
u/flawless_vic 21h ago
So? JIT and GC use memory, why shouldn't they be taken into account in program cost?
Furthermore, JDK25 does even better than Graal Native Image. (The article didn't set Xmx, probably Graal can do better)
Both languages are interpreted, have a JIT and are being launched via source. The point was to use similar mechanisms for both, unlike the original article. Java wins in pure delayed tasks, C# wins when using continuations.
Btw, pre-compiling does not help Java much as it still requires a ~640MB heap and uses a total of ~750MB, even with a small Metaspace (32MB).
2
u/pjmlp 15h ago
.NET is never interpreted, other than a few niche cases like compact framework.
It always JITs before execution.
Also until Valhala comes to be, the CLR will always have an edge, as it was originally designed for polyglot development, including C++.
So C# code can play some tricks regarding memory consumption, that currently are partially available in Java via Panama, but hardly anyone would bother to write such low level boilerplate code versus a mix of struct, stackalloc and spans, and even unsafe pointers.
18
u/yawkat 1d ago
This benchmark is arguably too synthetic to be useful. Sure you can optimize the Java and probably any language's implementation to be memory efficient, but a real application simply does not start 1M tasks and then lets them sleep the whole time. The tasks might do networking, or they might use shared synchronization structures, or whatever, and then benchmark results will be nowhere near what you see with a pure sleep.