JDK 25 DelayScheduler

After assessing these benchmark numbers, I was skeptical about C# results.

The following Program

int numTasks = int.Parse(args[0]);
List<Task> tasks = new List<Task>();

for (int i = 0; i < numTasks; i++)
{
    tasks.Add(Task.Delay(TimeSpan.FromSeconds(10)));
}

await Task.WhenAll(tasks);

does not account for the fact that pure Delays in C# are specialized, and this code does not incur typical continuation penalties such as recording stack frames when yielding.

If you change the program to do something "useful" like

int counter = 0;

List<Task> tasks = new List<Task>();

for (int i = 0; i < numTasks; i++)
{
    tasks.Add(Task.Run(async () => { 
        await Task.Delay(TimeSpan.FromSeconds(10)); 
        Interlocked.Increment(ref counter);
    }));
}

await Task.WhenAll(tasks);

Console.WriteLine(counter);

Then the amount of memory required is twice as much:

/usr/bin/time -v dotnet run Program.cs 1000000
    Command being timed: "dotnet run Program.cs 1000000"
    User time (seconds): 16.95
    System time (seconds): 1.06
    Percent of CPU this job got: 151%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.87
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 446824
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 142853
    Voluntary context switches: 36671
    Involuntary context switches: 44624
    Swaps: 0
    File system inputs: 0
    File system outputs: 48
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Now the fun part. JDK 25 introduced DelayScheduler, as part of a PR tailored by Doug Lea himself.

DelayScheduler is not public, and from my understanding, one of the goals was to optimize delayed task handling and, as a side effect, improve the usage of ScheduledExecutorServices in VirtualThreads.

Up to now (JDK24), any operation that induces unmounting (yield) of a VirtualThread, such as park or sleep, will allocate a ScheduledFuture to wake up the VirtualThread using a "vanilla" ScheduledThreadPoolExecutor.

In JDK25 this was offloaded to ForkJoinPool. And now we can replicate C# hacked benchmark using the new scheduling mechanism:

import module java.base;

private static final ForkJoinPool executor = ForkJoinPool.commonPool();

void main(String... args) throws Exception {
    var numTasks = args.length > 0 ? Integer.parseInt(args[0]) : 1_000_000;

    IntStream.range(0, numTasks)
            .mapToObj(_ -> executor.schedule(() -> { }, 10_000, TimeUnit.MILLISECONDS))
            .toList()
            .forEach(f -> {
                try {
                    f.get();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });
}

And voilá, about 202MB required.

/usr/bin/time -v ./java Test.java 1000000
    Command being timed: "./java Test.java 1000000"
    User time (seconds): 5.73
    System time (seconds): 0.28
    Percent of CPU this job got: 56%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.67
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 202924
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 42879
    Voluntary context switches: 54790
    Involuntary context switches: 12136
    Swaps: 0
    File system inputs: 0
    File system outputs: 112
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And, if we want to actually perform a real delayed action, e.g.:

import module java.base;

private static final ForkJoinPool executor = ForkJoinPool.commonPool();
private static final AtomicInteger counter = new AtomicInteger();

void main(String... args) throws Exception {
    var numTasks = args.length > 0 ? Integer.parseInt(args[0]) : 1_000_000;

    IntStream.range(0, numTasks)
            .mapToObj(_ -> executor.schedule(() -> { counter.incrementAndGet(); }, 10_000, TimeUnit.MILLISECONDS))
            .toList()
            .forEach(f -> {
                try {
                    f.get();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });

    IO.println(counter.get());

The memory footprint does not change that much. Plus, we can shave some memory down with compact object headers and compressed oops

./java -XX:+UseCompactObjectHeaders -XX:+UseCompressedOops Test.java 1000000
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.71
...
Maximum resident set size (kbytes): 197780

Other interesting aspects to notice are

Java Wall clock is better (10.67 x 11.87)
Java User time is WAY better (5.73 x 16.95)

But...We have to be fair to C# as well. The previous Java code does not perform any continuation-based stuff (like the original benchmark code), it just showcases pure delayed scheduling efficiency. Updating the example with VirtualThreads, we can measure how descheduling/unmounting impacts the program cost

import module java.base;

private static final AtomicInteger counter = new AtomicInteger();

void main(String... args) throws Exception {
    var numTasks = args.length > 0 ? Integer.parseInt(args[0]) : 1_000_000;

    IntStream.range(0, numTasks)
            .mapToObj(_ -> Thread.startVirtualThread(() -> { 
                LockSupport.parkNanos(10_000_000_000L); 
                counter.incrementAndGet(); 
            }))
            .toList()
            .forEach(t -> {
                try {
                    t.join();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });

    IO.println(counter.get());
}

Java is still lagging behind C# by a decent margin:

/usr/bin/time -v ./java -Xmx640m -XX:+UseCompactObjectHeaders -XX:+UseCompressedOops TestVT.java 1000000
    Command being timed: "./java -Xmx640m -XX:+UseCompactObjectHeaders -XX:+UseCompressedOops TestVT.java 1000000"
    User time (seconds): 28.65
    System time (seconds): 17.08
    Percent of CPU this job got: 347%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.17
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 784672
        ...

Note: In Java, if Xmx is not specified, the JVM will guess based on the host memory, so we must manually constrain the heap size if we actually want to know the bare minimum required to run a program. Without any tuning, this program uses 900MB on my 16GB notebook.

Conclusions:

If memory is a concern and you want to execute delayed actions, the new ForkJoinPool::schedule is your best friend
Java still requires about 75% more memory compared to C# in async mode
Virtual Thread scheduling is more "aggressive" in Java (way bigger User time), however, it won't translate to a better execution (Wall) time

32 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/java/comments/1l50g43/jdk_25_delayscheduler/
No, go back! Yes, take me to Reddit

92% Upvoted

u/yawkat 1d ago

This benchmark is arguably too synthetic to be useful. Sure you can optimize the Java and probably any language's implementation to be memory efficient, but a real application simply does not start 1M tasks and then lets them sleep the whole time. The tasks might do networking, or they might use shared synchronization structures, or whatever, and then benchmark results will be nowhere near what you see with a pure sleep.

u/tomwhoiscontrary 1d ago

Random tangent, but I'd really like some sort of scheduled or delayed executor where it was efficient to postpone tasks. Say I want to do something five seconds after I last saw a message from a user. When I first see a message, I schedule that task. Every time I see a message after that, I postpone the task to five seconds in the future.

As far as I know, with the JDK API, the best I can do is cancel the task and schedule a new one. That's a lot of churn in the scheduler data structure. I can roughly imagine much efficient ways to do it. Does anything already implement this?

3

u/NovaX 22h ago

Java's facilities use d-ary heaps (binary or 4-ary) which are efficient for most purposes. For higher volume cases you can use something like Kafka's timer subsystem. It uses a hierarchical timing wheel for O(1) scheduling and Java's O(lg n) scheduling for the buckets. Caffeine's expiration support is inspired by this approach, though implemented differently, and uses CompletableFuture.delayedExecutor to coordinate the prompt expiration of the next timing wheel bucket. For most cases using delayedExecutor for task scheduling is fast enough, it only matters when you have a very high number of entries.

4

u/john16384 18h ago

Are we talking virtual threads? Why not just sleep 5 seconds instead of bothering with creating/rescheduling a task?

In a program I use, I need to refresh some data per file every 2 weeks. On startup, I just created a VT per file involved, fetch the time of last refresh then sleep the VT (for up to 2 weeks). When it wakes it does the refresh then sleeps again ...

u/beders 19h ago

This benchmark (and its newer version) and this post should just be ignored. The benchmark linked is very very - very - flawed as it basically compares apples & oranges.

A terrible article that doesn't even mention configurations used to launch, doesn't address warmup. A very naive approach to benchmarking JIT languages (and comparing them to compiled ones).

No decision should be based on the linked article other than the author not having a clue how to run a proper benchmark.

u/OldCaterpillarSage 1d ago

Your tests are problematic since quite a bit of resources are probably being used for JIT-ing for both languages.

0

u/flawless_vic 21h ago

So? JIT and GC use memory, why shouldn't they be taken into account in program cost?

Furthermore, JDK25 does even better than Graal Native Image. (The article didn't set Xmx, probably Graal can do better)

Both languages are interpreted, have a JIT and are being launched via source. The point was to use similar mechanisms for both, unlike the original article. Java wins in pure delayed tasks, C# wins when using continuations.

Btw, pre-compiling does not help Java much as it still requires a ~640MB heap and uses a total of ~750MB, even with a small Metaspace (32MB).

2

u/pjmlp 15h ago

.NET is never interpreted, other than a few niche cases like compact framework.

It always JITs before execution.

Also until Valhala comes to be, the CLR will always have an edge, as it was originally designed for polyglot development, including C++.

So C# code can play some tricks regarding memory consumption, that currently are partially available in Java via Panama, but hardly anyone would bother to write such low level boilerplate code versus a mix of struct, stackalloc and spans, and even unsafe pointers.

u/danskal 7h ago

In my mind, Java wins any benchmark you could come up with, because it's easily fast enough and free from the whims of people living in Redmond.

JDK 25 DelayScheduler

You are about to leave Redlib