Using Dataloader
If you are using graphql
, you are likely to making queries on a graph of data (no surprises there). However, it's easy
to implement inefficient code with naive loading of a graph of data.
Using java-dataloader
will help you to make this a more efficient process by both caching and batching requests for that graph of data items. If dataloader
has seen a data item before, it will have cached the value and will return it without having to ask for it again.
Imagine we have the StarWars query outlined below. It asks us to find a hero, and their friend's names, and their friend's friend's names. It is likely that many of these people will be friends in common.
{
hero {
name
friends {
name
friends {
name
}
}
}
}
The result of this query is displayed below. You can see that Han, Leia, Luke and R2-D2 are a tight-knit bunch of friends and share many friends in common.
{
"hero": {
"name": "R2-D2",
"friends": [
{
"name": "Luke Skywalker",
"friends": [
{"name": "Han Solo"},
{"name": "Leia Organa"},
{"name": "C-3PO"},
{"name": "R2-D2"}
]
},
{
"name": "Han Solo",
"friends": [
{"name": "Luke Skywalker"},
{"name": "Leia Organa"},
{"name": "R2-D2"}
]
},
{
"name": "Leia Organa",
"friends": [
{"name": "Luke Skywalker"},
{"name": "Han Solo"},
{"name": "C-3PO"},
{"name": "R2-D2"}
]
}
]
}
}
A naive implementation would call a DataFetcher
to retrieve a person object every time it was invoked.
In this case it would be 15 calls over the network, even though the group of people have a lot of common friends.
With dataloader
you can make the graphql
query much more efficient.
As graphql
descends each level of the query (e.g., as it processes hero
and then friends
and then for each of their friends
),
the data loader is called to "promise" to deliver a person object. At each level dataloader.dispatch()
will be
called to fire off the batch requests for that part of the query. With caching turned on (the default) then
any previously returned person will be returned as-is for no cost.
In the above example there are only 5 unique people mentioned but with caching and batching retrieval in place there will be only 3 calls to the batch loader function. 3 calls over the network or to a database is much better than 15 calls, you will agree.
If you use capabilities like java.util.concurrent.CompletableFuture.supplyAsync()
then you can make it even more efficient by making the
remote calls asynchronous to the rest of the query. This will make it even more timely since multiple calls can happen at once
if need be.
Here is how you might put this in place:
//
// a batch loader function that will be called with N or more keys for batch loading
// This can be a singleton object since it's stateless
//
BatchLoader<String, Object> characterBatchLoader = new BatchLoader<String, Object>() {
@Override
public CompletionStage<List<Object>> load(List<String> keys) {
//
// we use supplyAsync() of values here for maximum parallelisation
//
return CompletableFuture.supplyAsync(() -> getCharacterDataViaBatchHTTPApi(keys));
}
};
//
// use this data loader in the data fetchers associated with characters and put them into
// the graphql schema (not shown)
//
DataFetcher<?> heroDataFetcher = new DataFetcher<Object>() {
@Override
public Object get(DataFetchingEnvironment environment) {
DataLoader<String, Object> dataLoader = environment.getDataLoader("character");
return dataLoader.load("2001"); // R2D2
}
};
DataFetcher<?> friendsDataFetcher = new DataFetcher<Object>() {
@Override
public Object get(DataFetchingEnvironment environment) {
StarWarsCharacter starWarsCharacter = environment.getSource();
List<String> friendIds = starWarsCharacter.getFriendIds();
DataLoader<String, Object> dataLoader = environment.getDataLoader("character");
return dataLoader.loadMany(friendIds);
}
};
//
// this instrumentation implementation will dispatch all the data loaders
// as each level of the graphql query is executed and hence make batched objects
// available to the query and the associated DataFetchers
//
// In this case we use options to make it keep statistics on the batching efficiency
//
DataLoaderDispatcherInstrumentationOptions options = DataLoaderDispatcherInstrumentationOptions
.newOptions().includeStatistics(true);
DataLoaderDispatcherInstrumentation dispatcherInstrumentation
= new DataLoaderDispatcherInstrumentation(options);
//
// now build your graphql object and execute queries on it.
// the data loader will be invoked via the data fetchers on the
// schema fields
//
GraphQL graphQL = GraphQL.newGraphQL(buildSchema())
.instrumentation(dispatcherInstrumentation)
.build();
//
// a data loader for characters that points to the character batch loader
//
// Since data loaders are stateful, they are created per execution request.
//
DataLoader<String, Object> characterDataLoader = DataLoaderFactory.newDataLoader(characterBatchLoader);
//
// DataLoaderRegistry is a place to register all data loaders in that needs to be dispatched together
// in this case there is 1 but you can have many.
//
// Also note that the data loaders are created per execution request
//
DataLoaderRegistry registry = new DataLoaderRegistry();
registry.register("character", characterDataLoader);
ExecutionInput executionInput = newExecutionInput()
.query(getQuery())
.dataLoaderRegistry(registry)
.build();
ExecutionResult executionResult = graphQL.execute(executionInput);
In this example we explicitly added the DataLoaderDispatcherInstrumentation
because we wanted to tweak its options. However,
it will be automatically added for you if you don't add it manually.
You can read a lot more about the java-dataloader
API in detail over
at https://github.com/graphql-java/java-dataloader.
Data Loader only works with AsyncExecutionStrategy
The only execution that works with DataLoader is graphql.execution.AsyncExecutionStrategy
. This is because this execution strategy knows
when the most optimal time to dispatch() your load calls is. It does this by deeply tracking how many fields are outstanding and whether they
are list values and so on.
Other execution strategies such as ExecutorServiceExecutionStrategy
can't do this and hence if the data loader code detects
you are not using AsyncExecutionStrategy
then it will simply dispatch the data loader as each field is encountered. You
may get caching
of values, but you will not get batching
of them.
Per Request Data Loaders
If you are serving web requests then the data can be specific to the user requesting it. If you have user specific data then you will not want to cache data meant for user A to then later give it to user B in a subsequent request.
The scope of your DataLoader instances is important. You will want to create them per web request to
ensure data is only cached within that web request and no more. It also ensures that a dispatch
call
only affects that graphql execution and no other.
DataLoaders by default act as caches. If they have seen a value before for a key then they will automatically return it in order to be efficient. They cache promises to a value and optionally the value itself.
If your data can be shared across web requests then you might want to change the ValueCache
implementation of your data loaders, so they share
data via caching systems like memcached or redis.
You still create data loaders per request, however the caching layer will allow data sharing (if that's suitable).
ValueCache<String, Object> crossRequestValueCache = new ValueCache<String, Object>() {
@Override
public CompletableFuture<Object> get(String key) {
return redisIntegration.getValue(key);
}
@Override
public CompletableFuture<Object> set(String key, Object value) {
return redisIntegration.setValue(key, value);
}
@Override
public CompletableFuture<Void> delete(String key) {
return redisIntegration.clearKey(key);
}
@Override
public CompletableFuture<Void> clear() {
return redisIntegration.clearAll();
}
};
DataLoaderOptions options = DataLoaderOptions.newOptions().setValueCache(crossRequestValueCache);
DataLoader<String, Object> dataLoader = DataLoaderFactory.newDataLoader(batchLoader, options);
Async Calls On Your Batch Loader Function Only
The data loader code pattern works by combining all the outstanding data loader calls into more efficient batch loading calls.
graphql-java tracks what outstanding data loader calls have been made, and it is its responsibility to call dispatch
in the background at the most optimal time, which is when all graphql fields have been examined and dispatched.
However, there is a code pattern that will cause your data loader calls to never complete, and these MUST be avoided. This bad
pattern consists of making an asynchronous off thread call to a DataLoader
in your data fetcher.
The following will not work (it will never complete).
BatchLoader<String, Object> batchLoader = new BatchLoader<String, Object>() {
@Override
public CompletionStage<List<Object>> load(List<String> keys) {
return CompletableFuture.completedFuture(getTheseCharacters(keys));
}
};
DataLoader<String, Object> characterDataLoader = DataLoaderFactory.newDataLoader(batchLoader);
// .... later in your data fetcher
DataFetcher<?> dataFetcherThatCallsTheDataLoader = new DataFetcher<Object>() {
@Override
public Object get(DataFetchingEnvironment environment) {
//
// Don't DO THIS!
//
return CompletableFuture.supplyAsync(() -> {
String argId = environment.getArgument("id");
DataLoader<String, Object> characterLoader = environment.getDataLoader("characterLoader");
return characterLoader.load(argId);
});
}
};
In the example above, the call to characterDataLoader.load(argId)
can happen some time in the future on another thread. The graphql-java
engine has no way of knowing when it's a good time to dispatch outstanding DataLoader
calls and hence the data loader call might never complete
as expected and no results will be returned.
Remember a data loader call is just a promise to actually get a value later when it's an optimal time for all outstanding calls to be batched together. The most optimal time is when the graphql field tree has been examined and all field values are currently dispatched.
The following is how you can still have asynchronous code, by placing it into the BatchLoader
itself.
BatchLoader<String, Object> batchLoader = new BatchLoader<String, Object>() {
@Override
public CompletionStage<List<Object>> load(List<String> keys) {
return CompletableFuture.supplyAsync(() -> getTheseCharacters(keys));
}
};
DataLoader<String, Object> characterDataLoader = DataLoaderFactory.newDataLoader(batchLoader);
// .... later in your data fetcher
DataFetcher<?> dataFetcherThatCallsTheDataLoader = new DataFetcher<Object>() {
@Override
public Object get(DataFetchingEnvironment environment) {
//
// This is OK
//
String argId = environment.getArgument("id");
DataLoader<String, Object> characterLoader = environment.getDataLoader("characterLoader");
return characterLoader.load(argId);
}
};
Notice above the characterDataLoader.load(argId)
returns immediately. This will enqueue the call for data until a later time when all
the graphql fields are dispatched.
Then later when the DataLoader
is dispatched, its BatchLoader
function is called. This code can be asynchronous so that if you have multiple batch loader
functions they all can run at once. In the code above CompletableFuture.supplyAsync(() -> getTheseCharacters(keys));
will run the getTheseCharacters
method in another thread.
Passing context to your data loader
The data loader library supports two types of context being passed to the batch loader. The first is an overall context object per dataloader, and the second is a map of per loaded key context objects.
This allows you to pass in the extra details you may need to make downstream calls. The dataloader key is used in the caching of results, but the context objects can be made available to help with the call.
So in the example below we have an overall security context object that gives out a call token, and we also pass the graphql source
object to each dataLoader.load()
call.
BatchLoaderWithContext<String, Object> batchLoaderWithCtx = new BatchLoaderWithContext<String, Object>() {
@Override
public CompletionStage<List<Object>> load(List<String> keys, BatchLoaderEnvironment loaderContext) {
//
// we can have an overall context object
SecurityContext securityCtx = loaderContext.getContext();
//
// and we can have a per key set of context objects
Map<Object, Object> keysToSourceObjects = loaderContext.getKeyContexts();
return CompletableFuture.supplyAsync(() -> getTheseCharacters(securityCtx.getToken(), keys, keysToSourceObjects));
}
};
// ....
SecurityContext securityCtx = SecurityContext.newSecurityContext();
BatchLoaderContextProvider contextProvider = new BatchLoaderContextProvider() {
@Override
public Object getContext() {
return securityCtx;
}
};
//
// this creates an overall context for the dataloader
//
DataLoaderOptions loaderOptions = DataLoaderOptions.newOptions().setBatchLoaderContextProvider(contextProvider);
DataLoader<String, Object> characterDataLoader = DataLoaderFactory.newDataLoader(batchLoaderWithCtx, loaderOptions);
// .... later in your data fetcher
DataFetcher<?> dataFetcherThatCallsTheDataLoader = new DataFetcher<Object>() {
@Override
public Object get(DataFetchingEnvironment environment) {
String argId = environment.getArgument("id");
Object source = environment.getSource();
//
// you can pass per load call contexts
//
return characterDataLoader.load(argId, source);
}
};