Configuration¶
When the worker is running over elements, be it locally or on Arkindex, the first step before actually doing
anything is configuration. This process is implemented in the configure
method.
This method can also be overloaded if the worker needs additional configuration steps.
The developer mode was designed to help worker developers reproduce and test how their worker would behave on Arkindex. This is why the configuration process in this mode mirrors the operations done on Arkindex while replacing configuration API calls by CLI arguments.
The developer mode (or read-only
mode) is enabled when at least either:
- the
--dev
CLI argument is used, - the
ARKINDEX_WORKER_RUN_ID
variable was not set in the environment.
None of these happen when running on Arkindex.
Parallel between both modes¶
flowchart TB
subgraph configure[Configuration step]
argument_parsing[CLI argument parsing]
end
argument_parsing --> is_read_only{IsReadOnly?}
is_read_only -- Yes --> devMode
is_read_only -- No --> arkindexMode
subgraph arkindexMode[Arkindex mode]
direction TB
subgraph workerConfiguration[Worker configuration]
direction TB
retrieveWorkerRun["API call to RetrieveWorkerRun"] --> userconfig_defaults[Initialize user configuration with default values]
userconfig_defaults --> load_secrets_API["Load Secrets using API calls to RetrieveSecret"]
load_secrets_API --> load_user_config[Override user configuration by values set by user]
load_user_config --> load_model_config["Load model configuration"]
end
workerConfiguration --> cacheConfiguration
subgraph cacheConfiguration[Base worker cache setup]
direction TB
get_paths_from_parent_tasks["Retrieve paths of parent tasks' cache databases"] --> initialize_db[Create cache database and its tables]
initialize_db --> merge_parent_databases[Merge parents databases]
end
end
subgraph devMode[Developer mode]
direction TB
subgraph devWorkerConfiguration[Worker configuration]
direction TB
configuration_parsing[CLI config argument parsing] --> corpus_id[Read Corpus ID from environment]
corpus_id --> load_secrets[Load secret in local developer storage]
end
end
classDef pyMeth font-style:italic
Arkindex mode¶
The details of a worker execution (what is called a WorkerRun) on Arkindex are stored in the backend. The first step of the configuration is to retrieve this information using the Arkindex API. The RetrieveWorkerRun endpoint gives information about:
- the running process,
- the configuration parameters that the user may have added from the frontend,
- the worker used,
- the version of this worker,
- the configuration stored in this version,
- the model version used in this worker execution,
- the configuration stored in this model version.
This step shows that there are a lot of sources for the actual configuration that the worker can use. In the end, any parameter set by the user must be applied over other known configurations.
Warning
The convention is to always give the final word to the user. This means that when the user configuration is filled, its values must be the last to override the worker’s config
attribute. If a model configuration was set, its values must override this attribute before the user configuration’s.
The worker configuration may specify default values for some parameters (see this section for more details about worker configuration). These default values are stored in the user_configuration
dictionary attribute.
This is also when the secrets (see this section to learn more about secrets) are actually downloaded. They are stored in the secrets
dictionary attribute.
An Arkindex-mode exclusive step is done after all that: the cache setup. Some workers benefit a lot, performance-wise, from having a SQLite cache artifact from previous workers. This is mostly used in processes with multiple workers with dependencies, where the second worker needs the results of the first one to work. The database is initialized, the tables created and its version checked as it must match the one supported by the Arkindex instances. The database is then merged with any other database generated by previous worker runs.
Once all information is retrieved and stored in the worker, the configuration is overridden by the model configuration and by the user configuration, if any, in this specific order.
Developer mode¶
In the developer mode, the worker execution is not linked to anything on Arkindex. Therefore, the only configuration the worker can use is provided via the --config
CLI argument. It supports YAML-formatted file and it should be similar to the configuration
section of the worker configuration file, without the user_configuration
details. More details about how to create the local worker configuration are available in this section.
The multiple configuration sources from the Arkindex-mode are merged into a unique one here. The configuration parameters are parsed as well as the list of required secrets. The secrets are loaded using a local Arkindex client. Again, see the section about local execution for more details.
One information cannot be retrieved directly from the configuration file and is required in some cases: the ID of the Arkindex corpus which the elements processed belong to. This is retrieved via the ARKINDEX_CORPUS_ID
environment variable.
Setting Debug logging level¶
There are three ways to activate the debug mode:
- the
--verbose
CLI argument, - setting the
ARKINDEX_DEBUG
environment variable toTrue
, - setting
"debug": True
in the worker’s configuration via any configuration source.
Important class attributes¶
Many attributes are set on the worker during at the configuration stage. Here is a non-exhaustive list with some details about their source and their usage.
api_client
- The Arkindex API client used by the worker to make the requests. One should not rely on this attribute to make API calls but use the many helpers available. The exception is for endpoints where no helper are available.
args
- The arguments passed via the CLI. This is used to trigger the Developer mode via
--dev
, to specify the configuration file via--config
and to list elements to process via--element
. config
- A dictionary with the worker’s configuration. This is filled by the worker run’s configuration, the worker version’s and the model version’s if there is any. Once loaded, it is overridden, in this order, by the
model_configuration
and theuser_configuration
. corpus_id
- The ID of the corpus linked to the current process. This is mostly needed when publishing objects linked to a corpus like
Entities
. You may set it in developer mode via theARKINDEX_CORPUS_ID
environment variable. is_read_only
- This is the computed property that determines which mode should be used. The Developer mode prevents any actual publication on Arkindex, hence the name
read_only
. model_configuration
- The parsed configuration as stored in the
ModelVersion
object on Arkindex. model_version_id
- The ID of the model version linked to the current
WorkerRun
object on Arkindex. You may set it in developer mode via theARKINDEX_MODEL_VERSION_ID
environment variable. model_details
- The details of the model for the model version linked to the current
WorkerRun
object on Arkindex. You may populate it in developer mode via theARKINDEX_MODEL_ID
environment variable. process_information
- The details about the process parent to this worker execution. Only set in Arkindex mode.
secrets
- A dictionary mapping the secret name to their parsed content.
use_cache
- Whether the cache optimization is available or not.
user_configuration
- The parsed configuration as the user entered it via the Arkindex frontend. Any parameter not specified will be filled with its default value if there is one.
worker_details
- The details of the worker used in this execution.
worker_run_id
- The ID of the
WorkerRun
corresponding object on the Arkindex instance. In Arkindex mode, this is used inRetrieveWorkerRun
API call to retrieve the configuration and other necessary information. In developer mode, this is not set nor used.