Skip to content

Configuration

When the worker is running over elements, be it locally or on Arkindex, the first step before actually doing anything is configuration. This process is implemented in the configure method. This method can also be overloaded if the worker needs additional configuration steps.

The developer mode was designed to help worker developers reproduce and test how their worker would behave on Arkindex. This is why the configuration process in this mode mirrors the operations done on Arkindex while replacing configuration API calls by CLI arguments.

The developer mode (or read-only mode) is enabled when at least either:

  • the --dev CLI argument is used,
  • the ARKINDEX_WORKER_RUN_ID variable was not set in the environment.

None of these happen when running on Arkindex.

Parallel between both modes

flowchart TB
    subgraph configure[Configuration step]
        argument_parsing[CLI argument parsing]
    end
    argument_parsing --> is_read_only{IsReadOnly?}
    is_read_only -- Yes --> devMode
    is_read_only -- No --> arkindexMode
    subgraph arkindexMode[Arkindex mode]
        direction TB
        subgraph workerConfiguration[Worker configuration]
            direction TB
            retrieveWorkerRun["API call to RetrieveWorkerRun"] --> userconfig_defaults[Initialize user configuration with default values]
            userconfig_defaults --> load_secrets_API["Load Secrets using API calls to RetrieveSecret"]
            load_secrets_API --> load_user_config[Override user configuration by values set by user]
            load_user_config --> load_model_config["Load model configuration"]
        end
        workerConfiguration --> cacheConfiguration
        subgraph cacheConfiguration[Base worker cache setup]
            direction TB
            get_paths_from_parent_tasks["Retrieve paths of parent tasks' cache databases"] --> initialize_db[Create cache database and its tables]
            initialize_db --> merge_parent_databases[Merge parents databases]
        end
    end

    subgraph devMode[Developer mode]
        direction TB
        subgraph devWorkerConfiguration[Worker configuration]
            direction TB
            configuration_parsing[CLI config argument parsing] --> corpus_id[Read Corpus ID from environment]
            corpus_id --> load_secrets[Load secret in local developer storage]
        end
    end
    classDef pyMeth font-style:italic

Arkindex mode

The details of a worker execution (what is called a WorkerRun) on Arkindex are stored in the backend. The first step of the configuration is to retrieve this information using the Arkindex API. The RetrieveWorkerRun endpoint gives information about:

  • the running process,
  • the configuration parameters that the user may have added from the frontend,
  • the worker used,
  • the version of this worker,
  • the configuration stored in this version,
  • the model version used in this worker execution,
  • the configuration stored in this model version.

This step shows that there are a lot of sources for the actual configuration that the worker can use. In the end, any parameter set by the user must be applied over other known configurations.

Warning

The convention is to always give the final word to the user. This means that when the user configuration is filled, its values must be the last to override the worker’s config attribute. If a model configuration was set, its values must override this attribute before the user configuration’s.

The worker configuration may specify default values for some parameters (see this section for more details about worker configuration). These default values are stored in the user_configuration dictionary attribute.

This is also when the secrets (see this section to learn more about secrets) are actually downloaded. They are stored in the secrets dictionary attribute.

An Arkindex-mode exclusive step is done after all that: the cache setup. Some workers benefit a lot, performance-wise, from having a SQLite cache artifact from previous workers. This is mostly used in processes with multiple workers with dependencies, where the second worker needs the results of the first one to work. The database is initialized, the tables created and its version checked as it must match the one supported by the Arkindex instances. The database is then merged with any other database generated by previous worker runs.

Once all information is retrieved and stored in the worker, the configuration is overridden by the model configuration and by the user configuration, if any, in this specific order.

Developer mode

In the developer mode, the worker execution is not linked to anything on Arkindex. Therefore, the only configuration the worker can use is provided via the --config CLI argument. It supports YAML-formatted file and it should be similar to the configuration section of the worker configuration file, without the user_configuration details. More details about how to create the local worker configuration are available in this section.

The multiple configuration sources from the Arkindex-mode are merged into a unique one here. The configuration parameters are parsed as well as the list of required secrets. The secrets are loaded using a local Arkindex client. Again, see the section about local execution for more details.

One information cannot be retrieved directly from the configuration file and is required in some cases: the ID of the Arkindex corpus which the elements processed belong to. This is retrieved via the ARKINDEX_CORPUS_ID environment variable.

Setting Debug logging level

There are three ways to activate the debug mode:

  • the --verbose CLI argument,
  • setting the ARKINDEX_DEBUG environment variable to True,
  • setting "debug": True in the worker’s configuration via any configuration source.

Important class attributes

Many attributes are set on the worker during at the configuration stage. Here is a non-exhaustive list with some details about their source and their usage.

api_client
The Arkindex API client used by the worker to make the requests. One should not rely on this attribute to make API calls but use the many helpers available. The exception is for endpoints where no helper are available.
args
The arguments passed via the CLI. This is used to trigger the Developer mode via --dev, to specify the configuration file via --config and to list elements to process via --element.
config
A dictionary with the worker’s configuration. This is filled by the worker run’s configuration, the worker version’s and the model version’s if there is any. Once loaded, it is overridden, in this order, by the model_configuration and the user_configuration.
corpus_id
The ID of the corpus linked to the current process. This is mostly needed when publishing objects linked to a corpus like Entities. You may set it in developer mode via the ARKINDEX_CORPUS_ID environment variable.
is_read_only
This is the computed property that determines which mode should be used. The Developer mode prevents any actual publication on Arkindex, hence the name read_only.
model_configuration
The parsed configuration as stored in the ModelVersion object on Arkindex.
model_version_id
The ID of the model version linked to the current WorkerRun object on Arkindex. You may set it in developer mode via the ARKINDEX_MODEL_VERSION_ID environment variable.
model_details
The details of the model for the model version linked to the current WorkerRun object on Arkindex. You may populate it in developer mode via the ARKINDEX_MODEL_ID environment variable.
process_information
The details about the process parent to this worker execution. Only set in Arkindex mode.
secrets
A dictionary mapping the secret name to their parsed content.
use_cache
Whether the cache optimization is available or not.
user_configuration
The parsed configuration as the user entered it via the Arkindex frontend. Any parameter not specified will be filled with its default value if there is one.
worker_details
The details of the worker used in this execution.
worker_run_id
The ID of the WorkerRun corresponding object on the Arkindex instance. In Arkindex mode, this is used in RetrieveWorkerRun API call to retrieve the configuration and other necessary information. In developer mode, this is not set nor used.