Skip to content

Releases

0.4.0

Released on 11 December 2024 • View on Gitlab

Breaking changes

# Old way
from apistar.exceptions import ErrorResponse

# New way
from arkindex.exceptions import ErrorResponse
  • The method BaseWorker.request has been removed. Developers should rely on BaseWorker.api_client.request instead.

Project architecture

Arkindex API

  • The create_iiif_url helper has been added to create an image from an existing IIIF image by URL, using the CreateIIIFURL endpoint.
  • The list_elements helper has been added to list elements in the current project, using the ListElements endpoint.
  • The create_element_children helper has been added to link multiple elements to a parent element at once, using the CreateElementChildren endpoint.
  • The list_corpus_types helper has been added to list element types in the current project and store them as instance attribute.
  • The download_export helper has been added to download a project SQLite export, using the DownloadExport endpoint.
  • The download_latest_export helper has been added to download the latest SQLite export of a project.
  • The list_process_elements helper has been added to list the elements of the current process, using the ListProcessElements endpoint.

Processing

  • ElementsWorker supports processing dataset sets.
  • ElementsWorker now supports Export processes, introduced in the latest Arkindex release.
  • Most bulk endpoints now publish their results in batches, to avoid too large queries at once on Arkindex. The default batch size is 50 but a larger value can be set through the batch_size argument of the helper.

Worker template

  • Workers now rely on the value set in the mandatory field docker.command in the YAML configuration to know each worker’s command. The CMD statement in the Dockerfile is no longer needed and should be removed.

Documentation

Misc

  • Workers and arkindex-base-worker now support Python 3.12.
  • A new pre-commit hook to report test files with too many lines is now added by default in new workers.
  • Pillow has an image size limit to avoid “decompression bombs”. To still be able to process very large images, this limit can be increased through the ARKINDEX_MAX_IMAGE_PIXELS environment variable.
  • Some tools have an image disk size limit instead of a dimensions limitations. When the image is too large, a new function resized_images is able to generate downsized versions of an image that can be used until the image is small enough in terms of disk size.
  • A new helper is available to automatically pluralize some words. This is mostly helpful in the log messages a worker might send. Default behaviour consist in adding an ‘s’ at the end but some exceptions are supported like “entity” and “child”.

    # Old way
    logger.info(f"Published {transcriptions_count} transcription{'s' if len(transcriptions_count) > 1 else ''}")
    
    # New way
    from arkindex.utils import pluralize
    logger.info(f"Published {transcriptions_count} {pluralize('transcription', transcriptions_count)}")
    
  • The Teklia CA certificate is no longer needed in the Docker images of the worker. The Dockerfile can be updated accordingly.

    Dockerfile
    WORKDIR /src
    
    - # Install curl
    - ENV DEBIAN_FRONTEND=non-interactive
    - RUN apt-get update -q -y && apt-get install -q -y --no-install-recommends curl
    
    # Install worker as a package
    ...
    
    - # Add archi local CA
    - RUN curl https://assets.teklia.com/teklia_dev_ca.pem > /usr/local/share/ca-certificates/arkindex-dev.crt && update-ca-certificates
    - ENV REQUESTS_CA_BUNDLE /etc/ssl/certs/ca-certificates.crt
    
  • The CLI arguments --element and --elements-list were converting the element IDs to different types uuid.UUID versus str. They now both convert to str.

0.3.7post1

Released on 23 May 2024 • View on Gitlab

Breaking changes

  • The dependency to peewee has been loosened to support any patch release of the 3.17 cycle. This has been done knowing that this library does not introduce massive breaking changes in its patch releases.

Hotfix

  • The bump to teklia-toolbox==0.1.4 broke support for
    • offline (no access to Internet) workers,
    • Arkindex instances that do not have valid SSL certificates (impacts Arkindex developers).

This release fixes both issues.

0.3.7

Released on 16 April 2024 • View on Gitlab

Breaking changes

  • This release updates the internal behavior of DatasetWorker, meant to process dataset sets, to accommodate for the changes introduced by Arkindex 1.6.0.
  • The create_metadatas helper has been renamed to create_metadata_bulk. Make sure to update existing imports.
  • The model version configuration and the user configuration are now updated at the very end of ElementsWorker.configure and DatasetWorker.configure. This means that there is no need to do it in workers.
# worker.py

class MyWorker(ElementsWorker):
    def configure(self):
-       # Retrieve the model configuration
-       if self.model_configuration:
-            self.config.update(self.model_configuration)
-
-       # Retrieve the user configuration
-       if self.user_configuration:
-           self.config.update(self.user_configuration)

        # Rest of configuration
        ...

Project architecture

  • The migration started in 0.3.6 is now finished and all project dependencies are now stored in pyproject.toml for both arkindex-base-worker and new workers, through the template.

Arkindex API

  • The create_classifications helper has been updated to use the right parameter of the CreateClassifications endpoint. Missing ML classes are now created automatically, as in create_classification.
  • The DatasetMixin has been updated following changes to Arkindex’s dataset processes.
  • The details of the loaded model is now always stored in the model_details attribute.
  • The TrainingMixin exposes a new property, is_finetuning, to know if the worker has a model version set. This is helpful for training workers, to know if they are fine-tuning an existing model.
  • Arkindex has deprecated the usage of worker_version in many endpoints. This change has been reflected in affected endpoints. Support for the equivalent worker_run argument has been added where it was missing.
  • The load_parents parameter is now exposed on the list_element_metadata helper.
  • There is an issue with the ValidateModelVersion endpoint in the latest Arkindex releases. This endpoint may return HTTP errors (codes 403 or 500) even though the model version has been successfully updated. To avoid raising false errors, a warning is logged when that happens and the worker’s processing will no longer stop at that exception.

Worker template

  • The worker template has been updated:
    • default values for author and email,
    • workers docker image have been renamed to make registry cleanup policies easier to write
      • tags are now named after the commit SHA: commit-$CI_COMMIT_SHORT_SHA (see Gitlab’s documentation to learn about this variable),
      • and corresponding cleanup policy regex is commit-.*.
    • the type key in YAML configurations has been removed.

Documentation

  • A new section explaining how to publish a worker to an Arkindex instance has been added.

Misc

  • A summary message is now logged at the end of the run method, even if no error was encountered during processing.
  • A new helper was added to parse source arguments, mostly used for worker_version and worker_run arguments. To filter manual sources, the Arkindex API expects the False value. This helper maps "manual" to this value.
  • A new helper to upload a Pillow image has been added.
  • SSL verification is now skipped for Arkindex local development hosts. This only affects instance whose URL is matching the pattern *ark.localhost.
  • A warning is now logged when calling an helper that doesn’t support cache.

0.3.6

Released on 22 Dec 2023 • View on Gitlab

Breaking changes

  • The arkindex_worker.git module was removed. It was not used locally by any workers, this module was only used to expose some workflows from python-gitlab. Please refer to their documentation if your worker needs to communicate with a Git instance.
  • Following Arkindex’s 1.5.3 release, the model_usage configuration parameter has been updated to a tri-enum. To migrate your workers:

    • model_usage: false becomes model_usage: disabled
    • model_usage: true becomes model_usage: required

    The supported value means that the model is supported by a worker but not required to make it work.

Project architecture

  • PEP 621 encourages user to store most of the package’s metadata in the pyproject.toml. We followed this proposition both for the arkindex-worker package and the worker template.

Arkindex API

  • The details of the model available to the worker is now stored under the model_details attribute.
  • The list_corpus_entities API helper now stores the entities in the entities attribute instead of returning them.
  • A reminder was added to prevent making changes to the Arkindex Cache schema without bumping the Version of said cache.
  • Each dataset’s archive is now properly deleted after processing.
  • The path to a Dataset’s archive is now stored under the filepath property.
  • The new create_element_parent API helper allows to create a link between two elements.
  • The create_sub_element was updated to support creating children element without zones and under a parent without a zone.
  • A new user configuration type was introduced to be able to select Arkindex Models. Learn more about it in the documentation.

Worker template

  • When the provided slug had more than one word, it was invalid for either:

    • the package name, because the user used _ as word delimiter,
    • the module directory’s name, because the user used - as word delimiter.

    The package name and the module directory’s name are now both computed from the slug, making sure that:

    • the package name uses - as word delimiter,
    • the module directory’s name uses _ as word delimiter.

Documentation

  • A link to the documentation was added:

    • in the README,
    • as a GitLab badge on the repo.
  • Some sections in the documentation were renamed to improve readability.

Misc

0.3.5

Released on 8 Nov 2023 • View on Gitlab

Breaking changes

  • The arkindex_worker.reporting module has been removed as the JSON report file was no longer needed.
  • The --model-dir CLI argument was renamed to --extras-dir as it was more suited to its use. This folder now stores dataset archives, hence the more generic name.

Arkindex API

  • Following Arkindex 1.5.2 release,
    • new helpers for Task-related endpoints were introduced,
    • A new worker class is available, to support Dataset processes
    • new helpers for Dataset-related endpoints were introduced,
  • Added a unicity check on the input of the create_transcription_entities helper.
  • The partial_update_element helper was updated to better match the endpoint.

Documentation

  • Some modules were poorly displayed in the documentation. Class methods are now only listed under their class’s section.

Release Management

  • A Makefile was added to the worker template to deploy new releases more easily. The default branch expects master, make sure to change it to main depending on your settings.
  • The base image used in the worker’s docker image was changed from python:3.11 to python:3.11-slim, in an effort to reduce their size.

Misc

  • During the configuration stage, a summary of the worker is now logged instead of the revision’s hash. This was changed to support workers not linked to any revision on Arkindex.
  • A retry mechanism on HTTP 50x errors was added. Additionally, when the requested size exceeds the maximum size allowed by the IIIF server, a new try is done with max instead of full as size parameter. More information about these parameters in the IIIF documentation.
  • When running the worker locally without the ARKINDEX_CORPUS_ID variable set in the environment, an explicit exception will be raised when trying to access the corpus_id attribute.
  • This release adds support for Python 3.12.

0.3.4

Released on 14 Sept 2023 • View on Gitlab

  • The worker template was updated to correctly install Git submodules if it depends on any.
  • Base-worker now uses ruff for linting. This tool replaces isort and flake8.
  • New Arkindex API helper to update an element, calling PartialUpdateElement.
  • New Arkindex API helper to list an element’s parents, calling ListElementParents.
  • Worker Activity API is now disabled when the worker runs in read-only mode instead of relying on the --dev CLI argument. The update_activity API helper was updated following Arkindex 1.5.1 changes.
  • Worker can now resize the image of an element when opening them. This uses the IIIF resizing API.

0.3.3

Released on 26 May 2023 • View on Gitlab

  • The Timer class previously defined in arkindex_worker.utils was removed as it was already defined Teklia’s python toolbox.
# Old usage
from arkindex_worker.utils import Timer
# New usage
from teklia_toolbox.time import Timer
  • The create_element_transcriptions API helper now accepts an element_confidence float field in the dictionaries provided through the transcriptions field. This confidence will be set on the created element.
  • More query filters are available on the list_element_children API helper. More details about their usage is available in the documentation:
    • transcription_worker_version
    • transcription_worker_run
    • with_metadata
    • worker_run
  • Arkindex Base-Worker now fully uses pathlib to handle filesystem paths as suggested by PEP 428.
  • Many helpers were added to handle ZSTD and TAR archives as well as delete files cleanly. More details about that in the documentation of the arkindex_worker.utils module.
  • A bug affecting the parsing of the configuration of workers that use a Machine learning model stored on an Arkindex instance was fixed.

0.3.2

Released on 8 March 2023 • View on Gitlab

  • A helper to use the new API endpoint to create transcription entities more efficiently was implemented.
  • Training workers may now publish a model configuration when creating a new model version on Arkindex. This will make the execution of a generic worker much smoother.
  • The model version API endpoints were updated in the latest Arkindex release and a new helper was introduced subsequently. However, there are no breaking changes and the main helper, publish_model_version, still has the same signature and behaviour.
  • The latest Arkindex release changed the way NER entities are stored and published.
    • The EntityType enum was removed as type slug are no longer restrcited to a small options,
    • create_entity now expects a type slug as a String,
    • a new helper list_corpus_entity_types was added to load the Entity types in the corpus,
    • a new helper check_required_entity_types to make sure that needed entity types are available in the corpus was added. Missing ones are created by default (this can be disabled).
  • The create_classifications helper now expects the UUID of each MLClass instead of their name.
  • In developer mode, the only way to set the corpus_id attribute is to use the ARKINDEX_CORPUS_ID environment variable. When it’s not set, all API requests using the corpus_id as path parameter will fail with 500 status code. A warning log was added to help developers troubleshoot this error by advising them to set this variable.
  • The create_transcriptions helper no longer makes the API call in developer mode. This behaviour aligns with all other publication helpers.
  • Fixes hash computation when publishing a model using publish_model_version.
  • If a process is linked to a model version, its id will be available to the worker through its model_version_id attribute.
  • The URLs of the API endpoint related to Ponos were changed in the latest Arkindex release. Some changes were needed in the test suite.
  • The classes attribute no directly contains the classes of the corpus of the processed element.
# Old usage
self.classes = {
    "corpus_id": {
        "ml_class_1": "class_uuid",
        ...
    }
}

# New usage
self.classes = {
    "ml_class_1": "class_uuid",
    ...
}

0.3.1

Released on 8 November 2022 • View on Gitlab

  • A breaking change, affecting mostly the API, was introduced in Arkindex’s 1.3.4 release:
    • Workers were mostly unaffected but the REST schema was updated.
  • Workers will progressively not be able to publish results with a worker_version_id anymore on Arkindex. They will have to use a related but more general field, worker_run_id:
    • Most publication API endpoint helpers have been updated accordingly,
    • A new version of the cache was released with the updated Django models.
  • Improvements to our Machine Learning training API to allow workers to use models published on Arkindex.
  • Support workers that have no configuration.
  • Allow publishing metadata with falsy but non-null values.
  • Add .polygon attribute shortcut on Element.
  • Add a major test speedup on our worker template.
  • Support cache usage on our metadata API endpoint helpers.
  • Drop support for Python 3.6 and add support for Python 3.11.
  • Update arkindex-client to version 1.0.11.
  • Update shapely to version 1.8.5-post1

0.3.0

Released on 12 September 2022 • View on Gitlab

  • A large refactoring effort was made on the worker initialization, to streamline most of the workflow:
    • developer setup is now set in a dedicated method configure_for_developers
    • cache setup is now set in a dedicated method configure_cache
    • deprecated useless attribute features
    • add a simpler debug mode for developers
    • depend only on Arkindex RetrieveWorkerRun API to get all the information needed, instead of relying on multiple API calls.
    • remove ARKINDEX_CORPUS_ID environment variable usage, replaced by corpus information from API, except for developers
    • do not erase defaults when reading configuration
  • Support new Machine Learning training APIs on Arkindex to allow workers to create model versions and publish them as zstandard archives on a remote S3-compatible bucket.
  • Add API helpers
    • list_corpus_entities
    • create_metadatas
    • list_metadata
    • list_transcription_entities
    • create_required_types
    • publish_model_version
    • create_model_version
    • upload_to_s3
  • Create missing element types when checking if they are available on the Arkindex instance (disabled by default).
  • Update arkindex-client to version 1.0.9.
  • Update automated rotation code (revert_orientation) to support reverse application

0.2.4

Released on 6 July 2022 • View on Gitlab

  • Document source code using Sphinx and docstrings with parameters. Documentation is available here.
  • Update workers inner config with default values from user_configuration
  • Support confidence in API helpers create_sub_element and create_elements as they are not available in Arkindex
  • Port rotation code from tesseract worker
  • Add helper to trim polygons so that they fit inside their image

0.2.3

Released on 28 March 2022 • View on Gitlab

  • Update arkindex-client to version 1.0.8.
  • Replace all transcription scores with confidences (also renamed on Arkindex)
  • Support cache versioning and detect compatibility in workers
  • Support confidence in create_transcription_entity API helper
  • Support Text orientation for transcriptions
  • Return the response payload in all creation helpers so that workers can use them
  • Support new metadata type URL

0.2.2

Released on 17 September 2021 • View on Gitlab

  • Update arkindex-client to version 1.0.7.
  • Detect already processed elements using worker activity, and skip them
  • Support rotation, mirroring and fix image crop in open_image method used by a lot of workers
  • Change default value for user_configuration from None to {} which simplifies usage code in workers
  • Support new metadata type Numeric
  • Add API helper create_classifications
  • Set worker version in transcription entities API helpers

0.2.1

Released on 30 June 2021 • View on Gitlab

  • Add API helper check_required_types
  • Add a developer mode via --dev argument to simplify boot process for local development
  • Send process_id when updating worker activities
  • Remove nb_best from ML classes list as it’s not supported anymore by Arkindex

0.2.0

Released on 6 May 2021 • View on Gitlab

This is a larger release which brings a new caching system to share data across workers (avoiding a lot of API calls in some workflows), and split the codebase in multiple files for helpers & unit tests (one file per topic).

  • Add cache system using a local SQLite database, shared from workers to workers. Currently supports Arkindex models:
    • elements and their hierarchy,
    • transcriptions,
    • images,
    • classifications,
    • entities,
  • Add API helpers:
    • create_elements
    • create_transcriptions
    • create_transcription_entity
  • Split ElementsWorker API helpers and unit tests in sub files
  • Drop TranscriptionType & DataSource as they are not used anymore in Arkindex
  • Retry all managed API calls that result in a 50x

0.1.14

Released on 8 April 2021 • View on Gitlab

  • Support weak SSL DH key when downloading images (needed for some outdated IIIF servers with old SSL certs).

0.1.13

Released on 2 March 2021 • View on Gitlab

  • Support new Arkindex feature Worker Activity, to track process progress.
  • Add new API helpers:
    • list_element_children
    • list_transcriptions
    • create_metadata
  • Extend git support with merge & rebase operations
  • Allow any worker type in cookiecutter template

0.1.12

Released on 8 December 2020 • View on Gitlab

  • Bugfix to avoid loading remote images from local file system
  • Deprecate TranscriptionType.

0.1.11

Released on 26 November 2020 • View on Gitlab

0.1.10

Released on 23 November 2020 • View on Gitlab

  • Support git base operations to allow workers to clone and checkout repositories
  • Setup automated CI task to update Python dependencies
  • Update arkindex-client to version 1.0.5.

0.1.9

Released on 19 October 2020 • View on Gitlab

  • Update arkindex-client to version 1.0.4.
  • Add API helpers:
    • get_worker_version
    • get_worker_version_slug
    • get_ml_result_slug

0.1.8

Released on 30 September 2020 • View on Gitlab

0.1.7

Released on 30 September 2020 • View on Gitlab

  • Support Arkindex secrets for workers, using API but also local storage for developers. More information on Arkindex documentation.
  • Do not crash when a worker tries to create a classification that already exists.

0.1.6

Released on 23 September 2020 • View on Gitlab

  • Automatically create missing Arkindex ML classes when using get_ml_class_id and creating classifications through API helpers.
  • Update arkindex-client to version 1.0.2.

0.1.5

Released on 22 September 2020 • View on Gitlab

  • Update arkindex-client to version 1.0.1.
  • Bugfix on score & confidence type checks in api helpers

0.1.4

Released on 2 September 2020 • View on Gitlab

  • Load worker configuration from Arkindex API, or local file (for developers)
  • Add API helpers:
    • load_corpus_classes
    • get_ml_class_id

0.1.3

Released on 25 August 2020 • View on Gitlab

  • Add API helper create_element_transcriptions
  • Return created instance ID in API helpers
  • Add cookiecutter variables to be able to easily rebuild

0.1.2

Released on 19 August 2020 • View on Gitlab

  • Use WORKER_VERSION_ID environment var in helper methods to identify the worker automatically
  • Add API helpers:
    • create_transcription
    • create_classification
    • create_entity
  • Extend cookiecutter template to generate clean Python packages
  • Add the Timer helper class in tools submodule

0.1.1

Released on 7 August 2020 • View on Gitlab

  • Add API helper create_sub_element
  • Add unit tests in cookiecutter template & base project.
  • Change cookiecutter base to use ElementsWorker

0.1.0

Released on 21 July 2020 • View on Gitlab

Initial version of the base worker, with cookiecutter support to easily create workers using this project.