Releases¶
0.4.0¶
Released on 11 December 2024 • View on Gitlab
Breaking changes¶
- The Arkindex API Client library no longer depends on apistar. Some imports should be updated, most notably:
# Old way
from apistar.exceptions import ErrorResponse
# New way
from arkindex.exceptions import ErrorResponse
- The method
BaseWorker.request
has been removed. Developers should rely onBaseWorker.api_client.request
instead.
Project architecture¶
Arkindex API¶
- The
create_iiif_url
helper has been added to create an image from an existing IIIF image by URL, using theCreateIIIFURL
endpoint. - The
list_elements
helper has been added to list elements in the current project, using theListElements
endpoint. - The
create_element_children
helper has been added to link multiple elements to a parent element at once, using theCreateElementChildren
endpoint. - The
list_corpus_types
helper has been added to list element types in the current project and store them as instance attribute. - The
download_export
helper has been added to download a project SQLite export, using theDownloadExport
endpoint. - The
download_latest_export
helper has been added to download the latest SQLite export of a project. - The
list_process_elements
helper has been added to list the elements of the current process, using theListProcessElements
endpoint.
Processing¶
ElementsWorker
supports processing dataset sets.ElementsWorker
now supportsExport
processes, introduced in the latest Arkindex release.- Most bulk endpoints now publish their results in batches, to avoid too large queries at once on Arkindex. The default batch size is
50
but a larger value can be set through thebatch_size
argument of the helper.
Worker template¶
- Workers now rely on the value set in the mandatory field
docker.command
in the YAML configuration to know each worker’s command. TheCMD
statement in theDockerfile
is no longer needed and should be removed.
Documentation¶
- The section Run your worker locally was updated.
Misc¶
- Workers and
arkindex-base-worker
now support Python 3.12. - A new pre-commit hook to report test files with too many lines is now added by default in new workers.
- Pillow has an image size limit to avoid “decompression bombs”. To still be able to process very large images, this limit can be increased through the
ARKINDEX_MAX_IMAGE_PIXELS
environment variable. - Some tools have an image disk size limit instead of a dimensions limitations. When the image is too large, a new function
resized_images
is able to generate downsized versions of an image that can be used until the image is small enough in terms of disk size. -
A new helper is available to automatically
pluralize
some words. This is mostly helpful in the log messages a worker might send. Default behaviour consist in adding an ‘s’ at the end but some exceptions are supported like “entity” and “child”.# Old way logger.info(f"Published {transcriptions_count} transcription{'s' if len(transcriptions_count) > 1 else ''}") # New way from arkindex.utils import pluralize logger.info(f"Published {transcriptions_count} {pluralize('transcription', transcriptions_count)}")
-
The Teklia CA certificate is no longer needed in the Docker images of the worker. The
Dockerfile
can be updated accordingly.DockerfileWORKDIR /src - # Install curl - ENV DEBIAN_FRONTEND=non-interactive - RUN apt-get update -q -y && apt-get install -q -y --no-install-recommends curl # Install worker as a package ... - # Add archi local CA - RUN curl https://assets.teklia.com/teklia_dev_ca.pem > /usr/local/share/ca-certificates/arkindex-dev.crt && update-ca-certificates - ENV REQUESTS_CA_BUNDLE /etc/ssl/certs/ca-certificates.crt
-
The CLI arguments
--element
and--elements-list
were converting the element IDs to different typesuuid.UUID
versusstr
. They now both convert tostr
.
0.3.7post1¶
Released on 23 May 2024 • View on Gitlab
Breaking changes¶
- The dependency to
peewee
has been loosened to support any patch release of the3.17
cycle. This has been done knowing that this library does not introduce massive breaking changes in its patch releases.
Hotfix¶
- The bump to
teklia-toolbox==0.1.4
broke support for- offline (no access to Internet) workers,
- Arkindex instances that do not have valid SSL certificates (impacts Arkindex developers).
This release fixes both issues.
0.3.7¶
Released on 16 April 2024 • View on Gitlab
Breaking changes¶
- This release updates the internal behavior of
DatasetWorker
, meant to process dataset sets, to accommodate for the changes introduced by Arkindex 1.6.0. - The
create_metadatas
helper has been renamed tocreate_metadata_bulk
. Make sure to update existing imports. - The model version configuration and the user configuration are now updated at the very end of
ElementsWorker.configure
andDatasetWorker.configure
. This means that there is no need to do it in workers.
# worker.py
class MyWorker(ElementsWorker):
def configure(self):
- # Retrieve the model configuration
- if self.model_configuration:
- self.config.update(self.model_configuration)
-
- # Retrieve the user configuration
- if self.user_configuration:
- self.config.update(self.user_configuration)
# Rest of configuration
...
Project architecture¶
- The migration started in 0.3.6 is now finished and all project dependencies are now stored in
pyproject.toml
for botharkindex-base-worker
and new workers, through the template.
Arkindex API¶
- The
create_classifications
helper has been updated to use the right parameter of theCreateClassifications
endpoint. Missing ML classes are now created automatically, as increate_classification
. - The
DatasetMixin
has been updated following changes to Arkindex’s dataset processes. - The details of the loaded model is now always stored in the
model_details
attribute. - The
TrainingMixin
exposes a new property,is_finetuning
, to know if the worker has a model version set. This is helpful for training workers, to know if they are fine-tuning an existing model. - Arkindex has deprecated the usage of
worker_version
in many endpoints. This change has been reflected in affected endpoints. Support for the equivalentworker_run
argument has been added where it was missing. - The
load_parents
parameter is now exposed on thelist_element_metadata
helper. - There is an issue with the
ValidateModelVersion
endpoint in the latest Arkindex releases. This endpoint may return HTTP errors (codes 403 or 500) even though the model version has been successfully updated. To avoid raising false errors, a warning is logged when that happens and the worker’s processing will no longer stop at that exception.
Worker template¶
- The worker template has been updated:
- default values for
author
andemail
, - workers docker image have been renamed to make registry cleanup policies easier to write
- tags are now named after the commit SHA:
commit-$CI_COMMIT_SHORT_SHA
(see Gitlab’s documentation to learn about this variable), - and corresponding cleanup policy regex is
commit-.*
.
- tags are now named after the commit SHA:
- the
type
key in YAML configurations has been removed.
- default values for
Documentation¶
- A new section explaining how to publish a worker to an Arkindex instance has been added.
Misc¶
- A summary message is now logged at the end of the
run
method, even if no error was encountered during processing. - A new helper was added to parse source arguments, mostly used for
worker_version
andworker_run
arguments. To filter manual sources, the Arkindex API expects theFalse
value. This helper maps"manual"
to this value. - A new helper to upload a Pillow image has been added.
- SSL verification is now skipped for Arkindex local development hosts. This only affects instance whose URL is matching the pattern
*ark.localhost
. - A warning is now logged when calling an helper that doesn’t support cache.
0.3.6¶
Released on 22 Dec 2023 • View on Gitlab
Breaking changes¶
- The
arkindex_worker.git
module was removed. It was not used locally by any workers, this module was only used to expose some workflows from python-gitlab. Please refer to their documentation if your worker needs to communicate with a Git instance. -
Following Arkindex’s 1.5.3 release, the
model_usage
configuration parameter has been updated to a tri-enum. To migrate your workers:model_usage: false
becomesmodel_usage: disabled
model_usage: true
becomesmodel_usage: required
The
supported
value means that the model is supported by a worker but not required to make it work.
Project architecture¶
- PEP 621 encourages user to store most of the package’s metadata in the
pyproject.toml
. We followed this proposition both for thearkindex-worker
package and the worker template.
Arkindex API¶
- The details of the model available to the worker is now stored under the
model_details
attribute. - The list_corpus_entities API helper now stores the entities in the
entities
attribute instead of returning them. - A reminder was added to prevent making changes to the Arkindex Cache schema without bumping the Version of said cache.
- Each dataset’s archive is now properly deleted after processing.
- The path to a Dataset’s archive is now stored under the
filepath
property. - The new create_element_parent API helper allows to create a link between two elements.
- The create_sub_element was updated to support creating children element without zones and under a parent without a zone.
- A new user configuration type was introduced to be able to select Arkindex
Models
. Learn more about it in the documentation.
Worker template¶
-
When the provided
slug
had more than one word, it was invalid for either:- the package name, because the user used
_
as word delimiter, - the module directory’s name, because the user used
-
as word delimiter.
The package name and the module directory’s name are now both computed from the slug, making sure that:
- the package name uses
-
as word delimiter, - the module directory’s name uses
_
as word delimiter.
- the package name, because the user used
Documentation¶
-
A link to the documentation was added:
- in the README,
- as a GitLab badge on the repo.
-
Some sections in the documentation were renamed to improve readability.
Misc¶
- While we removed the
black
formatter from our CI workflow, we replaced it by Ruff’s which respects most of its rules. -
Many linting rules supported by the Ruff formatter were added to improve the style of the codebase:
-
This project is now licensed under the MIT license.
0.3.5¶
Released on 8 Nov 2023 • View on Gitlab
Breaking changes¶
- The
arkindex_worker.reporting
module has been removed as the JSON report file was no longer needed. - The
--model-dir
CLI argument was renamed to--extras-dir
as it was more suited to its use. This folder now stores dataset archives, hence the more generic name.
Arkindex API¶
- Following Arkindex 1.5.2 release,
- new helpers for Task-related endpoints were introduced,
- A new worker class is available, to support
Dataset
processes - new helpers for Dataset-related endpoints were introduced,
- Added a unicity check on the input of the create_transcription_entities helper.
- The partial_update_element helper was updated to better match the endpoint.
Documentation¶
- Some modules were poorly displayed in the documentation. Class methods are now only listed under their class’s section.
Release Management¶
- A Makefile was added to the worker template to deploy new releases more easily. The default branch expects master, make sure to change it to
main
depending on your settings. - The base image used in the worker’s docker image was changed from
python:3.11
topython:3.11-slim
, in an effort to reduce their size.
Misc¶
- During the configuration stage, a summary of the worker is now logged instead of the revision’s hash. This was changed to support workers not linked to any revision on Arkindex.
- A retry mechanism on HTTP 50x errors was added. Additionally, when the requested size exceeds the maximum size allowed by the IIIF server, a new try is done with
max
instead offull
as size parameter. More information about these parameters in the IIIF documentation. - When running the worker locally without the
ARKINDEX_CORPUS_ID
variable set in the environment, an explicit exception will be raised when trying to access thecorpus_id
attribute. - This release adds support for Python 3.12.
0.3.4¶
Released on 14 Sept 2023 • View on Gitlab
- The worker template was updated to correctly install Git submodules if it depends on any.
- Base-worker now uses ruff for linting. This tool replaces
isort
andflake8
. - New Arkindex API helper to update an element, calling PartialUpdateElement.
- New Arkindex API helper to list an element’s parents, calling ListElementParents.
- Worker Activity API is now disabled when the worker runs in
read-only
mode instead of relying on the--dev
CLI argument. The update_activity API helper was updated following Arkindex 1.5.1 changes. - Worker can now resize the image of an element when opening them. This uses the IIIF resizing API.
0.3.3¶
Released on 26 May 2023 • View on Gitlab
- The
Timer
class previously defined inarkindex_worker.utils
was removed as it was already defined Teklia’s python toolbox.
# Old usage
from arkindex_worker.utils import Timer
# New usage
from teklia_toolbox.time import Timer
- The create_element_transcriptions API helper now accepts an
element_confidence
float field in the dictionaries provided through thetranscriptions
field. This confidence will be set on the created element. - More query filters are available on the list_element_children API helper. More details about their usage is available in the documentation:
transcription_worker_version
transcription_worker_run
with_metadata
worker_run
Arkindex Base-Worker
now fully uses pathlib to handle filesystem paths as suggested by PEP 428.- Many helpers were added to handle ZSTD and TAR archives as well as delete files cleanly. More details about that in the documentation of the arkindex_worker.utils module.
- A bug affecting the parsing of the configuration of workers that use a Machine learning model stored on an Arkindex instance was fixed.
0.3.2¶
Released on 8 March 2023 • View on Gitlab
- A helper to use the new API endpoint to create transcription entities more efficiently was implemented.
- Training workers may now publish a model configuration when creating a new model version on Arkindex. This will make the execution of a generic worker much smoother.
- The model version API endpoints were updated in the latest Arkindex release and a new helper was introduced subsequently. However, there are no breaking changes and the main helper,
publish_model_version
, still has the same signature and behaviour. - The latest Arkindex release changed the way NER entities are stored and published.
- The
EntityType
enum was removed as type slug are no longer restrcited to a small options, - create_entity now expects a type slug as a String,
- a new helper list_corpus_entity_types was added to load the Entity types in the corpus,
- a new helper check_required_entity_types to make sure that needed entity types are available in the corpus was added. Missing ones are created by default (this can be disabled).
- The
- The create_classifications helper now expects the UUID of each MLClass instead of their name.
- In developer mode, the only way to set the
corpus_id
attribute is to use theARKINDEX_CORPUS_ID
environment variable. When it’s not set, all API requests using thecorpus_id
as path parameter will fail with500
status code. A warning log was added to help developers troubleshoot this error by advising them to set this variable. - The create_transcriptions helper no longer makes the API call in developer mode. This behaviour aligns with all other publication helpers.
- Fixes hash computation when publishing a model using publish_model_version.
- If a process is linked to a model version, its id will be available to the worker through its
model_version_id
attribute. - The URLs of the API endpoint related to Ponos were changed in the latest Arkindex release. Some changes were needed in the test suite.
- The
classes
attribute no directly contains the classes of the corpus of the processed element.
# Old usage
self.classes = {
"corpus_id": {
"ml_class_1": "class_uuid",
...
}
}
# New usage
self.classes = {
"ml_class_1": "class_uuid",
...
}
0.3.1¶
Released on 8 November 2022 • View on Gitlab
- A breaking change, affecting mostly the API, was introduced in Arkindex’s 1.3.4 release:
- Workers were mostly unaffected but the REST schema was updated.
- Workers will progressively not be able to publish results with a
worker_version_id
anymore on Arkindex. They will have to use a related but more general field,worker_run_id
:- Most publication API endpoint helpers have been updated accordingly,
- A new version of the cache was released with the updated Django models.
- Improvements to our Machine Learning training API to allow workers to use models published on Arkindex.
- Support workers that have no configuration.
- Allow publishing metadata with falsy but non-null values.
- Add
.polygon
attribute shortcut onElement
. - Add a major test speedup on our worker template.
- Support cache usage on our metadata API endpoint helpers.
- Drop support for Python 3.6 and add support for Python 3.11.
- Update arkindex-client to version 1.0.11.
- Update shapely to version 1.8.5-post1
0.3.0¶
Released on 12 September 2022 • View on Gitlab
- A large refactoring effort was made on the worker initialization, to streamline most of the workflow:
- developer setup is now set in a dedicated method
configure_for_developers
- cache setup is now set in a dedicated method
configure_cache
- deprecated useless attribute
features
- add a simpler debug mode for developers
- depend only on Arkindex
RetrieveWorkerRun
API to get all the information needed, instead of relying on multiple API calls. - remove
ARKINDEX_CORPUS_ID
environment variable usage, replaced by corpus information from API, except for developers - do not erase defaults when reading configuration
- developer setup is now set in a dedicated method
- Support new Machine Learning training APIs on Arkindex to allow workers to create model versions and publish them as zstandard archives on a remote S3-compatible bucket.
- Add API helpers
list_corpus_entities
create_metadatas
list_metadata
list_transcription_entities
create_required_types
publish_model_version
create_model_version
upload_to_s3
- Create missing element types when checking if they are available on the Arkindex instance (disabled by default).
- Update arkindex-client to version 1.0.9.
- Update automated rotation code (
revert_orientation
) to support reverse application
0.2.4¶
Released on 6 July 2022 • View on Gitlab
- Document source code using Sphinx and docstrings with parameters. Documentation is available here.
- Update workers inner
config
with default values fromuser_configuration
- Support confidence in API helpers
create_sub_element
andcreate_elements
as they are not available in Arkindex - Port rotation code from tesseract worker
- Add helper to trim polygons so that they fit inside their image
0.2.3¶
Released on 28 March 2022 • View on Gitlab
- Update arkindex-client to version 1.0.8.
- Replace all transcription scores with confidences (also renamed on Arkindex)
- Support cache versioning and detect compatibility in workers
- Support confidence in
create_transcription_entity
API helper - Support Text orientation for transcriptions
- Return the response payload in all creation helpers so that workers can use them
- Support new metadata type
URL
0.2.2¶
Released on 17 September 2021 • View on Gitlab
- Update arkindex-client to version 1.0.7.
- Detect already processed elements using worker activity, and skip them
- Support rotation, mirroring and fix image crop in
open_image
method used by a lot of workers - Change default value for
user_configuration
fromNone
to{}
which simplifies usage code in workers - Support new metadata type
Numeric
- Add API helper
create_classifications
- Set worker version in transcription entities API helpers
0.2.1¶
Released on 30 June 2021 • View on Gitlab
- Add API helper
check_required_types
- Add a developer mode via
--dev
argument to simplify boot process for local development - Send
process_id
when updating worker activities - Remove
nb_best
from ML classes list as it’s not supported anymore by Arkindex
0.2.0¶
Released on 6 May 2021 • View on Gitlab
This is a larger release which brings a new caching system to share data across workers (avoiding a lot of API calls in some workflows), and split the codebase in multiple files for helpers & unit tests (one file per topic).
- Add cache system using a local SQLite database, shared from workers to workers. Currently supports Arkindex models:
- elements and their hierarchy,
- transcriptions,
- images,
- classifications,
- entities,
- Add API helpers:
create_elements
create_transcriptions
create_transcription_entity
- Split ElementsWorker API helpers and unit tests in sub files
- Drop
TranscriptionType
&DataSource
as they are not used anymore in Arkindex - Retry all managed API calls that result in a 50x
0.1.14¶
Released on 8 April 2021 • View on Gitlab
- Support weak SSL DH key when downloading images (needed for some outdated IIIF servers with old SSL certs).
0.1.13¶
Released on 2 March 2021 • View on Gitlab
- Support new Arkindex feature Worker Activity, to track process progress.
- Add new API helpers:
list_element_children
list_transcriptions
create_metadata
- Extend git support with merge & rebase operations
- Allow any worker type in cookiecutter template
0.1.12¶
Released on 8 December 2020 • View on Gitlab
- Bugfix to avoid loading remote images from local file system
- Deprecate
TranscriptionType
.
0.1.11¶
Released on 26 November 2020 • View on Gitlab
- Update arkindex-client to version 1.0.6.
0.1.10¶
Released on 23 November 2020 • View on Gitlab
- Support git base operations to allow workers to clone and checkout repositories
- Setup automated CI task to update Python dependencies
- Update arkindex-client to version 1.0.5.
0.1.9¶
Released on 19 October 2020 • View on Gitlab
- Update arkindex-client to version 1.0.4.
- Add API helpers:
get_worker_version
get_worker_version_slug
get_ml_result_slug
0.1.8¶
Released on 30 September 2020 • View on Gitlab
- Update arkindex-client to version 1.0.3.
0.1.7¶
Released on 30 September 2020 • View on Gitlab
- Support Arkindex secrets for workers, using API but also local storage for developers. More information on Arkindex documentation.
- Do not crash when a worker tries to create a classification that already exists.
0.1.6¶
Released on 23 September 2020 • View on Gitlab
- Automatically create missing Arkindex ML classes when using
get_ml_class_id
and creating classifications through API helpers. - Update arkindex-client to version 1.0.2.
0.1.5¶
Released on 22 September 2020 • View on Gitlab
- Update arkindex-client to version 1.0.1.
- Bugfix on score & confidence type checks in api helpers
0.1.4¶
Released on 2 September 2020 • View on Gitlab
- Load worker configuration from Arkindex API, or local file (for developers)
- Add API helpers:
load_corpus_classes
get_ml_class_id
0.1.3¶
Released on 25 August 2020 • View on Gitlab
- Add API helper
create_element_transcriptions
- Return created instance ID in API helpers
- Add cookiecutter variables to be able to easily rebuild
0.1.2¶
Released on 19 August 2020 • View on Gitlab
- Use
WORKER_VERSION_ID
environment var in helper methods to identify the worker automatically - Add API helpers:
create_transcription
create_classification
create_entity
- Extend cookiecutter template to generate clean Python packages
- Add the
Timer
helper class in tools submodule
0.1.1¶
Released on 7 August 2020 • View on Gitlab
- Add API helper
create_sub_element
- Add unit tests in cookiecutter template & base project.
- Change cookiecutter base to use ElementsWorker
0.1.0¶
Released on 21 July 2020 • View on Gitlab
Initial version of the base worker, with cookiecutter support to easily create workers using this project.