Skip to content

Dataset

arkindex_worker.worker.dataset

BaseWorker methods for datasets.

Classes

DatasetState

Bases: Enum

State of a dataset.

Attributes
Open class-attribute instance-attribute
Open = 'open'

The dataset is open.

Building class-attribute instance-attribute
Building = 'building'

The dataset is being built.

Complete class-attribute instance-attribute
Complete = 'complete'

The dataset is complete.

Error class-attribute instance-attribute
Error = 'error'

The dataset is in error.

MissingDatasetArchive

Bases: Exception

Exception raised when the compressed archive associated to a dataset isn’t found in its task artifacts.

DatasetMixin

Functions
add_arguments
add_arguments() -> None

Define specific argparse arguments for the worker using this mixin

Source code in arkindex_worker/worker/dataset.py
76
77
78
79
80
81
82
83
84
85
86
87
88
def add_arguments(self) -> None:
    """Define specific ``argparse`` arguments for the worker using this mixin"""
    self.parser.add_argument(
        "--set",
        type=check_dataset_set,
        nargs="+",
        help="""
            One or more Arkindex dataset sets, format is <dataset_uuid>:<set_name>
            (e.g.: "12341234-1234-1234-1234-123412341234:train")
        """,
        default=[],
    )
    super().add_arguments()
list_process_sets
list_process_sets() -> Iterator[Set]

List dataset sets associated to the worker’s process. This helper is not available in developer mode.

Returns:

Type Description
Iterator[Set]

An iterator of Set objects built from the ListProcessSets API endpoint.

Source code in arkindex_worker/worker/dataset.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def list_process_sets(self) -> Iterator[Set]:
    """
    List dataset sets associated to the worker's process. This helper is not available in developer mode.

    :returns: An iterator of ``Set`` objects built from the ``ListProcessSets`` API endpoint.
    """
    assert not self.is_read_only, "This helper is not available in read-only mode."

    results = self.api_client.paginate(
        "ListProcessSets", id=self.process_information["id"]
    )

    return map(
        lambda result: Set(
            name=result["set_name"], dataset=Dataset(**result["dataset"])
        ),
        results,
    )
list_set_elements
list_set_elements(dataset_set: Set) -> Iterator[Element]

List elements in a dataset set.

Parameters:

Name Type Description Default
dataset_set Set

Set to find elements in.

required

Returns:

Type Description
Iterator[Element]

An iterator of Element built from the ListDatasetElements API endpoint.

Source code in arkindex_worker/worker/dataset.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
def list_set_elements(self, dataset_set: Set) -> Iterator[Element]:
    """
    List elements in a dataset set.

    :param dataset_set: Set to find elements in.
    :returns: An iterator of Element built from the ``ListDatasetElements`` API endpoint.
    """
    assert dataset_set and isinstance(
        dataset_set, Set
    ), "dataset_set shouldn't be null and should be a Set"

    results = self.api_client.paginate(
        "ListDatasetElements", id=dataset_set.dataset.id, set=dataset_set.name
    )

    return map(lambda result: Element(**result["element"]), results)
list_sets
list_sets() -> Iterator[Set]

List the sets to be processed, either from the CLI arguments or using the list_process_sets method.

Returns:

Type Description
Iterator[Set]

An iterator of Set objects.

Source code in arkindex_worker/worker/dataset.py
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
def list_sets(self) -> Iterator[Set]:
    """
    List the sets to be processed, either from the CLI arguments or using the
    [list_process_sets][arkindex_worker.worker.dataset.DatasetMixin.list_process_sets] method.

    :returns: An iterator of ``Set`` objects.
    """
    if not self.is_read_only:
        yield from self.list_process_sets()

    datasets: dict[uuid.UUID, Dataset] = {}
    for dataset_id, set_name in self.args.set:
        # Retrieving dataset information if not already cached
        if dataset_id not in datasets:
            datasets[dataset_id] = Dataset(
                **self.api_client.request("RetrieveDataset", id=dataset_id)
            )

        yield Set(name=set_name, dataset=datasets[dataset_id])
update_dataset_state
update_dataset_state(
    dataset: Dataset, state: DatasetState
) -> Dataset

Partially updates a dataset state through the API.

Parameters:

Name Type Description Default
dataset Dataset

The dataset to update.

required
state DatasetState

State of the dataset.

required

Returns:

Type Description
Dataset

The updated Dataset object from the PartialUpdateDataset API endpoint.

Source code in arkindex_worker/worker/dataset.py
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
@unsupported_cache
def update_dataset_state(self, dataset: Dataset, state: DatasetState) -> Dataset:
    """
    Partially updates a dataset state through the API.

    :param dataset: The dataset to update.
    :param state: State of the dataset.
    :returns: The updated ``Dataset`` object from the ``PartialUpdateDataset`` API endpoint.
    """
    assert dataset and isinstance(
        dataset, Dataset
    ), "dataset shouldn't be null and should be a Dataset"
    assert state and isinstance(
        state, DatasetState
    ), "state shouldn't be null and should be a str from DatasetState"

    if self.is_read_only:
        logger.warning("Cannot update dataset as this worker is in read-only mode")
        return

    updated_dataset = self.api_client.request(
        "PartialUpdateDataset",
        id=dataset.id,
        body={"state": state.value},
    )
    dataset.update(updated_dataset)

    return dataset

Functions

check_dataset_set

check_dataset_set(value: str) -> tuple[uuid.UUID, str]

The --set argument should have the following format: :

Args: value (str): Provided argument.

Raises: ArgumentTypeError: When the value is invalid.

Returns: tuple[uuid.UUID, str]: The ID of the dataset parsed as UUID and the name of the set.

Source code in arkindex_worker/worker/dataset.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def check_dataset_set(value: str) -> tuple[uuid.UUID, str]:
    """The `--set` argument should have the following format:
    <dataset_id>:<set_name>

    Args:
        value (str): Provided argument.

    Raises:
        ArgumentTypeError: When the value is invalid.

    Returns:
        tuple[uuid.UUID, str]: The ID of the dataset parsed as UUID and the name of the set.
    """
    values = value.split(":")
    if len(values) != 2:
        raise ArgumentTypeError(
            f"'{value}' is not in the correct format `<dataset_id>:<set_name>`"
        )

    dataset_id, set_name = values
    try:
        dataset_id = uuid.UUID(dataset_id)
        return (dataset_id, set_name)
    except (TypeError, ValueError) as e:
        raise ArgumentTypeError(f"'{dataset_id}' should be a valid UUID") from e