Skip to content

Corpus

arkindex_worker.worker.corpus

BaseWorker methods for corpora.

Classes

CorpusExportState

Bases: Enum

State of a corpus export.

Attributes
Created class-attribute instance-attribute
Created = 'created'

The corpus export is created, awaiting its processing.

Running class-attribute instance-attribute
Running = 'running'

The corpus export is being built.

Failed class-attribute instance-attribute
Failed = 'failed'

The corpus export failed.

Done class-attribute instance-attribute
Done = 'done'

The corpus export ended in success.

CorpusMixin

Functions
download_export
download_export(export_id: str) -> _TemporaryFileWrapper

Download an export.

Parameters:

Name Type Description Default
export_id str

UUID of the export to download

required

Returns:

Type Description
_TemporaryFileWrapper

The downloaded export stored in a temporary file.

Source code in arkindex_worker/worker/corpus.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def download_export(self, export_id: str) -> _TemporaryFileWrapper:
    """
    Download an export.

    :param export_id: UUID of the export to download
    :returns: The downloaded export stored in a temporary file.
    """
    try:
        UUID(export_id)
    except ValueError as e:
        raise ValueError("export_id is not a valid uuid.") from e

    logger.info(f"Downloading export ({export_id})...")
    export: _TemporaryFileWrapper = self.api_client.request(
        "DownloadExport", id=export_id
    )
    logger.info(f"Downloaded export ({export_id}) @ `{export.name}`")
    return export
download_latest_export
download_latest_export() -> _TemporaryFileWrapper

Download the latest export in done state of the current corpus.

Returns:

Type Description
_TemporaryFileWrapper

The downloaded export stored in a temporary file.

Source code in arkindex_worker/worker/corpus.py
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def download_latest_export(self) -> _TemporaryFileWrapper:
    """
    Download the latest export in `done` state of the current corpus.

    :returns: The downloaded export stored in a temporary file.
    """
    # List all exports on the corpus
    exports = self.api_client.paginate("ListExports", id=self.corpus_id)

    # Find the latest that is in "done" state
    exports: list[dict] = sorted(
        list(
            filter(
                lambda export: export["state"] == CorpusExportState.Done.value,
                exports,
            )
        ),
        key=itemgetter("updated"),
        reverse=True,
    )
    assert (
        len(exports) > 0
    ), f'No available exports found for the corpus ({self.corpus_id}) with state "{CorpusExportState.Done.value.capitalize()}".'

    # Download latest export
    export_id: str = exports[0]["id"]

    return self.download_export(export_id)