Dataset

class labelbox.schema.dataset.Dataset(client, field_values)[source]

Bases: DbObject, Updateable, Deletable

A Dataset is a collection of DataRows.

name

Type:: str

description

Type:: str

updated_at

Type:: datetime

created_at

Type:: datetime

row_count

The number of rows in the dataset. Fetch the dataset again to update since this is cached.

Type:: int

created_by

ToOne relationship to User

Type:: Relationship

organization

ToOne relationship to Organization

Type:: Relationship

create_data_row(items=None, **kwargs) → DataRow[source]

Creates a single DataRow belonging to this dataset.

>>> dataset.create_data_row(row_data="http://my_site.com/photos/img_01.jpg")

Parameters:

items – Dictionary containing new DataRow data. At a minimum, must contain row_data or DataRow.row_data.
**kwargs – Key-value arguments containing new DataRow data. At a minimum, must contain row_data.

Raises:

InvalidQueryError – If both dictionary and kwargs are provided as inputs
InvalidQueryError – If DataRow.row_data field value is not provided in kwargs.
InvalidAttributeError – in case the DB object type does not contain any of the field names given in kwargs.

create_data_rows(items) → Task[source]

Asynchronously bulk upload data rows

Use this instead of Dataset.create_data_rows_sync uploads for batches that contain more than 1000 data rows.

Parameters:

items (iterable of (dict or str)) – See the docstring for Dataset._create_descriptor_file for more information

Returns:

Task representing the data import on the server side. The Task can be used for inspecting task progress and waiting until it’s done.

Raises:

InvalidQueryError – If the items parameter does not conform to the specification above or if the server did not accept the DataRow creation request (unknown reason).
ResourceNotFoundError – If unable to retrieve the Task for the import process. This could imply that the import failed.
InvalidAttributeError – If there are fields in items not valid for a DataRow.
ValueError – When the upload parameters are invalid

create_data_rows_sync(items) → None[source]

Synchronously bulk upload data rows.

Use this instead of Dataset.create_data_rows for smaller batches of data rows that need to be uploaded quickly. Cannot use this for uploads containing more than 1000 data rows. Each data row is also limited to 5 attachments.

Parameters:

items (iterable of (dict or str)) – See the docstring for Dataset._create_descriptor_file for more information.

Returns:

None. If the function doesn’t raise an exception then the import was successful.

Raises:

InvalidQueryError – If the items parameter does not conform to the specification in Dataset._create_descriptor_file or if the server did not accept the DataRow creation request (unknown reason).
InvalidAttributeError – If there are fields in items not valid for a DataRow.
ValueError – When the upload parameters are invalid

data_row_for_external_id(external_id) → DataRow[source]

Convenience method for getting a single DataRow belonging to this Dataset that has the given external_id.

Parameters:: external_id (str) – External ID of the sought DataRow.
Returns:: A single DataRow with the given ID.
Raises:: labelbox.exceptions.ResourceNotFoundError – If there is no DataRow in this DataSet with the given external ID, or if there are multiple DataRows for it.

data_rows(from_cursor: str | None = None, where: Comparison | None = None) → PaginatedCollection[source]

Custom method to paginate data_rows via cursor.

Parameters:

from_cursor (str) – Cursor (data row id) to start from, if none, will start from the beginning
where (dict(str,str)) – Filter to apply to data rows. Where value is a data row column name and key is the value to filter on.
example – {‘external_id’: ‘my_external_id’} to get a data row with external_id = ‘my_external_id’

Note

Order of retrieval is newest data row first. Deleted data rows are not retrieved. Failed data rows are not retrieved. Data rows in progress maybe retrieved.

data_rows_for_external_id(external_id, limit=10) → List[DataRow][source]

Convenience method for getting a multiple DataRow belonging to this Dataset that has the given external_id.

Parameters:

external_id (str) – External ID of the sought DataRow.
limit (int) – The maximum number of data rows to return for the given external_id

Returns:

A list of DataRow with the given ID.

Raises:

labelbox.exceptions.ResourceNotFoundError – If there is no DataRow in this DataSet with the given external ID, or if there are multiple DataRows for it.

export(task_name: str | None = None, filters: DatasetExportFilters | None = None, params: CatalogExportParams | None = None) → ExportTask[source]

Creates a dataset export task with the given params and returns the task.

>>>     dataset = client.get_dataset(DATASET_ID)
>>>     task = dataset.export(
>>>         filters={
>>>             "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "data_row_ids": [DATA_ROW_ID_1, DATA_ROW_ID_2, ...] # or global_keys: [DATA_ROW_GLOBAL_KEY_1, DATA_ROW_GLOBAL_KEY_2, ...]
>>>         },
>>>         params={
>>>             "performance_details": False,
>>>             "label_details": True
>>>         })
>>>     task.wait_till_done()
>>>     task.result

export_data_rows(timeout_seconds=120, include_metadata: bool = False) → Generator[source]

Returns a generator that produces all data rows that are currently attached to this dataset.

Note: For efficiency, the data are cached for 30 minutes. Newly created data rows will not appear until the end of the cache period.

Parameters:

timeout_seconds (float) – Max waiting time, in seconds.
include_metadata (bool) – True to return related DataRow metadata

Returns:

Generator that yields DataRow objects belonging to this dataset.

Raises:

LabelboxError – if the export fails or is unable to download within the specified time.

export_v2(task_name: str | None = None, filters: DatasetExportFilters | None = None, params: CatalogExportParams | None = None) → Task | ExportTask[source]

Creates a dataset export task with the given params and returns the task.

>>>     dataset = client.get_dataset(DATASET_ID)
>>>     task = dataset.export_v2(
>>>         filters={
>>>             "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "data_row_ids": [DATA_ROW_ID_1, DATA_ROW_ID_2, ...] # or global_keys: [DATA_ROW_GLOBAL_KEY_1, DATA_ROW_GLOBAL_KEY_2, ...]
>>>         },
>>>         params={
>>>             "performance_details": False,
>>>             "label_details": True
>>>         })
>>>     task.wait_till_done()
>>>     task.result

upsert_data_rows(items, file_upload_thread_count=20) → Task[source]

Upserts data rows in this dataset. When “key” is provided, and it references an existing data row, an update will be performed. When “key” is not provided a new data row will be created.

>>>     task = dataset.upsert_data_rows([
>>>         # create new data row
>>>         {
>>>             "row_data": "http://my_site.com/photos/img_01.jpg",
>>>             "global_key": "global_key1",
>>>             "external_id": "ex_id1",
>>>             "attachments": [
>>>                 {"type": AttachmentType.RAW_TEXT, "name": "att1", "value": "test1"}
>>>             ],
>>>             "metadata": [
>>>                 {"name": "tag", "value": "tag value"},
>>>             ]
>>>         },
>>>         # update global key of data row by existing global key
>>>         {
>>>             "key": GlobalKey("global_key1"),
>>>             "global_key": "global_key1_updated"
>>>         },
>>>         # update data row by ID
>>>         {
>>>             "key": UniqueId(dr.uid),
>>>             "external_id": "ex_id1_updated"
>>>         },
>>>     ])
>>>     task.wait_till_done()