Dataset

class labelbox.schema.dataset.Dataset(client, field_values)[source]

Bases: DbObject, Updateable, Deletable

A Dataset is a collection of DataRows.

name
Type:

str

description
Type:

str

updated_at
Type:

datetime

created_at
Type:

datetime

row_count

The number of rows in the dataset. Fetch the dataset again to update since this is cached.

Type:

int

created_by

ToOne relationship to User

Type:

Relationship

organization

ToOne relationship to Organization

Type:

Relationship

add_iam_integration(iam_integration: str | IAMIntegration) IAMIntegration[source]

Sets the IAM integration for the dataset. IAM integration is used to sign URLs for data row assets.

Parameters:

iam_integration – IAM integration object or IAM integration id.

create_data_row(items=None, **kwargs) DataRow[source]

Creates a single DataRow belonging to this dataset.

>>> dataset.create_data_row(row_data="http://my_site.com/photos/img_01.jpg")
Parameters:
  • items – Dictionary containing new DataRow data. At a minimum, must contain row_data or DataRow.row_data.

  • **kwargs – Key-value arguments containing new DataRow data. At a minimum, must contain row_data.

Raises:
create_data_rows(items, file_upload_thread_count=20) DataUpsertTask[source]

Asynchronously bulk upload data rows

Use this instead of Dataset.create_data_rows_sync uploads for batches that contain more than 1000 data rows.

Parameters:

items (iterable of (dict or str)) –

Returns:

Task representing the data import on the server side. The Task can be used for inspecting task progress and waiting until it’s done.

Raises:
  • InvalidQueryError – If the items parameter does not conform to the specification above or if the server did not accept the DataRow creation request (unknown reason).

  • ResourceNotFoundError – If unable to retrieve the Task for the import process. This could imply that the import failed.

  • InvalidAttributeError – If there are fields in items not valid for a DataRow.

  • ValueError – When the upload parameters are invalid

NOTE dicts and strings items can not be mixed in the same call. It is a responsibility of the caller to ensure that all items are of the same type.

create_data_rows_sync(items) None[source]

Synchronously bulk upload data rows.

Use this instead of Dataset.create_data_rows for smaller batches of data rows that need to be uploaded quickly. Cannot use this for uploads containing more than 1000 data rows. Each data row is also limited to 5 attachments.

Parameters:

items (iterable of (dict or str)) – See the docstring for Dataset._create_descriptor_file for more information.

Returns:

None. If the function doesn’t raise an exception then the import was successful.

Raises:
  • InvalidQueryError – If the items parameter does not conform to the specification in Dataset._create_descriptor_file or if the server did not accept the DataRow creation request (unknown reason).

  • InvalidAttributeError – If there are fields in items not valid for a DataRow.

  • ValueError – When the upload parameters are invalid

data_row_for_external_id(external_id) DataRow[source]

Convenience method for getting a single DataRow belonging to this Dataset that has the given external_id.

Parameters:

external_id (str) – External ID of the sought DataRow.

Returns:

A single DataRow with the given ID.

Raises:

labelbox.exceptions.ResourceNotFoundError – If there is no DataRow in this DataSet with the given external ID, or if there are multiple DataRows for it.

data_rows(from_cursor: str | None = None, where: Comparison | None = None) PaginatedCollection[source]

Custom method to paginate data_rows via cursor.

Parameters:
  • from_cursor (str) – Cursor (data row id) to start from, if none, will start from the beginning

  • where (dict(str,str)) – Filter to apply to data rows. Where value is a data row column name and key is the value to filter on.

  • example – {‘external_id’: ‘my_external_id’} to get a data row with external_id = ‘my_external_id’

Note

Order of retrieval is newest data row first. Deleted data rows are not retrieved. Failed data rows are not retrieved. Data rows in progress maybe retrieved.

data_rows_for_external_id(external_id, limit=10) List[DataRow][source]

Convenience method for getting a multiple DataRow belonging to this Dataset that has the given external_id.

Parameters:
  • external_id (str) – External ID of the sought DataRow.

  • limit (int) – The maximum number of data rows to return for the given external_id

Returns:

A list of DataRow with the given ID.

Raises:

labelbox.exceptions.ResourceNotFoundError – If there is no DataRow in this DataSet with the given external ID, or if there are multiple DataRows for it.

export(task_name: str | None = None, filters: DatasetExportFilters | None = None, params: CatalogExportParams | None = None) ExportTask[source]

Creates a dataset export task with the given params and returns the task.

>>>     dataset = client.get_dataset(DATASET_ID)
>>>     task = dataset.export(
>>>         filters={
>>>             "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "data_row_ids": [DATA_ROW_ID_1, DATA_ROW_ID_2, ...] # or global_keys: [DATA_ROW_GLOBAL_KEY_1, DATA_ROW_GLOBAL_KEY_2, ...]
>>>         },
>>>         params={
>>>             "performance_details": False,
>>>             "label_details": True
>>>         })
>>>     task.wait_till_done()
>>>     task.result
export_data_rows(timeout_seconds=120, include_metadata: bool = False) Generator[source]

Returns a generator that produces all data rows that are currently attached to this dataset.

Note: For efficiency, the data are cached for 30 minutes. Newly created data rows will not appear until the end of the cache period.

Parameters:
  • timeout_seconds (float) – Max waiting time, in seconds.

  • include_metadata (bool) – True to return related DataRow metadata

Returns:

Generator that yields DataRow objects belonging to this dataset.

Raises:

LabelboxError – if the export fails or is unable to download within the specified time.

export_v2(task_name: str | None = None, filters: DatasetExportFilters | None = None, params: CatalogExportParams | None = None) Task | ExportTask[source]

Creates a dataset export task with the given params and returns the task.

>>>     dataset = client.get_dataset(DATASET_ID)
>>>     task = dataset.export_v2(
>>>         filters={
>>>             "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
>>>             "data_row_ids": [DATA_ROW_ID_1, DATA_ROW_ID_2, ...] # or global_keys: [DATA_ROW_GLOBAL_KEY_1, DATA_ROW_GLOBAL_KEY_2, ...]
>>>         },
>>>         params={
>>>             "performance_details": False,
>>>             "label_details": True
>>>         })
>>>     task.wait_till_done()
>>>     task.result
remove_iam_integration() None[source]

Unsets the IAM integration for the dataset.

Parameters:

None

Returns:

None

Raises:

LabelboxError – If the IAM integration can’t be unset.

Examples

>>> dataset.remove_iam_integration()
upsert_data_rows(items, file_upload_thread_count=20) DataUpsertTask[source]

Upserts data rows in this dataset. When “key” is provided, and it references an existing data row, an update will be performed. When “key” is not provided a new data row will be created.

>>>     task = dataset.upsert_data_rows([
>>>         # create new data row
>>>         {
>>>             "row_data": "http://my_site.com/photos/img_01.jpg",
>>>             "global_key": "global_key1",
>>>             "external_id": "ex_id1",
>>>             "attachments": [
>>>                 {"type": AttachmentType.RAW_TEXT, "name": "att1", "value": "test1"}
>>>             ],
>>>             "metadata": [
>>>                 {"name": "tag", "value": "tag value"},
>>>             ]
>>>         },
>>>         # update global key of data row by existing global key
>>>         {
>>>             "key": GlobalKey("global_key1"),
>>>             "global_key": "global_key1_updated"
>>>         },
>>>         # update data row by ID
>>>         {
>>>             "key": UniqueId(dr.uid),
>>>             "external_id": "ex_id1_updated"
>>>         },
>>>     ])
>>>     task.wait_till_done()