CLI Documentation¶
datachain¶
DataChain: Wrangle unstructured AI data at scale
Usage:
Options:
-V
/--version
— show program's version number and exit (default: "SUPPRESS")
Arguments:
command
— Usedatachain command --help
for command-specific help.
datachain clear-cache¶
Clear the local file cache
Usage:
datachain clear-cache [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception
datachain clone¶
Copy data files from the cloud
Usage:
datachain clone [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
[-u] [-v] [-q] [--debug-sql] [--pdb] [-f] [-r]
[--no-glob] [--no-cp] [--edatachain]
[--edatachain-file EDATACHAIN_FILE]
sources [sources ...] output
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception-f
/--force
— Force creating outputs-r
/-R
/--recursive
— Copy directories recursively--no-glob
— Do not expand globs (such as * or ?)--no-cp
— Do not copy files, just create a dataset--edatachain
— Create a .edatachain file--edatachain-file
— Use a different filename for the resulting .edatachain file
Arguments:
sources
— Data sources - paths to cloud storage dirsoutput
— Output
datachain completion¶
Output shell completion script
Usage:
datachain completion [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
[-s {bash,zsh,tcsh}]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception-s
/--shell
— Shell syntax for completions. (default: "bash")
datachain cp¶
Copy data files from the cloud
Usage:
datachain cp [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
[-v] [-q] [--debug-sql] [--pdb] [-f] [-r] [--no-glob]
sources [sources ...] output
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception-f
/--force
— Force creating outputs-r
/-R
/--recursive
— Copy directories recursively--no-glob
— Do not expand globs (such as * or ?)
Arguments:
sources
— Data sources - paths to cloud storage dirsoutput
— Output
datachain dataset-stats¶
Shows basic dataset stats
Usage:
datachain dataset-stats [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
[--version VERSION] [-b] [--si]
name
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--version
— Dataset version-b
/--bytes
— Display size in bytes instead of human-readable size--si
— Display size using powers of 1000 not 1024
Arguments:
name
— Dataset name
datachain datasets¶
List datasets
Usage:
datachain datasets [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
[-u] [-v] [-q] [--debug-sql] [--pdb] [--studio] [-L]
[-a] [--team TEAM]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--studio
— List the files in the Studio-L
/--local
— List local files only-a
/--all
— List all files including hidden files (default: true)--team
— The team to list datasets for. By default, it will use team from config.
datachain du¶
Display space usage
Usage:
datachain du [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
[-v] [-q] [--debug-sql] [--pdb] [-b] [-d N] [--si]
sources [sources ...]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception-b
/--bytes
— Display sizes in bytes instead of human-readable sizes-d
/--depth
/--max-depth
— Display sizes for N directory depths below the given directory, the default is 0 (summarize provided directory only).--si
— Display sizes using powers of 1000 not 1024
Arguments:
sources
— Data sources - paths to cloud storage dirs
datachain edit-dataset¶
Edit dataset metadata
Usage:
datachain edit-dataset [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
[--new-name NEW_NAME]
[--description DESCRIPTION]
[--labels LABELS [LABELS ...]]
name
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--new-name
— Dataset new name--description
— Dataset description--labels
— Dataset labels
Arguments:
name
— Dataset name
datachain find¶
Search in a directory hierarchy
Usage:
datachain find [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
[-v] [-q] [--debug-sql] [--pdb] [--name NAME]
[--iname INAME] [--path PATH] [--ipath IPATH]
[--size SIZE] [--type TYPE] [-c COLUMNS]
sources [sources ...]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--name
— Filename to match pattern.--iname
— Like -name but case insensitive.--path
— Path to match pattern.--ipath
— Like -path but case insensitive.--size
— Filter by size (+ is greater or equal, - is less or equal). Specified size is in bytes, or use a suffix like K, M, G for kilobytes, megabytes, gigabytes, etc.--type
— File type: "f" - regular, "d" - directory-c
/--columns
— A comma-separated list of columns to print for each result. Options are: du,name,path,size,type (Default: path)
Arguments:
sources
— Data sources - paths to cloud storage dirs
datachain gc¶
Garbage collect temporary tables
Usage:
datachain gc [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
[-v] [-q] [--debug-sql] [--pdb]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception
datachain index¶
Index storage location
Usage:
datachain index [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
[-u] [-v] [-q] [--debug-sql] [--pdb]
sources [sources ...]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception
Arguments:
sources
— Data sources - paths to cloud storage dirs
datachain internal-run-udf¶
Usage:
datachain internal-run-udf [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql]
[--pdb]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception
datachain internal-run-udf-worker¶
Usage:
datachain internal-run-udf-worker [-h]
[--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql]
[--pdb]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception
datachain ls¶
List storage contents
Usage:
datachain ls [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
[-v] [-q] [--debug-sql] [--pdb] [-l] [--studio] [-L] [-a]
[--team TEAM]
[sources ...]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception-l
/--long
— List files in the long format--studio
— List the files in the Studio-L
/--local
— List local files only-a
/--all
— List all files including hidden files (default: true)--team
— The team to list datasets for. By default, it will use team from config.
Arguments:
sources
— Data sources - paths to cloud storage dirs
datachain pull¶
Pull specific dataset version from SaaS
Usage:
datachain pull [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
[-v] [-q] [--debug-sql] [--pdb] [-o OUTPUT] [-f] [-r]
[--no-cp] [--edatachain]
[--edatachain-file EDATACHAIN_FILE]
dataset
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception-o
/--output
— Output-f
/--force
— Force creating outputs-r
/-R
/--recursive
— Copy directories recursively--no-cp
— Do not copy files, just pull a remote dataset into local DB--edatachain
— Create .edatachain file--edatachain-file
— Use a different filename for the resulting .edatachain file
Arguments:
dataset
— Name and version of remote dataset created in SaaS
datachain query¶
Create a new dataset with a query script
Usage:
datachain query [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
[-u] [-v] [-q] [--debug-sql] [--pdb] [--parallel [N]]
[-p param=value]
<script.py>
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--parallel
— Use multiprocessing to run any query script UDFs with N worker processes. N defaults to the CPU count.-p
/--param
— Query parameters
Arguments:
<script.py>
— Filepath for script
datachain rm-dataset¶
Removes dataset
Usage:
datachain rm-dataset [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
[--version VERSION] [--force | --no-force]
name
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--version
— Dataset version--force
/--no-force
— Force delete registered dataset with all of it's versions (default: falses)
Arguments:
name
— Dataset name
datachain show¶
Create a new dataset with a query script
Usage:
datachain show [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon] [-u]
[-v] [-q] [--debug-sql] [--pdb] [--version VERSION]
[--schema] [--limit LIMIT] [--offset OFFSET]
[--columns COLUMNS] [--no-collapse]
name
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--version
— Dataset version--schema
— Show schema--limit
— Number of rows to show (default: 10)--offset
— Number of rows to offset--columns
— Columns to show--no-collapse
— Do not collapse the columns
Arguments:
name
— Dataset name
datachain studio¶
Authenticate DataChain with Studio and set the token. Once this token has been properly configured, DataChain will utilize it for seamlessly sharing datasets and using Studio features from CLI
Usage:
datachain studio [-h] [--aws-endpoint-url AWS_ENDPOINT_URL] [--anon]
[-u] [-v] [-q] [--debug-sql] [--pdb]
{login,logout,team,token,datasets} ...
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception
Arguments:
cmd
— UseDataChain studio CMD --help
to display command-specific help.
datachain studio datasets¶
This command lists all the datasets available in Studio. It will show the dataset name and the number of versions available.
Usage:
datachain studio datasets [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
[--team TEAM]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--team
— The team to list datasets for. By default, it will use team from config.
datachain studio login¶
By default, this command authenticates the DataChain with Studio using default scopes and assigns a random name as the token name.
Usage:
datachain studio login [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
[-H HOSTNAME] [-s SCOPES] [-n NAME] [--no-open]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception-H
/--hostname
— The hostname of the Studio instance to authenticate with.-s
/--scopes
— The scopes for the authentication token.-n
/--name
— The name of the authentication token. It will be used to identify token shown in Studio profile.--no-open
— Use authentication flow based on user code. You will be presented with user code to enter in browser. DataChain will also use this if it cannot launch browser on your behalf.
datachain studio logout¶
This removes the studio token from your global config.
Usage:
datachain studio logout [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception
datachain studio team¶
Set the default team for DataChain to use when interacting with Studio.
Usage:
datachain studio team [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
[--global]
team_name
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception--global
— Set the team globally for all DataChain projects.
Arguments:
team_name
— The name of the team to set as the default.
datachain studio token¶
View the token datachain uses to contact Studio
Usage:
datachain studio token [-h] [--aws-endpoint-url AWS_ENDPOINT_URL]
[--anon] [-u] [-v] [-q] [--debug-sql] [--pdb]
Options:
--aws-endpoint-url
— AWS endpoint URL--anon
— AWS anon (aka awscli's --no-sign-request)-u
/--update
— Update cache-v
/--verbose
— Verbose-q
/--quiet
— Be quiet--debug-sql
— Show All SQL Queries (very verbose output, for debugging only)--pdb
— Drop into the pdb debugger on fatal exception