llmcompressor.transformers.finetune.data.data_helpers
get_custom_datasets_from_path(path, ext='json')
Get a dictionary of custom datasets from a directory path. Support HF's load_dataset for local folder datasets https://huggingface.co/docs/datasets/loading
This function scans the specified directory path for files with a specific extension (default is '.json'). It constructs a dictionary where the keys are either subdirectory names or direct dataset names (depending on the directory structure) and the values are either file paths (if only one file exists with that name) or lists of file paths (if multiple files exist).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path | str | The path to the directory containing the dataset files. | required |
ext | str | The file extension to filter files by. Default is 'json'. | 'json' |
Returns:
Type | Description |
---|---|
Dict[str, str] | A dictionary mapping dataset names to their file paths or lists of file paths. Example: dataset = get_custom_datasets_from_path("/path/to/dataset/directory", "json") Note: If datasets are organized in subdirectories, the function constructs the dictionary with lists of file paths. If datasets are found directly in the main directory, they are included with their respective names. Accepts: - path train.json test.json val.json - path train data1.json data2.json ... test ... val ... |
Source code in llmcompressor/transformers/finetune/data/data_helpers.py
get_raw_dataset(dataset_args, cache_dir=None, streaming=False, **kwargs)
Load the raw dataset from Hugging Face, using cached copy if available
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache_dir | Optional[str] | disk location to search for cached dataset | None |
streaming | Optional[bool] | True to stream data from Hugging Face, otherwise download | False |
Returns:
Type | Description |
---|---|
Dataset | the requested dataset |
Source code in llmcompressor/transformers/finetune/data/data_helpers.py
transform_dataset_keys(data_files)
Transform dict keys to train
, val
or test
for the given input dict if matches exist with the existing keys. Note that there can only be one matching file name. Ex. Folder(train_foo.json) -> Folder(train.json) Folder(train1.json, train2.json) -> Same
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_files | Dict[str, Any] | The dict where keys will be transformed | required |