datasets 🧠

Use BaseDataSet for creating a project-specific dataset class.

Authors: Simon M. Hofmann | Hannah S. Heinrichs
Years: 2021-2023

BaseDataSet 🧠

BaseDataSet(
    study_table_or_path: DataFrame | str | Path,
    project_id: str,
    mri_sequence: str,
    register_to_mni: int | None = None,
    cache_dir: str | Path = CACHE_DIR,
    load_mode: str = "full_array",
    **kwargs
)

Bases: ABC

Base class for an MRI dataset.

In the case of a research project with multiple MRI sequences, each sequence must have its own dataset class that inherits from the BaseDataSet class.

Initialize BaseDataSet.

Usage

# Create a study-specific dataset class
class MyStudyData(BaseDataSet):
    def __init__(self):
        super().__init__(
            study_table_or_path="PATH/TO/STUDY_TABLE.csv",  # one column must be 'sid' (subject ID)
            project_id="MyProjectID",
            mri_sequence="t1w",  # this is of descriptive nature, for projects with multiple MRI sequences
            load_mode="full_array",  # or 'file_paths' for very large datasets
        )

    # Define mri_path_constructor
    def mri_path_constructor(sid: str) -> str | Path:
        return f"/path/to/mri/{sid}.nii.gz"


# Instantiate the dataset class
my_study_data = MyStudyData()

Parameters:

Name	Type	Description	Default
`study_table_or_path`	`DataFrame \| str \| Path`	The study table, OR the absolute or relative path to the table [`.csv` \| `.tsv`]. The table must have 'sid' as the index column, containing subject IDs.	required
`project_id`	`str`	The project ID.	required
`mri_sequence`	`str`	MRI sequence ('t1_mni_1mm', 't2', 'dwi', or similar). This is of a descriptive nature for projects with multiple MRI sequences, hence multiple offsprings of `BaseDataSet`.	required
`register_to_mni`	`int \| None`	Register MRIs to the MNI space (1 mm, 2 mm) using `ANTs`, OR not [`None`].	`None`
`cache_dir`	`str \| Path`	Path to the cache directory, where intermediate and processed data is stored.	`CACHE_DIR`
`load_mode`	`str`	Load mode for the dataset: 'file_paths': Load the MRI data from file paths (recommended for very large datasets). 'full_array': Load the MRI data as a full array (default).	`'full_array'`
`kwargs`		Additional keyword arguments for MRI processing. Find details to `kwargs` in docs of `xai4mri.dataloader.mri_dataloader._load_data_as_full_array()` or `_load_data_as_file_paths()`.	`{}`

Source code in src/xai4mri/dataloader/datasets.py

def __init__(
    self,
    study_table_or_path: pd.DataFrame | str | Path,
    project_id: str,
    mri_sequence: str,
    register_to_mni: int | None = None,
    cache_dir: str | Path = CACHE_DIR,
    load_mode: str = "full_array",
    **kwargs,
):
    """
    Initialize BaseDataSet.

    !!! example "Usage"
        ```python
        # Create a study-specific dataset class
        class MyStudyData(BaseDataSet):
            def __init__(self):
                super().__init__(
                    study_table_or_path="PATH/TO/STUDY_TABLE.csv",  # one column must be 'sid' (subject ID)
                    project_id="MyProjectID",
                    mri_sequence="t1w",  # this is of descriptive nature, for projects with multiple MRI sequences
                    load_mode="full_array",  # or 'file_paths' for very large datasets
                )

            # Define mri_path_constructor
            def mri_path_constructor(sid: str) -> str | Path:
                return f"/path/to/mri/{sid}.nii.gz"


        # Instantiate the dataset class
        my_study_data = MyStudyData()
        ```

    :param study_table_or_path: The study table, OR the absolute or relative path to the table [`*.csv` | `*.tsv`].
                                The table must have 'sid' as the index column, containing subject IDs.
    :param project_id: The project ID.
    :param mri_sequence: MRI sequence ('t1_mni_1mm', 't2', 'dwi', or similar).
                         This is of a descriptive nature for projects with multiple MRI sequences,
                         hence multiple offsprings of `BaseDataSet`.
    :param register_to_mni: Register MRIs to the MNI space (1 mm, 2 mm)
                            using [`ANTs`](https://antspyx.readthedocs.io/en/latest/),
                            OR not [`None`].
    :param cache_dir: Path to the cache directory, where intermediate and processed data is stored.
    :param load_mode: Load mode for the dataset:
                      'file_paths': Load the MRI data from file paths (recommended for very large datasets).
                      'full_array': Load the MRI data as a full array (default).
    :param kwargs: Additional keyword arguments for MRI processing.
                   Find details to `kwargs` in docs of
                   `xai4mri.dataloader.mri_dataloader._load_data_as_full_array()`
                   or `_load_data_as_file_paths()`.
    """
    self._study_table = None  # init
    self._study_table_path: str | Path | None = None  # init

    self.study_table = study_table_or_path
    self.project_id = project_id
    self.mri_sequence = mri_sequence
    self.cache_dir = cache_dir
    self.load_mode = load_mode

    self._sid_list = None  # init
    self._split_dict: dict[str, np.ndarray] | None = None  # init

    # Following variables should ideally remain untouched after dataset class is instantiated.
    # They get passed via self.get_data to .mri_dataloader._load_data_as_full_array() (find kwargs details there)
    self._regis_mni: int | None = register_to_mni
    self._norm: bool = kwargs.pop("norm", True)
    self._path_brain_mask: str | None = kwargs.pop("path_brain_mask", None)
    self._compress: bool = kwargs.pop("compress", True)
    self._prune_mode: str | None = kwargs.pop("prune_mode", "max")  # None: no pruning, OR "max" or "cube"
    self._path_to_dataset: str | None = kwargs.pop("path_to_dataset", None)  # if set not in cache dir
    self._cache_files: bool = kwargs.pop("cache_files", True)
    self._save_after_processing: bool = kwargs.pop("save_after_processing", self.load_mode != "file_paths")
    # **save_kwargs # as_npy, as_zip in get_mri_set_path() &

    # Run check
    if not kwargs.pop("ignore_checks", False):
        self._check_mri_path_constructor()

    # Print unknown kwargs
    if kwargs:
        cprint(
            f"At init of the DataSet class, unknown kwargs were passed: {kwargs}, which will be ignored!", col="y"
        )

current_split_dict `property` 🧠

current_split_dict: dict[str, ndarray[str]] | None

Return the current split dictionary.

The split dictionary is created by calling create_data_split(), and has the following structure:

{'train': np.ndarray[str], 'validation': np.ndarray[str], 'test': np.ndarray[str]}

Returns:

Type	Description
`dict[str, ndarray[str]] \| None`	split dictionary

load_mode `property` `writable` 🧠

load_mode: str

Return the load mode for the dataset.

The load mode can be either 'file_paths' or 'full_array'.

'file_paths': Load the MRI data from file paths. That is, individual files are stored separately.
'full_array': Load the MRI data as a full array. Data is saved as a single large file.

sid_list `property` `writable` 🧠

sid_list: ndarray[str]

Return the list of subject IDs.

Returns:

Type	Description
`np.ndarray[str]`	list of subject IDs

study_table `property` `writable` 🧠

study_table: DataFrame

Get the study table.

Ideally, each BaseDataSet has its own study table, except if for all participants of a research project all MRI sequences are available. In this case, the study table can be the same for all MRI sequences and derivatives.

Returns:

Type	Description
`pd.DataFrame`	study table

study_table_path `property` `writable` 🧠

study_table_path: str | None

Return the path to the study table if it has been provided.

Can be provided as study_table_or_path at initialization, or it can be set later.

Returns:

Type	Description
`str \| None`	path to the study table

create_data_split 🧠

create_data_split(
    target: str,
    batch_size: int = 1,
    split_ratio: tuple[float, float, float] | None = (
        0.8,
        0.1,
        0.1,
    ),
    split_dict: dict[str, str] | None = None,
    **get_data_kwargs
) -> tuple[
    dict[str, ndarray],
    _DataSetGenerator,
    _DataSetGenerator,
    _DataSetGenerator,
]

Create data split with a training, validation, and test set.

The data subsets are provided as generator objects, and can be used for model training and evaluation.

Usage

# Create a data split for model training and evaluation
split_dict, train_gen, val_gen, test_gen = mydata.create_data_split(target="age")

# Train a model
model.fit(train_gen, validation_data=val_gen, ...)

Parameters:

Name	Type	Description	Default
`target`	`str`	Prediction target. `target` must match a column in the study table.	required
`batch_size`	`int`	Batch size in the returned data generators per data split. MRIs are arther large files; hence, it is recommended to keep batches rather small.	`1`
`split_ratio`	`tuple[float, float, float] \| None`	Ratio of the data split (train, validation, test). Must add up to 1.	`(0.8, 0.1, 0.1)`
`split_dict`	`dict[str, str] \| None`	Dictionary with 'train', 'validation', & 'test' as keys, and subject IDs as values. If a `split_dict` is provided, it overrules `split_ratio`. Providing `split_dict` is useful when specific subject data shall be used in a split.	`None`

Returns:

Type	Description
`tuple[dict[str, ndarray], _DataSetGenerator, _DataSetGenerator, _DataSetGenerator]`	split_dict, and the data generators for the training, validation, and test set

Source code in src/xai4mri/dataloader/datasets.py

def create_data_split(
    self,
    target: str,
    batch_size: int = 1,
    split_ratio: tuple[float, float, float] | None = (0.8, 0.1, 0.1),
    split_dict: dict[str, str] | None = None,
    **get_data_kwargs,
) -> tuple[dict[str, np.ndarray], _DataSetGenerator, _DataSetGenerator, _DataSetGenerator]:
    """
    Create data split with a training, validation, and test set.

    The data subsets are provided as generator objects, and can be used for model training and evaluation.

    !!! example "Usage"
        ```python
        # Create a data split for model training and evaluation
        split_dict, train_gen, val_gen, test_gen = mydata.create_data_split(target="age")

        # Train a model
        model.fit(train_gen, validation_data=val_gen, ...)
        ```

    :param target: Prediction target.
                   `target` must match a column in the study table.
    :param batch_size: Batch size in the returned data generators per data split.
                       MRIs are arther large files; hence, it is recommended to keep batches rather small.
    :param split_ratio: Ratio of the data split (train, validation, test).
                        Must add up to 1.
    :param split_dict: Dictionary with 'train', 'validation', & 'test' as keys, and subject IDs as values.
                       If a `split_dict` is provided, it overrules `split_ratio`.
                       Providing `split_dict` is useful when specific subject data shall be used in a split.
    :return: split_dict, and the data generators for the training, validation, and test set
    """
    # Load subject data
    volume_data, sid_list = self.get_data(mmap_mode="r", **get_data_kwargs)  # 'r' does not load data to RAM
    sid_list = np.array(sid_list)

    if split_dict is None:
        if not (split_ratio is not None and round(sum(split_ratio), 3) == 1):
            msg = "Either split_dict or split_ratio must be provided. split_ratio must sum to 1."
            raise ValueError(msg)

        # Create split indices
        indices = list(range(len(sid_list)))
        random.shuffle(indices)
        n_train = round(len(sid_list) * split_ratio[0])
        train_indices = indices[:n_train]
        n_val = round(len(sid_list) * split_ratio[1])
        val_indices = indices[n_train : n_train + n_val]
        n_test = len(sid_list) - n_train - n_val
        test_indices = indices[-n_test:]

        split_dict = {
            "train": sid_list[train_indices],
            "validation": sid_list[val_indices],
            "test": sid_list[test_indices],
        }

    else:
        all_sids = [item for sublist in split_dict.values() for item in sublist]
        if not set(all_sids).issubset(set(sid_list)):
            msg = "All SID's in split_dict must be part of the dataset!"
            raise ValueError(msg)
        if len(set(all_sids)) != len(all_sids):
            msg = "SID's must only appear once in the split!"
            raise ValueError(msg)
        if split_dict.keys() != {"train", "validation", "test"}:
            msg = "split_dict must have keys 'train', 'validation', 'test'!"
            raise ValueError(msg)

        # Create split indices
        train_indices = [list(sid_list).index(sid) for sid in split_dict["train"]]
        val_indices = [list(sid_list).index(sid) for sid in split_dict["validation"]]
        test_indices = [list(sid_list).index(sid) for sid in split_dict["test"]]

    # Prepare target (y) data
    if target not in self.study_table.columns:
        msg = "target variable must be in study table!"
        raise ValueError(msg)
    ydata = self.study_table[target]

    # Save split dict
    self._split_dict = split_dict

    # Define preprocessor
    if self.load_mode == "full_array":
        # Data is preprocessed already in the full array mode
        preprocessor = None
    else:
        # Set arguments for compress_and_norm() function
        metadata_table = self.get_metadata_table()

        clip_min, clip_max, global_norm_min, global_norm_max, self._norm = _extract_values_from_metadata(
            table=metadata_table,
            compress=self._compress,
            norm=self._norm,
        )

        # Set preprocessor
        preprocessor = partial(
            compress_and_norm,
            clip_min=clip_min,
            clip_max=clip_max,
            norm=self._norm,
            global_norm_min=global_norm_min,
            global_norm_max=global_norm_max,
        )

    return (
        split_dict,
        _DataSetGeneratorFactory.create_generator(
            name="train",  # training set
            x_data=volume_data[train_indices],
            y_data=ydata.loc[sid_list[train_indices]].to_numpy(),
            batch_size=batch_size,
            data_indices=train_indices,
            preprocess=preprocessor,
        ),
        _DataSetGeneratorFactory.create_generator(
            name="validation",  # validation set
            x_data=volume_data[val_indices],
            y_data=ydata.loc[sid_list[val_indices]].to_numpy(),
            batch_size=batch_size,
            data_indices=val_indices,
            preprocess=preprocessor,
        ),
        _DataSetGeneratorFactory.create_generator(
            name="test",  # test set
            x_data=volume_data[test_indices],
            y_data=ydata.loc[sid_list[test_indices]].to_numpy(),
            batch_size=1,
            data_indices=test_indices,
            preprocess=preprocessor,
        ),
    )

get_data 🧠

get_data(**kwargs) -> tuple[ndarray, ndarray[str]]

Load dataset into workspace.

Parameters:

Name	Type	Description	Default
`kwargs`		Additional keyword arguments for MRI processing. Find details to `kwargs` in docs of `xai4mri.dataloader.mri_dataloader._load_data_as_full_array()` or `_load_data_as_file_paths()`. Note, these `kwargs` should only be used if the deviation from the `__init__`-`kwargs` is intended.	`{}`

Returns:

Type	Description
`tuple[ndarray, ndarray[str]]`	Processed MRI data: either 5D data array of shape `[n_subjects, x,y,z, channel=1]` for `self.load_mode='full_array'`, or 1D array (paths to processed MRIs) for `self.load_mode='file_paths'`, and ordered list of corresponding subject ID's

Source code in src/xai4mri/dataloader/datasets.py

def get_data(self, **kwargs) -> tuple[np.ndarray, np.ndarray[str]]:
    """
    Load dataset into workspace.

    :param kwargs: Additional keyword arguments for MRI processing.
                   Find details to `kwargs` in docs of
                   `xai4mri.dataloader.mri_dataloader._load_data_as_full_array()`
                    or `_load_data_as_file_paths()`.
                    Note, these `kwargs` should only be used
                    if the deviation from the `__init__`-`kwargs` is intended.
    :return: Processed MRI data: either 5D data array of shape `[n_subjects, x,y,z, channel=1]`
                       for `self.load_mode='full_array'`,
                       or 1D array (paths to processed MRIs) for `self.load_mode='file_paths'`,
             and ordered list of corresponding subject ID's
    """
    # Unpack kwargs and update class attributes accordingly
    self._norm = kwargs.pop("norm", self._norm)
    self._path_brain_mask = kwargs.pop("path_brain_mask", self._path_brain_mask)
    self._compress = kwargs.pop("compress", self._compress)
    self._regis_mni = kwargs.pop("regis_mni", self._regis_mni)
    self._prune_mode = kwargs.pop("prune_mode", self._prune_mode)
    self._path_to_dataset = kwargs.pop("path_to_dataset", self._path_to_dataset)
    self._cache_files = kwargs.pop("cache_files", self._cache_files)
    self._save_after_processing = kwargs.pop("save_after_processing", self._save_after_processing)

    # Load data
    load_volume_data = _load_data_as_file_paths if self.load_mode == "file_paths" else _load_data_as_full_array

    volume_data, sid_list = load_volume_data(
        sid_list=self.sid_list,
        project_id=self.project_id,
        mri_path_constructor=self.mri_path_constructor,
        mri_seq=self.mri_sequence,
        cache_dir=self.cache_dir,
        norm=self._norm,
        path_brain_mask=self._path_brain_mask,
        compress=self._compress,
        regis_mni=self._regis_mni,
        prune_mode=self._prune_mode,
        path_to_dataset=self._path_to_dataset,
        cache_files=self._cache_files,
        save_after_processing=self._save_after_processing,
        **kwargs,  # == **save_kwargs remain, see _load_data_as_full_array() for details
    )

    if self.load_mode == "file_paths" and self._save_after_processing:
        # Reset to default for load_mode == "file_paths"
        self._save_after_processing = False

    sids_without_data = set(self.sid_list) - set(sid_list)
    if sids_without_data:
        msg = (
            f"No MRI data found for following subjects (ID's): {sids_without_data}.\n"
            f"Please locate data or remove SIDs from study table to proceed!"
        )
        raise ValueError(msg)

    return volume_data, sid_list

get_metadata_table 🧠

get_metadata_table()

Get the metadata table for the MRIs of the project dataset.

Source code in src/xai4mri/dataloader/datasets.py

def get_metadata_table(self):
    """Get the metadata table for the MRIs of the project dataset."""
    path_to_metadata = get_metadata_path(
        project_id=self.project_id,
        mri_seq=self.mri_sequence,
        regis_mni=self._regis_mni,
        path_brain_mask=self._path_brain_mask,
        norm=self._norm,
        prune_mode=self._prune_mode,
        path_to_dataset=self._path_to_dataset
        if isinstance(self._path_to_dataset, (str, Path))
        else self.cache_dir,
    )

    if Path(path_to_metadata).is_file():
        return pd.read_csv(path_to_metadata, index_col="sid", dtype={"sid": str})
    msg = f"Metadata table not found at '{path_to_metadata}'.\nRun data processing to create metadata table."
    raise FileNotFoundError(msg)

get_size_of_prospective_mri_set 🧠

get_size_of_prospective_mri_set(
    estimate_with_n: int = 3,
    estimate_processing_time: bool = True,
    **process_mri_kwargs
) -> None

Estimate the prospective storage sized, which is necessary to save the pre-processed project data.

Additionally, estimate the time needed to process the entire dataset.

Parameters:

Name	Type	Description	Default
`estimate_with_n`	`int`	use n samples to approximate the size of the whole processed dataset. If approx_with >= N, the entire dataset will be taken.	`3`
`estimate_processing_time`	`bool`	estimate the time needed to process the entire dataset.	`True`
`process_mri_kwargs`		`kwargs` for `process_single_mri()`, e.g., `prune_mode`. These should overlap with or be equal to the `kwargs` for `self.get_data()`. Note, these `kwargs` should only be used if the deviation from the `__init__`-`kwargs` is intended.	`{}`

Source code in src/xai4mri/dataloader/datasets.py

def get_size_of_prospective_mri_set(
    self,
    estimate_with_n: int = 3,
    estimate_processing_time: bool = True,
    **process_mri_kwargs,
) -> None:
    """
    Estimate the prospective storage sized, which is necessary to save the pre-processed project data.

    Additionally, estimate the time needed to process the entire dataset.

    :param estimate_with_n: use n samples to approximate the size of the whole processed dataset.
                            If approx_with >= N, the entire dataset will be taken.
    :param estimate_processing_time: estimate the time needed to process the entire dataset.
    :param process_mri_kwargs: `kwargs` for `process_single_mri()`, e.g., `prune_mode`.
                               These should overlap with or be equal to the `kwargs` for `self.get_data()`.
                               Note, these `kwargs` should only be used
                               if the deviation from the `__init__`-`kwargs` is intended.
    """
    max_n = 10
    if estimate_with_n > max_n:
        cprint(
            string=f"estimate_with_n is set down to {max_n}. "
            f"Too many samples make the calculation unnecessarily slow.",
            col="y",
        )
    estimate_with_n = np.clip(estimate_with_n, a_min=1, a_max=max_n)

    # Prep temporary sid_list
    full_sid_list = self.study_table.index.to_list()
    np.random.shuffle(full_sid_list)  # noqa: NPY002

    def print_estimated_size(size_in_bytes: int, cached_single_files: bool) -> None:
        """Print the estimated size of the pre-processed data."""
        cprint(
            string=f"\nEstimated size of all pre-processed {self.project_id.upper()} "
            f"data{' (mri-set + cached single files)' if cached_single_files else ''}: "
            f"{bytes_to_rep_string(size_in_bytes)}",
            col="b",
        )

    # Extract kwargs
    cache_files = process_mri_kwargs.pop("cache_files", self._cache_files)  # bool
    if not (cache_files or estimate_processing_time or process_mri_kwargs.get("register_to_mni", self._regis_mni)):
        # This is the fastest way to estimate the size of the processed data.
        # However, it only works if single files are not cached and no processing time is to be estimated
        from .prune_image import PruneConfig

        # Get shape
        temp_mri = get_nifti(self.mri_path_constructor(sid=full_sid_list[0]), reorient=True)
        resolution = np.round(temp_mri.header["pixdim"][1:4], decimals=3)  # image resolution per axis
        if process_mri_kwargs.get("prune_mode", self._prune_mode) is None:
            mri_shape = temp_mri.shape
        else:
            # If self._prune_mode == "max" or "cube"
            mri_shape = np.round(PruneConfig.largest_brain_max_axes // resolution).astype(int)  # "max"
            if process_mri_kwargs.get("prune_mode", self._prune_mode) == "cube":
                mri_shape = (int(mri_shape.max()),) * 3

        # Get dtype
        dtype = (
            np.uint8
            if process_mri_kwargs.get("compress", self._compress) and process_mri_kwargs.get("norm", self._norm)
            else temp_mri.get_fdata().dtype
        )
        # Compute size
        total_bytes = compute_array_size(shape=(len(full_sid_list), *mri_shape), dtype=dtype, verbose=False)
        print_estimated_size(size_in_bytes=total_bytes, cached_single_files=cache_files)

    else:
        # Prepare the path to the temporary cache directory
        cache_dir_parent = Path(process_mri_kwargs.pop("cache_dir", self.cache_dir))
        cache_dir_existed = cache_dir_parent.exists()  # init
        if not cache_dir_existed:
            cache_dir_parent.mkdir(parents=True, exist_ok=True)
        with TemporaryDirectory(dir=cache_dir_parent) as temp_cache_dir:
            try:
                start_time = perf_counter()
                # Prepare and load temp data
                load_volume_data = (
                    _load_data_as_file_paths if self.load_mode == "file_paths" else _load_data_as_full_array
                )
                _, _ = load_volume_data(  # _, _ == volume_data, sid_list
                    sid_list=full_sid_list[:estimate_with_n],
                    project_id=self.project_id,
                    mri_path_constructor=self.mri_path_constructor,
                    mri_seq=self.mri_sequence,
                    cache_dir=temp_cache_dir,
                    norm=process_mri_kwargs.pop("norm", self._norm),
                    path_brain_mask=process_mri_kwargs.pop("path_brain_mask", self._path_brain_mask),
                    compress=process_mri_kwargs.pop("compress", self._compress),
                    regis_mni=process_mri_kwargs.pop("regis_mni", self._regis_mni),
                    prune_mode=process_mri_kwargs.pop("prune_mode", self._prune_mode),
                    path_to_dataset=None,  # must be temp cache dir
                    cache_files=cache_files,
                    save_after_processing=process_mri_kwargs.pop(
                        "save_after_processing", self._save_after_processing
                    ),
                    force=True,  # force processing (ignores ask_true_false in _load_data_as_full_array())
                    **process_mri_kwargs,  # <- dtype, verbose, path_cached_mni
                )

                # Take average time (in seconds) per sample and multiply with total population size
                total_time = perf_counter() - start_time  # in seconds
                time_per_sample = total_time / estimate_with_n
                total_time = time_per_sample * len(full_sid_list)  # time for all
                total_time = timedelta(seconds=total_time)  # convert seconds into timedelta object

                # Take the average size per sample and multiply with the total population size
                total_bytes = sum(f.stat().st_size for f in Path(temp_cache_dir).glob("**/*") if f.is_file())
                total_bytes *= len(full_sid_list) / estimate_with_n

                data_suffix = ""
                if self.load_mode == "full_array" and cache_files:
                    data_suffix = "(mri-set + cached single files)"
                elif self.load_mode == "file_paths" and cache_files:
                    data_suffix = " (cached single files)"
                cprint(
                    string=f"\nEstimated size of all pre-processed {self.project_id.upper()} "
                    f"data{data_suffix}: {bytes_to_rep_string(total_bytes)}",
                    col="b",
                )
                cprint(
                    string=f"Estimated time to process all data: {chop_microseconds(total_time)} [hh:mm:ss]\n",
                    col="b",
                )
            except Exception as e:  # noqa: BLE001
                # Catch any Exception here, such that temp data will be deleted afterward.
                cprint(str(e), col="r")

            if not cache_dir_existed:
                sleep(0.5)
                shutil.rmtree(cache_dir_parent)

get_unpruned_mri 🧠

get_unpruned_mri(sid: str) -> Nifti1Image

Get the processed MRI data for a given subject ID, in the state before the image is pruned.

Parameters:

Name	Type	Description	Default
`sid`	`str`	subject ID	required

Returns:

Type	Description
`Nifti1Image`	Non-pruned MRI data as a NIfTI image

Source code in src/xai4mri/dataloader/datasets.py

def get_unpruned_mri(self, sid: str) -> nib.Nifti1Image:
    """
    Get the processed MRI data for a given subject ID, in the state before the image is pruned.

    :param sid: subject ID
    :return: Non-pruned MRI data as a NIfTI image
    """
    if isinstance(self._regis_mni, int) and not self._cache_files:
        msg = "MNI NIfTI files were not cached during processing!"
        raise ValueError(msg)

    return _get_unpruned_mri(
        sid=sid,
        project_id=self.project_id,
        mri_path_constructor=self.mri_path_constructor,
        regis_mni=self._regis_mni,
        cache_dir=self.cache_dir,
    )

load_split_dict `staticmethod` 🧠

load_split_dict(
    split_dict_path: str | Path,
) -> dict[str, ndarray[str]]

Load the split dictionary from the given file path.

Parameters:

Name	Type	Description	Default
`split_dict_path`	`str \| Path`	path to split dictionary file	required

Returns:

Type	Description
`dict[str, ndarray[str]]`	split dictionary

Source code in src/xai4mri/dataloader/datasets.py

@staticmethod
def load_split_dict(split_dict_path: str | Path) -> dict[str, np.ndarray[str]]:
    """
    Load the split dictionary from the given file path.

    :param split_dict_path: path to split dictionary file
    :return: split dictionary
    """
    # Add .npy-extension if it is not provided
    split_dict_path = Path(split_dict_path).with_suffix(".npy")
    split_dict = np.load(split_dict_path, allow_pickle=True).item()
    if not isinstance(split_dict, dict):
        msg = f"split_dict should be a dict, but is of type {type(split_dict)}"
        raise TypeError(msg)
    return split_dict

mri_path_constructor `abstractmethod` `staticmethod` 🧠

mri_path_constructor(sid: str) -> str | Path

Construct the path to the original MRI file of a subject given its ID (sid).

Define this function in the dataset class which inherits from BaseDataSet.

Parameters:

Name	Type	Description	Default
`sid`	`str`	subject ID	required

Returns:

Type	Description
`str \| Path`	path to the original MRI file of the subject

Source code in src/xai4mri/dataloader/datasets.py

@staticmethod
@abstractmethod
def mri_path_constructor(sid: str) -> str | Path:
    """
    Construct the path to the original MRI file of a subject given its ID (sid).

    Define this function in the dataset class which inherits from BaseDataSet.

    :param sid: subject ID
    :return: path to the original MRI file of the subject
    """
    pass

save_split_dict 🧠

save_split_dict(
    split_dict: dict[str, ndarray[str]] | None = None,
    save_path: str | Path | None = None,
) -> str

Save a split dictionary to a file.

If no split dictionary is given, the self.current_split_dict is saved. self.current_split_dict is set after calling self.create_data_split().

Parameters:

Name	Type	Description	Default
`split_dict`	`dict[str, ndarray[str]] \| None`	data split dictionary: {'train': ['sub-42', ...], 'validation': [...], 'test': [...]}	`None`
`save_path`	`str \| Path \| None`	path to file	`None`

Returns:

Type	Description
`str`	the path to the file

Source code in src/xai4mri/dataloader/datasets.py

def save_split_dict(
    self, split_dict: dict[str, np.ndarray[str]] | None = None, save_path: str | Path | None = None
) -> str:
    """
    Save a split dictionary to a file.

    If no split dictionary is given, the `self.current_split_dict` is saved.
    `self.current_split_dict` is set after calling `self.create_data_split()`.

    :param split_dict: data split dictionary: {'train': ['sub-42', ...], 'validation': [...], 'test': [...]}
    :param save_path: path to file
    :return: the path to the file
    """
    # Use the default path if no path is given
    if save_path is None:
        save_path = Path(
            self.cache_dir,
            "data_splits",
            f"{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}_{self.project_id}_{self.mri_sequence}_"
            f"split_dict.npy",
        )
    else:
        save_path = Path(save_path)

    # Use current data split dict if no split dict is given
    split_dict = self._split_dict if split_dict is None else split_dict

    if split_dict is None:
        raise ValueError("No split dictionary provided!")

    # Create directory if it does not exist
    save_path.parent.mkdir(exist_ok=True, parents=True)

    # Save split dictionary
    np.save(save_path, split_dict, allow_pickle=True)
    print(f"Saved split dictionary to {save_path}.")

    return str(save_path)

datasets 🧠

BaseDataSet 🧠

current_split_dict property 🧠

load_mode property writable 🧠

sid_list property writable 🧠

study_table property writable 🧠

study_table_path property writable 🧠

create_data_split 🧠

get_data 🧠

get_metadata_table 🧠

get_size_of_prospective_mri_set 🧠

get_unpruned_mri 🧠

load_split_dict staticmethod 🧠

mri_path_constructor abstractmethod staticmethod 🧠

save_split_dict 🧠

current_split_dict `property` 🧠

load_mode `property` `writable` 🧠

sid_list `property` `writable` 🧠

study_table `property` `writable` 🧠

study_table_path `property` `writable` 🧠

load_split_dict `staticmethod` 🧠

mri_path_constructor `abstractmethod` `staticmethod` 🧠