ChatGPT 로 쓴 data split script (Image classification)

jinmc 2023. 10. 5. 15:11

ChatGPT로 다음과 Image classification 관련해서 train, val, test 관련된 python script 를 써달라고 말해 봤습니다.

다음은 말한 ChatGPT Script 입니다.

write me a script in python that splits images into train/val/test. The source will have several directories with images in it. Target will be empty at first. the Target will output three directories, which will maintain the directory structure as the source directory. We should be able to select source and target on the command line as inputs when asked, and also the ratio of it as integers (e.g. 3 1 1 ). split the images by random order, using the ratio of the inputs.

그리고 다음은 ChatGPT 가 써준 스크립트 입니다.

import os
import shutil
import random
import argparse

def split_dataset(source, target, ratios):
    # Ensure the source directory exists
    if not os.path.exists(source):
        print(f"Source directory {source} does not exist.")
        return

    # Create train, val, test directories inside target
    for split_name in ["train", "val", "test"]:
        os.makedirs(os.path.join(target, split_name), exist_ok=True)

    # Process each sub-directory in source
    for subdir in os.listdir(source):
        subdir_path = os.path.join(source, subdir)

        if not os.path.isdir(subdir_path):
            continue

        # Create the same sub-directory structure inside train, val, test
        for split_name in ["train", "val", "test"]:
            os.makedirs(os.path.join(target, split_name, subdir), exist_ok=True)

        # Get list of all images in the current sub-directory
        images = [f for f in os.listdir(subdir_path) if f.endswith('.jpg')]
        random.shuffle(images)

        total = len(images)
        train_count = (total * ratios[0]) // sum(ratios)
        val_count = (total * ratios[1]) // sum(ratios)
        test_count = total - train_count - val_count

        for i, image in enumerate(images):
            if i < train_count:
                dest = os.path.join(target, "train", subdir, image)
            elif i < train_count + val_count:
                dest = os.path.join(target, "val", subdir, image)
            else:
                dest = os.path.join(target, "test", subdir, image)

            shutil.copy2(os.path.join(subdir_path, image), dest)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Split dataset into train, val, test.")
    parser.add_argument("source", type=str, help="Path to the source directory.")
    parser.add_argument("target", type=str, help="Path to the target directory.")
    parser.add_argument("ratios", type=int, nargs=3, help="Ratios for train, val, test split. Example: 3 1 1")

    args = parser.parse_args()
    split_dataset(args.source, args.target, args.ratios)

또, train 과 test로만 split 해달라고도 해 봤습니다.

can you write me another script that only splits to train and test? rest will be same, so the ratio will only accept two integers (e.g. 3 1)

다음은 코드입니다.

import os
import shutil
import random
import argparse

def split_dataset(source, target, ratios):
    # Ensure the source directory exists
    if not os.path.exists(source):
        print(f"Source directory {source} does not exist.")
        return

    # Create train and test directories inside target
    for split_name in ["train", "test"]:
        os.makedirs(os.path.join(target, split_name), exist_ok=True)

    # Process each sub-directory in source
    for subdir in os.listdir(source):
        subdir_path = os.path.join(source, subdir)

        if not os.path.isdir(subdir_path):
            continue

        # Create the same sub-directory structure inside train and test
        for split_name in ["train", "test"]:
            os.makedirs(os.path.join(target, split_name, subdir), exist_ok=True)

        # Get list of all images in the current sub-directory
        images = [f for f in os.listdir(subdir_path) if f.endswith('.jpg')]
        random.shuffle(images)

        total = len(images)
        train_count = (total * ratios[0]) // sum(ratios)
        test_count = total - train_count

        for i, image in enumerate(images):
            if i < train_count:
                dest = os.path.join(target, "train", subdir, image)
            else:
                dest = os.path.join(target, "test", subdir, image)

            shutil.copy2(os.path.join(subdir_path, image), dest)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Split dataset into train and test.")
    parser.add_argument("source", type=str, help="Path to the source directory.")
    parser.add_argument("target", type=str, help="Path to the target directory.")
    parser.add_argument("ratios", type=int, nargs=2, help="Ratios for train and test split. Example: 3 1")

    args = parser.parse_args()
    split_dataset(args.source, args.target, args.ratios)