Data Cleanup Tool

Import or paste raw CSV / TSV / delimited text, configure the cleanup operations, and run — mirrors the logic of the Python script running on the backend.

File

Processed in-browser only — your file is never uploaded or stored.

Text Operations
Casing
Find & Replace
Input Delimiter
Output Delimiter
Keep Columns
Rearrange Columns
Split Column
Input
Rows: 0 Cols: Chars: 0
Output
Rows: 0 Removed: 0 Dupes: 0

Documentation

Technical reference for the Python data cleanup script that powers this demo.

Overview

This tool processes CSV, TSV, pipe-delimited, or plain line-separated text through a configurable pipeline of transformations. Each operation is applied in order, and when a delimiter is detected the pipeline works at the cell level — not the raw line — so delimiter characters are never corrupted by casing, find-replace, or punctuation operations.

The browser demo mirrors the logic of a standalone Python script that accepts the same options as command-line flags. The script reads from a file or stdin, applies the pipeline, and writes to a file or stdout — making it composable with standard Unix tools.

Dependencies

All core functionality uses the Python standard library — no third-party packages required to run the script. pandas is listed as an optional dependency only for the large-file chunked-processing pattern described in the Performance Notes.

  • csv — RFC 4180-compliant row parsing and writing (handles quoted fields containing the delimiter)
  • re — compiled regex for punctuation stripping
  • io — in-memory StringIO stream passed to csv.reader / csv.writer
  • argparse — CLI argument parsing and --help generation
  • pandas (optional) — chunked reading for files that exceed available memory

Cleanup Pipeline

Steps are applied in the order listed. Cell-level steps (1–5) run before the header is removed (step 6) and before structural steps (8–10), so that trimming and casing operate on original column boundaries and the header is never subjected to data-level transforms.

1 — Detect & parse delimiter

Auto-detection checks the first non-empty line for \t, then ,, |, ; in priority order. A forced value from --in-delimiter skips detection entirely. Parsing is handled by csv.reader, which correctly handles quoted fields containing the delimiter.

import csv, io

def detect_delimiter(text: str) -> str | None:
    first = next((l for l in text.splitlines() if l.strip()), '')
    for d in ('\t', ',', '|', ';'):
        if d in first:
            return d
    return None

def parse_rows(text: str, delimiter: str) -> list[list[str]]:
    return list(csv.reader(io.StringIO(text), delimiter=delimiter))

2 — Trim whitespace

Strips leading and trailing whitespace from every cell. Applied before punctuation and casing so those operations work on clean values.

def trim_cells(rows: list[list[str]]) -> list[list[str]]:
    return [[cell.strip() for cell in row] for row in rows]

3 — Strip punctuation

Removes characters that are not word characters, whitespace, or common data characters. The preserved set (@ . - ; plus all delimiter characters) is intentional — it keeps email addresses, URLs, and numeric ranges intact.

import re

_PUNCT_RE = re.compile(r'[^\w\s,\t|.@;-]')

def strip_punctuation(rows: list[list[str]]) -> list[list[str]]:
    return [[_PUNCT_RE.sub('', cell) for cell in row] for row in rows]

4 — Apply casing

def apply_casing(rows: list[list[str]], mode: str) -> list[list[str]]:
    def transform(cell: str) -> str:
        if mode == 'upper':    return cell.upper()
        if mode == 'lower':    return cell.lower()
        if mode == 'title':    return cell.title()
        if mode == 'sentence': return cell.capitalize()
        return cell
    return [[transform(c) for c in row] for row in rows]

5 — Find & replace

Exact string match, all occurrences, applied per cell. For regex-based replacement, swap str.replace for re.sub.

def find_replace(rows: list[list[str]], find: str, replace: str) -> list[list[str]]:
    return [[cell.replace(find, replace) for cell in row] for row in rows]

6 — Separate header row

When --headers is set, the first row is removed from the data before any transformations run. It passes through trim (step 2) and the structural operations (steps 8–10) so that column names stay aligned with the data beneath them. Casing, find-replace, punctuation, deduplication, and empty-row removal never touch it. At the end of the pipeline the header is prepended back to the output.

def separate_header(
    rows: list[list[str]], do_trim: bool
) -> tuple[list[str] | None, list[list[str]]]:
    if not rows:
        return None, rows
    header = [c.strip() for c in rows[0]] if do_trim else list(rows[0])
    return header, rows[1:]

7 — Remove empty rows

A row is empty when every cell is blank after stripping. Runs after trim and cell-level ops so that rows that were only whitespace are correctly caught. The header row is never subject to this check.

def remove_empty_rows(rows: list[list[str]]) -> list[list[str]]:
    return [row for row in rows if any(c.strip() for c in row)]

8 — Keep columns

Accepts a list of 0-based column indices. Columns not in the list are dropped. Out-of-range indices silently produce an empty string rather than raising IndexError.

def keep_columns(rows: list[list[str]], indices: list[int]) -> list[list[str]]:
    return [[row[i] if i < len(row) else '' for i in indices] for row in rows]

9 — Rearrange columns

Accepts a new index order. Unlike keep_columns, the length of order defines the output width — indices not mentioned are omitted from output.

def rearrange_columns(rows: list[list[str]], order: list[int]) -> list[list[str]]:
    return [[row[i] if i < len(row) else '' for i in order] for row in rows]

10 — Split column

Splits the value at col_index on a sub-delimiter, replacing that one cell with all resulting parts in-place. Every row widens by len(parts) - 1 columns. Because the structural step is also applied to the header (step 6), the header row expands by the same amount, keeping column names aligned.

def split_column(
    rows: list[list[str]], col_index: int, delimiter: str
) -> list[list[str]]:
    result = []
    for row in rows:
        cell  = row[col_index] if col_index < len(row) else ''
        parts = cell.split(delimiter)
        result.append(row[:col_index] + parts + row[col_index + 1:])
    return result

11 — Deduplicate

Comparison is case-insensitive and whitespace-normalised so that rows that differ only in casing or surrounding spaces are treated as duplicates. The first occurrence is always kept.

def deduplicate(rows: list[list[str]]) -> list[list[str]]:
    seen: set[tuple[str, ...]] = set()
    out  = []
    for row in rows:
        key = tuple(c.strip().lower() for c in row)
        if key not in seen:
            seen.add(key)
            out.append(row)
    return out

12 — Write output

Serialises rows back to delimited text using csv.writer, which correctly re-quotes fields that contain the output delimiter.

def write_output(rows: list[list[str]], delimiter: str = ',') -> str:
    buf = io.StringIO()
    csv.writer(buf, delimiter=delimiter).writerows(rows)
    return buf.getvalue()

Full CLI Script

The complete script wires the pipeline functions to argparse flags and handles file I/O. Every flag maps directly to a control in the browser demo.

#!/usr/bin/env python3
"""data_cleanup.py — configurable CSV / delimited text cleanup tool."""
import argparse, csv, io, re, sys

# --- helpers (all functions from the pipeline section above) ---

def run_pipeline(text: str, args: argparse.Namespace) -> str:
    delim_in  = args.in_delimiter or detect_delimiter(text)
    delim_out = args.out_delimiter or delim_in or ','

    if delim_in:
        rows = parse_rows(text, delim_in)

        # Separate header before data transformations
        header_cells = None
        if args.headers and rows:
            raw_hdr      = rows.pop(0)
            header_cells = [c.strip() for c in raw_hdr] if args.trim else raw_hdr

        if args.trim:    rows = trim_cells(rows)
        if args.punct:   rows = strip_punctuation(rows)
        if args.case:    rows = apply_casing(rows, args.case)
        if args.find:    rows = find_replace(rows, args.find, args.replace)
        if args.empty:   rows = remove_empty_rows(rows)
        if args.dedupe:  rows = deduplicate(rows)

        # Structural ops applied identically to header and data rows for alignment
        def apply_structural(cells: list[str]) -> list[str]:
            if args.keep:
                idx   = [int(i) for i in args.keep.split(',')]
                cells = [cells[i] if i < len(cells) else '' for i in idx]
            if args.reorder:
                idx   = [int(i) for i in args.reorder.split(',')]
                cells = [cells[i] if i < len(cells) else '' for i in idx]
            if args.split_col is not None and args.split_delim:
                ci    = args.split_col
                parts = (cells[ci] if ci < len(cells) else '').split(args.split_delim)
                cells = cells[:ci] + parts + cells[ci + 1:]
            return cells

        rows = [apply_structural(r) for r in rows]
        if header_cells is not None:
            rows = [apply_structural(header_cells)] + rows

        return '\n'.join(delim_out.join(row) for row in rows)

    else:
        lines     = text.splitlines()
        header_ln = None
        if args.headers and lines:
            header_ln = lines.pop(0).strip() if args.trim else lines.pop(0)
        if args.trim:   lines = [l.strip() for l in lines]
        if args.empty:  lines = [l for l in lines if l.strip()]
        if args.dedupe:
            seen: set[str] = set()
            deduped = []
            for l in lines:
                k = l.strip().lower()
                if k not in seen:
                    seen.add(k); deduped.append(l)
            lines = deduped
        if header_ln is not None:
            lines = [header_ln] + lines
        return '\n'.join(lines)


def main() -> None:
    p = argparse.ArgumentParser(description='Cleanup CSV/delimited text files')
    p.add_argument('input',           help='Input file path, or - for stdin')
    p.add_argument('-o', '--output',  help='Output file path (default: stdout)')
    p.add_argument('--in-delimiter',  dest='in_delimiter',  default=None)
    p.add_argument('--out-delimiter', dest='out_delimiter', default=None)
    p.add_argument('--headers', action='store_true',
                   help='Treat first row as a header — preserves it through all data transformations')
    p.add_argument('--trim',   action='store_true', help='Trim cell whitespace')
    p.add_argument('--dedupe', action='store_true', help='Remove duplicate rows')
    p.add_argument('--empty',  action='store_true', help='Remove empty rows')
    p.add_argument('--punct',  action='store_true', help='Strip punctuation')
    p.add_argument('--case',   choices=['upper', 'lower', 'title', 'sentence'])
    p.add_argument('--find',   default='')
    p.add_argument('--replace', default='')
    p.add_argument('--keep',   help='Comma-separated column indices to keep, e.g. 0,2,4')
    p.add_argument('--reorder', help='New column order, e.g. 2,0,1')
    p.add_argument('--split-col',   dest='split_col',   type=int, default=None)
    p.add_argument('--split-delim', dest='split_delim', default=None)
    args = p.parse_args()

    src  = sys.stdin if args.input == '-' else open(args.input, encoding='utf-8')
    text = src.read()

    result = run_pipeline(text, args)

    if args.output:
        open(args.output, 'w', encoding='utf-8').write(result)
    else:
        print(result)


if __name__ == '__main__':
    main()

Usage

All flags are optional and composable. When no --in-delimiter is given, the script auto-detects from the first non-empty line.

# Trim whitespace and remove empty rows (no header)
python data_cleanup.py input.csv --trim --empty

# Same file, but the first row is a header — preserve it unchanged
python data_cleanup.py input.csv --trim --empty --headers

# Deduplicate a TSV file and write back to TSV
python data_cleanup.py input.tsv --trim --dedupe --in-delimiter $'\t' --out-delimiter $'\t' -o output.tsv

# Convert CSV to pipe-delimited while applying Title Case
python data_cleanup.py input.csv --trim --case title --out-delimiter '|' -o output.txt

# Keep only columns 0, 2, and 3, then reorder them as 2,0,3
python data_cleanup.py input.csv --trim --keep 0,2,3 --reorder 2,0,3 -o output.csv

# Split a "full_name" column (index 1) on a space into first/last columns
python data_cleanup.py input.csv --trim --split-col 1 --split-delim ' ' -o output.csv

# Read from stdin and pipe cleaned output to another command
cat raw.csv | python data_cleanup.py - --trim --empty --dedupe | sort
Column indices are always 0-based and refer to the columns as they exist at that point in the pipeline. If you use --keep before --reorder, the indices in --reorder refer to the already-filtered set of columns, not the original file. When --headers is set, structural operations (--keep, --reorder, --split-col) are applied to the header row as well, so column names always stay aligned with the data beneath them.

Performance Notes

The script loads the entire file into memory as a list[list[str]]. Peak memory usage is roughly 2–3× the raw file size. For most CSVs (under ~200 MB) this is not a concern.

For files that exceed available memory, use pandas with chunksize to stream processing in batches:

import pandas as pd

chunk_iter = pd.read_csv('large.csv', chunksize=100_000, dtype=str)

with open('output.csv', 'w') as out:
    for i, chunk in enumerate(chunk_iter):
        chunk = (chunk
                 .map(str.strip)          # trim all cells
                 .dropna(how='all')       # remove empty rows
                 .drop_duplicates())      # deduplicate within chunk
        chunk.to_csv(out, index=False, header=(i == 0))
Cross-chunk deduplication is not handled by the chunked approach above — a row that appears in chunk 1 and chunk 5 will not be removed. For true global deduplication on large files, sort the file first (sort -u input.csv) then run the script, or load only the key columns into a set during a first pass.

One important distinction: csv.reader handles RFC 4180 quoting automatically — a field like "Smith, Jr." is treated as a single cell even though it contains the comma delimiter. Plain str.split(',') would incorrectly split it into two cells.