'How to detect and remove indentation of a piped text

I'm looking for a way to remove the indentation of a piped text. Below is a solution using cut -c 9- which assumes the indentation is 8 character wide.

I'm looking for a solution which can detect the number of spaces to remove. This implies going through the whole (piped) file to know the minimum number of spaces (tabs?) used to indent it, then remove them on each line.

run.sh

help() {
    awk '
    /esac/{b=0}
    b
    /case "\$arg" in/{b=1}' \
    "$me" \
    | cut -c 9-
}

while [[ $# -ge 1 ]]
do
    arg="$1"
    shift
    case "$arg" in
        help|h|?|--help|-h|'-?')
            # Show this help
            help;;
    esac
done

$ ./run.sh --help

help|h|?|--help|-h|'-?')
    # Show this help
    help;;

Note: echo $' 4\n 2\n 3' | python3 -c 'import sys; import textwrap as tw; print(tw.dedent(sys.stdin.read()), end="")' works but I expect there is a better, way (I mean, one which doesn't only depends on software more common than python. Maybe awk? I wouldn't mind seeing a perl solution either.

Note2: echo $' 4\n 2\n 3' | python -c 'import sys; import textwrap as tw; print tw.dedent(sys.stdin.read()),' also works (Python 2.7.15rc1).



Solution 1:[1]

The following is pure bash, with no external tools or command substitutions:

#!/usr/bin/env bash
all_lines=( )
min_spaces=9999 # start with something arbitrarily high
while IFS= read -r line; do
  all_lines+=( "$line" )
  if [[ ${line:0:$min_spaces} =~ ^[[:space:]]*$ ]]; then
    continue  # this line has at least as much whitespace as those preceding it
  fi
  # this line has *less* whitespace than those preceding it; we need to know how much.
  [[ $line =~ ^([[:space:]]*) ]]
  line_whitespace=${BASH_REMATCH[1]}
  min_spaces=${#line_whitespace}
done

for line in "${all_lines[@]}"; do
  printf '%s\n' "${line:$min_spaces}"
done

Its output is:

  4
2
 3

Solution 2:[2]

Suppose you have:

$ echo $'    4\n  2\n   3\n\ttab'
    4
  2
   3
    tab

You can use the Unix expand utility to expand the tabs to spaces. Then run through an awk to count the minimum number of spaces on a line:

$ echo $'    4\n  2\n   3\n\ttab' | 
expand | 
awk 'BEGIN{min_indent=9999999}
     {lines[++cnt]=$0
      match($0, /^[ ]*/)
      if(RLENGTH<min_indent) min_indent=RLENGTH
     }
     END{for (i=1;i<=cnt;i++) 
               print substr(lines[i], min_indent+1)}'
  4
2
 3
      tab

Solution 3:[3]

Here's the (semi-) obvious temp file solution.

#!/bin/sh

t=$(mktemp -t dedent.XXXXXXXXXX) || exit
trap 'rm -f $t' EXIT ERR
awk '{ n = match($0, /[^ ]/); if (NR == 1 || n<min) min = n }1
    END { exit min+1 }' >"$t"
cut -c $?- "$t"

This obviously fails if all lines have more than 255 leading whitespace characters because then the result won't fit into the exit code from Awk.

This has the advantage that we are not restricting ourselves to the available memory. Instead, we are restricting ourselves to the available disk space. The drawback is that disk might be slower, but the advantage of not reading big files into memory will IMHO trump that.

Solution 4:[4]

echo $'    4\n  2\n   3\n  \n   more spaces in  the    line\n  ...' | \
(text="$(cat)"; echo "$text" \
| cut -c "$(echo "$text" | sed 's/[^ ].*$//' | awk 'NR == 1 {a = length} length < a {a = length} END {print a + 1}')-"\
)

With explanations:

echo $'    4\n  2\n   3\n  \n   more spaces in  the    line\n  ...' | \
(
    text="$(cat)" # Obtain the input in a varibale
    echo "$text" | cut -c "$(
        # `cut` removes the n-1 first characters of each line of the input, where n is:
            echo "$text" | \
            sed 's/[^ ].*$//' | \
            awk 'NR == 1 || length < a {a = length} END {print a + 1}'
            # sed: keep only the initial spaces, remove the rest
            # awk:
            # At the first line `NR == 1`, get the length of the line `a = length`.
            # For any shorter line `a < length`, update the length `a = length`.
            # At the end of the piped input, print the shortest length + 1.
            # ... we add 1 because in `cut`, characters of the line are indexed at 1.
        )-"
)

Update:

It is possible to avoid spawning sed. As per tripleee's comment, sed's s/// can be replace awk's sub(). Here is an even shorter option, using n = match() as in tripleee's answer.

echo $'    4\n  2\n   3\n  \n   more spaces in  the    line\n  ...' | \
(
    text="$(cat)" # Obtain the input in a varibale
    echo "$text" | cut -c "$(
        # `cut` removes the a-1 first characters of each line of the input, where a is:
            echo "$text" | \
            awk '
                {n = match($0, /[^ ]/)}
                NR == 1 || n < a {a = n}
                END || a == 0 {print a + 1; exit 0}'
            # awk:
            # At every line, get the position of the first non-space character
            # At the first line `NR == 1`, copy that lenght to `a`.
            # For any line with less spaces than `a` (`n < a`) update `a`, (`a = n`).
            # At the end of the piped input, print a + 1.
            # a is then the minimum number of common leading spaces found in all lines.
            # ... we add 1 because in `cut`, characters of the line are indexed at 1.
            #
            # I'm not sure the whether the `a == 0 {...;  exit 0}` optimisation will let the "$text" be written to the script stdout yet (which is not desirable at all). Gotta test that when I get the time.

        )-"
)

Apparently, it's also possible to do in Perl 6 with the function my &f = *.indent(*);.

Solution 5:[5]

Another solution with awk, based on dawg’s answer. Major differences include:

  • No need to set an arbitrary large number for indentation, which feels hacky.
  • Works on text with empty lines, by not considering them when gathering the lowest indented line.
awk '
  {
    lines[++count] = $0
    if (NF == 0) next
    match($0, /[^ ]/)
    if (length(min) == 0 || RSTART < min) min = RSTART
  }
  END {
    for (i = 1; i <= count; i++) print substr(lines[i], min)
  }
' <<< $'    4\n  2\n   3'

Or all on the same line

awk '{ lines[++count] = $0; if (NF == 0) next; match($0, /[^ ]/); if (length(min) == 0 || RSTART < min) min = RSTART; } END { for (i = 1; i <= count; i++) print substr(lines[i], min) }' <<< $'    4\n  2\n   3'

Explanation:

Add current line to an array, and increment count variable

{
  lines[++count] = $0

If line is empty, skip to next iteration

  if (NF == 0) next

Set RSTART to the start index of the first non-space character.

  match($0, /[^ ]/)

If min isn’t set or is higher than RSTART, set the former to the latter.

  if (length(min) == 0 || RSTART < min) min = RSTART
}

Run after all input is read.

END {

Loop over the array, and for each line print only a substring going from the index set in min to the end of the line.

  for (i = 1; i <= count; i++) print substr(lines[i], min)
}

Solution 6:[6]

solution using bash

#!/usr/bin/env bash
cb=$(xclip -selection clipboard -o)
firstchar=${cb::1}
if [ "$firstchar" == $'\t' ];then
  tocut=$(echo "$cb" | awk -F$'\t' '{print NF-1;}' | sort -n | head -n1)
else
  tocut=$(echo "$cb" | awk -F '[^ ].*' '{print length($1)}' | sort -n | head -n1)
fi

echo "$cb" | cut -c$((tocut+1))- | xclip -selection clipboard

Note: assumes first line has the left-most indent

Works for both spaces and tabs

Ctrl+V some text, run that bash script, and now the dedented text is saved to your clipboard

solution using python

detab.py

import sys
import textwrap

data = sys.stdin.readlines()
data = "".join(data)
print(textwrap.dedent(data))

use with pipes

xclip -selection clipboard -o | python detab.py | xclip -selection clipboard

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Charles Duffy
Solution 2
Solution 3
Solution 4
Solution 5 user137369
Solution 6 Seth Foster