Text processing is one of the most practical skills a Linux administrator uses daily. Log analysis, configuration file editing, data extraction, report generation, and countless automation tasks all depend on your ability to transform text streams quickly and accurately from the command line.

This guide covers every text filtering tool listed under the LPIC-1 Exam 103, Topic 103.2: Process text streams using filters. Each section includes real command examples, common option flags, and practical use cases drawn from production sysadmin work. Whether you are preparing for the LPIC-1 certification or strengthening your day-to-day Linux skills, this is your definitive reference.

All examples work on current distributions including Debian 12/13, Ubuntu 24.04/24.10, RHEL 9, Fedora 41, and Arch Linux. The GNU coreutils and gawk packages provide these tools on every standard installation.

1. cat: Concatenate and Display Files

The cat command reads one or more files sequentially and writes their contents to standard output. Despite being one of the first commands most people learn, it has several options that are extremely useful for debugging and inspecting files.

Display a file with line numbers:

cat -n /etc/passwd

The -n flag numbers every line, including blank lines. If you want to number only non-blank lines, use -b instead:

cat -b /etc/hosts

Show non-printing characters:

cat -A /etc/fstab

The -A flag is equivalent to -vET. It marks the end of each line with a $ symbol, displays tab characters as ^I, and shows other non-printing characters using caret notation. This is invaluable when you suspect trailing whitespace, mixed tabs/spaces, or Windows-style line endings (^M$) in a configuration file.

Suppress repeated blank lines:

cat -s /var/log/syslog

The -s (squeeze) flag collapses multiple consecutive blank lines into a single blank line. This cleans up output from verbose log files.

Concatenate multiple files:

cat header.txt body.txt footer.txt > report.txt

This is the original purpose of cat: joining files end to end. The shell redirection > writes the combined output to a new file.

2. head and tail: View the Beginning or End of Files

The head command prints the first part of a file, while tail prints the last part. By default, both show 10 lines.

Show the first 20 lines of a file:

head -n 20 /var/log/auth.log

Show the last 50 lines:

tail -n 50 /var/log/syslog

You can also use the shorthand tail -50 for the same result.

Follow a log file in real time:

tail -f /var/log/nginx/access.log

The -f (follow) flag keeps the output open and prints new lines as they are appended. This is the standard way to monitor active log files. Press Ctrl+C to stop following.

For files that get rotated (renamed and recreated), use tail -F (capital F). It will detect the rotation and reopen the new file automatically:

tail -F /var/log/messages

Print everything except the first 5 lines:

tail -n +6 data.csv

The +6 syntax means “start from line 6,” which effectively skips the first 5 lines. This is useful for stripping CSV headers or file preambles.

Extract a range of lines (lines 10 through 20):

head -n 20 /etc/services | tail -n 11

Combining head and tail through a pipe is a clean way to extract any line range from a file.

3. sort: Order Lines of Text

The sort command arranges lines of input in a specified order. It supports alphabetical, numeric, and field-based sorting, and it handles very large files efficiently using temporary disk storage when needed.

Basic alphabetical sort:

sort /etc/passwd

Numeric sort:

du -sh /var/log/* | sort -h

The -h flag sorts human-readable sizes (K, M, G) correctly. For plain numeric sorting, use -n:

cat scores.txt | sort -n

Sort by a specific field (column):

sort -t: -k3 -n /etc/passwd

This sorts /etc/passwd numerically by the third field (UID), using the colon as the field delimiter (-t:). The -k3 flag specifies the sort key.

Reverse sort:

sort -t: -k3 -n -r /etc/passwd

Add -r to reverse the order. Combined with numeric sort, this puts the highest UIDs first.

Sort and remove duplicates:

sort -u /var/log/auth.log

The -u flag eliminates duplicate lines from the sorted output, combining the behavior of sort and uniq in a single pass.

Sort with a secondary key:

sort -t, -k2,2 -k3,3n employees.csv

This sorts first by the second field alphabetically, then by the third field numerically as a tiebreaker. The -k2,2 notation means “start and end at field 2” to prevent the sort key from extending to the end of the line.

4. uniq: Report or Filter Repeated Lines

The uniq command filters out adjacent duplicate lines. The critical point to remember: uniq only compares consecutive lines, so you must sort the input first for it to work correctly on unsorted data.

Remove duplicate lines (sort first):

sort access_ips.txt | uniq

Count occurrences of each line:

sort access_ips.txt | uniq -c | sort -rn | head -20

This is a classic sysadmin pattern. It counts how many times each IP address appears, sorts the counts in descending numeric order, and shows the top 20. The -c flag prefixes each line with the number of consecutive occurrences.

Show only duplicate lines:

sort usernames.txt | uniq -d

The -d flag prints only lines that appear more than once. Use this to find duplicate entries in lists.

Show only unique lines (lines that appear exactly once):

sort usernames.txt | uniq -u

The -u flag is the opposite of -d. It prints only lines that are not repeated.

Ignore case when comparing:

sort -f names.txt | uniq -i

The -i flag tells uniq to ignore differences in case. Note that you should also pass -f (fold case) to sort so the grouping is consistent.

5. wc: Count Lines, Words, and Characters

The wc (word count) command counts lines, words, and bytes in files. It is small but used constantly in scripts and pipelines.

Full count (lines, words, bytes):

wc /etc/passwd

The output shows three numbers: line count, word count, and byte count, followed by the filename.

Count only lines:

wc -l /etc/passwd

This tells you how many user accounts exist on the system. The -l flag is the most commonly used wc option.

Count words:

wc -w report.txt

Count characters (multibyte-aware) vs bytes:

wc -m document.txt
wc -c document.txt

The -m flag counts characters (respecting your locale and multibyte encodings like UTF-8), while -c counts raw bytes. For ASCII text, these numbers are identical. For files containing multibyte characters, -c will be higher than -m.

Count lines from a pipeline:

grep "Failed password" /var/log/auth.log | wc -l

Combining grep with wc -l gives you a quick count of matching lines. This answers questions like “how many failed SSH login attempts happened today?”

6. cut: Extract Columns and Fields

The cut command removes sections from each line of input. It works in three modes: by delimiter-separated fields, by character positions, or by byte positions.

Cut by delimiter and field:

cut -d: -f1,3 /etc/passwd

This extracts the username (field 1) and UID (field 3) from each line of /etc/passwd, using the colon as the delimiter. The -d flag sets the delimiter and -f selects the fields.

Extract a range of fields:

cut -d: -f1-4 /etc/passwd

The range 1-4 selects fields 1 through 4 inclusive.

Cut by character position:

cut -c1-8 /etc/passwd

This extracts characters 1 through 8 from each line. Character-based cutting is useful for fixed-width data formats.

Extract fields from CSV data:

cut -d, -f2,5 sales_data.csv

Use cut with a pipe to process command output:

df -h | tail -n +2 | tr -s ' ' | cut -d' ' -f5

This pipeline extracts the “Use%” column from the df output. The tr -s ' ' step squeezes multiple spaces into one, creating a clean single-space delimiter for cut.

7. paste: Merge Files Side by Side

The paste command joins files horizontally, placing corresponding lines from each file side by side, separated by a tab character (by default).

Merge two files line by line:

paste names.txt scores.txt

If names.txt contains names and scores.txt contains corresponding scores, paste combines them into two-column output.

Use a custom delimiter:

paste -d, names.txt emails.txt phones.txt

The -d, flag uses a comma as the separator, which is handy for building CSV files from separate data sources.

Convert a single column to multiple columns:

paste -d' ' - - - < wordlist.txt

Each - represents one column drawn from standard input. Three dashes produce three-column output from a single-column file. This is a practical way to reformat a list into a table.

Serial mode (join all lines into one):

paste -s -d, hostnames.txt

The -s flag pastes lines in serial rather than parallel, converting a vertical list into a single comma-separated line. Useful when you need to build a comma-delimited string from a list of hostnames or IP addresses.

8. tr: Translate, Delete, and Squeeze Characters

The tr command translates, deletes, or squeezes characters. Unlike most text tools, tr does not accept filenames as arguments. It reads only from standard input, so you always use it in a pipeline or with redirection.

Convert lowercase to uppercase:

cat hostname.txt | tr 'a-z' 'A-Z'

You can also use character classes for portability across locales:

tr '[:lower:]' '[:upper:]' < hostname.txt

Delete specific characters:

echo "Phone: (555) 123-4567" | tr -d '()- '

The -d flag removes all occurrences of the listed characters. This strips parentheses, hyphens, and spaces from a phone number, leaving just the digits and colon.

Squeeze repeated characters:

echo "too    many    spaces" | tr -s ' '

The -s flag replaces repeated occurrences of a character with a single instance. This is commonly used to normalize whitespace in command output before piping to cut or awk.

Convert Windows line endings to Unix:

tr -d '\r' < windows_file.txt > unix_file.txt

This removes carriage return characters (\r), converting \r\n line endings to Unix-style \n.

Replace characters (translate):

echo "2026-03-19" | tr '-' '/'

Output: 2026/03/19. Each character in the first set is replaced by the corresponding character in the second set.

9. sed: Stream Editor for Text Transformation

The sed (stream editor) command performs text transformations on input streams. It processes text line by line, applying one or more editing commands to each line. For LPIC-1, you need solid familiarity with substitution, deletion, and in-place editing.

Basic substitution (replace first occurrence per line):

sed 's/old/new/' filename.txt

Global substitution (replace all occurrences per line):

sed 's/http/https/g' urls.txt

The g flag at the end of the substitution command tells sed to replace every match on the line, not just the first one.

Case-insensitive substitution:

sed 's/error/WARNING/gI' application.log

The I flag (GNU sed extension) makes the pattern match case-insensitive, so it matches "error", "Error", "ERROR", and any other combination.

Delete lines matching a pattern:

sed '/^#/d' /etc/ssh/sshd_config

The d command deletes lines matching the pattern. Here, /^#/ matches lines starting with a hash, effectively stripping comments from the output.

Delete blank lines:

sed '/^$/d' config.conf

Combine comment and blank line removal:

sed '/^#/d; /^$/d' /etc/ssh/sshd_config

Multiple sed commands are separated by semicolons. This shows only the active configuration directives, which is much easier to review than the full file with all its comments.

In-place editing:

sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config

The -i flag modifies the file directly instead of writing to stdout. This is powerful but dangerous because there is no undo. Create a backup by passing a suffix to the -i flag:

sed -i.bak 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config

This saves the original file as sshd_config.bak before applying the changes.

Print only matching lines (like grep):

sed -n '/Port/p' /etc/ssh/sshd_config

The -n flag suppresses default output, and p prints only lines matching the pattern.

Substitute using regex with capture groups:

echo "2026-03-19" | sed 's/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\3\/\2\/\1/'

Output: 19/03/2026. This uses basic regex with escaped parentheses for capture groups and backreferences \1, \2, \3 to rearrange the date format.

Edit a specific line number:

sed '5s/foo/bar/' filename.txt

Prefixing the command with a line number applies it to only that line. You can also use ranges: sed '10,20s/foo/bar/' applies to lines 10 through 20.

10. awk: Pattern Scanning and Field Processing

The awk command is a full programming language for text processing. On most Linux systems, the awk command points to GNU awk (gawk). For the LPIC exam, you need to understand field processing, pattern matching, and basic awk programs.

Awk automatically splits each input line into fields. By default, whitespace is the field separator. Fields are referenced as $1, $2, $3, and so on. $0 represents the entire line.

Print specific fields:

awk '{print $1, $3}' /etc/passwd

This prints the first and third whitespace-delimited fields. However, since /etc/passwd uses colons, you need to set the field separator:

awk -F: '{print $1, $3}' /etc/passwd

The -F: flag sets the input field separator to a colon. Now $1 is the username and $3 is the UID.

Pattern matching (print lines where UID is greater than 999):

awk -F: '$3 > 999 {print $1, $3}' /etc/passwd

The pattern $3 > 999 acts as a condition. Only lines where the third field is numerically greater than 999 are processed by the action block. This lists all regular (non-system) user accounts.

Use BEGIN and END blocks:

awk -F: 'BEGIN {print "Username\tUID\tShell"}
         $3 >= 1000 {print $1"\t"$3"\t"$7}
         END {print "---\nTotal users:", NR}' /etc/passwd

The BEGIN block executes before any input is read. The END block executes after all input has been processed. The built-in variable NR holds the total number of records (lines) read.

Formatted output with printf:

awk -F: '{printf "%-20s %5d %s\n", $1, $3, $7}' /etc/passwd

The printf function gives you full control over output formatting. %-20s prints a left-aligned string in a 20-character wide field, %5d prints an integer right-aligned in 5 characters.

Sum a numeric column:

df -h | awk 'NR>1 {sum += $3} END {print "Total used: " sum}'

This skips the header line (NR>1), accumulates the values from the third column, and prints the total at the end.

String matching with regex:

awk -F: '$7 ~ /bash$/ {print $1}' /etc/passwd

The ~ operator tests whether a field matches a regular expression. This prints usernames whose login shell ends with "bash".

Multiple field separators:

awk -F'[:/]' '{print $1, $NF}' /etc/passwd

Placing multiple characters inside brackets creates a character class that splits on any of those characters. $NF is a built-in variable representing the last field on each line.

11. grep and egrep: Search Text with Patterns

The grep command searches files for lines matching a pattern and prints those lines. It is probably the single most-used text processing tool in daily Linux administration.

Basic pattern search:

grep "root" /etc/passwd

Case-insensitive search:

grep -i "error" /var/log/syslog

Recursive search through a directory:

grep -r "AllowOverride" /etc/apache2/

The -r flag searches all files under the specified directory recursively. Add -l to show only filenames (not the matching lines):

grep -rl "Listen 80" /etc/apache2/

Invert match (show lines that do NOT match):

grep -v "^#" /etc/fstab

The -v flag inverts the match. Combined with the ^# regex, this filters out comment lines.

Count matches:

grep -c "Failed password" /var/log/auth.log

The -c flag outputs only the count of matching lines, not the lines themselves.

Show context around matches:

grep -B 2 -A 5 "kernel panic" /var/log/kern.log

-B 2 prints 2 lines before each match, and -A 5 prints 5 lines after. Use -C 3 for 3 lines of context both before and after.

Extended regular expressions with egrep (or grep -E):

grep -E "^(root|admin|www-data):" /etc/passwd

The -E flag enables extended regex, which supports +, ?, | (alternation), and () (grouping) without backslash escaping. The egrep command is equivalent to grep -E and is still widely used, though it is technically deprecated in favor of the -E flag.

Match whole words only:

grep -w "port" /etc/ssh/sshd_config

The -w flag matches the pattern only when it appears as a complete word, preventing partial matches like "transport" or "support".

Show line numbers:

grep -n "PermitRootLogin" /etc/ssh/sshd_config

The -n flag prefixes each match with its line number, which is helpful when you plan to edit the file at a specific location.

Match multiple patterns:

grep -e "error" -e "warning" -e "critical" /var/log/syslog

The -e flag allows multiple patterns. A line matching any one of the patterns will be printed.

12. tee: Write to Both File and Standard Output

The tee command reads from standard input and writes to both standard output and one or more files simultaneously. It is the branching point in a pipeline when you need to save intermediate output while continuing to process it.

Save and display output at the same time:

df -h | tee disk_usage.txt

The output of df -h appears on screen and is also written to disk_usage.txt.

Append instead of overwrite:

echo "Backup completed at $(date)" | tee -a /var/log/backup.log

The -a flag appends to the file instead of overwriting it. Without -a, tee truncates the file before writing.

Write to multiple files:

cat /etc/passwd | tee copy1.txt copy2.txt > /dev/null

You can list multiple filenames after tee. Redirecting to /dev/null suppresses the screen output if you only want the files.

Use tee with sudo for writing to protected files:

echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

This is a well-known pattern. You cannot use sudo echo "text" > /protected/file because the shell (running as your user, not root) handles the redirection. Piping through sudo tee solves this because tee runs with root privileges and performs the file write.

13. fmt, pr, nl, expand, and unexpand

These utilities are less commonly used in daily work but appear on the LPIC-1 exam and serve specific formatting purposes.

fmt: Reformat text to a specified width

fmt -w 72 longlines.txt

The fmt command reformats paragraphs to fit within a specified line width (72 characters here). It joins short lines and breaks long lines while trying to keep paragraphs looking reasonable. It respects paragraph boundaries (blank lines).

pr: Paginate text for printing

pr -h "Server Inventory" -l 60 servers.txt

The pr command formats files for printing by adding headers, footers, and page breaks. The -h flag sets a custom header, and -l 60 sets the page length to 60 lines. You can also create multi-column output:

pr -2 -t hostnames.txt

The -2 flag creates two-column output, and -t suppresses headers and trailers.

nl: Number lines

nl /etc/hosts

The nl command numbers lines, similar to cat -n, but with more control over the numbering format. By default, nl skips blank lines:

nl -ba /etc/hosts

The -ba flag numbers all lines including blank ones. You can also control the number format:

nl -nrz -w4 script.sh

This produces right-justified, zero-padded line numbers with a width of 4 digits (0001, 0002, etc.).

expand: Convert tabs to spaces

expand -t 4 Makefile > Makefile_spaces.txt

The expand command replaces tab characters with the appropriate number of spaces. The -t 4 flag sets the tab stop interval to 4 spaces (default is 8).

unexpand: Convert spaces back to tabs

unexpand -t 4 --first-only source.py

The unexpand command is the reverse of expand. The --first-only flag converts only leading whitespace (indentation), leaving spaces within the line untouched.

14. Piping and Combining Commands

The real power of Linux text processing comes from combining these tools through pipes. A pipe (|) sends the stdout of one command directly to the stdin of the next. Here are several practical examples that demonstrate how these tools work together.

Find the top 10 largest directories under /var:

du -sh /var/*/ 2>/dev/null | sort -rh | head -10

This pipeline calculates directory sizes, sorts them in reverse human-readable order, and shows the top 10. The 2>/dev/null discards permission-denied errors.

Extract and count unique IP addresses from an Apache access log:

awk '{print $1}' /var/log/apache2/access.log | sort | uniq -c | sort -rn | head -20

Awk pulls the first field (IP address), sort groups identical IPs together, uniq -c counts each group, a second sort orders by count descending, and head shows the top 20 visitors.

List all users with bash as their shell, sorted by UID:

grep "bash$" /etc/passwd | sort -t: -k3 -n | cut -d: -f1,3,7

Grep filters for lines ending in "bash," sort orders by UID numerically, and cut extracts the relevant fields.

Monitor a log file for errors and save matches:

tail -f /var/log/syslog | grep --line-buffered "error" | tee -a errors.log

The --line-buffered flag on grep forces it to flush output after each matching line, which is necessary for real-time monitoring through a pipe. Without it, grep may buffer output and you will not see matches immediately.

Generate a comma-separated list of usernames:

cut -d: -f1 /etc/passwd | sort | paste -sd,

Cut extracts usernames, sort alphabetizes them, and paste in serial mode joins them with commas.

Find processes using the most memory:

ps aux | awk 'NR>1 {printf "%-8s %6.1f%% %s\n", $1, $4, $11}' | sort -t'%' -k1 -rn | head -15

This takes the output of ps aux, formats it with awk to show user, memory percentage, and command, then sorts by memory usage.

Count HTTP response codes from a log file:

awk '{print $9}' /var/log/apache2/access.log | sort | uniq -c | sort -rn

Field 9 in the common Apache log format is the HTTP status code. This pipeline gives you a breakdown of all response codes and their frequency.

Replace a string across multiple configuration files:

grep -rl "old.server.com" /etc/nginx/ | while read f; do sed -i 's/old.server.com/new.server.com/g' "$f"; done

Grep finds all files containing the old hostname, then sed performs the in-place replacement in each one. The while read loop handles filenames safely.

Clean and reformat a data file:

cat raw_data.txt | tr -d '\r' | tr -s ' ' | sed '/^$/d' | sort -u > clean_data.txt

This pipeline strips Windows line endings, squeezes multiple spaces, removes blank lines, deduplicates, and saves the result. Four simple tools combined to do something that would require a script in most other environments.

15. LPIC-1 Exam 103.2 Objective Mapping

The LPIC-1 exam objective 103.2: Process text streams using filters carries a weight of 2 (out of a possible 5 per objective). The exam expects candidates to send text files and output streams through text utility filters to modify the output using standard UNIX commands found in the GNU textutils and coreutils packages.

Here is the full list of tools named in the official objective, mapped to the sections in this guide:

Files and commands covered by 103.2:

  • bzcat, xzcat, zcat: Display compressed file contents without decompressing. These are equivalent to bzip2 -dc, xz -dc, and gzip -dc respectively. Use them to read compressed log files: zcat /var/log/syslog.2.gz | grep "error"
  • cat: Concatenate files, number lines, show non-printing characters (Section 1)
  • cut: Extract fields and character ranges (Section 6)
  • head: Display the beginning of files (Section 2)
  • less: Page through file contents interactively. Supports forward and backward navigation, search with /pattern, and is the default pager for man pages
  • md5sum: Compute and verify MD5 checksums. While not a text filter in the traditional sense, it processes text streams: echo -n "test" | md5sum
  • nl: Number lines with configurable formatting (Section 13)
  • od: Dump files in octal and other formats. Useful for examining binary data: od -c /bin/ls | head or viewing hex: od -A x -t x1z file.bin
  • paste: Merge files horizontally (Section 7)
  • sed: Stream editing with substitution, deletion, regex (Section 9)
  • sha256sum, sha512sum: Compute and verify SHA checksums. Same usage pattern as md5sum but with stronger hash algorithms
  • sort: Sort lines alphabetically, numerically, by field (Section 3)
  • split: Split a file into pieces. By line count: split -l 1000 bigfile.log chunk_. By size: split -b 100M backup.tar.gz part_
  • tail: Display the end of files, follow mode (Section 2)
  • tr: Translate, delete, and squeeze characters (Section 8)
  • uniq: Remove or report adjacent duplicate lines (Section 4)
  • wc: Count lines, words, characters, bytes (Section 5)

The following tools are not listed in the official 103.2 objective but are covered in this guide because they are closely related and frequently tested in adjacent objectives (103.1 and 103.3):

  • awk/gawk: Covered under 103.2 in practice tests and real exams, though officially part of broader text processing knowledge (Section 10)
  • grep/egrep: Formally covered under objective 103.3 (basic file searching), but inseparable from text stream processing in practice (Section 11)
  • tee: Part of pipeline construction knowledge (Section 12)
  • fmt, pr, expand, unexpand: Text formatting utilities (Section 13)

Exam Preparation Tips for 103.2

  • Practice with real files. Use /etc/passwd, /etc/services, and log files under /var/log/ as your training data. The exam expects you to process realistic system data.
  • Know the common flags. You will not be asked obscure options, but you must know the frequently used flags for each tool: sort -n -r -k -t -u, grep -i -v -r -c -n -E, cut -d -f -c, wc -l -w -c, sed s///g -i -n, uniq -c -d -u, tr -d -s.
  • Understand piping deeply. Many questions involve multi-command pipelines. Practice chaining 3 or 4 commands together and predicting the output at each stage.
  • Remember that uniq requires sorted input. This is a classic exam question. Running uniq on unsorted data will not remove all duplicates because it only compares adjacent lines.
  • Know the difference between basic and extended regex. In basic regex (used by grep and sed by default), you must escape (, ), {, }, +, and ?. In extended regex (grep -E, egrep, sed -E, awk), these metacharacters work without escaping.
  • Remember the compressed file viewers. zcat, bzcat, and xzcat are explicitly in the objective. Know which compression format each one handles: gzip, bzip2, and xz respectively.
  • Practice checksum commands. Know that md5sum, sha256sum, and sha512sum produce checksums and can verify them with -c flag against a checksum file.

Summary

Text processing on Linux is built around the Unix philosophy: small, focused tools that do one thing well, connected through pipes to handle tasks of any complexity. The commands covered in this guide form the foundation of that approach.

For LPIC-1 exam objective 103.2, make sure you can use each of these tools independently and in combination. Set up a practice environment and work through the examples in this guide against real system files. The patterns you build here will serve you well beyond the exam, in every scripting and troubleshooting task you encounter as a Linux administrator.

LEAVE A REPLY

Please enter your comment!
Please enter your name here