Apache

How to Install Apache Tika on Ubuntu 22.04|20.04|18.04

How can I install Apache Tika on Ubuntu 22.04|20.04|18.04?. Apache Tika is an Open source toolkit that detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). Tika is very useful for search engine indexing, content analysis, translation e.t.c.

Original content from computingforgeeks.com - post 5680

What is new in Apache Tika 2.2.x

  • Add support for OneNote files downloaded from O365
  • Improve extraction of embedded files from MSOffice files created by non-Microsoft tools
  • Added back ability to ignore load errors in TikaConfig
  • Fix logic bug in PipesServer that prevented concatenation of content from attachments
  • Fix default logging in tika-app in batch mode
  • Fix race condition when starting multiple forked servers on multiple ports
  • Add metadata item for whether or not a PDF has a collection/is a Portfolio PDF
  • Add detection of JPEG XL, MARC, ICC profiles, NES-ROM file types
  • Add optional fetch ranges to FetchEmitTuple to allow range fetching from,e.g. http or s3

In this post, we will discuss the installation of Apache Tika on Ubuntu 22.04|20.04|18.04 LTS.

Apache Tika dependencies

What you need to build and install Apache Tika on Ubuntu 22.04|20.04|18.04 LTS are:

  • Java Runtime Environment (JRE)
  • Apache Maven

We will install these dependencies before we can download and install Tika on Ubuntu 22.04|20.04|18.04 Linux system.

Step 1: Install required dependencies

Start by ensuring you’re running an updated Ubuntu Desktop / Server.

sudo apt update
sudo apt -y install wget curl vim unzip

Step 2: Install Java on Ubuntu 22.04|20.04|18.04

As from Tika 1.19, build from Java 11 is supported. You can install Java on Ubuntu using the following commands:

sudo apt install -y default-jdk

Confirm installed version of Java:

$ java --version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Step 3: Install Apache Maven

Install Apache Maven by following our guide:

Step 4: Download and Install Apache Tika

Download latest Apache Tika from the Downloads page.

export VER="2.2.1"
wget https://archive.apache.org/dist/tika/${VER}/tika-${VER}-src.zip

Unzip the downloaded file.

unzip tika-${VER}-src.zip

Change to new folder and run mvn install

cd tika-${VER}
mvn install

Sample installation output.

install apache tika ubuntu 18.04

Wait for the installation to finish then test Tika within its base directory.

Reference:

http://tika.apache.org/2.2.1/gettingstarted.html

Keep reading

Upgrade Ubuntu 24.04 to Ubuntu 26.04 LTS (Step by Step) Ubuntu Upgrade Ubuntu 24.04 to Ubuntu 26.04 LTS (Step by Step) UFW Firewall Commands with Examples on Ubuntu 24.04 / 22.04 Security UFW Firewall Commands with Examples on Ubuntu 24.04 / 22.04 Install Arch Linux the Easy Way with archinstall Arch Linux Install Arch Linux the Easy Way with archinstall Install OrientDB on Ubuntu 26.04 / 24.04 / 22.04 Databases Install OrientDB on Ubuntu 26.04 / 24.04 / 22.04 Install Valkey on Debian 13 / 12 Databases Install Valkey on Debian 13 / 12 Cpanel Alternatives (Top Paid Hosting Control Panels) Web Hosting Cpanel Alternatives (Top Paid Hosting Control Panels)

Leave a Comment

Press ESC to close