(Last Updated On: December 30, 2018)

How can I install Apache Tika 1.20 on Ubuntu 18.04 / Ubuntu 16.04?. Apache Tika is an Open source toolkit that detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). Tika is very useful for search engine indexing, content analysis, translation e.t.c.

What is new in Apache Tika 1.20

  • Upgrade to POI 4.0.1
  • Upgrade to PDFBox 2.0.13
  • Integrate/parameterize new angles handling in
    PDFBox
  • Prevent content within <style> and <script/> elements to be written in the ToTextContentHandle
  • Switch child to parent communication to a shared memory-mapped file in tika-server’s – spawnChild mode
  • Bulk upgrade of dependencies
  • Upgrade jaxb-runtime and javax.activation
  • Improve language id efficiency in tika-eval
  • Remove duplication of notes in PPT slides
  • Upgrade sqlite “provided” dependency to 3.25.2

In this post, we will discuss the installation of Apache Tika on Ubuntu 18.04 / Ubuntu 16.04 LTS.

Apache Tika dependencies

What you need to build and install Apache Tika on Ubuntu 18.04 / Ubuntu 16.04 LTS are:

  • Java Runtime Environment (JRE)
  • Apache Maven

We will install these dependencies before we can download and install Tika on Ubuntu 18.04 / Ubuntu 16.04.

Step 1: Update your Ubuntu system

Start by ensuring you’re running an updated Ubuntu Desktop / Server.

sudo apt update
sudo apt -y upgrade
sudo apt -y intall wget curl vim

Step 2: Install Java on Ubuntu 18.04 / Ubuntu 16.04

As from Tika 1.19, build from Java 11 is supported. You can install Java 11 on Ubuntu 18.04 / Ubuntu 16.04 LTS using our previous guide below.

How to Install Java 11 on Ubuntu 18.04 /16.04 / Debian 9

For Java 8, install it using commands below

sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-set-default

Confirm installed version of Java:

$ java --version
java 11.0.1 2018-10-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.1+13-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.1+13-LTS, mixed mode)

Step 3: Install Apache Maven

Install Apache Maven by following our guide:

Install Latest Apache Maven on Ubuntu 18.04 /16.04 / Debian 9

Step 4: Download and Install Apache Tika

Download latest Apache Tika from the Downloads page.

export VER="1.20"
wget https://archive.apache.org/dist/tika/tika-${VER}-src.zip

Unzip the downloaded file.

unzip tika-${VER}-src.zip

Change to new folder and run mvn install

cd tika-${VER}
mvn install

Sample output.

Wait for the installation to finish then test Tika within its base directory.

Reference:

http://tika.apache.org/1.20/gettingstarted.html