Apache

How to Install Apache Tika on Ubuntu 22.04|20.04|18.04

How can I install Apache Tika on Ubuntu 22.04|20.04|18.04?. Apache Tika is an Open source toolkit that detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). Tika is very useful for search engine indexing, content analysis, translation e.t.c.

Original content from computingforgeeks.com - post 5680

What is new in Apache Tika 2.2.x

  • Add support for OneNote files downloaded from O365
  • Improve extraction of embedded files from MSOffice files created by non-Microsoft tools
  • Added back ability to ignore load errors in TikaConfig
  • Fix logic bug in PipesServer that prevented concatenation of content from attachments
  • Fix default logging in tika-app in batch mode
  • Fix race condition when starting multiple forked servers on multiple ports
  • Add metadata item for whether or not a PDF has a collection/is a Portfolio PDF
  • Add detection of JPEG XL, MARC, ICC profiles, NES-ROM file types
  • Add optional fetch ranges to FetchEmitTuple to allow range fetching from,e.g. http or s3

In this post, we will discuss the installation of Apache Tika on Ubuntu 22.04|20.04|18.04 LTS.

Apache Tika dependencies

What you need to build and install Apache Tika on Ubuntu 22.04|20.04|18.04 LTS are:

  • Java Runtime Environment (JRE)
  • Apache Maven

We will install these dependencies before we can download and install Tika on Ubuntu 22.04|20.04|18.04 Linux system.

Step 1: Install required dependencies

Start by ensuring you’re running an updated Ubuntu Desktop / Server.

sudo apt update
sudo apt -y install wget curl vim unzip

Step 2: Install Java on Ubuntu 22.04|20.04|18.04

As from Tika 1.19, build from Java 11 is supported. You can install Java on Ubuntu using the following commands:

sudo apt install -y default-jdk

Confirm installed version of Java:

$ java --version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Step 3: Install Apache Maven

Install Apache Maven by following our guide:

Step 4: Download and Install Apache Tika

Download latest Apache Tika from the Downloads page.

export VER="2.2.1"
wget https://archive.apache.org/dist/tika/${VER}/tika-${VER}-src.zip

Unzip the downloaded file.

unzip tika-${VER}-src.zip

Change to new folder and run mvn install

cd tika-${VER}
mvn install

Sample installation output.

install apache tika ubuntu 18.04

Wait for the installation to finish then test Tika within its base directory.

Reference:

http://tika.apache.org/2.2.1/gettingstarted.html

Related Articles

Debian Setup iPXE Server on Ubuntu or Debian using netboot.xyz Debian How To Configure NFS Server on  Debian 12 (Bookworm) Zabbix How To Install Zabbix agent 5.0 on Ubuntu 20.04|18.04 CentOS How To Run Java Jar Application with Systemd on Linux

Leave a Comment

Press ESC to close