Tesseract-ocr installation on Sagemaker notebooks

0

Hi,

One of the python packages that I use to convert pdf files to text has the dependency of tesseract-ocr. I tried to install using the below command using terminal

sudo yum install tesseract-ocr

However, I get the following error

Loaded plugins: dkms-build-requires, extras_suggestions, kernel-livepatch, langpacks, priorities, update-motd, versionlock https://download.docker.com/linux/centos/2/x86_64/stable/repodata/repomd.xml: > [Errno 14] HTTPS Error 404 - Not Found Trying other mirror. 63 packages excluded due to repository priority protections No package tesseract-ocr available.

I have tried to install it according to this link, but still facing the same issues. Any help would be highly appreciated.

Thanks

asked 8 months ago928 views
3 Answers
3
Accepted Answer

Hii,

The error you're encountering is due to the SageMaker Notebook instance trying to access a Docker repository that is not available or accessible. This is likely because the SageMaker Notebook instance is running on Amazon Linux, which is different from CentOS.

1.Remove the Docker repository: First, remove the Docker repository that is causing the error by running the following command:

$ sudo rm -rf /etc/yum.repos.d/docker.repo

2.Clean the package manager cache: Next, clean the package manager cache by running the following command:

$ sudo yum clean all

3.Update the package manager: Update the package manager by running the following command:

$ sudo yum update -y

After completing these steps, you should be able to proceed with the installation of Tesseract OCR as described in my previous response.

If you still encounter issues, you can try installing Tesseract OCR using the Amazon Linux Extra packages repository. Follow these steps:

1.Enable the Amazon Linux Extra packages repository: Run the following command to enable the Amazon Linux Extra packages repository:

$ sudo amazon-linux-extras install epel

2.Install Tesseract OCR: After enabling the repository, you should be able to install Tesseract OCR by running the following command:

$ sudo yum install tesseract

This should install Tesseract OCR and its dependencies on your SageMaker Notebook instance.

Links:

Amazon Linux 2 Extras Library (for installing Tesseract OCR): https://aws.amazon.com/premiumsupport/knowledge-center/ec2-install-extras-repository/

profile picture
EXPERT
answered 8 months ago
profile picture
EXPERT
reviewed 8 months ago
profile picture
EXPERT
reviewed 8 months ago
profile picture
EXPERT
reviewed 8 months ago
EXPERT
reviewed 8 months ago
  • Thankyou! The solution is working.

  • welcome...

  • One more issue! With the above commands i am able to install tesseract 3.04. How can I upgrade it to versions > 3.05 or >4.*. I have tried to update it using sudo yum update tesseract, but it did not work.

2

To upgrade Tesseract OCR to versions 3.05 or higher on Amazon Linux 2, you'll need to compile it from source, as the default repositories don't provide newer versions. Here's a detailed steps on how to do this:

  1. First, remove the existing Tesseract installation:
sudo yum remove tesseract
  1. Install the necessary dependencies:
sudo yum install -y autoconf automake libtool libpng-devel libjpeg-devel libtiff-devel zlib-devel libwebp-devel gcc-c++ make
  1. Install Leptonica, which is required for Tesseract:
wget http://www.leptonica.org/source/leptonica-x.xx.x.tar.gz
tar -xzvf leptonica-x.xx.x.tar.gz
cd leptonica-x.xx.x
./configure
make
sudo make install
sudo ldconfig
pkg-config --modversion lept
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
  1. Download and compile Tesseract from source:
wget https://github.com/tesseract-ocr/tesseract/archive/4.1.1.tar.gz
tar xzvf 4.1.1.tar.gz
cd tesseract-4.1.1
./autogen.sh
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
./configure
make
sudo make install
sudo ldconfig
  1. Verify the installation:
tesseract --version

This should show the newly installed version of Tesseract.

profile pictureAWS
EXPERT
answered 8 months ago
profile picture
EXPERT
reviewed 8 months ago
  • Hi, I have installed leptonica 1.82.0 and i got some version errors when installing tesseract 4.1.1. Also I have installed leptonica 1.83.0 and when installing tesseract 5.0.0, I get some installation errors. I cannot paste the entire error details here due to size limitation. Could you please verify whether the commands are working fine and can help me installing tesseract 5 or above?

0

Hi,

You install to add Repel to the repos to be accessed by yum to access terreract-ocr:

rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y update

Then, yum install should succeed.

To make that the repo above was added, you can also run yum repolist to get the list of repos known by yum after your command.

Best,

Didier

profile pictureAWS
EXPERT
answered 8 months ago
profile picture
EXPERT
reviewed 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions