Amazon Kendra Web Crawler — content search for internet and intranet

Diptiman Raichaudhuri
6 min readSep 21, 2023

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

Earlier this month Amazon Kendra released a feature where AWS customers can use the Amazon Kendra Web Crawler to index and search pages, including dynamic pages from web sites they have authorisation to crawl from.

Couldn’t wait to get my hands on the good-old “inverted index” based search of Kendra.

As explained in the documentation , “Kendra indexes your documents directly or from your third-party document repository and intelligently serves relevant information to your users. You can use Amazon Kendra to create an updatable index of documents of a variety of types.

Amazon Kendra has the following components:

An index that holds your documents and makes them searchable.

A data source that stores your documents and Amazon Kendra connects to. You can automatically synchronize a data source with an Amazon Kendra index so that your index stays updated with your source repository.

A document addition API that adds documents directly to an index.”

I chose https://diptimanrc.medium.com as my data source, which is my medium site where I publish tech articles. I have 5 articles published so far and I wished to crawl these 5 articles.

While Kendra can index and search many possible document types , such as PDF, text etc … in this case, I wanted to crawl my 5 articles (HTML)and create an index on which I could issue search queries.

So, I needed to create a Kendra index and add the URLs of my articles to crawl and populate the index.

First, I logged into my AWS console, and created an index :

AWS Console — Kendra Welcome Page

Once you click “Create an Index”, I filled up the Index Name and create a new IAM role :

Kendra — Create Index

I didn’t change default settings on configure access control page, where a token based authentication could be introduced :

Kendra — Configure Access Control

Selected Developer edition and created the index :

Kendra — Provisioning

Took around 2–3 mins for me to get the index created :

Kendra — Index Created

Now, I was ready to add my articles as the data sources. Once, you click “Add data sources” hyperlink you would see a lot of supported data sources, such as S3, RDS, Box, Confluence, GitHub, Jira etc … I scrolled down to select “Web Crawler V2.0” and clicked “Add Connector”.

Web Crawler V2.0

I created a data source name and kept the default selection of English as the language :

Web Crawler V2.0 Data Source Name

I kept Source URLs as my data source :

Source URLs

And added all my 5 articles to crawl :

Articles to Crawl

For authentication I chose “No” and did not specify a Web Proxy which is useful to connect to internal websites (Intranet) :

No Authentication

For crawling, I also created a new IAM role, since, the same IAM role created for indexes cannot be used again for data sources.

I kept the Sync settings as default, which specifies the Crawl Depth, maximum file size, maximum links per page, maximum throttling etc ..

Sync Settings

I chose a “Daily” Sync run schedule and clicked “Next”.

In the Set Field Mappings page, I added “title” along with “category” and “sourceUrl” as the fields/terms within my search index :

Field Mappings

Reviewed everything and clicked create.

My data source got created very quickly and then I ran “Sync Now” :

Sync Data Source

Syncing is an expensive process and as per AWS documentation, “can take from a few minutes to a few hours. Syncing is a two-step process. First documents are crawled to determine the ones to index. Then the selected documents are indexed. Sync speeds are limited by factors such as remote repository throughput and throttling, network bandwidth, and the size of documents.”

For me it took ~12 mins to get the sync done :

Data Source Synced

A word of caution about crawling, right from AWS documentation :

“You can only crawl public facing websites or internal company websites that use the secure communication protocol Hypertext Transfer Protocol Secure (HTTPS). If you receive an error when crawling a website, it could be that the website is blocked from crawling. To crawl internal websites, you can set up a web proxy. The web proxy must be public facing. You can also use authentication to access and crawl websites.

Amazon Kendra Web Crawler v2.0 uses the Selenium web crawler package and a Chromium driver. Amazon Kendra automatically updates the version of Selenium and the Chromium driver using Continuous Integration (CI).

When selecting websites to index, you must adhere to the Amazon Acceptable Use Policy and all other Amazon terms. Remember that you must only use Amazon Kendra Web Crawler to index your own web pages, or web pages that you have authorization to index. To learn how to stop Amazon Kendra Web Crawler from indexing your website(s), please see Configuring the robots.txt file for Amazon Kendra Web Crawler.

This is very important, and many a times, crawling a website or scraping a website can run into contentious issues. Please only target your own sites/pages for crawling.

Now, that the index is synced with my data sources, let’s run a couple of search queries, click on “Search indexed content” :

Query 1

I queried with the search term “import time” which is one of the code bits I used in the first article, which it correctly picks up as the first document in the list of results. I could click the result hyperlink and it opens up the medium article on my browser.

Also notice, that on the left nav menu, 5 documents are mentioned as added within the index :

Index documents added

Ending with another search :

Query 2

While, this was a quick example, for enterprise deployments, Kendra also supports a lot of access control and security features, like :

“Field mappings
Web proxy
Inclusion/exclusion filters
Virtual private cloud (VPC)
Sync all documents / sync only new, modified, deleted documents
Basic, NTLM/Kerberos, SAML, and form authentication for your websites”

Please go through the documentation to learn and do more.

So long !

--

--