This document gives some guidance and notes about how to configure the search service. Using search is a good way to find documents based on various criteria. The search service will return results for documents the searcher is eligible to access. Read the description of the search service for more details.
You can use basic search functionality without any configuration as it is preconfigured when using the default binary or container/orchestration deployment.
With basic search, only metadata is indexed. Content can be searched when configuring Tika as the content extraction engine.
Metadata: all data that describes the file like
Content: all data that relates to content of the file like
Depending on the configuration, space requirements can differ.
Indexing is a non-blocking operation. It is triggered by various events (see State Changes which Trigger Indexing) and does not prevent file access by users. While indexing metadata is more or less instantaneous, extracting and indexing content can take some time depending on the setup and size of the document.
There is no definitive answer as to how much space needs to be provided for storing the index or how it can be calculated. The only valid answer to that question is - it depends and and you need to monitor it. Note that monitoring is not part of this document.
Here are some notes to give some guidance:
Extracting and indexing metadata consumes only little space compared to a content index, though it can be significant in an environment with a lot of files and limited filesystem space.
When extracting and indexing content, the range for the consumed index can be - as rule of thumb - between 50-200% of saved documents containing text. Being conservative and taking a value of 150%, having 3TB of documents where text based data can be extracted, would require up to 4.5TB - only for the content index.
|Having the index on the default location unmonitored, filling up the filesystem by the index can happen silently and make Infinite Scale unresponsive. As the OS, Infinite Scale and its data share the same filesystem, recovery can be a task taking considerable downtime.
The location of the search index can be customized and should be on a fast backend.
Consider separate hardware for the search service if response time is critical for your environment, as scaling is currently not possible for the search service.
Content extraction can consume considerable CPU and memory ressources and naturally competes with all other services if running on the same hardware. It has to extract every document and index it before it is available for searching.
The search index can be manually relocated and search reconfigured to use the new path.
If it becomes necessary to relocate the index, you need to:
Shut down the Infinite Scale instance.
This is necessary to avoid changes that miss triggering an index update.
Move the contents referenced via
SEARCH_ENGINE_BLEVE_DATA_PATHto a new location.
Define the new location in
Restart the Infinite Scale instance.
It can happen that an index needs to be recreated. Currently this can only be done on a per space basis. Use the following command for this task:
ocis search index --space $SPACE_ID --user $USER_ID
Note that not names but IDs are necessary and that the specified user ID needs access to the space to be indexed.
To search for content, a content extraction engine needs to be installed and configured. Infinite scale currently supports Apache Tika - a content analysis toolkit to extract content.
Though you can compile Tika manually on your system by following the Getting Started with Apache Tika guide (newer Tika versions may be available) or download a precompiled Tika server, you can also run Tika using a Tika container. Note that at the time of writing, containers are only available for the
amd64 platform. The Docker Compose Examples (ocis_wopi) is based on the container as it is ready to use.
The following describes how to make Tika available for your environment.
To see if the Tika container runs on your architecture, type:
docker run -d --name=tika --restart=always apache/tika
If you do not get a startup error message and accessing the container via
you can use the container. Finally, you can keep the image when planning to use a container based setup but remove the test container with
docker stop <ID> and
docker rm <ID> where ID is the container ID of Tika.
If using the container does not work in your environment, you need to use the server installation of Tika which requires at least Java version 8 installed, check with
java -version and install java if required. After downloading the Tika server .jar file, you can start the server with:
java -jar tika-server-standard-2.7.0.jar
It is then accessible via
http://your-server:9998. Check that the Tika server is automatically started like when using systemd - which is not covered here though you can take Setup the systemd Service from the Small-Scale Deployment with systemd as setup reference.
As prerequisite, Tika needs to be accessible via
http://your-server:9998 either using the manual installation or via docker. You can decide to let Tika run on the same or a separate server from where the search service runs. The following configuration assumes that all Infinite Scale services including the search service and Tika run on the same hardware.
These configuration parameters need to be set for the use of Tika:
The parameters can either be set via environment variables or as part of a
yaml configuration file. Also see the Docker Compose Examples (ocis_wopi) for an example using container orchestration which also downloads the necessary Tika image.
Though in the majority of cases not necessary, components of Tika can be configured if required by providing an xml file with necessary data. For more information see Configuring Tika on their web page.