Search Service Configuration
Introduction
The Infinite Scale Search service is responsible for metadata and content extraction, stores that data as index and makes it searchable. The following clarifies the extraction terms metadata and content:
-
Metadata: all data that describes the file like
Name
,Size
,MimeType
,Tags
andMtime
. -
Content: all data that relates to content of the file like
words
,geo data
,exif data
etc.
General Considerations
-
To use the search service, an event system needs to be configured for all services like NATS, which is shipped and preconfigured.
-
The search service consumes events and does not block other tasks.
-
When looking for content extraction, Apache Tika - a content analysis toolkit can be used but needs to be installed separately.
-
Scaling of Tika, if configured, is not part of this documentation.
-
-
Although indexing metadata is essentially instantaneous, extracting and indexing content can take some time, depending on the setup and size of the document.
-
Indexing is a non-blocking operation. It is triggered by various events.
-
Consider using a dedicated hardware for this service in case more resources are needed.
Both metadata and content extractions are stored as indexes via the search service. Keep in mind that indexing requires adequate storage capacity, and this requirement will grow over time. To prevent the index from filling up the file system and rendering Infinite Scale unusable, it should reside on its own file system. |
In case the file system gets close to full and you need to relocate search data, you can change the path to where search maintains its index data.
The search service runs with the default "basic" configuration shipped out of the box. No additional configuration is necessary unless scaling or content extraction is being used.
Space Requirements
There is no definitive answer as to how much space is needed for storing the index, nor is there a way to calculate it. The only valid answer is that it depends, and you need to monitor it. Note that monitoring is not part of this document.
Here are some notes to provide guidance:
-
Although extracting and indexing metadata uses little space compared to a content index, it can be significant in environments with many files and limited file system space.
-
As a rule of thumb, when extracting and indexing content, the range for the consumed index can be between 50% and 200% of the saved documents containing text. Taking a conservative approach and using a value of 150%, 3 TB of documents from which text-based data can be extracted would require up to 4.5 TB of space — and that’s just for the content index.
If the index is left unmonitored in its default location, it can silently fill up the file system and make Infinite Scale unresponsive. Since Infinite Scale and its data share the same file system with the OS, recovery can require considerable downtime. |
Scaling
The search service can be scaled by running multiple instances. Some rules apply:
-
With
SEARCH_ENGINE_BLEVE_SCALE=false
, which is the default , the search service has exclusive write access to the index. Once the first search process is started, any subsequent search processes attempting to access the index are locked out. -
With
SEARCH_ENGINE_BLEVE_SCALE=true
, a search service will no longer have exclusive write access to the index. This setting must be enabled for all instances of the search service.
Search Engines
By default, the search service is shipped with bleve as its primary search engine.
Extraction Engines
The search service provides the following extraction engines and their results are used as index for searching:
-
The embedded
basic
configuration provides metadata extraction which is always on. This includes all data that describes the file likeName
,Size
,MimeType
,Tags
andMtime
. -
The
tika
configuration, which additionally provides content extraction, if installed and configured. This includes all data that relates to content of the file likewords
,geo data
,exif data
etc.
Content Extraction
The search service can manage and retrieve many types of information. To this end, the following content extractors are included. Extraction is triggered by events, see State Changes which Trigger Indexing for more details.
Basic Extractor
This extractor is the most simple one and just uses the resource information provided by Infinite Scale. It needs no configuration and does not do any further analysis.
Tika Extractor
This extractor is more advanced compared to the Basic extractor. The main difference is that this extractor is able to provide file contents for the index. Though you can compile Tika manually on your system by following the Getting Started with Apache Tika guide (newer Tika versions may be available) or download a precompiled Tika server, you can also run Tika using a Tika container. See the Tika container usage document for a quickstart.
As soon as Tika is installed and accessible, the search service must be configured for the use with Tika. The following settings must be set:
-
SEARCH_EXTRACTOR_TYPE=tika
-
SEARCH_EXTRACTOR_TIKA_TIKA_URL=http://YOUR-TIKA.URL
-
FRONTEND_FULL_TEXT_SEARCH_ENABLED=true
When using the Tika extractor, make sure to also set this enironment variable in the frontend service. This will tell the web client that full-text search has been enabled.
When the search service can reach Tika, it begins to extract content on demand. Note that files must be downloaded by Tika during the extraction process, which can lead to delays with larger documents.
When extracting content, you can specify whether [stop words](https://en.wikipedia.org/wiki/Stop_word) like I
, you
, the
are ignored or not. Normally, these stop words are removed automatically. To keep them, the environment variable SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS
must be set to false
.
Content extraction and handling the extracted content can be very resource intensive. Content extraction is therefore limited to files with a certain file size. The default limit is 20MB and can be configured using the SEARCH_CONTENT_EXTRACTION_SIZE_LIMIT variable.
|
When using the Tika container and docker-compose, consider the following:
-
See the Local Production Setup deployment example in particular the downloaded
tika.yml
file for more details. -
Containers for the linked service are reachable at a hostname identical to the alias or the service name if no alias was specified.
Search Functionality
The search service consists of two main parts which are file indexing
and file searching
.
Indexing
Every time a resource changes its state, a corresponding event is triggered. Based on the event, the search service processes the file and adds the result to its index. There are a few more steps between accepting the file and updating the index.
State Changes which Trigger Indexing
The following state changes in the life cycle of a file can trigger the creation of an index or an update:
Resource Trashed
The search service checks its index to see if the file has been processed. If an index entry exists, the index will be marked as deleted. In consequence, the file won’t appear in search requests anymore. The index entry stays intact and could be restored via Resource Restored.
Resource Deleted
The search service checks its index to see if the file has been processed. If an index entry exists, the index will be finally deleted. In consequence, the file won’t appear in search requests anymore.
Resource Restored
This step is the counterpart of Resource Trashed. When a file is deleted, is isn’t removed from the index, instead the search service just marks it as deleted. This mark is removed when the file has been restored, and it shows up in search results again.
Resource Moved
This comes into play whenever a file or folder is renamed or moved. The search index then updates the resource location path or starts indexing if no index has been created so far for all items affected. See Notes for an example.
Folder Created
The creation of a folder always triggers indexing. The search service extracts all necessary information and stores it in the search index
File Created
This case is similar to Folder created with the difference that a file can contain far more valuable information. This gets interesting but time-consuming when data content needs to be analyzed and indexed. Content extraction is part of the search service if configured.
File Version Restored
Since Infinite Scale is capable of storing multiple versions of the same file, the search service also needs to take care of those versions. When a file version is restored, the search service starts to extract all needed information, creates the index and makes the file discoverable.
Resource Tag Added
Whenever a resource gets a new tag, the search service takes care of it and makes that resource discoverable by the tag.
Resource Tag Removed
This is the counterpart of Resource tag added. It takes care that a tag gets unassigned from the referenced resource.
File Uploaded - Synchronous
This case only triggers indexing if async post processing
is disabled. If so, the service starts to extract all needed file information, stores it in the index and makes it discoverable.
File Uploaded - Asynchronous
This is exactly the same as File uploaded - synchronous with the only difference that it is used for asynchronous uploads.
Index Management
Index Location and Scaling
-
The location of the search index can be customized and should be on a fast backend.
-
Consider separate hardware for the search service if response time is critical for your environment.
-
Content extraction can consume considerable CPU and memory ressources and naturally competes with all other services if running on the same hardware. It has to extract every document and index it before it is available for searching.
-
-
The search index can be manually relocated and search reconfigured to use the new path.
Reloacting the Index
If it becomes necessary to relocate the search index, you need to:
-
Shut down the Infinite Scale instance.
This is necessary to avoid changes that miss triggering an index update. -
Move the contents referenced via
SEARCH_ENGINE_BLEVE_DATA_PATH
to a new location. -
Define the new location in
SEARCH_ENGINE_BLEVE_DATA_PATH
. -
Restart the Infinite Scale instance.
Index Maintenance
It can happen that an index needs to be recreated. Currently this can only be done on a per space / all spaces basis.
Manually Trigger Re-Indexing Spaces
The service includes a command-line interface to trigger re-indexing spaces:
ocis search index --space $SPACE_ID || --all-spaces
-
IDs but not names are necessary as parameter.
-
See the Listing Space IDs for how to retrieve a space ID.
-
-
The arguments
-space
and--all-spaces
are mutual exclusive , but one must be provided.
Notes
The indexing process tries to be self-healing in some situations.
In the following example, let’s assume a file tree foo/bar/baz
exists. If the folder bar
gets renamed to new-bar
, the path to baz
is no longer foo/bar/baz
but foo/new-bar/baz
. The search service checks the change and either just updates the path in the index or creates a new index for all items affected if none was present.
Event Bus Configuration
The Infinite Scale event bus can be configured by a set of environment variables.
|
Note that for each global environment variable, a service-based one might be available additionally. For precedences see Environment Variable Notes. Check the configuration section below.
Without the aim of completeness, see the list of environment variables to configure the event bus:
Envvar | Description |
---|---|
|
The address of the event system. |
|
The clusterID of the event system. Mandatory when using NATS as event system. |
|
Enable TLS for the connection to the events broker. |
|
Whether to verify the server TLS certificates. |
|
The username to authenticate with the events broker. |
|
The password to authenticate with the events broker. |
Configuration
Environment Variables
The search
service is configured via the following environment variables. Read the Environment Variable Types documentation for important details. Column IV
shows with which release the environment variable has been introduced.
Name | IV | Type | Default Value | Description |
---|---|---|---|---|
|
pre5.0 |
bool |
false |
Activates tracing. |
|
pre5.0 |
string |
|
The type of tracing. Defaults to '', which is the same as 'jaeger'. Allowed tracing types are 'jaeger' and '' as of now. |
|
pre5.0 |
string |
|
The endpoint of the tracing agent. |
|
pre5.0 |
string |
|
The HTTP endpoint for sending spans directly to a collector, i.e. http://jaeger-collector:14268/api/traces. Only used if the tracing endpoint is unset. |
|
pre5.0 |
string |
|
The log level. Valid values are: 'panic', 'fatal', 'error', 'warn', 'info', 'debug', 'trace'. |
|
pre5.0 |
bool |
false |
Activates pretty log output. |
|
pre5.0 |
bool |
false |
Activates colorized log output. |
|
pre5.0 |
string |
|
The path to the log file. Activates logging to this file if set. |
|
pre5.0 |
string |
127.0.0.1:9224 |
Bind address of the debug server, where metrics, health, config and debug endpoints will be exposed. |
|
pre5.0 |
string |
|
Token to secure the metrics endpoint. |
|
pre5.0 |
bool |
false |
Enables pprof, which can be used for profiling. |
|
pre5.0 |
bool |
false |
Enables zpages, which can be used for collecting and viewing in-memory traces. |
|
pre5.0 |
string |
127.0.0.1:9220 |
The bind address of the GRPC service. |
|
pre5.0 |
string |
|
The secret to mint and validate jwt tokens. |
|
pre5.0 |
string |
com.owncloud.api.gateway |
The CS3 gateway endpoint. |
|
pre5.0 |
string |
|
TLS mode for grpc connection to the go-micro based grpc services. Possible values are 'off', 'insecure' and 'on'. 'off': disables transport security for the clients. 'insecure' allows using transport security, but disables certificate verification (to be used with the autogenerated self-signed certificates). 'on' enables transport security, including server certificate verification. |
|
pre5.0 |
string |
|
Path/File name for the root CA certificate (in PEM format) used to validate TLS server certificates of the go-micro based grpc services. |
|
pre5.0 |
string |
127.0.0.1:9233 |
The address of the event system. The event system is the message queuing service. It is used as message broker for the microservice architecture. |
|
pre5.0 |
string |
ocis-cluster |
The clusterID of the event system. The event system is the message queuing service. It is used as message broker for the microservice architecture. Mandatory when using NATS as event system. |
|
pre5.0 |
bool |
true |
Enable asynchronous file uploads. |
|
pre5.0 |
int |
0 |
The amount of concurrent event consumers to start. Event consumers are used for searching files. Multiple consumers increase parallelisation, but will also increase CPU and memory demands. The default value is 0. |
|
pre5.0 |
int |
1000 |
The duration in milliseconds the reindex debouncer waits before triggering a reindex of a space that was modified. |
|
pre5.0 |
bool |
false |
Whether to verify the server TLS certificates. |
|
pre5.0 |
string |
|
The root CA certificate used to validate the server’s TLS certificate. If provided SEARCH_EVENTS_TLS_INSECURE will be seen as false. |
|
pre5.0 |
bool |
false |
Enable TLS for the connection to the events broker. The events broker is the ocis service which receives and delivers events between the services. |
|
5.0 |
string |
|
The username to authenticate with the events broker. The events broker is the ocis service which receives and delivers events between the services. |
|
5.0 |
string |
|
The password to authenticate with the events broker. The events broker is the ocis service which receives and delivers events between the services. |
|
pre5.0 |
string |
bleve |
Defines which search engine to use. Defaults to 'bleve'. Supported values are: 'bleve'. |
|
pre5.0 |
string |
/var/lib/ocis/search |
The directory where the filesystem will store search data. If not defined, the root directory derives from $OCIS_BASE_DATA_PATH/search. |
|
7.2.0 |
bool |
false |
Enable scaling of the search index (bleve). If set to 'true', the instance of the search service will no longer have exclusive write access to the index. Note when scaling search, all instances of the search service must be set to true! For 'false', which is the default, the running search service has exclusive access to the index as long it is running. This locks out other search processes tying to access the index. |
|
pre5.0 |
string |
basic |
Defines the content extraction engine. Defaults to 'basic'. Supported values are: 'basic' and 'tika'. |
|
pre5.0 |
bool |
false |
Ignore untrusted SSL certificates when connecting to the CS3 source. |
|
pre5.0 |
string |
http://127.0.0.1:9998 |
URL of the tika server. |
|
5.0 |
bool |
true |
Defines if stop words should be cleaned or not. See the documentation for more details. |
|
pre5.0 |
uint64 |
20971520 |
Maximum file size in bytes that is allowed for content extraction. |
|
5.0 |
string |
|
The ID of the service account the service should use. See the 'auth-service' service description for more details. |
|
5.0 |
string |
|
The service account secret. |
YAML Example
-
Note the file shown below must be renamed and placed in the correct folder according to the Configuration File Naming conventions to be effective.
-
See the Notes for Environment Variables if you want to use environment variables in the yaml file.
# Autogenerated
# Filename: search-config-example.yaml
tracing:
enabled: false
type: ""
endpoint: ""
collector: ""
log:
level: ""
pretty: false
color: false
file: ""
debug:
addr: 127.0.0.1:9224
token: ""
pprof: false
zpages: false
grpc:
addr: 127.0.0.1:9220
tls: null
token_manager:
jwt_secret: ""
reva:
address: com.owncloud.api.gateway
tls:
mode: ""
cacert: ""
grpc_client_tls: null
events:
endpoint: 127.0.0.1:9233
cluster: ocis-cluster
async_uploads: true
num_consumers: 0
debounce_duration: 1000
tls_insecure: false
tls_root_ca_certificate: ""
enable_tls: false
username: ""
password: ""
engine:
type: bleve
bleve:
data_path: /var/lib/ocis/search
scale: false
extractor:
type: basic
cs3_allow_insecure: false
tika:
tika_url: http://127.0.0.1:9998
clean_stop_words: true
content_extraction_size_limit: 20971520
service_account:
service_account_id: ""
service_account_secret: ""