Harvesting Configuration
This page describes how harvesting is triggered and configured in CRISalid, and how responsibilities are split between the IKG and the Harvester microservices.
π Overview
Harvesting is triggered by messages published on the RabbitMQ publications exchange by the IKG (Institutional Knowledge Graph) service. The Harvester UI provides observability only (although it can be used to trigger harvesting requests manually, but only for testing purposes).
The IKG publishes harvesting requests to the RabbitMQ publications exchange.
Each message represents one harvesting job.
Requests can be triggered:
- Manually, via IKG CLI tools
- Automatically, via Ofelia (Docker Compose) or Kubernetes Jobs / CronJobs (which in turn call IKG CLI tools)
βοΈ Configuration & Execution
Harvesting is configured at two levels:
- IKG: decides when harvesting runs and which harvesters are requested
- Harvester: defines which harvesters exist and whether they can run
IKG β Requested Harvesters
The HARVESTERS environment variable defines which harvesters are included in harvesting requests published on the publications exchange.
- Can be set via:
- runtime environment
- Docker Compose (e.g.
docker/ikg/ikg.yaml) - Kubernetes ConfigMaps
- Overrides the default when defined
Default configuration:
[
"idref",
"scanr",
"hal",
"openalex",
"scopus"
]π Defined in the IKG settings (link to GitHub settings file)
Harvester β Available Harvesters
The Harvester service declares available harvesters in:
harvesters.yaml
This file binds harvester identifiers to their implementations.
- name: idref
module: app.harvesters.idref.idref_harvester_factory
class: IdrefHarvesterFactory
- name: scanr
module: app.harvesters.scanr.scanr_harvester_factory
class: ScanrHarvesterFactory
- name: hal
module: app.harvesters.hal.hal_harvester_factory
class: HalHarvesterFactory
- name: openalex
module: app.harvesters.open_alex.open_alex_harvester_factory
class: OpenAlexHarvesterFactory
- name: scopus
module: app.harvesters.scopus.scopus_harvester_factory
class: ScopusHarvesterFactoryOnly harvesters declared in harvesters.yaml can be executed.
π Execution Rules
The IKG publishes a harvesting request on the
publicationsexchangeThe message contains:
- the type of entity (currently:
persononly) - a flag indicating whether a reply is expected
- a flag for βidentifiers safe modeβ (if true, the alignment of identifiers is not recorded in the harvesters database)
- a list of event types to be reported (e.g.
created,updated,deleted,unchanged) - a list of requested harvesters (e.g.
idref,scanr,hal,openalex,scopus) - the fields of the entity to be harvested, including a list of identifiers with their types and values :
name: displayed in the Harvester UI but not used for harvestingidentifiers: used to determine which harvesters can run (e.g.openalexrequires an ORCID identifier) and to perform the actual harvesting
- the type of entity (currently:
{
"type": "person",
"reply": true,
"identifiers_safe_mode": false,
"events": [
"created",
"updated",
"deleted",
"unchanged"
],
"harvesters": [
"idref",
"scanr",
"hal",
"openalex",
"scopus"
],
"fields": {
"name": "John Doe",
"identifiers": [
{
"type": "orcid",
"value": "0000-0002-1825-0097"
},
{
"type": "idref",
"value": "123456789"
}
]
}
}For each requested harvester, the Harvester service checks:
- if the harvester is declared in
harvesters.yaml - if a compatible identifier is present
- if the harvester is declared in
A harvester is executed only if a suitable identifier is available (e.g. openalex requires an ORCID and will not run with a HAL identifier).
- All eligible harvesters are executed in parallel