How to optimize a pipeline?

General Guidelines

Use the following guidelines when creating a pipeline to find the optimal balance between disk space, memory and execution run time.

[+] net positive effect
[-] net negative effect

 

Disk Space

Memory

Time

Import

Only import properties that will be used. 

++

 

+

Import

Divide big files in batches in one importer, this decreases time and the possibilities of errors.

 

++

+

Merge Classes

If multiple classes should be merged, perform a sequential merging to the one class having the highest number of instances.

+

 

+

Merge Classes vs. Infer

Merge Classes component takes precedence over Inferring components.

 

-

+

Create Relationship (by identifier) vs. Create Relationship (by label)

Choose Create Relationship "by identifier" over "by label " whenever possible.

 

 

++

Remove Resources

Obsolete resources in a class need to be removed in the pipeline as early as possible to increase subsequent components’ efficiency. 

+

+

+

Create Compact Class

When many resources/predicates become obsolete, create a compact class. This increases subsequent components’ efficiency. Do this as early as possible in the pipeline.

-

+

+

Publish in DISQOVER

Toggle on “Automatically drop predicates" in the Publish in DISQOVER component Indexer.

+

 

+

Importer Overview

Information on the different types of importers can be found in the table below. They list the pros and cons of each type so an optimal decision of format can be made that fits the data and use case.

Importer Type

Advantages

Disadvantages

Import RDF

  • Multiple classes created with one import
  • Relationships are automatically created
  • TTL, N-Triples
  • Simple schema
  • Automatically sets the resource URI
  • Complex Schemas (multi-leveled, many blank
       nodes)
  • OWL, RDF/XML

Import XML

  • Import SubObject
  • Filter on property content using XPATH
  • Heavy on memory for big files

Import JSON

  • Flat data model
  • Need to specify correct format of JSON/JL file

Import Excel

  • Import Sheet
  • Flat data model
  • Slow

Import CSV

  • Flat data model
  • Columns with the same name overwrite each other

Import Identifier Block

 

  • File needs to be sorted