Data management with Nuvla.io Part 3

Posted by Konstantin Skaburskas on 11 August 2020

This is the third blog post on Data Management with Nuvla.io.  Nuvla.io is an edge management platform as a service, designed to make it easy for you to manage, monitor, deploy and update your containerised apps and edge devices.

In the previous two blog posts, we described the capabilities of Nuvla.io around Data Management and its connection to application provisioning, and then we introduced you to the data-object - Nuvla resource - that allows you to handle S3 objects without the need for an S3 client, or the need to share S3 credentials with the users of your data. Check back on those posts if you missed them:

Now it's time to dive into another Nuvla.io resource) -  data-record - that turns Nuvla.io into a Metadata Catalogue for your diverse data sets.

Nuvla.io as Metadata Catalogue

When you are collecting, storing, and registering your data, it is all with the sole intent of later processing it to make a sense out it - i.e. to derive information.

The data processing step requires discovery of the data and provisioning of apps to where the data is located. Wouldn't it be cool if the data discovery and apps provisioning could be done in a single tool? In the majority of the cases the answer obvious. That is why, along with Container Application Management, Nuvla.io implements the possibility to store and query metadata describing your data objects, and you have the full freedom to define the schema for the metadata.

This feature turns Nuvla.io into a flexible Metadata Catalogue.

data-record resource

The data-record resources provide rich metadata for data objects, either created through data-object resources (see blog post number 2) in Nuvla.io or externally.

As you've already seen, data-object can be assigned tags to hold some info about the object. While this can help with providing extended information about the object, it is the data-record that is intended to be the real object's metadata ledger. data-record can hold the object's metadata in an arbitrary schema that you define. This can include the reference to the data-object that you just created. In the case of the describing data-object, this builds the link between the metadata and the actual location of the data.

Below is an example of the creation of three data-records, each describing and referencing corresponding example data-object (data_obj variable and "data-object" key) as well as the infrastructure service S3,  where the actual data objects are stored. (You can find more details on data-object and infrastructure-service S3 in blog #2 ). 

from nuvla.api import Api as Nuvla
from nuvla.api.resources.data import DataRecord

nuvla = Nuvla()
nuvla.login('<login params>')

infra_s3 = 'infrastructure-service/1-2-3-4-5'

elephants = [('Lora', '2005-01-01', 'data-object/1-2-3-4-5-1'),
             ('Bambi', '2010-01-01', 'data-object/1-2-3-4-5-2'),
             ('Dodi', '2015-01-01', 'data-object/1-2-3-4-5-3')]

for name, bd, data_obj in elephants:
    data_record = {
        "infrastructure-service": infra_s3, # S3 infrastructure service ID.
        "data-object": data_obj,              # Reference to the actual data object.
        "description": "Elephant in Tierpark Berlin",
        "name": f"Elephant {name}",
        "object": f"elephant-{name}.png",
        "bucket": "zoo",
        "content-type": "animals/zoo",    # Allows to find apps that can process the object.
        "bytes": 12499950,
        "platform": "S3",
        "animal:type": "elephant",
        "animal:name": name,
        "animal:birth-date": bd,
        "zoo:name": "Tierpark Berlin",
        "zoo:location": [52.50189498, 13.53193555, 40] # lat, lon, elev
    }
    DataRecord(nuvla).add(data_record)

Here are the three animal:type='elephant' data records just registered in Nuvla.io.

data-records-created

 

If we wanted to find the elephants that were born between 2005 and 2015 in the Tierpark -- below is the query, and the screenshot shows the executed query with the corresponding correct result.

animal:type='elephant' and (animal:birth-date>'2005-01-01' and 
animal:birth-date<'2015-01-01') and zoo:name^='Tierpark'

data-record-search

After finding the required data records, users can either download the object directly from S3 (by referring to the data-object ID) or start an application on the infrastructure where the data object is located. The second option is possible thanks to references between resources and the ability for Nuvla.io to derive the (cloud or edge) infrastructure service where the data is located and how it can be accessed.

More Information

More information can be found in our online documentation:

Data management model and Example on data management. 

Usage Scenarios

Because the data records are free schema documents, with infrastructure-service being the only mandatory attribute , they are suitable for the description of any file-like objects stored on different storage media - hence, it can be used not only for the description of S3 data-objects defined in Nuvla.io (as presented in the previous section).

One can imagine defining a schema that describes files on Persistent Volumes attached to a Docker Swarm or Kubernetes compute cluster registered in Nuvla.io and identified by an infrastructure service ID. For example:

{
  "infrastructure-service": "infrastructure-service/<k8s-cluster-id>",
  ...
  "platform": "k8s.pv",
  "k8s:namespace": "zoo",
  "k8s:pvc": "elephants",
  ...
}

There are use cases when it is not always possible to use S3 data-object along with data-record to store and register an S3 object. For example, as the operation of data-object creation and obtaining S3 pre-signed upload URL takes time - usually around 250ms - when dealing with high-throughput stream processing applications where each ms counts, it is advisable to store the object directly on S3 from the applications, but then provide all the required information for the object discovery and access in the corresponding data-record. The data-record registration can also happen on background application thread(s), thus allowing its main execution thread to continue. This is the case in the GNSS Big Data project, which we mentioned in the previous post, and will describe in detail at the end of the Data Management blog series.

Next steps

This blog post has shown you how Nuvla.io can act as Metadata Catalogue via data-record resources, with details of how this can be practically accomplished.

In the next blog posts, we will combine the knowledge from the previous blogs and describe how to:

  • create data-sets of data-objects and data-records (which expands Metadata Catalogue capabilities),
  • search and filter them, and
  • start applications with the data attached using Nuvla's "Open With ..." feature.

Follow us on LinkedIn or Twitter to make sure don't miss when the next blog is published!

Ready to start your journey to the edge with our edge-to-cloud open platform?

Go to Nuvla.io

Comments