Promote user data to become DestinE data

DestinE - Data Lake maintains a DestinE Data Portfolio with more than 170 official entries at the time of writing. However, DestinE Users can propose their own Datasets to the DestinE community. In this article we describe the entire process.

In essence, the Dataset must be packaged as a STAC Collection with its associated STAC Items. A Review Board will determine if your Dataset can be integrated into DestinE - Data Lake.

Once reviewed and validated, your Dataset would be accessible through the exposed HDA component STAC API, in the same way as the rest of the DestinE Data Lake portfolio. In addition, it would be discoverable through the DestinE Data Porfolio - Web Portal.

Step 1 Contact the DestinE Help Desk

The first step to bring your data to the DestinE community is to get in touch with the DestinE Help Desk and to express your wish to share your dataset.

In your DestinE Help Desk ticket, describe briefly the dataset and the benefit it will bring to the DestinE community. Important initial information to provide should include the following:

General description of the dataset
The temporal and spatial extent of the data
The format of the data
The total volume of the data
The input data that was used to generate this new dataset
Licensing preferences to be attached to the dataset
Data quality information

The Data Lake HDA (Harmonised Data Access component) has the possibility of restricting datasets to specific users. You should therefore note whether the dataset is to be made available to specific users, or made available to all users. Once the agreement is reached, a dedicated IAM role or roles may be created to handle this desired visibility.

Note

A review board will examine your request and respond to you directly in the ticket, giving follow-up steps.

Step 2 Proposal Accepted by Data Lake - Review Board

Once the Data Lake - Review Board has accepted your proposal, you will receive an agreed Collection Id (a.k.a Dataset Id) which will, at the end of the process, be used to access your data through the Data Lake HDA.

It is now your responsibility to provide data / metadata (STAC Collection + Items) compatible with DEDL HDA
- The final delivery of this data / metadata is done via an object storage Bucket hosted on DestinE Data Lake - Central Islet Storage, with the naming convention usergenerated-proposal-[collection_id]
To help you with this data preparation, we have provided a Demonstration Project found on the github DestinE-DataLake-Lab repository.

Step 3 Integrating with DEDL HDA - a STAC API

The Data Lake HDA component exposes data using STAC API, which means that your data and metadata will need to be provided according to the STAC specifications.

You will see from the Demonstration Project that Python and PySTAC are used to manipulate a STAC Collection and STAC Items which reference your data.

STAC Collection

A STAC Collection provides additional information about a spatio-temporal collection of data. It extends Catalog directly, layering on additional fields to enable description of things like the spatial and temporal extent of the data, the license, keywords, providers, etc. It in turn can easily be extended for additional collection level metadata. It is used standalone by parts of the STAC community, as a lightweight way to describe data holdings.

STAC Item

Fundamental to any STAC, a STAC Item represents an atomic collection of inseparable data and metadata. A STAC Item is a GeoJSON feature and can be easily read by any modern GIS or geospatial library. The STAC Item JSON specification includes additional fields for:

the time the asset represents
a thumbnail for quick browsing
asset links, links to the described data
relationship links, allowing users to traverse other related STAC Items

Step 4 Data Preparation

As you will see in the Demonstration Project, you will need to provide your data and metadata in a predetermined structure. Your source data will likely need to be manipulated in order to conform to our requirements.

Using your own infrastructure to prepare data

You will likely have your own infrastructure (Server / VM) which will allow you to manipulate your source data. Follow these steps to prepare your data

On your VM, Copy/Clone the Demonstration Project
We suggest (your choice) that you create a Python file called data_preparation.py at the root of the folder that handles
- downloading your source data
- moving files to the expected folder structure (See the structure in the block below)
- extraction of preliminary item metadata (into a file called item_config.json) ready for the next steps
When the data is structured correctly, you will be able to use the file generate_item_metadata.py to automatically generate the STAC Item files.

Using DestinE Data Lake infrastructure to prepare data

If you already have access to the DestinE Data Lake Islet Service, you could use this for any data preparation

Here is the expected structure of your data and metadata that would be delivered to us for review:

└── [MY_COLLECTION_ID]                           # e.g. EO.XXX.YYY.ZZZ (HIGH_LEVEL_DATA_TYPE.DATA_PROVIDER.DATATYPE.DATASET_NAME)
    └── metadata
        ├── collection_config.json               # Global configuration that can be used when generating items. This can be overloaded at the item level in item_config.json (see below). e.g. "thumbnail_regex" (to identify thumbnails)
        ├── collection.json                      # A STAC file of type 'Collection' in json format gives an overview of the collection.
        └── items
            └── ITEM_1_ID.json                   # NOTE: These files are generated (using our demonstration project) on the condition that the pre-requisite structure is adhered to
            └── ITEM_2_ID.json                   # STAC ITEM metadata files representing individual Items of type 'Feature' in json format gives information on the 'Item'
    └── data                                             # The data associated with Items is stored in folders YYYY/MM/DD/...
        └── 2024                                         # 2024
            └── 11                                       # 11 = November
                └── 15                                   # 15 = 15th (of November)
                    └── ITEM_1_ID                        # a folder containing 1-n files/folders: naming convention [MY_COLLECTION_ID]_[start_datetime]_[end_datetime] or [MY_COLLECTION_ID]_[datetime]
                        └── item_config.json             # Item level configuration. Overides Collection level config if any. e.g. "bbox"
                        └── datafile1                    # Each file in this folder/subfolders is an 'Asset' of the Item (Possibility of ignoring some files using configuration). Each Asset should have a role "data", "metadata", "thumbnail", "overview"
                        └── datafile2
                        └── ...

As you can see from the example above, and considering a Collection Id EO.XXX.YYY.ZZZ:

A file representing the Collection (a.k.a. Dataset) is found in the path EO.XXX.YYY.ZZZ/metadata/collection.json
- This contains a high level description and metadata that describes all the Items you want to expose.
Data representing the Items of your Collection has been placed in a folder representing a given day, e.g. in the folder EO.XXX.YYY.ZZZ/data/2024/11/15/ (data/YYYY/MM/DD)
- However, by the nature of the Items in your Collection, it is possible to place data files at the month level ../data/YYYY/MM/ or year level ../data/YYYY/
- In the targeted folder you will have files that become Assets of individual Items in the collection. These Assets will have different roles e.g. data, metdata, thumbnail, overview etc.

Naming Convention of Item folders - grouping by dates

As noted above there is a naming convention for the folders that represent the Items in your Collection. This naming convention groups together ALL data assets for a given date or timerange. The simplest case is that the folder name for an item is in the following format [MY_COLLECTION_ID]_[start_datetime]_[end_datetime] or [MY_COLLECTION_ID]_[datetime]

It is normal that an Item contains multiple data assets representing data and metadta of your item

└── [MY_COLLECTION_ID]                           # e.g. EO.XXX.YYY.ZZZ (HIGH_LEVEL_DATA_TYPE.DATA_PROVIDER.DATATYPE.DATASET_NAME)

    └── data
        └── 2024
            └── 11
                └── 15
                    └── [MY_COLLECTION_ID]_[start_datetime]_[end_datetime]     # e.g. EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959
                        └── item_config.json
                        └── datafile1
                        └── datafile2
                        └── datafile3
                        └── datafile4
                        └── ...

Naming Convention of Item folders - grouping by dates AND other criteria

However, rather than putting all your data assets for a given date into the same Item, you may want to group together data assets for a given date using some other criteria e.g. ‘named region’ or ‘model’ or ‘instrument’ etc.

This is possible, and would allow your users to identify an item that interests them using some additional criteria (1-n additional criteria separated by double underscores).

For example, imagine your items represent data for a given date range processed by a given ‘model’ called ‘model1’, this would allow you to have multiple items with the same date range:

└── [MY_COLLECTION_ID]                           # e.g. EO.XXX.YYY.ZZZ (HIGH_LEVEL_DATA_TYPE.DATA_PROVIDER.DATATYPE.DATASET_NAME)

    └── data
        └── 2024
            └── 11
                └── 15
                    └── [MY_COLLECTION_ID]_[start_datetime]_[end_datetime]__model1     # e.g. EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959__model1
                        └── item_config.json
                        └── datafile1 (for model1)
                        └── datafile2 (for model1)
                        └── ...
                    └── [MY_COLLECTION_ID]_[start_datetime]_[end_datetime]__model2     # e.g. EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959__model2
                        └── item_config.json
                        └── datafile1 (for model2)
                        └── datafile2 (for model2)
                        └── ...

or for multiple criteria (e.g. model and some variable)

└── [MY_COLLECTION_ID]                           # e.g. EO.XXX.YYY.ZZZ (HIGH_LEVEL_DATA_TYPE.DATA_PROVIDER.DATATYPE.DATASET_NAME)

    └── data
        └── 2024
            └── 11
                └── 15
                    └── [MY_COLLECTION_ID]_[start_datetime]_[end_datetime]__model1__var1     # e.g. EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959__model1_var1
                        └── item_config.json
                        └── datafile1 (for model1 var1)
                        └── datafile2 (for model1 var1)
                        └── ...
                    └── [MY_COLLECTION_ID]_[start_datetime]_[end_datetime]__model1__var2    # e.g. EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959__model1_var2
                        └── item_config.json
                        └── datafile1 (for model1 var2)
                        └── datafile2 (for model1 var2)
                        └── ...

During item metadata generation, configuration of collection_config.json allows the automatic extraction of additional criteria if you use this naming convention. See the use of “additional_property_keys” in collection_config.json

{
    "id": "EO.XXX.YYY.ZZZ",
    "item_asset_ignore_list": ["item_config.json"],
    "item_folder_level": "DD",
    "thumbnail_regex": "^thumbnail",
    "overview_regex": "^overview",
    "additional_property_keys": ["model", "variable"]
}

This would expect item folders to use the following naming convention [MY_COLLECTION_ID]_[start_datetime]_[end_datetime]__[model_value]__[variable_value]. And in the resulting item metadata, you would automatically get additional properties. For example:

"properties": {
    "model": "model1",
    "variable": "var1",
},

Note

For advice on the structure of your Items folders, don’t hesitate to get in touch with support through your ticket.

Step 5 Generating Item Level Metadata

Assuming you have prepared the data, and that it is in the structure seen above, you can execute the file Usergenerated/generate_item_metadata.py in the usergenerated project.

This will:

Open your ../metadata/collection.json STAC collection file and validate it using PySTAC (you should treat any errors if problems with STAC conformance appear)
Open your ../metadata/collection_config.json file to load any collection level configuration (e.g. here you could configure the expected regex for thumbnails in your data).
- You can also identify in this configuration file where to expect the items in the data folder (using the key “item_folder_level” and potential values “YYYY”, “MM” or “DD”)
Identify a list of Items (i.e. it will go to the configured level and extract a list of ITEM folders in the data folder)
Initialise a STAC Item object (using PySTAC) for each of the found ITEM folders.
- The ITEM folders contain an item_config.json file which contains additional information for the generation of the STAC Item object.
Generates the STAC item metadata in the path ../metadata/items/ITEM_1_ID.json for example

Reference : Collection Example

In the code block below, we see an example, ‘Collection’ type, STAC file (here based on the DestinE Data Lake collection EO.CLMS.DAT.CORINE).

The Collection is in json format and represents the high level overview of your Data.

The Collection when deployed to the HDA component, will be exposed online, in HTML form, as part of the DestinE Data Lake - Data Portfolio. e.g. at https://data.destination-earth.eu/dataset/EO.CLMS.DAT.CORINE

The raw Collection STAC file when deployed to the HDA component, will be found, for example, at https://hda.data.destination-earth.eu/stac/collections/EO.CLMS.DAT.CORINE

Note: The comments would have to be removed for the following json to be valid json.

{
"type": "Collection",
"id": "EO.XXX.YYY.ZZZ",                                                  // Replace this e.g. EO.CLMS.DAT.CORINE
"stac_version": "1.0.0",
"description": "Text describing the dataset. This text will be shown in the Overview section of hda catalogue and can be a paragraph of text.",
"links": [
    {
        "rel": "license",
        "href": "LICENCE_URL_LINK",                                      // Replace this e.g. https://land.copernicus.eu/en/data-policy
        "title": "LICENCE_NAME"                                          // Replace this e.g. Copernicus Land Data Policy
    },
    {
        "rel": "cite-as",
        "href": "DOI_URL",                                               // Replace this e.g. https://doi.org/10.2909/17ab2088-6907-470f-90b6-8c1364865803
        "title": "DOI_DATASET_TITLE"                                     // Replace this e.g. CORINE Land Cover Change 1990-2000 (vector), Europe, 6-yearly - version 2020_20u1, May 2020
    },
    {
        "rel": "describedby",
        "href": "OTHER_DATASET_URL",                                     // Replace this e.g. https://land.copernicus.eu/en/products/corine-land-cover
        "title": "DATASET_TITLE"                                         // Replace this e.g. CORINE Land Cover
    }
                                                                        // NOTE: Other links will be added dynamically when integrating the collection into DEDL HDA (e.g. Parent, Self etc...)
],
"stac_extensions": [
    "https://stac-extensions.github.io/scientific/v1.0.0/schema.json"    // e.g. Optional extension e.g. to expose Digital Object Identifiers
],
"sci:publications": [
    {
        "sci:doi": "DOI_CODE",                                           // Replace this e.g. 10.2909/c62bb056-5ac3-4512-b642-7f484175d951
        "sci:citation": "DOI_DATASET_TITLE"                              // Replace this e.g. European Union's Copernicus Land Monitoring Service information (CORINE Land Cover Change 1990-2000 (raster 100 m), Europe, 6-yearly)
    }
],
"title": "DATASET_TITLE",                                                // Replace this (A short description of the collection) e.g. CORRINE Land Cover
"extent": {
    "spatial": {
        "bbox": [                                                        // Replace these values with your spatial bbox coordinates
            [
                -31.561261,
                27.405827,
                44.820775,
                71.409109
            ]
        ]
    },
    "temporal": {
        "interval": [                                                    // Replace these values with your temporal extent (covering the temporal extent of the planned Items you are exposing). if no end date the second date can be replaced with null (no quotation marks)
            [
                "2024-11-01T00:00:00Z",
                "2024-11-30T23:59:59Z"
            ]
        ]
    }
},
"license": "proprietary",
"keywords": [                                                            // Replace these keywords - refer to https://data.destination-earth.eu/data-portfolio to see existing. Use these where possible
    "Satellite Image Interpretation",
    "Land Cover Change",
    "geospatial data",
    "landscape alteration",
    "Land cover",
    "European",
    "Copernicus"
],
"providers": [
    {
        "name": "PROVIDER_NAME",                                         // Replace this e.g. European Environment Agency
        "roles": [                                                       // Replace these values as appropriate
            "producer",
            "processor",
            "licensor"
        ],
        "url": "PROVIDER_URL"                                            // Replace this e.g. https://www.eea.europa.eu/
    }
],
"assets": {
    "thumbnail": {
        "href": "URL_TO_DATASET_IMAGE",                                  // Replace this e.g. https://land.copernicus.eu/en/products/corine-land-cover/@@images/image-400-7d8e8dfc63d50c9bf89ff5a7475dcd46.png
        "type": "image/png",                                             // Replace this with the correct Mime type
        "title": "overview",
        "roles": [
            "thumbnail"
        ]
    }
}
}

Reference : Item Example

And here we have an example of a STAC ITEM (Feature) associated with that collection

{
    "type": "Feature",
    "stac_version": "1.0.0",
    "stac_extensions": [],
    "id": "EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959",
    "geometry": {
        "type": "Polygon",
        "coordinates": [
            [
                [
                    10.0,
                    35.0
                ],
                [
                    10.0,
                    60.0
                ],
                [
                    -10.0,
                    60.0
                ],
                [
                    -10.0,
                    35.0
                ],
                [
                    10.0,
                    35.0
                ]
            ]
        ]
    },
    "bbox": [
        -10.0,
        35.0,
        10.0,
        60.0
    ],
    "properties": {
        "start_datetime": "2024-11-15T00:00:00Z",
        "end_datetime": "2024-11-15T23:59:59Z",
        "datetime": "2024-11-15T00:00:00Z"
    },
    "links": [
        {
            "rel": "collection",
            "href": "https://hda.data.destination-earth.eu/stac/collections/EO.XXX.YYY.ZZZ",
            "type": "application/json",
            "title": "DATASET_TITLE"
        },
        {
            "rel": "self",
            "href": "https://hda.data.destination-earth.eu/stac/collections/EO.XXX.YYY.ZZZ/items/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959",
            "type": "application/json"
        }
    ],
    "assets": {
        "metadata1.json": {
            "href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/metadata1.json",
            "type": "application/json",
            "roles": [
                "metadata"
            ]
        },
        "thumbnail.jpg": {
            "href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/thumbnail.jpg",
            "type": "image/jpeg",
            "roles": [
                "thumbnail"
            ]
        },
        "overview.jpg": {
            "href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/overview.jpg",
            "type": "image/jpeg",
            "roles": [
                "overview"
            ]
        },
        "20241115.png": {
            "href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/20241115.png",
            "type": "image/png",
            "roles": [
                "data"
            ]
        }
    },
    "collection": "EO.XXX.YYY.ZZZ"
}

Step 6 Transferring the Collection to Data Lake - Islet Storage / Bucket

When the data and metadata have been prepared, you will need to transfer them to a ‘private’ dedicated Object Storage Container (a.k.a Bucket) in the Central Site - Islet (Storage) Service.

Note

Proposals should be delivered to a bucket following the naming convention usergenerated-proposal-[collection_id] e.g. usergenerated-proposal-EO.XXX.YYY.ZZZ

Creating a Private Bucket for your Collection Proposal

Through your Support Ticket you can ask for advice on getting access/credentials for a ‘private’ Bucket. The options are as follows:

Through the Islet Storage ONLY service (Most cases)
- As the name states this will give you the possibility to handle Object Storage only and would be a way of delivering your Collection for review.
Through the Islet Compute service for users who have already made a successful application for Edge Services - Big Data Processing Services)
- With Islet Compute, there is a built in object storage service.
- ec2 credentials are necessary to interact with a private bucket programmatically (See How to generate and manage EC2 credentials)

Now that you have an empty bucket and credentials to write to it

When you execute generate_item_metadata.py in the usergenerated demonstration, by default, variable IS_UPLOAD_S3 is set to False. Change this value to True in config.py
- See README.MD of the usergenerated project with respect to setting ec2 credentials in a .env file at root of the usergenerated project.
Now re-execute generate_item_metadata.py and your Collection will be uploaded to your bucket ‘usergenerated-proposal-EO.XXX.YYY.ZZZ’ ready for review.

Note

This method will upload all of your data to the target bucket in one process that may not be appropriate for large volumes. An alternative approach would be to push one item at a time during the data preparation stage. If you need any advice, don’t hesitate to contact the DestinE Help Desk through your ticket.

Step 7 Follow the ticket exchanges

Once the data is ready, the DEDL team will proceed to upload your dataset to the data lake. The team will keep you informed of the current status of the process and will come back to you if there is any issue regarding the data or its format.

Once the data is fully uploaded and available, you will be notified on the ticket and you will be able to access this newly added dataset directly from the DEDL HDA API and the DEDL Web Portal