Promote user data to become DestinE data
DestinE - Data Lake maintains a DestinE Data Portfolio with more than 170 official entries at the time of writing. However, DestinE Users can propose their own Datasets to the DestinE community; and in this article we describe the entire process.
In essence, the Dataset must be packaged as a STAC Collection with its associated STAC Items. A Review Board will determine if your Dataset can be integrated into DestinE - Data Lake.
Once reviewed and validated, your Dataset would be accessible through the exposed HDA component STAC API, in the same way as the rest of the DestinE Data Lake portfolio. In addition, it would be discoverable through the DestinE Data Porfolio - Web Portal.
Step 1 Contact the DestinE Help Desk
The first step to bring your data to the DestinE community is to get in touch with the DestinE Help Desk and to express your wish to share your dataset.
In your DestinE Help Desk ticket, describe briefly the dataset and the benefit it will bring to the DestinE community. Important initial information to provide should include the following:
General description of the dataset
The temporal and spatial extent of the data
The format of the data
The total volume of the data
The input data that was used to generate this new dataset
Licensing preferences to be attached to the dataset
The Data Lake HDA (Harmonised Data Access component) has the possibility of restricting datasets to specific users. You should therefore note whether the dataset is to be made available to specific users, or made available to all users. Once the agreement is reached, a dedicated IAM role or roles may be created to handle this desired visibility.
Note
A review board will examine your request and respond to you directly in the ticket, giving follow-up steps.
Step 2 Proposal Accepted by Data Lake - Review Board
Once the Data Lake - Review Board has accepted your proposal, you will receive an agreed Collection Id (a.k.a Dataset Id) which will, at the end of the process, be used to access your data through the Data Lake HDA.
It is now your responsibility to provide data / metadata (STAC Collection + Items) compatible with DEDL HDA
The final delivery of this data / metadata is done via an object storage Bucket hosted on DestinE Data Lake - Central Islet Storage, with the naming convention usergenerated-proposal-[collection_id]
To help you with this data preparation, we have provided a Demonstration Project found on the github DestinE-DataLake-Lab repository.
Step 3 Integrating with DEDL HDA - a STAC API
The Data Lake HDA component exposes data using STAC API, which means that your data and metadata will need to be provided according to the STAC specifications.
You will see from the Demonstration Project that Python and PySTAC are used to manipulate a STAC Collection and STAC Items which reference your data.
- STAC Collection
A STAC Collection provides additional information about a spatio-temporal collection of data. It extends Catalog directly, layering on additional fields to enable description of things like the spatial and temporal extent of the data, the license, keywords, providers, etc. It in turn can easily be extended for additional collection level metadata. It is used standalone by parts of the STAC community, as a lightweight way to describe data holdings.
- STAC Item
Fundamental to any STAC, a STAC Item represents an atomic collection of inseparable data and metadata. A STAC Item is a GeoJSON feature and can be easily read by any modern GIS or geospatial library. The STAC Item JSON specification includes additional fields for:
the time the asset represents
a thumbnail for quick browsing
asset links, links to the described data
relationship links, allowing users to traverse other related STAC Items
Step 4 Data Preparation
As you will see in the Demonstration Project, you will need to provide your data and metadata in a predetermined structure. Your source data will likely need to be manipulated in order to conform to our requirements.
- Using your own infrastructure to prepare data
You will likely have your own infrastructure (Server / VM) which will allow you to manipulate your source data. Follow these steps to prepare your data
On your VM, Copy/Clone the Demonstration Project
Create a Python file called data_preparation.py at the root of the folder that handles
downloading your source data
moving files to the expected folder structure (See the structure in the block below)
extraction of preliminary item metadata (into a file called item_config.json) ready for the next steps
When the data is structured correctly, you will be able to use the file generate_item_metadata.py to automatically generate the STAC Item files.
- Using DestinE Data Lake infrastructure to prepare data
If you already have access to the DestinE Data Lake Islet Service, you could use this for any data preparation
└── [MY_COLLECTION_ID] # e.g. EO.XXX.YYY.ZZZ (HIGH_LEVEL_DATA_TYPE.DATA_PROVIDER.DATATYPE.DATASET_NAME)
└── metadata
├── collection_config.json # Global configuration that can be used when generating items. This can be overloaded at the item level in item_config.json (see below). e.g. "thumbnail_regex" (to identify thumbnails)
├── collection.json # A STAC file of type 'Collection' in json format gives an overview of the collection.
└── items
└── ITEM_1_ID.json # NOTE: These files are generated (using our demonstration project) on the condition that the pre-requisite structure is adhered to
└── ITEM_2_ID.json # STAC ITEM metadata files representing individual Items of type 'Feature' in json format gives information on the 'Item'
└── data # The data associated with Items is stored in folders YYYY/MM/DD/...
└── 2024 # 2024
└── 11 # 11 = November
└── 15 # 15 = 15th (of November)
└── ITEM_1_ID # a folder containing 1-n files/folders: naming convention [MY_COLLECTION_ID]_[start_datetime]_[end_datetime] or [MY_COLLECTION_ID]_[datetime]
└── item_config.json # Item level configuration. Overides Collection level config if any. e.g. "bbox"
└── datafile1 # Each file in this folder/subfolders is an 'Asset' of the Item (Possibility of ignoring some files using configuration). Each Asset should have a role "data", "metadata", "thumbnail", "overview"
└── datafile2
└── ...
As you can see from the example above, and considering a Collection Id EO.XXX.YYY.ZZZ:
A file representing the Collection (a.k.a. Dataset) is found in the path EO.XXX.YYY.ZZZ/metadata/collection.json
This contains a high level description and metadata that describes all the Items you want to expose.
Data representing the Items of your Collection has been placed in a folder representing a given day, e.g. in the folder EO.XXX.YYY.ZZZ/data/2024/11/15/ (data/YYYY/MM/DD)
However, by the nature of the Items in your Collection, it is possible to place data files at the month level ../data/YYYY/MM/ or year level ../data/YYYY/
In the targeted folder you will have files that become Assets of individual Items in the collection. These Assets will have different roles e.g. data, metdata, thumbnail, overview etc.
Step 5 Generating Item Level Metadata
Assuming you have prepared the data, and that it is in the structure seen above, you can execute the file Usergenerated/generate_item_metadata.py in the usergenerated project.
This will:
Open your ../metadata/collection.json STAC collection file and validate it using PySTAC (you should treat any errors if problems with STAC conformance appear)
Open your ../metadata/collection_config.json file to load any collection level configuration (e.g. here you could configure the expected regex for thumbnails in your data).
You can also identify in this configuration file where to expect the items in the data folder (using the key “item_folder_level” and potential values “YYYY”, “MM” or “DD”)
Identify a list of Items (i.e. it will go to the configured level and extract a list of ITEM folders in the data folder)
Initialise a STAC Item object (using PySTAC) for each of the found ITEM folders.
The ITEM folders contain an item_config.json file which contains additional information for the generation of the STAC Item object.
Generates the STAC item metadata in the path ../metadata/items/ITEM_1_ID.json for example
Reference : Collection Example
In the code block below, we see an example, ‘Collection’ type, STAC file (here based on the DestinE Data Lake collection EO.CLMS.DAT.CORINE).
- The Collection is in json format and represents the high level overview of your Data.
The Collection when deployed to the HDA component, will be exposed online, in HTML form, as part of the DestinE Data Lake - Data Portfolio. e.g. at https://hda.central.data.destination-earth.eu/ui/dataset/EO.CLMS.DAT.CORINE
The raw Collection STAC file when deployed to the HDA component, will be found, for example, at https://hda.data.destination-earth.eu/stac/collections/EO.CLMS.DAT.CORINE
Note: The comments would have to be removed for the following json to be valid json.
{
"type": "Collection",
"id": "EO.XXX.YYY.ZZZ", // Replace this e.g. EO.CLMS.DAT.CORINE
"stac_version": "1.0.0",
"description": "Text describing the dataset. This text will be shown in the Overview section of hda catalogue and can be a paragraph of text.",
"links": [
{
"rel": "license",
"href": "LICENCE_URL_LINK", // Replace this e.g. https://land.copernicus.eu/en/data-policy
"title": "LICENCE_NAME" // Replace this e.g. Copernicus Land Data Policy
},
{
"rel": "cite-as",
"href": "DOI_URL", // Replace this e.g. https://doi.org/10.2909/17ab2088-6907-470f-90b6-8c1364865803
"title": "DOI_DATASET_TITLE" // Replace this e.g. CORINE Land Cover Change 1990-2000 (vector), Europe, 6-yearly - version 2020_20u1, May 2020
},
{
"rel": "describedby",
"href": "OTHER_DATASET_URL", // Replace this e.g. https://land.copernicus.eu/en/products/corine-land-cover
"title": "DATASET_TITLE" // Replace this e.g. CORINE Land Cover
}
// NOTE: Other links will be added dynamically when integrating the collection into DEDL HDA (e.g. Parent, Self etc...)
],
"stac_extensions": [
"https://stac-extensions.github.io/scientific/v1.0.0/schema.json" // e.g. Optional extension e.g. to expose Digital Object Identifiers
],
"sci:publications": [
{
"sci:doi": "DOI_CODE", // Replace this e.g. 10.2909/c62bb056-5ac3-4512-b642-7f484175d951
"sci:citation": "DOI_DATASET_TITLE" // Replace this e.g. European Union's Copernicus Land Monitoring Service information (CORINE Land Cover Change 1990-2000 (raster 100 m), Europe, 6-yearly)
}
],
"title": "DATASET_TITLE", // Replace this (A short description of the collection) e.g. CORRINE Land Cover
"extent": {
"spatial": {
"bbox": [ // Replace these values with your spatial bbox coordinates
[
-31.561261,
27.405827,
44.820775,
71.409109
]
]
},
"temporal": {
"interval": [ // Replace these values with your temporal extent (covering the temporal extent of the planned Items you are exposing). if no end date the second date can be replaced with null (no quotation marks)
[
"2024-11-01T00:00:00Z",
"2024-11-30T23:59:59Z"
]
]
}
},
"license": "proprietary",
"keywords": [ // Replace these keywords - refer to https://hda.central.data.destination-earth.eu/ui/catalog to see existing. Use these where possible
"Satellite Image Interpretation",
"Land Cover Change",
"geospatial data",
"landscape alteration",
"Land cover",
"European",
"Copernicus"
],
"providers": [
{
"name": "PROVIDER_NAME", // Replace this e.g. European Environment Agency
"roles": [ // Replace these values as appropriate
"producer",
"processor",
"licensor"
],
"url": "PROVIDER_URL" // Replace this e.g. https://www.eea.europa.eu/
}
],
"assets": {
"thumbnail": {
"href": "URL_TO_DATASET_IMAGE", // Replace this e.g. https://land.copernicus.eu/en/products/corine-land-cover/@@images/image-400-7d8e8dfc63d50c9bf89ff5a7475dcd46.png
"type": "image/png", // Replace this with the correct Mime type
"title": "overview",
"roles": [
"thumbnail"
]
}
}
}
Reference : Item Example
And here we have an example of a STAC ITEM (Feature) associated with that collection
{
"type": "Feature",
"stac_version": "1.0.0",
"stac_extensions": [],
"id": "EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959",
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
10.0,
35.0
],
[
10.0,
60.0
],
[
-10.0,
60.0
],
[
-10.0,
35.0
],
[
10.0,
35.0
]
]
]
},
"bbox": [
-10.0,
35.0,
10.0,
60.0
],
"properties": {
"start_datetime": "2024-11-15T00:00:00Z",
"end_datetime": "2024-11-15T23:59:59Z",
"datetime": "2024-11-15T00:00:00Z"
},
"links": [
{
"rel": "collection",
"href": "https://hda.data.destination-earth.eu/stac/collections/EO.XXX.YYY.ZZZ",
"type": "application/json",
"title": "DATASET_TITLE"
},
{
"rel": "self",
"href": "https://hda.data.destination-earth.eu/stac/collections/EO.XXX.YYY.ZZZ/items/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959",
"type": "application/json"
}
],
"assets": {
"metadata1.json": {
"href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/metadata1.json",
"type": "application/json",
"roles": [
"metadata"
]
},
"thumbnail.jpg": {
"href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/thumbnail.jpg",
"type": "image/jpeg",
"roles": [
"thumbnail"
]
},
"overview.jpg": {
"href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/overview.jpg",
"type": "image/jpeg",
"roles": [
"overview"
]
},
"20241115.png": {
"href": "data/2024/11/15/EO.XXX.YYY.ZZZ_20241115T000000_20241115T235959/20241115.png",
"type": "image/png",
"roles": [
"data"
]
}
},
"collection": "EO.XXX.YYY.ZZZ"
}
Step 6 Transferring the Collection to Data Lake - Islet Storage / Bucket
When the data and metadata have been prepared, you will need to transfer them to a ‘private’ dedicated Object Storage Container (a.k.a Bucket) in the Central Site - Islet (Storage) Service.
Note
Proposals should be delivered to a bucket following the naming convention usergenerated-proposal-[collection_id] e.g. usergenerated-proposal-EO.XXX.YYY.ZZZ
- Creating a Private Bucket for your Collection Proposal
Through your Support Ticket you can ask for advice on getting access/credentials for a ‘private’ Bucket. The options are as follows:
Through the Islet Storage ONLY service (Most cases)
As the name states this will give you the possibility to handle Object Storage only and would be a way of delivering your Collection for review.
Through the Islet Compute service for users who have already made a successful application for Edge Services - Big Data Processing Services)
With Islet Compute, there is a built in object storage service.
ec2 credentials are necessary to interact with a private bucket programmatically (See How to generate and manage EC2 credentials)
Now that you have an empty bucket and credentials to write to it
When you execute generate_item_metadata.py in the usergenerated demonstration, by default, variable IS_UPLOAD_S3 is set to False. Change this value to True in config.py
See README.MD of the usergenerated project with respect to setting ec2 credentials in a .env file at root of the usergenerated project.
Now re-execute generate_item_metadata.py and your Collection will be uploaded to your bucket ‘usergenerated-proposal-EO.XXX.YYY.ZZZ’ ready for review.
Note
This method will upload all of your data to the target bucket in one process that may not be appropriate for large volumes. An alternative approach would be to push one item at a time during the data preparation stage. If you need any advice, don’t hesitate to contact the DestinE Help Desk through your ticket
Step 7 Follow the ticket exchanges
Once the data is ready, the DEDL team will proceed to upload your dataset to the data lake. The team will keep you informed of the current status of the process and will come back to you if there is any issue regarding the data or its format.
Once the data is fully uploaded and available, you will be notified on the ticket and you will be able to access this newly added dataset directly from the DEDL HDA API and the DEDL Web Portal