Data Models
The Rockfish platform focuses on solving data bottlenecks for enterprises. Our platform supports two types of operational data models: time-series and tabular.
Time-series Data
Time-series data refers to data that is collected over time. In time-series data, each entity is called a session. Each session is a collection of events ordered by a timestamp column.
Columns in a session can be of two types: metadata and measurement. In a session, metadata columns have the same values for events. These columns usually describe the session. Measurement columns, on the other hand, have values that vary over time within a session. In some datasets, each session can also be identified using an explicit session key column.
Time-series Column Type | Description |
---|---|
Metadata | Values that describe the session, providing contextual information about the events. |
Measurement | Values that change over time within a session, usually quantitative in nature. |
Timestamp | Timestamps at which events were collected. |
Session Key | Unique identifiers for each session, optional. |
Some examples of time-series datasets include:
- Web analytics data: User interactions with a website, including page views, clicks, and time spent.
- IoT data: Sensor readings from smart devices, such as temperature or humidity levels.
- Financial data: Stock prices or trading volumes over time.
Let's understand the time-series data model using concrete examples.
Example: Financial Transactions Dataset
Suppose we have a dataset that collects information about customer financial transactions:
customer | age | gender | category | amount | fraud | timestamp |
---|---|---|---|---|---|---|
C2222 | 25 | F | food | 70.84 | 1 | 2023-08-01 09:10:07 |
C1111 | 40 | M | transportation | 35.13 | 0 | 2023-08-01 09:12:51 |
C2222 | 25 | F | transportation | 28.26 | 0 | 2023-08-01 09:27:30 |
C1111 | 40 | M | food | 64.99 | 0 | 2023-08-01 09:39:17 |
C1111 | 40 | M | health | 35.88 | 0 | 2023-08-01 09:52:30 |
Here, each transaction is an event. Each session is a set of transactions that a particular customer performs over time.
This dataset has two sessions, one for customer C1111
:
customer | age | gender | category | amount | fraud | timestamp |
---|---|---|---|---|---|---|
C1111 | 40 | M | transportation | 35.13 | 0 | 2023-08-01 09:12:51 |
C1111 | 40 | M | food | 64.99 | 0 | 2023-08-01 09:39:17 |
C1111 | 40 | M | health | 35.88 | 0 | 2023-08-01 09:52:30 |
And the other for customer C2222
:
customer | age | gender | category | amount | fraud | timestamp |
---|---|---|---|---|---|---|
C2222 | 25 | F | food | 70.84 | 1 | 2023-08-01 09:10:07 |
C2222 | 25 | F | transportation | 28.26 | 0 | 2023-08-01 09:27:30 |
The following table describes how each column in this time-series dataset is classified:
Time-series Column Type | Columns | Rationale |
---|---|---|
Metadata | age , gender |
These columns describe the customer performing the transactions. |
Measurement | category , amount , fraud |
These columns describe each transaction made by the customer, i.e., how much money was spent for what, whether the transaction was normal or fraudulent. |
Timestamp | timestamp |
This column stores when each transaction was performed. |
Session Key | customer |
This column stores each customer's unique identifier. |
Sample Time-series Datasets
Tabular Data
Tabular data is organized into rows and columns. In tabular data, each entity is called a record. Each row represents an individual record, and each column represents a feature or attribute of the data. All columns in a tabular dataset are metadata columns.
Some examples of tabular datasets include:
- Customer data: Information about customers, such as name, age, address, and purchase history.
- Inventory data: Details about products, including product ID, name, category, quantity in stock, and price.
- Sales data: Records of sales transactions, including transaction ID, date, customer ID, and items sold.
Let's understand the tabular data model using concrete examples.
Example: Fall Detection Dataset
Suppose we have a dataset that collects information about patients who went through medical incidents (falls):
Age range of patient | Sex | Reason for incident | Body Temperature | Heart Rate | Respiratory Rate | SBP | DBP | Hypertension |
---|---|---|---|---|---|---|---|---|
60<70 | M | Slip | 97 | 80 | 15 | 140 | 90 | Yes |
30<40 | F | Loss of balance | 96 | 78 | 14 | 145 | 95 | Yes |
60<70 | M | Mental confusion | 98 | 81 | 13 | 143 | 93 | No |
Here, each patient incident is a record and all columns are metadata columns. This dataset has 3 records.
For example, the first record in this dataset is:
Age range of patient | Sex | Reason for incident | Body Temperature | Heart Rate | Respiratory Rate | SBP | DBP | Hypertension |
---|---|---|---|---|---|---|---|---|
60<70 | M | Slip | 97 | 80 | 15 | 140 | 90 | Yes |
Sample Tabular Datasets
Download the fall detection dataset
Choosing Between Time-series and Tabular Data Models
There can be multiple ways of interpreting the same dataset, based on how you plan to use the data in downstream tasks.
In the examples below, we model a Netflow dataset as both time-series and tabular. You can choose an appropriate data model based on what you will use the synthetic Netflow data for:
- If your downstream task is an ML model that looks at each flow and predicts its
type
, using the tabular data model might be preferable. - On the other hand, if your downstream task is analyzing patterns over time, using the time-series data model might be a better choice.
Note: If a dataset contains only one session (a single group of metadata fields), it will be treated as tabular data rather than time-series data, as there are no distinct sessions to learn from.
Example: Modelling the Netflow Dataset as Time-series
Suppose we have a dataset that collects information about network flow data on IoT sensors:
srcip | dstip | srcport | dstport | proto | ts | td | pkt | byt | type |
---|---|---|---|---|---|---|---|---|---|
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 11:18:50 | 0.1 | 1 | 63 | normal |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 12:10:15 | 0.2 | 1 | 63 | normal |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 13:15:30 | 0.1 | 1 | 63 | normal |
192.168.1.32 | 18.194.169.124 | 55822 | 80 | TCP | 2020-01-19 13:23:00 | 12.709624 | 8 | 11487 | xss |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 13:29:00 | 0.3 | 1 | 63 | normal |
192.168.1.32 | 18.194.169.124 | 55822 | 80 | TCP | 2020-01-19 15:54:00 | 0.1 | 1 | 60 | xss |
Here, each flow is an event. Each session is a set of flows that went through a particular connection over time.
This dataset has two sessions, one for the UDP
connection between 192.168.1.79:45927
and 239.255.255.250:15600
:
srcip | dstip | srcport | dstport | proto | ts | td | pkt | byt | type |
---|---|---|---|---|---|---|---|---|---|
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 11:18:50 | 0.1 | 1 | 63 | normal |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 12:10:15 | 0.2 | 1 | 63 | normal |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 13:15:30 | 0.1 | 1 | 63 | normal |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 13:29:00 | 0.3 | 1 | 63 | normal |
And the other for the TCP
connection between 192.168.1.32:55822
and 18.194.169.124:80
:
srcip | dstip | srcport | dstport | proto | ts | td | pkt | byt | type |
---|---|---|---|---|---|---|---|---|---|
192.168.1.32 | 18.194.169.124 | 55822 | 80 | TCP | 2020-01-19 13:23:00 | 12.70 | 8 | 11487 | xss |
192.168.1.32 | 18.194.169.124 | 55822 | 80 | TCP | 2020-01-19 15:54:00 | 0.1 | 1 | 60 | xss |
The following table describes how each column in this time-series dataset is classified:
Time-series Column Type | Columns | Rationale |
---|---|---|
Metadata | srcip , dstip , srcport , dstport , proto |
These columns describe the connection through which packets of data are being transferred. |
Measurement | pkt , byt , type , td |
These columns describe each flow, i.e., how many packets and bytes were transferred, whether the flow was normal or part of an attack, and how long the flow lasted. |
Timestamp | ts |
This column stores when each flow was recorded. |
Session Key | None | There are no unique identifiers for a connection in this dataset, so we do not have a session key column. |
Example: Modelling the Netflow Dataset as Tabular
Like before, we have the same network flow data on IoT sensors:
srcip | dstip | srcport | dstport | proto | ts | td | pkt | byt | type |
---|---|---|---|---|---|---|---|---|---|
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 11:18:50 | 0.1 | 1 | 63 | normal |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 12:10:15 | 0.2 | 1 | 63 | normal |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 13:15:30 | 0.1 | 1 | 63 | normal |
192.168.1.32 | 18.194.169.124 | 55822 | 80 | TCP | 2020-01-19 13:23:00 | 12.709624 | 8 | 11487 | xss |
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 13:29:00 | 0.3 | 1 | 63 | normal |
192.168.1.32 | 18.194.169.124 | 55822 | 80 | TCP | 2020-01-19 15:54:00 | 0.1 | 1 | 60 | xss |
This time, however, each flow is a record and all columns are metadata columns. This dataset has 6 records.
For example, the first record in this dataset is:
srcip | dstip | srcport | dstport | proto | ts | td | pkt | byt | type |
---|---|---|---|---|---|---|---|---|---|
192.168.1.79 | 239.255.255.250 | 45927 | 15600 | UDP | 2020-01-19 11:18:50 | 0.1 | 1 | 63 | normal |