Data Models

The Rockfish platform focuses on solving data bottlenecks for enterprises. Our platform supports two types of operational data models: time-series and tabular.

Time-series Data

Time-series data refers to data that is collected over time. In time-series data, each entity is called a session. Each session is a collection of events ordered by a timestamp column.

Columns in a session can be of two types: metadata and measurement. In a session, metadata columns have the same values for events. These columns usually describe the session. Measurement columns, on the other hand, have values that vary over time within a session. In some datasets, each session can also be identified using an explicit session key column.

Time-series Column Type	Description
Metadata	Values that describe the session, providing contextual information about the events.
Measurement	Values that change over time within a session, usually quantitative in nature.
Timestamp	Timestamps at which events were collected.
Session Key	Unique identifiers for each session, optional.

Some examples of time-series datasets include:

Web analytics data: User interactions with a website, including page views, clicks, and time spent.
IoT data: Sensor readings from smart devices, such as temperature or humidity levels.
Financial data: Stock prices or trading volumes over time.

Let's understand the time-series data model using concrete examples.

Example: Financial Transactions Dataset

Suppose we have a dataset that collects information about customer financial transactions:

customer	age	gender	category	amount	fraud	timestamp
C2222	25	F	food	70.84	1	2023-08-01 09:10:07
C1111	40	M	transportation	35.13	0	2023-08-01 09:12:51
C2222	25	F	transportation	28.26	0	2023-08-01 09:27:30
C1111	40	M	food	64.99	0	2023-08-01 09:39:17
C1111	40	M	health	35.88	0	2023-08-01 09:52:30

Here, each transaction is an event. Each session is a set of transactions that a particular customer performs over time.

This dataset has two sessions, one for customer C1111:

customer	age	gender	category	amount	timestamp
C1111	40	M	transportation	35.13	2023-08-01 09:12:51
C1111	40	M	food	64.99	2023-08-01 09:39:17
C1111	40	M	health	35.88	2023-08-01 09:52:30

And the other for customer C2222:

customer	age	gender	category	amount	fraud	timestamp
C2222	25	F	food	70.84	1	2023-08-01 09:10:07
C2222	25	F	transportation	28.26	0	2023-08-01 09:27:30

The following table describes how each column in this time-series dataset is classified:

Time-series Column Type	Columns	Rationale
Metadata	`age`, `gender`	These columns describe the customer performing the transactions.
Measurement	`category`, `amount`, `fraud`	These columns describe each transaction made by the customer, i.e., how much money was spent for what, whether the transaction was normal or fraudulent.
Timestamp	`timestamp`	This column stores when each transaction was performed.
Session Key	`customer`	This column stores each customer's unique identifier.

Sample Time-series Datasets

Download the finance dataset

Download the pcap dataset

Tabular Data

Tabular data is organized into rows and columns. In tabular data, each entity is called a record. Each row represents an individual record, and each column represents a feature or attribute of the data. All columns in a tabular dataset are metadata columns.

Some examples of tabular datasets include:

Customer data: Information about customers, such as name, age, address, and purchase history.
Inventory data: Details about products, including product ID, name, category, quantity in stock, and price.
Sales data: Records of sales transactions, including transaction ID, date, customer ID, and items sold.

Let's understand the tabular data model using concrete examples.

Example: Fall Detection Dataset

Suppose we have a dataset that collects information about patients who went through medical incidents (falls):

Age range of patient	Sex	Reason for incident	Body Temperature	Heart Rate	Respiratory Rate	SBP	DBP	Hypertension
60<70	M	Slip	97	80	15	140	90	Yes
30<40	F	Loss of balance	96	78	14	145	95	Yes
60<70	M	Mental confusion	98	81	13	143	93	No

Here, each patient incident is a record and all columns are metadata columns. This dataset has 3 records.

For example, the first record in this dataset is:

Age range of patient	Sex	Reason for incident	Body Temperature	Heart Rate	Respiratory Rate	SBP	DBP	Hypertension
60<70	M	Slip	97	80	15	140	90	Yes

Sample Tabular Datasets

Download the fall detection dataset

Download the spotify dataset

Choosing Between Time-series and Tabular Data Models

There can be multiple ways of interpreting the same dataset, based on how you plan to use the data in downstream tasks.

In the examples below, we model a Netflow dataset as both time-series and tabular. You can choose an appropriate data model based on what you will use the synthetic Netflow data for:

If your downstream task is an ML model that looks at each flow and predicts its type, using the tabular data model might be preferable.
On the other hand, if your downstream task is analyzing patterns over time, using the time-series data model might be a better choice.

Note: If a dataset contains only one session (a single group of metadata fields), it will be treated as tabular data rather than time-series data, as there are no distinct sessions to learn from.

Example: Modelling the Netflow Dataset as Time-series

Suppose we have a dataset that collects information about network flow data on IoT sensors:

srcip	dstip	srcport	dstport	proto	ts	td	pkt	byt	type
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 11:18:50	0.1	1	63	normal
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 12:10:15	0.2	1	63	normal
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 13:15:30	0.1	1	63	normal
192.168.1.32	18.194.169.124	55822	80	TCP	2020-01-19 13:23:00	12.709624	8	11487	xss
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 13:29:00	0.3	1	63	normal
192.168.1.32	18.194.169.124	55822	80	TCP	2020-01-19 15:54:00	0.1	1	60	xss

Here, each flow is an event. Each session is a set of flows that went through a particular connection over time.

This dataset has two sessions, one for the UDP connection between 192.168.1.79:45927 and 239.255.255.250:15600:

srcip	dstip	srcport	dstport	proto	ts	td	pkt	byt	type
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 11:18:50	0.1	1	63	normal
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 12:10:15	0.2	1	63	normal
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 13:15:30	0.1	1	63	normal
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 13:29:00	0.3	1	63	normal

And the other for the TCP connection between 192.168.1.32:55822 and 18.194.169.124:80:

srcip	dstip	srcport	dstport	proto	ts	td	pkt	byt	type
192.168.1.32	18.194.169.124	55822	80	TCP	2020-01-19 13:23:00	12.70	8	11487	xss
192.168.1.32	18.194.169.124	55822	80	TCP	2020-01-19 15:54:00	0.1	1	60	xss

The following table describes how each column in this time-series dataset is classified:

Time-series Column Type	Columns	Rationale
Metadata	`srcip`, `dstip`, `srcport`, `dstport`, `proto`	These columns describe the connection through which packets of data are being transferred.
Measurement	`pkt`, `byt`, `type`, `td`	These columns describe each flow, i.e., how many packets and bytes were transferred, whether the flow was normal or part of an attack, and how long the flow lasted.
Timestamp	`ts`	This column stores when each flow was recorded.
Session Key	None	There are no unique identifiers for a connection in this dataset, so we do not have a session key column.

Example: Modelling the Netflow Dataset as Tabular

Like before, we have the same network flow data on IoT sensors:

srcip	dstip	srcport	dstport	proto	ts	td	pkt	byt	type
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 11:18:50	0.1	1	63	normal
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 12:10:15	0.2	1	63	normal
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 13:15:30	0.1	1	63	normal
192.168.1.32	18.194.169.124	55822	80	TCP	2020-01-19 13:23:00	12.709624	8	11487	xss
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 13:29:00	0.3	1	63	normal
192.168.1.32	18.194.169.124	55822	80	TCP	2020-01-19 15:54:00	0.1	1	60	xss

This time, however, each flow is a record and all columns are metadata columns. This dataset has 6 records.

For example, the first record in this dataset is:

srcip	dstip	srcport	dstport	proto	ts	td	pkt	byt	type
192.168.1.79	239.255.255.250	45927	15600	UDP	2020-01-19 11:18:50	0.1	1	63	normal