Skip to content

Data Models

The Rockfish platform focuses on solving data bottlenecks for enterprises. Our platform supports two types of operational data models: time-series and tabular.

Time-series Data

Time-series data refers to data that is collected over time. In time-series data, each entity is called a session. Each session is a collection of events ordered by a timestamp column.

Columns in a session can be of two types: metadata and measurement. In a session, metadata columns have the same values for events. These columns usually describe the session. Measurement columns, on the other hand, have values that vary over time within a session. In some datasets, each session can also be identified using an explicit session key column.

Time-series Column Type Description
Metadata Values that describe the session, providing contextual information about the events.
Measurement Values that change over time within a session, usually quantitative in nature.
Timestamp Timestamps at which events were collected.
Session Key Unique identifiers for each session, optional.

Some examples of time-series datasets include:

  • Web analytics data: User interactions with a website, including page views, clicks, and time spent.
  • IoT data: Sensor readings from smart devices, such as temperature or humidity levels.
  • Financial data: Stock prices or trading volumes over time.

Let's understand the time-series data model using concrete examples.

Example: Financial Transactions Dataset

Suppose we have a dataset that collects information about customer financial transactions:

customer age gender category amount fraud timestamp
C2222 25 F food 70.84 1 2023-08-01 09:10:07
C1111 40 M transportation 35.13 0 2023-08-01 09:12:51
C2222 25 F transportation 28.26 0 2023-08-01 09:27:30
C1111 40 M food 64.99 0 2023-08-01 09:39:17
C1111 40 M health 35.88 0 2023-08-01 09:52:30

Here, each transaction is an event. Each session is a set of transactions that a particular customer performs over time.

This dataset has two sessions, one for customer C1111:

customer age gender category amount fraud timestamp
C1111 40 M transportation 35.13 0 2023-08-01 09:12:51
C1111 40 M food 64.99 0 2023-08-01 09:39:17
C1111 40 M health 35.88 0 2023-08-01 09:52:30

And the other for customer C2222:

customer age gender category amount fraud timestamp
C2222 25 F food 70.84 1 2023-08-01 09:10:07
C2222 25 F transportation 28.26 0 2023-08-01 09:27:30

The following table describes how each column in this time-series dataset is classified:

Time-series Column Type Columns Rationale
Metadata age, gender These columns describe the customer performing the transactions.
Measurement category, amount, fraud These columns describe each transaction made by the customer, i.e., how much money was spent for what, whether the transaction was normal or fraudulent.
Timestamp timestamp This column stores when each transaction was performed.
Session Key customer This column stores each customer's unique identifier.

Sample Time-series Datasets

Download the finance dataset

Download the pcap dataset

Tabular Data

Tabular data is organized into rows and columns. In tabular data, each entity is called a record. Each row represents an individual record, and each column represents a feature or attribute of the data. All columns in a tabular dataset are metadata columns.

Some examples of tabular datasets include:

  • Customer data: Information about customers, such as name, age, address, and purchase history.
  • Inventory data: Details about products, including product ID, name, category, quantity in stock, and price.
  • Sales data: Records of sales transactions, including transaction ID, date, customer ID, and items sold.

Let's understand the tabular data model using concrete examples.

Example: Fall Detection Dataset

Suppose we have a dataset that collects information about patients who went through medical incidents (falls):

Age range of patient Sex Reason for incident Body Temperature Heart Rate Respiratory Rate SBP DBP Hypertension
60<70 M Slip 97 80 15 140 90 Yes
30<40 F Loss of balance 96 78 14 145 95 Yes
60<70 M Mental confusion 98 81 13 143 93 No

Here, each patient incident is a record and all columns are metadata columns. This dataset has 3 records.

For example, the first record in this dataset is:

Age range of patient Sex Reason for incident Body Temperature Heart Rate Respiratory Rate SBP DBP Hypertension
60<70 M Slip 97 80 15 140 90 Yes

Sample Tabular Datasets

Download the fall detection dataset

Download the spotify dataset

Choosing Between Time-series and Tabular Data Models

There can be multiple ways of interpreting the same dataset, based on how you plan to use the data in downstream tasks.

In the examples below, we model a Netflow dataset as both time-series and tabular. You can choose an appropriate data model based on what you will use the synthetic Netflow data for:

  1. If your downstream task is an ML model that looks at each flow and predicts its type, using the tabular data model might be preferable.
  2. On the other hand, if your downstream task is analyzing patterns over time, using the time-series data model might be a better choice.

Note: If a dataset contains only one session (a single group of metadata fields), it will be treated as tabular data rather than time-series data, as there are no distinct sessions to learn from.

Example: Modelling the Netflow Dataset as Time-series

Suppose we have a dataset that collects information about network flow data on IoT sensors:

srcip dstip srcport dstport proto ts td pkt byt type
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 11:18:50 0.1 1 63 normal
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 12:10:15 0.2 1 63 normal
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 13:15:30 0.1 1 63 normal
192.168.1.32 18.194.169.124 55822 80 TCP 2020-01-19 13:23:00 12.709624 8 11487 xss
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 13:29:00 0.3 1 63 normal
192.168.1.32 18.194.169.124 55822 80 TCP 2020-01-19 15:54:00 0.1 1 60 xss

Here, each flow is an event. Each session is a set of flows that went through a particular connection over time.

This dataset has two sessions, one for the UDP connection between 192.168.1.79:45927 and 239.255.255.250:15600:

srcip dstip srcport dstport proto ts td pkt byt type
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 11:18:50 0.1 1 63 normal
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 12:10:15 0.2 1 63 normal
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 13:15:30 0.1 1 63 normal
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 13:29:00 0.3 1 63 normal

And the other for the TCP connection between 192.168.1.32:55822 and 18.194.169.124:80:

srcip dstip srcport dstport proto ts td pkt byt type
192.168.1.32 18.194.169.124 55822 80 TCP 2020-01-19 13:23:00 12.70 8 11487 xss
192.168.1.32 18.194.169.124 55822 80 TCP 2020-01-19 15:54:00 0.1 1 60 xss

The following table describes how each column in this time-series dataset is classified:

Time-series Column Type Columns Rationale
Metadata srcip, dstip, srcport, dstport, proto These columns describe the connection through which packets of data are being transferred.
Measurement pkt, byt, type, td These columns describe each flow, i.e., how many packets and bytes were transferred, whether the flow was normal or part of an attack, and how long the flow lasted.
Timestamp ts This column stores when each flow was recorded.
Session Key None There are no unique identifiers for a connection in this dataset, so we do not have a session key column.

Example: Modelling the Netflow Dataset as Tabular

Like before, we have the same network flow data on IoT sensors:

srcip dstip srcport dstport proto ts td pkt byt type
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 11:18:50 0.1 1 63 normal
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 12:10:15 0.2 1 63 normal
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 13:15:30 0.1 1 63 normal
192.168.1.32 18.194.169.124 55822 80 TCP 2020-01-19 13:23:00 12.709624 8 11487 xss
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 13:29:00 0.3 1 63 normal
192.168.1.32 18.194.169.124 55822 80 TCP 2020-01-19 15:54:00 0.1 1 60 xss

This time, however, each flow is a record and all columns are metadata columns. This dataset has 6 records.

For example, the first record in this dataset is:

srcip dstip srcport dstport proto ts td pkt byt type
192.168.1.79 239.255.255.250 45927 15600 UDP 2020-01-19 11:18:50 0.1 1 63 normal