Using Chat GPT To Generate Datasets 🤖
Cris Mihalache
Posted on February 21, 2024
Recently I used Chat GPT to generate mock data in JSON format.
This is extremely powerful, as you can describe what real data looks like in your prompt, and Chat GPT will generate data with realistic content in fields like descriptions, titles, tags, etc.
This is a sample of the data Chat GPT generated:
{
"id": 3,
"description": "Implementing OAuth2 for user authentication",
"tags": [
"@feature",
"@security"
],
"start": "2024-02-22T03:11:03",
"end": "2024-02-22T06:36:18"
}
Notice how the description, tags, and duration are all realistic.
Example
To build the documentation website for track-time-cli
, I needed a database with mock data to have realistic command output in the documentation content.
To get Chat GPT to generate them for me, I used the following prompt:
Generate 5 time sheet entries with realistic software development tasks as descriptions. Use the following schema:
{
"id": 1,
"description": "writing documentation",
"tags": [
"@feature",
"@docs"
],
"start": "2024-02-21T18:26:45",
"end": "2024-02-21T19:33:08"
}
The entries should be sequential in time, of varying durations (between 5 minutes and 5 hours), and have realistic descriptions that fit the software development time sheet theme.
The output should be in JSON, an array of time sheet entries conforming to the provided schema example.
Note that I specified the number of entries (5) and the type of activity (realistic software development tasks) that should be mentioned in the description.
I defined the output schema by providing an example of a correctly formatted time sheet entry. I also specified that they should be sequential in time and have realistic durations as would be expected for the defined type of activity.
Result
This was the output:
[
{
"id": 1,
"description": "Refactoring user interface code",
"tags": [
"@feature",
"@frontend"
],
"start": "2024-02-21T18:26:45",
"end": "2024-02-21T22:39:32"
},
{
"id": 2,
"description": "Optimizing database queries for report generation",
"tags": [
"@performance",
"@database"
],
"start": "2024-02-21T22:39:32",
"end": "2024-02-22T03:11:03"
},
{
"id": 3,
"description": "Implementing OAuth2 for user authentication",
"tags": [
"@feature",
"@security"
],
"start": "2024-02-22T03:11:03",
"end": "2024-02-22T06:36:18"
},
{
"id": 4,
"description": "Weekly team sync-up meeting",
"tags": [
"@meeting",
"@communication"
],
"start": "2024-02-22T06:36:18",
"end": "2024-02-22T10:16:24"
},
{
"id": 5,
"description": "Adding localization support for new languages",
"tags": [
"@feature",
"@i18n"
],
"start": "2024-02-22T10:16:24",
"end": "2024-02-22T11:50:10"
}
]
After running this prompt a few times to generate 16 entries, I placed them all in an array within a mock JSON database. The output of running track-time-cli
's list
command was the following:
23:50 â–¶ 97% â–¶ xia â–¶ f3rno64 â–¶ .track-time-cli â–¶ î‚ main â–¶ ✎ â–¶ 2? â–¶ $ â–¶ tt l
* Showing sheets since 2/21/2024, 12:00:00 AM [yesterday]
- Sheet main (16 entries) [1:10:19:16]
(1) [9:28:26] yesterday -> 14 hours ago Writing API documentation @feature @docs
(2) [16:51] 14 hours ago -> 14 hours ago Implementing user authentication module @feature @backend
(3) [50:01] 14 hours ago -> 13 hours ago Designing database schema for new features @design @database
(4) [2:04:11] 13 hours ago -> 11 hours ago Improving responsive design for mobile devices @feature @frontend
(5) [2:29:48] 11 hours ago -> 9 hours ago Conducting user experience research @research @ux
(6) [1:42:30] 9 hours ago -> 7 hours ago Implementing caching for improved performance @performance @backend
(7) [9:08] 7 hours ago -> 7 hours ago Weekly team retrospective meeting @meeting @team
(8) [39:52] 7 hours ago -> 6 hours ago Refining search algorithm @algorithm @backend
(9) [2:13:23] 6 hours ago -> 4 hours ago Reviewing and merging pull requests @review @version-control
(10) [1:08:12] 4 hours ago -> 3 hours ago Developing new feature for customer analytics @feature @analytics
(11) [2:27:32] 3 hours ago -> 21 minutes ago Solving compatibility issues in cross-platform app @bugfix @mobile
(12) [3:14:14] 21 minutes ago -> in 3 hours Organizing backlog and prioritizing tasks @planning @management
(13) [3:07:36] in 3 hours -> in 6 hours Upgrading server infrastructure @maintenance @devops
(14) [1:41:17] in 6 hours -> in 8 hours Creating automated tests for new API endpoints @testing @automation
(15) [2:02:29] in 8 hours -> in 10 hours Troubleshooting server downtime issues @maintenance @devops
(16) [43:46] in 10 hours -> in 10 hours Designing user interface for the upcoming module @design @frontend
* 0 Sheets not shown. use --all to show
Conclusions
Clearly Chat GPT is a great tool for generating realistic datasets.
Try it out the next time you need some mock data with realistic textual content! 🚀
Posted on February 21, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.