The first thing is, Don’t Do This.
I needed to put our main proxy servers IIS logs into a database in order to calculate out the total bytes sent and total bytes received over time. The reason for the analytics was to estimate the expected cost of running a similar setup in AWS with an Application Load Balancer.
To load the data, I copied a powershell script from this blog (Not so many…), which was a modification of the original script from the same blog. The script is below. It is meant to be run as scheduled task and to log the details of each run using PowerShellLogging. The script is currently setup to only import a single day of data, but it can be altered to load many days without much effort.
But I want to focus on the size of the database and the time it takes to load.
IIS Log Information
90 Sites
38 days of logs
58.8 GB of IIS Logs
Database Configuration Information
My Personal Work Machine (not an isolated server)
MS SQL Server 2016 SP 1
1TB SSD Drive
Limited to 5GB Memory
Core i7-6700
Windows 10 1803
First Attempt – An Unstructured Database
This was “not thinking ahead” in a nutshell. I completely ignored the fact that I needed to query the data afterwards and simply loaded all of into a table which contained no Primary Key or Indexes.
The good news was it loaded “relatively” quickly.
Stats
- 151 Million Records
- 161 GB of Disk Space (that’s a 273% increase)
- 7h 30m Running Time
The data was useless as I couldn’t look up anything without a full table scan. I realized this problem before running my first query, so I have no data on how long that would have taken; but I figure it would have been a long time.
First Attempt Part 2 – Adding Indexes (Bad Idea)
Foolishly, I thought I could add the indexes to the table. So, I turned on Simple Logging and tried to add a Primary Key.
Within 1h 30m the database had grown to over 700 GB and a lot of error messages started popping up. I had to forcefully stop MSSQL Server and delete the .mdf/.ldf files by hand.
So, that was a really bad idea.
Second Attempt – Table with Indexes
This time I created a table with 9 indexes (1 PK, 8 IDX) before loading the data. Script below.
With the additional indexes and a primary key having a different sort order than the way the data was being loaded, it took significantly longer to load.
Stats
- 151 Million Records
- 362 GB of Disk Space (that’s a 615% increase)
- 77 GB Data
- 288 GB Indexes
- 25h Running Time
I was really surprised to see the indexes taking up that much space. It was a lot of indexes, but I wanted to be covered for a lot of different querying scenarios.
Daily Imports
Stats
- 3.8 Million Records
- 6 GB of Disk Space
- 30m Running Time
Initial Thoughts …
Don’t Do This.
There are a lot of great log organizing companies out there: Splunk, New Relic, DataDog, etc. I have no idea how much they cost, but the amount of space and the amount of time it takes to organize this data for querying absolutely justifies the need for their existence.
0 comments:
Post a Comment