Scaling the Database: Data Types

It was 2 AM on February 4, 2008 and I was on the largest conference call in my life. There were dozens of people listening in, from developers to executives. The Super Bowl had concluded hours before and our Vice President of Marketing was asking me if the query was done yet so that everyone could call it a night. The problem was, the query had been running for hours and I had no idea when it would be done.

The query in question was running against a legacy SQL Server database that had grown to 2.5 terabytes and regularly reached 5,000 transactions per second. The database had been hastily developed the weekend before Super Bowl 2006 to facilitate reporting. Putting together a new, complex system just days before a major event is a pretty impressive feat. Unfortunately, best practices are rarely on the radar when developers are in “git ‘er done” mode.  Two years and much added functionality later, it was time for a major redesign.

Add to that, this database was on the receiving end of some pretty heavy Super Bowl activity. Go Daddy’s 2008 Super Bowl commercial employed a marketing technique that was revolutionary at the time: drive visits to our website by airing a teaser and enticing the viewer to come to the website to see the commercial’s too-hot-for-TV conclusion. And it worked! We experienced one of the largest advertising spikes in the history of Super Bowl ads. This was great news for the company, but being downhill of that kind of activity wreaked havoc on the database server. While customer-facing systems handled the spike admirably, it took 3.5 hours before all pending inserts were written to the analytical database.

Let’s go back and examine what the query was trying to do. It was joining three large tables, the smallest of which had 200 million rows and the largest of which had 4.5 billion rows. The tables did not have partitioning or covering indexes, and all were clustered on a non-sequential GUID. This meant that we were essentially trying to scan and join every row in three very large tables to see if the results matched the aggregate condition. Later, I also discovered that the tables were 99.99% fragmented. No wonder everything was taking so long!

Fast forward to Super Bowl 2009. What a game! Santonio Holmes scored a touchdown in the last 35 seconds to win it for the Steelers. What’s more, the game set a record for viewership. With a U.S. audience of 98.7 million viewers, it was the most watched Super Bowl in history (at the time). That played no small part in Go Daddy setting its own records, including new customers and orders. This time, however, the database performed a little better:

2008 2009
Time to Catch Up on INSERTs 3.5 hours 20 minutes
Peak Transactions per Second 15,000 27,000
Time To Complete Complex Query 2.5 hours < 1 minute
When I Was Able to Leave the Office 2:30 AM (next day) 9:00 PM (same day)

As you can see in the chart above, the database hit a sustained peak of 27,000 transactions per second. According to Microsoft SQLCAT, this was among the highest transactional rates in the world for a SQL Server instance at that time.

So what changed? Well, it certainly wasn’t the hardware. The database was still running on the same commodity Dell PowerEdge 6850 server and NetApp FAS3040 array from the previous year. It also wasn’t the application tier. They added a significant amount of new functionality but left the old code largely alone. The fact is, we were able to achieve an extraordinary boost in performance just by redesigning the database using best practices.

This post is the first in a series exploring database scalability and very large database (VLDB) performance tuning. In this series, we’ll discuss performance tuning strategies that we have successfully implemented at Go Daddy and, when possible, discuss some of the results I’ve seen firsthand.

A word of warning: I’m going to assume this isn’t your first database rodeo. Also, I really, really like code, so you can be sure that we’ll be getting down and dirty in T-SQL whenever possible. ;)

Data Types

Although few realize it, one of the biggest decisions a database developer makes is the data types he or she chooses. A well-intentioned but ill-conceived decision to store a critical date or numeric column as a character column can cause years of pain for developers and waste money for a business. One Fortune 500 company I worked with experienced this firsthand. For reasons known only to the original developer(s), a critical column, ProductNumber, was defined as a CHAR(4). Finally, after 20+ years of custom integrations and creative workarounds, the company had no option left but to upgrade the data type. What followed was a very long and expensive modification to nearly every piece of software in the company. Had the developer(s) decided to use a standard integer-based data type, the company would have been able to save a lot of money in its software integrations and avoid a costly data type conversion process.

Back to Go Daddy in February of 2006.  In the absence of clear requirements and not having the time to spend on researching it, the developers chose to use the most common data types: all integer columns were stored as a 4-byte INT, all date columns were an 8-byte DATETIME, and all string columns were stored as an NVARCHAR column. 16-byte UNIQUEIDENTIFIER columns were also liberally used to store GUIDs. Web developers generally like GUIDs, or globally unique identifiers. They’re a great way to generate a unique value when a central issuing authority is not available. These are useful for all sorts of things, from uniquely identifying a person to maintaining a unique web session.

Let’s look at one of the largest tables in the database. For security reasons, I’ll replace actual column names with generic names.

CREATE TABLE dbo.BadTable
(
myGuid1   UNIQUEIDENTIFIER
, myGuid2   UNIQUEIDENTIFIER
, myDate1   DATETIME
, myDate2   DATETIME
, myID1     INT
, myID2     INT
, myID3     INT
, myID4     INT
, myID5     INT
, myID6     INT
, myText1   NVARCHAR(2)
, myText2   NVARCHAR(10)
, myText3   NVARCHAR(12)
, myText4   NVARCHAR(10)
, myText5   NVARCHAR(15)
, myText6   NVARCHAR(15)
, myText7   NVARCHAR(100)
, myText8   NVARCHAR(255)
, myText9   NVARCHAR(255)

CONSTRAINT PK_BadTable
PRIMARY KEY (myGuid1)
);

The developers chose to use NVARCHAR because they knew that Go Daddy does business internationally and would need to support international characters in some of its columns. Not having the time to identify which columns would need to support Unicode, the developers made the understandable decision to make everything Unicode. The problem with this is that character columns are usually some of the widest in a table, and Unicode consumes twice as much space as non-Unicode.

While we’re on the topic of space, let’s take a look at the storage requirements of the data types in the dbo.BadTable example above.

Data Type Values Supported Storage Requirements
varchar(n) A variable-length non-Unicode alpha-numeric string actual length + 2 bytes
nvarchar(n) A variable-length Unicode alpha-numeric string ((actual length) * 2)
+ 2 bytes
char(n) A fixed-length non-Unicode alpha-numeric string n defined length
tinyint 0 to 255 1 byte
smallint up to -/+ 32,767 2 bytes
int up to -/+ 2,147,483,647 4 bytes
bigint up to -/+ 9,223,372,036,854,775,807 8 bytes
smalldatetime 1900-01-01 to 2079-06-06, minute precision 4 bytes
datetime 1753-01-01 to 9999-12-13, sub-second precision 8 bytes
uniqueidentifier 36-character string in the form of xxxxxxxx- xxxx- xxxx- xxxx- xxxxxxxxxxxx, i.e.  6F9619FF-8B86-D011-B42D-00C04FC964FF 16 bytes

You might be saying, “Why should I care about 2 bytes versus 4 bytes? Storage is cheap! I can buy a 1 TB drive for $100.” Well, all storage is not created equal, and the biggest bottleneck in most databases is IO. This leads me to the biggest performance tuning tip I can offer: The more data you can fit on a data page, the better your database will perform. It’s really that simple. If you can maximize the storage of data on a page, you can get great database performance.

Using the table of data types above, let’s do some math to figure out how much space dbo.BadTable will consume. To do this, we’ll need to know the average length of each variable-length column:

Column Data Type Average Length Bytes
myGuid1 UNIQUEIDENTIFIER 16
myGuid2 UNIQUEIDENTIFIER 16
myDate1 DATETIME 8
myDate2 DATETIME 8
myID1 INT 4
myID2 INT 4
myID3 INT 4
myID4 INT 4
myID5 INT 4
myID6 INT 4
myText1 NVARCHAR(2) 2 6
myText2 NVARCHAR(10) 8 18
myText3 NVARCHAR(12) 6 14
myText4 NVARCHAR(10) 8 18
myText5 NVARCHAR(15) 6 14
myText6 NVARCHAR(15) 9 20
myText7 NVARCHAR(100) 7 16
myText8 NVARCHAR(255) 32 66
myText9 NVARCHAR(255) 45 92
Total 336

Tip: You can easily identify the average length of a column by selecting AVG(LEN(columnName)) on a table. Because it will scan an entire table, try to avoid running this type of query in peak hours.

Based on the data types specified, dbo.BadTable consumes 336 bytes of storage per row. While I won’t go into the particulars, you also need to be aware that there are an additional 11 bytes of overhead for each row in this particular table. Thus, in actuality, each row inserted into this table consumes 347 bytes of space.

If you are using SQL Server 2005 or later, you can see this for yourself using the following code. This code will insert a row that simulates the typical data footprint of the table and will output the data page of dbo.BadTable using the undocumented DBCC PAGE command.

INSERT INTO dbo.BadTable
(
myGuid1
, myGuid2
, myDate1
, myDate2
, myID1
, myID2
, myID3
, myID4
, myID5
, myID6
, myText1
, myText2
, myText3
, myText4
, myText5
, myText6
, myText7
, myText8
, myText9
)
SELECT NEWID()
, NEWID()
, GETDATE()
, GETDATE()
, 2500
, 1150
, 185
, 11000
, 75
, 175
, REPLICATE('X', 2)
, REPLICATE('X', 8)
, REPLICATE('X', 6)
, REPLICATE('X', 8)
, REPLICATE('X', 6)
, REPLICATE('X', 9)
, REPLICATE('X', 7)
, REPLICATE('X', 32)
, REPLICATE('X', 45);

DECLARE @dbName SYSNAME = DB_NAME()
, @fileID INT
, @pageID INT;

SELECT @fileID = CONVERT (VARCHAR(6), CONVERT (INT,
SUBSTRING (au.first_page, 6, 1) +
SUBSTRING (au.first_page, 5, 1)))
, @pageID = CONVERT (VARCHAR(20), CONVERT (INT,
SUBSTRING (au.first_page, 4, 1) +
SUBSTRING (au.first_page, 3, 1) +
SUBSTRING (au.first_page, 2, 1) +
SUBSTRING (au.first_page, 1, 1)))
FROM sys.indexes AS i
JOIN sys.partitions AS p
ON i.[object_id] = p.[object_id]
AND i.index_id = p.index_id
JOIN sys.system_internals_allocation_units AS au
ON p.hobt_id = au.container_id
WHERE p.[object_id] = OBJECT_ID('dbo.BadTable')
AND au.first_page > 0x000000000000;

DBCC TraceOn (3604);
DBCC PAGE (@dbName, @fileID, @pageID, 3);
DBCC TraceOff (3604);

(1 row(s) affected)
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

PAGE: (1:283)

BUFFER:

BUF @0x0000000278751D40

bpage = 0x000000026C7FE000 bhash = 0×0000000000000000 bpageno = (1:283)
bdbid = 5 breferences = 1 bcputicks = 0
bsampleCount = 0 bUse1 = 39082 bstat = 0xb
blog = 0x15ab215a bnext = 0×0000000000000000

PAGE HEADER:

Page @0x000000026C7FE000

m_pageId = (1:283) m_headerVersion = 1 m_type = 1
m_typeFlagBits = 0×0 m_level = 0 m_flagBits = 0×8000
m_objId (AllocUnitId.idObj) = 88 m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594043695104
Metadata: PartitionId = 72057594039238656 Metadata: IndexId = 1
Metadata: ObjectId = 629577281 m_prevPage = (0:0) m_nextPage = (0:0)
pminlen = 76 m_slotCnt = 2 m_freeCnt = 7398
m_freeData = 790 m_reservedCnt = 0 m_lsn = (31:366:2)
m_xactReserved = 0 m_xdesId = (0:0) m_ghostRecCnt = 0
m_tornBits = -940717766 DB Frag ID = 1

Allocation Status

GAM (1:2) = ALLOCATED SGAM (1:3) = ALLOCATED
PFS (1:1) = 0×60 MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

Slot 0 Offset 0×60 Length 347

Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS
Record Size = 347
Memory Dump @0x000000001165A060

The actual output is considerably longer; however, we only need this first section for the purposes of this example. Look at “Record Size.” This tells you how many bytes a row actually consumes. If you have variable-length columns, you will see the value change for each row. This value does include row overhead, so you will notice that the Record Size is 347, which is what I expected to see.

A data page can hold 8192 bytes. 96 bytes are reserved for the header, which leaves 8096 bytes available for data.  That means a data page can only hold 23 rows in this table (8096 bytes / 347 bytes), assuming there’s no fragmentation. At the time of Super Bowl 2008, this table had 200 million rows. If you divide 200 million by 23 rows, it took 8,695,652 pages to hold our data. Multiply that by 8192 bytes, and you find that this table consumed 66 GB of storage space.

Now let’s look at this exact same table and select the appropriate data types based on our actual storage needs:

Column Values Actually Stored
myGuid1 GUID
myGuid2 GUID
myDate1 Precision needed to the minute
myDate2 Precision needed to the second (not subsecond)
myID1 0 to 2,500
myID2 0 to 1,150
myID3 0 to 185
myID4 0 to 11,000
myID5 0 to 75
myID6 0 to 175
myText1 Always 2 non-Unicode characters
myText2 Average 8 non-Unicode characters
myText3 Average 6 non-Unicode characters
myText4 Average 8 non-Unicode characters
myText5 Average 6 non-Unicode characters
myText6 Average 9 non-Unicode characters
myText7 Average 7 Unicode characters
myText8 Average 32 Unicode characters
myText9 Average 45 non-Unicode characters
CREATE TABLE dbo.GoodTable
(
myNewID   INT IDENTITY(1,1)
, myGuid1   UNIQUEIDENTIFIER
, myGuid2   UNIQUEIDENTIFIER
, myDate1   SMALLDATETIME
, myDate2   SMALLDATETIME
, myID1     SMALLINT
, myID2     SMALLINT
, myID3     SMALLINT
, myID4     SMALLINT
, myID5     TINYINT
, myID6     TINYINT
, myText1   CHAR(2)
, myText2   VARCHAR(10)
, myText3   VARCHAR(12)
, myText4   VARCHAR(10)
, myText5   VARCHAR(15)
, myText6   VARCHAR(15)
, myText7   NVARCHAR(100)
, myText8   NVARCHAR(255)
, myText9   VARCHAR(255)

CONSTRAINT PK_GoodTable
PRIMARY KEY (myNewID)
);

Let’s take a minute to discuss some of the decisions made. First, notice that I added a new column, myNewID, to the table. This adds 4 bytes of space but serves a very valuable tuning purpose that I’ll discuss in a later blog post. UNIQUEIDENTIFIER is the correct data type to store a GUID, so I left those two columns as is. I then determined that only two of the string columns actually need to support Unicode, so I left those two columns alone as well. All of the other data types were changed. Why? The date columns were changed from DATETIME to SMALLDATETIME because I only needed minute-level precision. Also, 3 of my ID columns needed to store a value over 255 (the cutoff for a TINYINT) and under 2.1 billion (the cutoff for an INT), making SMALLINT an easy decision. But why did I choose to make myID3, which only needs to store values between 0 and 185, a SMALLINT instead of a TINYINT? When choosing the right data type, you need to choose the smallest data type that makes sense for the business. This ends up being equal parts art and science. My rationale is this: if I was already at 185 after only a year of data collection, there’s a good chance I would exceed the value limit of 255 sometime in the next year. Everyone has a different guideline, but mine is to allow for 3-5 years’ worth of reasonable growth in the business and application. Lastly, I was able to cut the storage requirements for all non-Unicode columns in half by switching from NVARCHAR to VARCHAR.

Column Data Type Needed Average Length Bytes
myNewID INT IDENTITY(1,1) 4
myGuid1 UNIQUEIDENTIFIER 16
myGuid2 UNIQUEIDENTIFIER 16
myDate1 SMALLDATETIME 4
myDate2 SMALLDATETIME 4
myID1 SMALLINT 2
myID2 SMALLINT 2
myID3 SMALLINT 2
myID4 SMALLINT 2
myID5 TINYINT 1
myID6 TINYINT 1
myText1 CHAR(2) 2 2
myText2 VARCHAR(10) 8 10
myText3 VARCHAR(12) 6 8
myText4 VARCHAR(10) 8 10
myText5 VARCHAR(15) 6 8
myText6 VARCHAR(15) 9 11
myText7 NVARCHAR(100) 7 16
myText8 NVARCHAR(100) 32 66
myText9 VARCHAR(255) 45 47
Total 232

The new table only requires 232 bytes to store the exact same data. Note that there is still an overhead of 11 bytes per row in the new table, too. You can see this for yourself, if you like, by re-running the DBCC PAGE code from earlier and replacing dbo.BadTable with dbo.GoodTable.

At the end of the day, here’s how my changes impacted the performance of the table:

  Before After % Change
Defined Bytes Per Row 336 232 -31%
Row Overhead 11 11 0%
Total Bytes Per Row 347 243 -30%
Rows Per Page 23 33 45%
Pages to hold 200m rows 8,695,652 6,060,606 -30%
Table Storage Needs (GB) 66.34 46.24 -30%

By using the right data types, I was able to reduce the size of my table by 30%, which saves space both on my disks and in my backups. Even better, I was able to improve IO by 45%! Not too shabby.

One last thing I’d like to mention. Don’t wait until you have a large table before paying attention to data types. Many of those small tables, such as reference or dimension tables, tend to become foreign keys in a much larger table. Therefore, selecting the right data types for those tables is equally important.

Database Lesson #1: Pay attention to your data types. They can have a huge impact on database performance.

Michelle is a SQL Server database developer, Microsoft SQL Server MVP, published author, accomplished presenter, and self-proclaimed scripting junkie. She has been with Go Daddy since July 2007 and specializes in performance tuning and high-volume VLDB (very large database) development. Her experience also includes database automation, operational predictive analytics, and all stages of the data lifecycle–from OLTP to data warehousing. Connect with Michelle on Google+

20 Comments on "Scaling the Database: Data Types"

  1. Dean Willson says:

    Great tips on data types Michelle. Changes to things like data types after they are in wide usage in the enterprise are much more risky than taking the time to make the right choices while still in the design and modeling phase.

  2. A+ post. I cannot recall how many times developers have just been like ‘oh, hey let’s just make everything nvarchar(255) because it works!’ – a total head-in-hands moment. Choosing proper data types early in a project is guaranteed to pay dividends later on.

  3. Great Post Michelle. I had a great time reading your example.
    Can you please tell me where does the overhead of 11 bytes come from for each row? (I’m novice, sorry :P )

    And i’m waiting to know why you entered the identity column.

    Thanks for your time to share this with us.

  4. Really nice post Michelle, Thank You. our env tends to have some huge tables but are 3rd party products dont know how to proceed

  5. Scott Caldwell says:

    Michelle,

    Thanks for taking the time to write up an often overlooked topic. Keep up the “inside” blogging!

  6. Shaun Stuart says:

    Fantastic! I’m assuming you eventually changed the clustered index from a GUID to your new myNewID column.. Would be very interested to hear how long SQL took to make that change..

  7. As a developer I learned early on that paying attention to correct data types was critical to the performance of an application. I just wanted to make the point that not all developers want to make everything nvarchar(255), but I have met my share who have.

    I am curious on the column order that you chose if there was a specific reason you put the 16 byte unique identifier first? I have always worked from smallest to largest, so in that case I would put TINYINT first, then SMALLINT, SMALLDATETIME, INT, etc with the NVARCHAR(255) last.

  8. Todd Everett says:

    Great post Michelle and thanks for pointing us here on your blog! You really put a lot of time into this and I liked how you showed a great real world example of the impact of design decisions on performance when data volumes are huge – both in the time it takes to write and in the time it takes to do range scans. Where I work most SQL Server database never approach the kind of size you experience and so attempting to maximize rows per page at design time is often overlooked. As you show design DOES matter and I’ll be able to point to this post as to why. And given that everybody I work with will know who GoDaddy is makes it even better! Keep up the blogging!

    -Todd

  9. @Tiago Don’t apologize! It’s a good question. There is a minimum of 7 bytes of overhead per row that is used by SQL Server. The amount of overhead can increase based upon variable columns, NULLs, etc. You can see this in action using this code example:

    – TEST 1

    CREATE TABLE dbo.Overhead1
    (
    myID INT NOT NULL — 4 bytes of storage
    , myColumn CHAR(10) NOT NULL — 10 bytes of storage

    CONSTRAINT PK_Overhead1
    PRIMARY KEY CLUSTERED(myID)
    );

    INSERT INTO dbo.Overhead1
    SELECT 1, REPLICATE(‘X’, 10);

    DECLARE @dbName SYSNAME = DB_NAME()
    , @fileID INT
    , @pageID INT;

    SELECT @fileID = CONVERT (VARCHAR(6), CONVERT (INT,
    SUBSTRING (au.first_page, 6, 1) +
    SUBSTRING (au.first_page, 5, 1)))
    , @pageID = CONVERT (VARCHAR(20), CONVERT (INT,
    SUBSTRING (au.first_page, 4, 1) +
    SUBSTRING (au.first_page, 3, 1) +
    SUBSTRING (au.first_page, 2, 1) +
    SUBSTRING (au.first_page, 1, 1)))
    FROM sys.indexes AS i
    JOIN sys.partitions AS p
    ON i.[object_id] = p.[object_id]
    AND i.index_id = p.index_id
    JOIN sys.system_internals_allocation_units AS au
    ON p.hobt_id = au.container_id
    WHERE p.[object_id] = OBJECT_ID(‘dbo.Overhead1′)
    AND au.first_page > 0×000000000000;

    DBCC TraceOn (3604);
    DBCC PAGE (@dbName, @fileID, @pageID, 3);
    DBCC TraceOff (3604);

    DROP TABLE dbo.OverHead1;

    — TEST 2

    CREATE TABLE dbo.Overhead2
    (
    myID INT NOT NULL — 4 bytes of storage
    , myColumn VARCHAR(10) NOT NULL — 10 bytes of storage

    CONSTRAINT PK_Overhead2
    PRIMARY KEY CLUSTERED(myID)
    );

    INSERT INTO dbo.Overhead2
    SELECT 1, REPLICATE(‘X’, 10);

    DECLARE @dbName SYSNAME = DB_NAME()
    , @fileID INT
    , @pageID INT;

    SELECT @fileID = CONVERT (VARCHAR(6), CONVERT (INT,
    SUBSTRING (au.first_page, 6, 1) +
    SUBSTRING (au.first_page, 5, 1)))
    , @pageID = CONVERT (VARCHAR(20), CONVERT (INT,
    SUBSTRING (au.first_page, 4, 1) +
    SUBSTRING (au.first_page, 3, 1) +
    SUBSTRING (au.first_page, 2, 1) +
    SUBSTRING (au.first_page, 1, 1)))
    FROM sys.indexes AS i
    JOIN sys.partitions AS p
    ON i.[object_id] = p.[object_id]
    AND i.index_id = p.index_id
    JOIN sys.system_internals_allocation_units AS au
    ON p.hobt_id = au.container_id
    WHERE p.[object_id] = OBJECT_ID(‘dbo.Overhead2′)
    AND au.first_page > 0×000000000000;

    DBCC TraceOn (3604);
    DBCC PAGE (@dbName, @fileID, @pageID, 3);
    DBCC TraceOff (3604);

    DROP TABLE dbo.OverHead2;

    See the difference? The 2nd example has 4 more bytes per row: 2 bytes for the variable count, and 2 bytes for the variable offset. There’s a great book that called SQL Server 2008 Internals that explores row overhead in much greater detail. HTH! :)

  10. @Shaun well, I don’t want to give away the exciting conclusion, but… yes. ;) Unfortunately, I don’t have a time to give you, as I didn’t modify existing tables. I’m hoping to blog about the approach I took to updating the database, when I get time. :)

  11. @Bryan F. Good point. I don’t mean to suggest that all developers do that. Just that, it’s easier to go big and worry about it later, so many people do so. :) As for column order, I tend to order my tables as follows:

    1. Uniquely identifying columns (i.e. PK)
    2. CIX, if separate from the PK
    3. Date columns, i.e. date inserted or last modified
    4. Descending order based on a guess of importance or access

    That’s my personal preference, but I know developers who have different approaches. Fortunately, it doesn’t matter how columns are ordered within a data page. Within an index, that’s a separate story. :)

  12. @Dean, @Ken, @Scott, @Todd – thank you all for the kind and encouraging comments! You guys are awesome. If you have any future topic suggestions, don’t hesitate to let me know. :)

    @Kiran – That’s a difficult situation to be in. I assume you’re not allowed to modify the schema? That’s been my personal experience when dealing with 3rd party products. However, I’ve had some success with maintaining the application layer but changing the database layer by using views and indexed views, which obfuscate the changes you make while allowing the application to retrieve data in the format it expects. HTH! Good luck.

  13. Awesome article Michelle, thanks for sharing!

    Out of interest, and in terms of a real-world design approach that you would take, would you recommend starting with smaller data-types and refactoring columns to larger ones if/when you encounter a column truncation issue, or starting with a larger datatypes and going through an optimization iteration after more data is collected, given that you are unable to accurately determine the maximum column size for the data at hand?

    Thanks again! :)
    />L

  14. Was any consideration given to how the db files were distributed across the netapps and if so would you expand on that piece.

  15. fyi,
    For your MyNewID column is may be appropriate to use a BIGINT datatype instead of INT for your largest tables since INT is capped at a little over 2.1 billion. We’ve run into issues in my shop with several tables running out of identities either b/c of row count > 2.1 billion or b/c of data churn, ie a queue table. Since identities are gone forever (kind of) once they’re deleted, we’ve had tables run out of identity values with only a few thousand records in the table. BIGINTs are twice as wide at 8 bytes, but for us it’s a necessary evil.

  16. One thing I always tell people is organize variable length columns according to nullability. In other words put non-nullable first, then based on the percentage of expected nulls from least to greatest. The reason for this is that SQL will not include the 2 for a null column as long as no other variable length column with data exists after it. So if I have the following scenario layout

    Col1 varchar(40) not null, Col2 varchar(10) null, Col3 varchar(20) null

    and I have data of lengths in column order 34, 0, 5 my total cost is 45 on the disk for the columns plus overhead with other data and rows header. But if I change to

    Col1 varchar(40) not null, Col3 varchar(20) null, Col2 varchar(10) null

    and these are my only variable length columns then my cost is 43 for the data. On the table all fixed length data is stored at the beginning of the row physically and the variable length data has its length marker written for each column even if null, except any nulled column where there is no column containing data after it. You still want to calculate like it is there for storage concerns but it will give you better density per page on average and it can be especially helpfully on very large tables in reducing expected costs.

    Also, I sometimes see people do things like use a bitmask in an int type field versus individual bit fields. Most don’t realize that bit fields often get organzied under the hood into a bitmasked field thus they are not really using 1 byte per column, as I recall they just need to be organized together for it to happen, although it may happen anyway I just don’t have a sample to recheck, Plus if you use anything other than a tinyint datatype you have the overheade of the sign bit (+/- number) and they are fixed length meaning they will always use the number of bytes listed even if you needed less.

    Another cost savings thing I suggest to people often is really evaluate your CHAR datatypes. If the data is very often null then consider using varchar with a constraint of length of the data to ensure length is never shorter unless it doesn’t really matter. The reason is you can use my first comment to organize the data and save space. I only suggest using this on fields of at least 6 characters and based on the percentage null expected. The reason for that is becuase of course the field incures the 2 bytes added for a variable length column, but if you have a column of 10 which is 80% null at worse yu will cost 2 bytes per row versus 10 bytes for every row regardless of null or not.

    Which brings me to my last comment, if a column is typically close to 100% full and nearly to always not null, and it is a varchar then consider using a char instead to save on the 2 bytes. For example I have seen people use abbreviation fields for items that were not null and they were always 3 to 4 characters long but they chose varchar(4) instead of char(4) becuase they viewed it as variable. The issue here is that it filled to 3 cost 5 bytes and filled to 4 cost 6 bytes, where a char(4) always cost 4 bytes. Yes you will have a padded field but you can manage that at access time, but like you pointed out software cost to update a datatype, just weight that in, often makes sense to pass in varchar to a stored procedure and have the proc change to char, thus one management point to fix. You can also cast the data out to vachar and deliver from sql server using a view but I would say it is more costly to do that than it is to allow the software to manage, and long term you will expect to maintain software, just make sure you properly document all interactions and put them somewhere were those relationships to sql are known before sql makes changes.

  17. I’ve gotten a lot of “disk is cheap” push-back from upstairs lately; this is a terrific real-world example clearly demonstrating why that’s not the point.

  18. Sean Pearce says:

    Hey mst, when upstairs give me the “disk is cheap” line, I reply with “that’s where we keep your data overnight”. When the users need it we first need to allocate physical memory and then we have to fetch it from the disk and that is where the real cost lies.

    Great article. I am also suffering with a legacy system which is much bigger than it could or should be.

  19. Karen Lopez (@datachick) says:

    Great, thorough explanation, Michelle.

    I will have to say that I have almost always regretted going with TINYINT in my designs. The world tends to outgrow it for the silliest of reasons, even for things with very low volume like StatusCodes and GenderTypeCodes.

    Something real world comes along and makes someone need to keep more history, change the meaning slightly, or address more international needs.

    I love, though, that you talked about making a decision to go with SMALLINT even though your current data seems to indicate it can stay at TINYINT. I use volumetric features of my data modeling tools to forecast how much the data will grow over the next 12 months and 5 years to help me decide which datatype to go with.

  20. Shaiju CK says:

    @Michelle, Good explanation and thanks for sharing the code to find the overhead.

    @James, Thanks for your tip also. I had even tried to create columns in a particlar
    order. Be frank, those were purely cosmetic purpose at that time :D . But found it is required from your post.

    I created columns in a particular order on tables at the initial stage of the creation. But I could not create columns on an existing table through scripts. Is there anyway to create columns in a particular order on an existing table by alter scripts?

Got something to say? Go for it!

 
Traffic Log Image