Debunking Dataset Myth

Many people think that datasets are stored internally as XML. What most people need to know is that Datasets are serialized as XML (even when done binary) but that doesn't mean they are stored as XML internally - although we have no easy way of knowing, it's easy to take a look at the memory footprint of datasets compared to XmlDocuments.

I know that if datasets were stored as XML, then in theory, datasets should be larger since BeginLoadData/EndLoadData implies there are internal indexes maintained along with the data.

It's not easy to get the size of an object in memory, but here is my attempt.

long bytecount = System.GC.GetTotalMemory(true);
DataSet1 ds =
new DataSet1();
ds.EnforceConstraints =
false;
ds.Order_Details.BeginLoadData();
ds.Orders.BeginLoadData();
ds.ReadXml("c:\\test.xml");
bytecount = System.GC.GetTotalMemory(
true) - bytecount;
MessageBox.Show("Loaded - Waiting. Total K = " + (bytecount/1024).ToString());

long bytecount = System.GC.GetTotalMemory(true);
System.Xml.XmlDocument xmlDoc =
new System.Xml.XmlDocument();
xmlDoc.Load("c:\\test.xml");
bytecount = System.GC.GetTotalMemory(
true) - bytecount;
MessageBox.Show("Loaded - Waiting. Total K = " + (bytecount/1024).ToString());

I tried these examples with two different xml files - both storing orders & orderdetails out of the northwind database. The first example was the entire result set of both tables. The dataset memory size was approximately 607K. The XmlDocument was 1894K, over 3 times larger. On a second test, I used only 1 record in both the order and order details tables. The dataset in this case took 24K and the XmlDocument took 26K, a small difference.  You will notice that in my dataset example I have turned off index maintenance on the dataset by using BeginLoadData. Taking this code out resulted in a dataset of 669K, an increase of approximately 10%. An interesting note is that if you put in a BeginLoadData and EndLoadData, the net size of the dataset is only 661K. This would imply that leaving index maintenance on during loads is inefficient in memory usage.

The speed of loading from XML is a different story.  Because the XmlDocument delays (I'm assuming) the parsing of the XmlDocument, the time to load of the full dataset from an XML file is 1/3rd of the time to load the DataSet from XML. I would be careful in being too concerned about this. Loading a dataset from a relational source like a DataAdapter that involves no Xml parsing and is much faster.

If you load up Anakrino and take a look at how the Dataset stores it's data, each DataTable has a collection of columns, and each column is in fact a strongly type storage array. Each type of storage array has an appropriate private member array of the underlying value type (integer, string, etc.). The storage array also maintains a bit array that is used to keep track of which rows for that array are null. The bit array is always checked first before going to the typed storage array and returns either null or the default value. That's pretty tight.