DataSet Serialization: Smaller & Faster

DataSets serialize naturally to XML quite well and you have lots of control over that. Typed DataSets have the XSD with some properties that control that (and of course you can do it programmatically too). But one of the common problems with remoting DataSets is that the default binary serialization is actually just the XML Serialization ASCII. Crude. Some people have even used this fact to extrapolate that the DataSet is internally XML - which isn't true.

This is improved in Whidbey. But until then, what's a Serializer to do?

Lots of people have done some great work to do customized binary serialization of datasets. To mention a few:

  • Microsoft Knowledge Base Article 829740 by Ravinder Vuppula where he demonstrates his DataSetSurrogate class which wraps up a DataSet and converts the internal items into array lists which can then be Binary Serialized by the framework by default. It also contains a ConvertToDataSet so that you can reverse the process of a De-serialized surrogate back into a dataset.
  • Dino Esposito has demonstrates a GhostSerializer in his MSDN Article for a DataTable which does a similar ArrayList conversion thing.
  • Richard Lowe's Fast Binary Serialization
  • Dominic Cooney's Pickle
  • Angelo Scotto's CompactFormatter goes a step further with a serializable technique that doesn't rely on the Binary Formatter so it works on the Compact Framework which is event more Compact than the BinaryFormatter
  • Peter Bromberg builds on top of the CompactFormatter to support compression using Mike Krueger's ICSharpCode SharpZiplib in-memory zip libraries

 

DataSet Performance Tips with Indicies

Each of these tips relate to speeding up the searching of data in a dataset. Although indexes are created in a dataset they are not exposed. Knowing how to make ADO .net generate them gives you the power to enhance the performance of you DataBase application.

Tip #1: ADO .net generates indexes on the primary key(s).

If you are searching on primary key fields instead of using DataTable.Select(), use DataTable.Rows.Find(). The Find method will use the index to find the row. Knowing this may also make you think differently when defining primary keys.

Tip #2: Use a DataView.

DataViews will generate an index for a sorted column. If you find you are searching for data using a particular column. It would be a good idea to create a Dataview sorted by the column(s) you need to search. Then use the Dataview.FindRows() method to get at your data. The DataView will have created an index for the sort column, and FindRows() will use the index.

Tip #3: Be leary of Tip #2.

If you are trying to take advantage of DataView indexes it's important to know when the indexes are created. ADO .net generates an index each time you set the filter, sort, and rowversion properties of the DataView. So if you create a DataView and set these properties individually the indexes will be created serveral times. To avoid this be sure to use the DataView constructor that takes the table, filter, sort and rowversion as arguments. (See example below) This way the indexes will be created correctly the first time.

private void MakeDataView(DataSet ds)
{
   DataView dv = new DataView(ds.Tables["Suppliers"], "Country = 'CA'", "CompanyName", DataViewRowState.CurrentRows);
   dv.AllowEdit = true;
   dv.AllowNew = true;
   dv.AllowDelete = true;
}

How can I improve the loading of my datasets?

Datasets maintain some internal indexes to improve performs for things like finds and selects.

When you are loading more than 1 row into a DataTable - with a DataAdapter.Fill or other technique, you can turn this index maintenance off by doing a

MyTable.BeginLoadData()

Where MyTable is the reference to your DataTable, which could be

MyDataSet.Tables[0].BeginLoadData()

When you are done loading the data, don't forget to turn the index maintenance back on with.

MyDataSet.Tables[0].EndLoadData()

When loading multiple tables, you should turn on/off 1 table at a time.

Another thing that Datasets do while having rows added to tables is to validate them. Things like primary key uniqueness, foreign key referential integrity and nulls in columns with AllowDBNull = false are some examples of things that must be checked. Again, you can save some cycles by turning this off during loading of a dataset and turning it back on afterward. This can be done with:

MyDataSet.EnforceConstraints = false

And of course when you are done loading, you can perform a:

MyDataSet.EnforceConstraints = true

Of course you may get a “ConstraintException” on this last line if there are problems with your data. Otherwise, you'd get this exception as soon as the offending row is loaded. On a related note, you can check DataSet.HasErrors and each DataTable.HasErrors for any errors. For each table you can call DataTable.GetErrors() to get a list of rows with errors. Each row has a RowError property that contains any error text related to the entire row, and also a GetColumnError() method that you can use to test each column for a column specific error message.