Developing Streaming Pipeline Components - Part 1

Hello thrillseekers.  Well here it is - Part 1 of ??? on developing streaming pipeline components for BizTalk Server.  My apologies for the delay of this post after causing mass hysteria with the earlier announcement of a series of posts on Developing Streaming Pipeline Components.  This photo was taken outside my apartment shortly after the announcement was made.

Mass Blogging Hysteria

This post I will cover the basics of what a "streaming approach" actually means, and why we ideally need take a streaming approach to pipeline component development.  I will perform some comparisons of an in memory approach to a streaming approach from a number of perspectives.  Then in the follow-up posts I will dive into the detail of how this is achieved.

...

So, when documentation etc. refers to taking a streaming approach to pipeline component development, it basically means that the messages being processed by the pipeline components do not load the entire message into memory - they work on the message "chunk by chunk" as it moves through the pipeline.  This has the obvious benefits of consuming less memory, which in a BizTalk system that is processing either large messages or a high volume of messages becomes critical.

Below is a table which compares an "in memory" approach to a streaming approach - from various perspectives.  I will get into the details in subsequent posts which will explain more about these areas.  There may be areas I have missed but these are the ones that come to mind:

Comparison of... Streaming In Memory
Memory Usage per Message Low, regardless of message size High (varies depending on message size)
Common Classes Used to Process XML Data

Built in and custom derivations of:

XmlTranslatorStream, XmlReader, XmlWriter

XmlDocument, XPathDocument, MemoryStream, VirtualStream
Documentation Poor – many un-supported and un-documented BizTalk classes Very good - framework classes

Location of “Processing Logic” Code

- “Wire up” readers & streams via Execute method

- Actual execution occurs in the readers & streams as the data is read through them

Directly from the Execute method of the pipeline component
Data Re-created at each wrapping layer as data is read through it

Read in/modified/written out at each component prior to next component being called

 

Now lets look at the advantages of both streaming and in-memory approaches:

Streaming In Memory
Low memory use Fast when small message size (i.e. When server memory consumption is not stretched)

By utilising the built in BizTalk classes, some functionality exposed in the out of box pipeline components can be embedded in your custom components

Developers generally have experience using these classes
Easy to add/re-use functionality by utilising the decorator pattern with Stream and XmlReader Classes are generally fully featured
E.g. XPath, XSLT fully support the standards
  Often quicker/easier to code
  Developers generally more familiar with this practice

 

And now for the limitations of each approach:

Streaming In Memory

Scenarios that require caching large amounts of data are generally not supported or defeat the purpose (i.e. The cache takes up memory anyway)

E.g. Certain XPath expressions

High memory use

This can cripple a server’s throughput when processing large messages or a large number of messages

Due to this, a streaming approach is recommended

Poor documentation/unfamiliar development patterns to many developers  
Built in BizTalk pipeline components not designed with extensibility in mind. Often a re-write is required to add a small amount of functionality  

  

As stated earlier, the details in the table are what comes to mind - I may have omitted some.  But basically, when you look at the advantages and limitations of each approach (note that I have included development factors in here as well as runtime performance), it looks like an in-memory approach takes the points!?

However, you will notice that a number of these relate to documentation and developer experience etc.  So although for runtime performance we want to use a streaming approach; for development factors, it appears more advantageous (as in development time and complexity) to go the in-memory approach.  I would say that this is generally true, and is the main reason that this is the route most often taken by developers.

That said, once you become experienced at developing with a streaming "mindset", and also once you build up an internal library of re-useable classes to assist you, most of the extra development time is removed and you have the runtime benefits that the streaming approach buys you.

 

So, I have outlined the pros and cons of each approach and also provided a comparison of each approach from both a technical workings perspective and a development perspective.  Stay tuned for the next post where I will start looking into how to technically achieve a streaming approach when developing your own custom pipeline components.