SlimDX Performance

Whoo, I burnt through a lot of posts last week. I really ought to pace myself a bit more conservatively.

I’m not equipped at the moment to undertake a full treatment of SlimDX’s performance right now, as that’s a rather touchy set of problems. However, I did want to provide a general overview of what kinds of things affect SlimDX performance and what you can expect from your own programs. Strictly from the library’s point of view, performance is actually surprisingly good. It is not as good as unmanaged code — and can never be — but it is much better than you might think.

There are a few key sources of inefficiency in SlimDX, which are varied and complex. Some are avoidable, and some are not. Some are inherent to the process as a whole, and some are results of architectural decisions in building the library. The name “Slim” officially means that there’s a very minimal barrier between you and DirectX. While this is basically true, there’s still a lot details to be aware of.

The very process of exposing a native API to managed code is complex. Apart from converting the parameters themselves — a process which SlimDX is exceedingly efficient at — there is a lot of bookkeeping to do in making sure the call stacks stay correct, that permissions are handled properly, that exceptions are trapped safely, and so on. C++/CLI handles all of this for us behind the scenes, but there is a substantial cost for making ANY native call as a result. We have studied techniques for addressing those costs by allowing you to batch certain kinds of calls (eg SetRenderState), but nothing has yet shown itself to be clearly advantageous.

The most obvious candidate for performance issues is the math library. Floating point is well known for being performance critical in games (and certain other software) and that’s why we have processor extensions like SSE to make it as fast as possible. The D3DX routines that Microsoft provides are heavily optimized, and take full advantage of processor extensions. Unfortunately, using these routines from managed code is very expensive, so we instead provide completely managed implementations. XNA takes the same approach, but MDX calls into D3DX. These managed functions will JIT into scalar SSE which, while not optimal, is still very fast. Rough benchmarks have shown that completely native D3DX code has about a 10% advantage over our implementation. In other words, if you spend 20% of your time just doing math, a native code version could be 2% faster in that respect. In order to get the best possible performance out of SlimDX — or XNA, for that matter — make sure to use the ref/out overloads of parameters. It won’t get you Havok levels of math performance, but it’s still extremely fast.

The vast majority of calls into the DirectX API itself return an HRESULT error code. In C++, you’re free to check or ignore this as you please. MDX took the fairly aggressive step of checking every single return value and converting to an exception; SlimDX and XNA both follow essentially that same model. SlimDX is particularly powerful, because it allows you to trap specific failed results, to control what actually causes an exception, or to disable exceptions outright. There’s also a LastError facility that stores a thread-local value for the last result code recorded by SlimDX. It’s a lot of truly useful flexibility, but it comes with a cost. For the March 2009 release, we leaned out the error checking significantly, making it much, much cheaper to make short DirectX calls. For a successful result, you basically pay for a method call, a comparison, and a small write to thread-local storage. A failed result triggers a much slower path. In the end, we are occasionally slower than MDX, but usually not thanks to improvements in .NET 2.0 and our improved architecture overall. We’re also nearly always faster than XNA on the CPU; XNA spends a lot of time doing thorough parameter validation.

Be wary of properties, especially ones that return complex value types. Many properties will cause a DirectX call on every invocation. If the return type is large, we’re probably reading an entire struct back, and then copying it to your variable. The Description property on many objects, for example, causes a GetDesc call. If you check obj.Description.Width and obj.Description.Height, that is two calls to GetDesc and two full copy operations. Cache the return value if you’re invoking DirectX a lot in performance sensitive code areas!

Anything in SlimDX that returns a ComObject derivative is typically translating from a native pointer to a managed object. In SlimDX, this invokes some internal machinery that will take a lock, do a table lookup, possibly insert to the table, and then construct an object if necessary. In the vast majority of cases this should never be an issue, but if you’re getting lock contention on our object table there is potential for problems. We don’t really have good data on this one either way, so it’s probably safe to assume it’s not a concern. The lock is only held for very short amounts of time.

Lastly, cache all of your effect handles when working with D3DX effects! Because these are string pointers internally, and D3DX doesn’t support unicode for them, we have to allocate every time a handle is created. Passing a string instead of a cached effect handle will cause allocation, every call and every frame. The net effect on performance of passing strings directly is quite extreme, since it follows slow paths in both SlimDX and D3DX.

That’s a somewhat high level overview, and I’m planning to add documentation that discusses SlimDX’s performance characteristics in more detail. However, we’ve been completely silent on performance and I decided to at least rectify that in brief. If I had to summarize, I’d say that in the general case:

  • XNA and SlimDX are both faster than MDX, and sometimes much faster.
  • SlimDX is faster than XNA at specific tasks, but it’s unlikely to make a noticeable difference for most people.
  • SlimDX is still slower than native code, probably on the order of 5%-10% with respect to DirectX calls. So if 50% of your time is spent calling SlimDX/DirectX (VERY high), you could be running as much as 5% faster in native code in that respect, but 3% is a more realistic estimate. For sane applications, even less.

Binaural Audio Follow-up

When I last posted, the demo video was the Barbershop recording. Although it’s an excellent video, it’s not ours and the only media I had to show that was actually by us was an eight second clip that was not terribly impressive. That is no longer the case!

Grab headphones and listen!

The game is previewed, with the real in game sound, near the end. But listen to the whole thing, and remember that it works in real time on pretty much any current platform. With any luck, we’ll have an SDK ready for Q1 2010, maybe even earlier.

Oh, and that’s still part 1. If you’re interested in animation and if you think that what NaturalMotion did was pretty cool, you’ll be in for a big treat later on.

P.S. I answer all comments, generally speaking, although it might be a bit delayed at times. So don’t think I’m not noticing.

Michael Jackson

There’s not really much to be said beyond what’s said, but at some level the man was already dead to me. It’s clear from the early 90s or so that he was very far gone, which is very sad. At least it seems like most people remember him as the King of Pop, and not the wide array of messes that came after his late eighties heyday.

SlimDX Supports Direct3D 10.1, too

SlimDX has had Direct3D 10 support for a long time, and it’s almost certainly the single best way to get 10 support from C# or any other .NET language. While it’s true that the very original prototype didn’t plan on it, that was over two years ago and it’s very much a first class citizen. Although we have Direct3D 11 support now (see this post), 10 is still a priority and we’re not leaving it behind.

As part of that commitment, SlimDX now has full Direct3D 10.1 support. It will be part of our next release, which looks like it will be August 09 at this point. Although nobody specifically asked for 10.1, it’s a very useful API in its own right. Sure it adds some features, but just like DirectX 11, it has feature levels. In other words, even though DirectX 10 requires DX10 class hardware, 10.1 works on DX9 or 10 hardware! The caveat is that Vista SP1 or later is required, but if that’s an acceptable limitation, 10.1 is great to have for that reason alone.

And as always, if you find missing features or bugs or whatever, let us know! D3D 10 is in use by several of our customers in production environments, and we’re determined to continue being the single best option for using it from C#, VB.NET, or anything else in the .NET ecosystem.

The SlimDX Architecture, Part 3

I asked Josh to write up a follow up to the first two parts, covering the ancillary object support in our ObjectTable, which he graciously agreed to. Here it is.

I want to switch gears a little bit and talk about a class you’ve almost certainly seen if you’re doing any SlimDX work: DataStream. I’d describe ObjectTable and ComObject as the heart of SlimDX, and DataStream as the soul. It’s been exceedingly difficult to get right, with 41 revisions of the header and 50 revisions of the source, even though the feature set has only changed a little bit over time. In fact, we just changed it again. It’s critical to most people because it does the one thing everybody needs to do — transfer data from the application to the API, and occasionally vice versa.

The original class in MDX was called GraphicsStream, and SlimDX used the same name for a while. We decided in r138 to rename the class to DataStream, since it was fairly obvious that it wasn’t graphics-specific in any way. The name was never quite ideal, but MemoryStream is taken and we couldn’t figure out something better. Besides, it’s kinda catchy. Although the goal of the class was kind of vague at first, it fundamentally represents what a pointer has become in managed code. It’s not quite as elegant in many ways, but unlike a pointer it is a hell of a lot safer.

The DataStream typically replaces a pointer in the underlying API, but a pointer in the unmanaged world is a relatively simple beast. In the managed world, we have a whole host of problems. There’s a few different places that a user could want to copy data to or from:

  • An array on the managed heap
  • A managed Stream
  • A buffer on the native heap (owned by anybody)
  • Memory in an ID3DXBuffer

In XNA, the approach to handling this complexity is to dispense with it entirely, and limit data transfers to managed arrays only. I criticized this decision heavily when the beta was released, but it’s not necesarily an unreasonable one. I just didn’t think it was the right move. It does fit in with the overall design strategy of XNA, after all. However, it doesn’t fit the SlimDX design philosophy, so we have a somewhat different approach.

DataStream can do all of these things. That’s why the name is so vague; trying to quantify the class at all is difficult. It exists to mediate data transfer between an application and DirectX, in either direction; it provides the standard Stream interface. Every DataStream has a backing store that can be stored in a pointer, but sometimes there’s auxiliary members. (A pin handle or an ID3DXBuffer are currently supported.) You can even allocate memory off the native heap for use via one of the constructors, in case that’s useful.

DataStream’s code is very simple and easy to understand, but it’s fraught with subtleties. The bounds checks have to be exactly right, which has been surprisingly difficult to do. It’s also important to use the correct pointer; sometimes this is the beginning of the stream and sometimes it’s the stream position. Despite those pitfalls, it’s not a class that looks threatening…but it does power nearly all of the data transfer in SlimDX.

Revolution, Part 1: Audio

A friend of mine has been working very hard on a project which involves two revolutionary technical innovations. I am not usually the kind to care for revolutions, but I honestly believe that both of these are really, really significant. I’m actually hoping to join up and help make these things into a proper product, in fact. These are things you have not seen in games, and he has it working in prototype form on a mobile platform. That’s unusual in this industry.

Here, listen to this Youtube video while reading. You need headphones for this, which is why the tech is launching on the iPhone first. The video is not ours, but it is a very good demonstration. Just listen and read. Remember, it doesn’t work without headphones — if you don’t have any, it might actually be worth reading this only after you’ve found some.

The underlying principal is binaural recording. (This is not the revolutionary bit, and has been around for quite a long time.) Games have had 3D audio for ages, so that is in itself nothing to get excited about. Current 3D audio basically works by modifying channel volumes for playback of a mono sound in order to simulate a 3D space. It works alright if you have a 5.1 setup, but it’s not terribly effective in stereo and in general the effect is a bit weak. Binaural recording, however, is a method of recording sounds with a pair of microphones and an actual head model that attempts to produce a stereo sound that simulates what our ears hear. You need headphones because of the recording methodology, and if you’re listening to the video I linked, you’re probably spazzing out right now.

There is a catch to all this, which is that nobody can synthesize it. The sound is recorded by physically placing it relative to the head, so you can’t go back later and place it at an arbitrary location. (Some people have pointed out that there are processors and algorithms that try, but they are expensive and don’t really work well.) That’s essentially why it’s never showed up in games –although headphones-only isn’t a thrilling restriction, either. Still enjoying the barber?

Here is his binaural recording. (It’s 8 seconds, just pause the barber.) There’s one key difference, though. That’s not a binaural recording of a sound being moved in front of a recording head. It is done in real-time. (This is the revolutionary bit.) This friend of mine has figured out how to do it. The original implementation worked very well but required a lot of memory and processing power. But the current system is efficient enough to fit on the iPhone. I’ve seen and heard the demo, working in real time off an iPod Touch. It works well enough to make your skin crawl, like those scissors are probably doing right now if you’re still listening to the barber.

I can’t really say too much about how he’s pulled it off, because we think it’s kind of a big deal. You’ll see iPhone game releases with the technology later this year, and hopefully by early to mid next year we’ll be licensing an actual SDK for whatever platform you might care to use. We’re fairly confident this is technology people will want, and hey, it wouldn’t hurt to forward this post around the office.

Finalizers and DirectX 11

Long story short: I’m not so sure about them.

The reason SlimDX does not, by and large, include finalizers is because of the threading issues involved. You can enable multithreaded devices in D3D 9 and 10, but it’s not really a good idea in most cases and we don’t see any reason to handle it differently. Direct3D 11, however, is making a big push for multithreaded rendering, and that means we could credibly enable finalizers just for it. There is a single threaded flag, but we can detect that and avoid finalizing those objects. But there’s more to it than that.

Let’s recap. Finalizers generally only make sense as part of the IDisposable pattern. The idea is that you expose Dispose to allow deterministic destruction of unmanaged objects, but in the case that you don’t call Dispose — deliberately or accidentally — the finalizer comes around and does it for you. This is mostly designed for Windows handles and the like, which I think is why it fails to really take threading into account. Because most DirectX objects are not free threaded, that goes to hell in SlimDX. We chose to eschew the finalization part of the pattern completely, and nobody really seems to mind.

From a strictly technical standpoint, enabling finalizers for 11 would be easy. It would take less time than writing this, actually. Josh pointed out that it’s a break in consistency, but I believe we can at least make an argument for it being part of the multithreaded support. The Windows API Code Pack provides finalizers, which tends to support the idea we should do it. However, there are a number of issues I’m concerned about, and for June we almost certainly won’t have them. I’m not sure we ever will.

At some level, finalizers exist to patch up programmer error. While this is not necessarily a bad thing, I have some misgivings about it. SlimDX doesn’t clean up after you, but it does provide some very powerful facilities for finding out what didn’t get cleaned up, and we even have some options for expanding that functionality. Finalizers will tend to interfere with that tracking in unpredictable (but innocuous) ways. If you have repeating leaks, maybe some will get finalized and maybe some won’t. The ObjectTable won’t be able to provide you with reliable results. We could provide a configuration option for it, like we do for exceptions — but that’s not a good idea. Basically if you’re exposing a choice to the user, it means you didn’t have the balls to actually make the choice and you’re passing the buck. I think Raymond Chen said something similar at one point, in not quite the same language.

There’s also a question of interop objects. We can track the single threaded flag in SlimDX code, but not in externally created objects. Sure it might be possible to get the device for an object and check its caps (I haven’t checked), but even if that works sometimes, it’s extra overhead and I’m guessing that the number of special cases involved is prohibitive. We could assume all external objects are multithreaded, and then someone will come complaining about a decision that we can no longer reverse, and which is impacting their performance negatively. Supporting interop smoothly is a critical use case for us, not least because it’s a major differentiating point against XNA. It has worked very well so far and I’m not thrilled about potentially breaking it now.

I guess I was lying at the beginning, actually. I am fairly sure about them. In fact, I’m fairly sure we won’t have them. They are looking to cause more problems than they solve, and nobody seems to care about the problem they’re supposedly solving in the first place.

The SlimDX Architecture, Part 2

Now we get into the real substance of things. In Part 1, I discussed some of the historical elements that went into the design of ComObject and ObjectTable, but stopped short of explaining the current system. In this segment, I’ll cover the essentials of how things work today. You should also read COM and SlimDX, Part I, as it does a better job than me of covering some of the details. (Remember, these are test drafts for new documentation.)

As SlimDX and its userbase grew, a number of cracks began to show up in the design. At the heart of things was that a lot of new objects were created every time you tried to do something, which meant a lot of things to dispose. Sure we had no problems with events chaining endlessly, but even with the leak tracker it was a real pain in the ass to make sure everything was handled properly. As a result, we also had to be very, very clear about where objects were created. With an object such as Texture that allows you to get its parent Device, a Device property was out of the question. It had to be GetDevice(), or the amount of stealthy object creations could get out of hand very quickly — and for no apparent reason.

Consider what a function like GetDevice() has to do. The native API gives us an IDirect3DDevice9*, which we have to convert into a managed reference to Device. The obvious way to do this — the way MDX worked — is to create a new Device object, passing it the pointer to be used internally. The device has already had the COM AddRef() function called to adjust the reference count, and the reference count is restored when Dispose() is called. It’s a simple and efficient strategy to implement, and creates a huge amount of allocations. We referred to this as the GetDevice() != GetDevice() problem at the time, because we didn’t implement equality operators and so the new objects didn’t even compare to the same thing.

Josh had the initial idea to address this. I never did research this in depth, but when you ask for Visual Studio to add a reference to a COM object, it creates something called a COM callable wrapper. Apparently the wrapper uses some kind of table of objects; I don’t really know the details. The basic idea, though, was to construct a table of mappings from native pointers to managed instances. By switching over to factory construction for most objects internally, we were able to make the table work.

ObjectTracker became ObjectTable, and maintains a Dictionary of IntPtr -> ComObject. Internally, when we call a method that always creates a new object (eg D3DXCreateTextureFromFile), the new object is added to the table. For methods that might return an object that is already in the table, we check the table to see if it contains the pointer we just got. If it does, the table calls Release on the object and then returns the original managed instance. When an object is disposed, we call Release and then remove it from the table. It’s set up this way in ordeto maintain a very important invariant:
SlimDX is always responsible for at most one reference on a COM object.

By maintaining this invariant, lifetime management is much easier to control. Instead of having to match up a large list of AddRef and Release calls to make sure everything comes together cleanly, all you have to do is Dispose the object once, usually the first reference to it that you saw. Stuff like GetDevice() became properties, which return the original instance without any caveats and without creating memory management headaches. Actually, that’s not quite true — there is one caveat if you’re doing interop. COM and SlimDX, Part II has a proper explanation.

There is one more issue outstanding with the ObjectTable, which Josh will write up for me soon. For part 3 of my series, I’ll step away from this stuff and take a look at DataStream.

The SlimDX Architecture, Part 1

As I mentioned earlier, there’s very little documentation on how the internals of SlimDX work, and why they work that way. This is important documentation, both for the team and for anyone looking to make changes or add things. I do intend to add full documentation; however, how to lay it out properly isn’t quite clear to me. I’m writing this series of blog posts as a dry run to get a sense for what topics are significant and what ordering would make the most sense. I’ll try a chronological approach first, and discuss the history of how things used to work.

The original SlimDX code was not intended to be the all-encompassing library it is now. It was just a simplified copy of Managed DirectX, and so it used the same design. The only real difference was that I took out the event stuff. I wasn’t using them anyway, due to Tom Miller’s post about how they were basically kind of dangerous. I wrote SlimDX by tearing MDX out of a game I’d written the previous semester, and retrofitting it with SlimDX until it worked. That game was simple and used a simple resource management model, so there simply was no need to examine the behavior farther. Everything got disposed in its proper way, and that was that.

Needless to say, this isn’t a very nice approach to things in the managed world. There was an early attempt to re-enable finalizers, but of course most of DX isn’t thread safe and this approach was abandoned quickly. I came up with a stopgap, which was to incorporate registration of every object via the common base class into an ObjectTracker.

The common base class was one of the design cues taken directly from MDX, although the idea is relatively straightforward. The original class was even called the same as MDX, DirectXObject. It’s not in the code quite from the beginning, but at r6 it’s damned close. It eventually gets renamed to BaseObject and then finally ComObject in r320. It has been a consistently useful class to have, by incorporating a lot of the basic boilerplate of handling COM objects in a way that makes sense in the managed world. It’s a relatively small role for a long time, but later in the SlimDX lifecycle this class becomes one of the most complex and subtle ones we have. The current version is truly terrifying. More on that later.

ObjectTracker was kind of a cool tool, because it meant that at any point in time you could see exactly what objects were outstanding and where they had come from. We even hooked the process exit to report the entire list to the debug spew, so you’d get a nice list of what you leaked. It had a public toggle, so you could turn it off and on as desired. It was added in July of 2007 (r110) and the functionality is still there today.

However, although the functionality is still included in SlimDX, the ObjectTracker is long gone. It was replaced in late February 2008 (r389) with the ObjectTable. ObjectTable is nearly identical to the old ObjectTracker, but with one fundamental behavior that changed everything. The current SlimDX design is essentially defined by the interactions between the ObjectTable and ComObject classes, which took a very long time to get right. Every little detail is critical, and if you don’t understand or pay attentiont to those details — and we’ve noticed at least a few people don’t — you’re liable to break the library.

I’ll go more into depth on those two and why they work the way they do in the next entry.