SlimTune 0.2.1 Released!

I’ve been working towards this milestone for some time now, and it’s finally ready for public consumption. SlimTune 0.2.1 is out!

SlimTune Profiler on Google Code

This version sports a newly retooled interface, along with numerous stability improvements and several new features. To recap, here’s some of the cool things SlimTune can do:

Live Profiling – Why should you have to wait until your program has ended to see results? SlimTune reports results almost immediately, while your code is still running. See your bottlenecks in real-time, not after the fact.
Remote Profiling – Other tools must be run on the same machine as the application being profiled, which can be inconvenient and worse, can interfere with the results. Remote profiling is an integral part of SlimTune.
On-Demand Profiling – Just because your code’s running doesn’t mean you want the profiler interfering. SlimTune lets you profile exactly when and where you need it, so you can focus on the results you need instead of filtering uninteresting data.
SQL Database Storage – Instead of developing a custom, opaque file format, we use well known SQL database formats for our results files. That means you don’t have to rely on SlimTune to be able to read your files.
Multiple Visualizations – Most performance tools offer a single preset view of your data. Don’t like it, or want it sliced differently? Tough. With SlimTune, multiple visualizers ship out of the box to show you what you want to see, the way you want to see it.
Plugin Support – We’re doing our best to produce the most useful visualizations, but that doesn’t mean your needs are the same as everyone else. A few dozen lines of standard SQL and C# code are all it takes to drop in your own view of the performance data, focused on what YOU want to see.

And yes, I know the writing is a bit cheesy, but product pages usually are.

SlimTune was started because the .NET profilers out there suck, or are expensive. I thought there should be a good, open source product out there to support all the developers who are doing real work, but can’t necessarily spend hundreds of dollars per seat for licensing. The more people using SlimTune, the better the feedback will be and the better the thing will get. Please help spread the word, and if you’re using SlimTune for something cool, let me know!

SQLite Support in SlimTune

I’ve mentioned before that most of SlimTune’s core functionality is pluggable. This actually includes the underlying data storage system. The app works through a fairly simple interface, and even SQL is only used by the visualizers and not the core program. To date, the engine in use was Microsoft’s SQL Server Compact Edition (SQLCE). With the next release, I’m introducing support for SQLite as well.

Let’s recap. Every other profiler I’m aware of works in more or less the same way. While the application is running, data is written to a file. Once the run is complete, the frontend steps in to parse the data and visualize it one way or another. SlimTune on the other hand allows (encourages, in fact) live visualization while the application is running. It also supports very different visualizations that slice the data in entirely different ways. The enabling technology for these features is the use of an embedded database. I’m not sure why no one else has taken this approach, but my theory at the time was that it was a simple matter of performance. Databases have to be manipulated with SQL, have to store everything in tables, etc. I suspected that updating a database so often was a problem. My goal was to write a blindingly fast database backend that would be capable of handling large amounts of data efficiently.

I was very, very successful. There’s a number of application side tricks to batch writes together, and I modify the database tables directly instead of issuing queries. The code is quite complex, and annoying to maintain. But the results are nothing short of fabulous. With the standard sampling rate, the database update takes 3-10 ms every second or two, and the frontend process accumulates about one second of CPU time for every thirty on a target single-threaded process eating 100% CPU. Live queries don’t really make a dent at all, since they’re so infrequent. Overall I was thrilled with the performance I’ve gotten out of SQLCE.

I decided to add SQLite support for a few reasons. First of all, it’s cross platform and I’m looking to enable Mac OSX support (and potentially Linux) in this release series. Second, it doesn’t require installation and so distribution could potentially be simplified somewhat. Third, SQLite supports in-memory databases, which MS SQL does not. Some people have complained about the need to create a file every time they run a profile, and that will no longer be necessary.

There was one more reason though — I was honestly curious how the performance of SQLite compares. I started by deciding I didn’t like any of the existing C# wrappers, so I wrote my own. (Wasn’t interested in ADO.NET support.) It’s a simple PInvoke deal, took me an hour to build the support I needed. The SQLite implementation is also much, much simpler than my SQLCE code. As I said before, I work directly with the tables in CE, which is fairly annoying to code. There’s no support for that type of thing in SQLite, so I simply issue prepared statements. All the application side caching tricks are still there, but writes are using normal SQL, one entry at a time. (No batched inserts!)

And how is performance? Equivalent to SQLCE, actually, with far less code and effort. Figures, right? It took a little legwork to get there, but nothing compared to what I spent on making the SQLCE implementation fast. When I started, the amount of time spent in the database was catastrophically long, and I thought maybe I’d wasted the effort. SQLite has a few options which are important to look at in order to get the best possible performance out of it. These options are called pragmas, and they turned extremely poor initial performance into an implementation that is good enough that I’ve now marked the SQLCE code obsolete.

I changed two pragmas in order to get the performance I wanted. Remember that the way I’ve written the code, every single data point (several thousand a second) is a separate transaction. I tried to combine them into one transaction but that failed miserably. I ended up specifying two pragmas:

m_database.Execute("PRAGMA synchronous=OFF");
m_database.Execute("PRAGMA journal_mode=MEMORY");

The first setting had a particularly dramatic effect, about 2000x in fact. It turns out that SQLite’s default behavior is to force a filesystem flush to disk of the database after every transaction ends, which is hideously slow. (I’m told that on some systems, it forces a flush of the ENTIRE filesystem’s pending writes.) Setting synchronous to off disables filesystem flushes, and relies on the OS to get things to disk safely. The second setting moves the transaction journal to memory instead of using a file. Again, with thousands of transactions this is dramatically faster than creating file traffic. Unfortunately it does mean a high likelihood of file corruption if your app crashes, but .NET’s ability to fail fairly gracefully and still run finalizers offers a lot of protection against that.

I’m still planning on a few more database engines to be available. I had kicked around the idea last year of being able to profile to a remote database instead of being restricted to a local filesystem, and I’m eager to see how a full blown MySQL or SQL Server instance handles the data. I’m worried about the amount of data moving through the TCP/IP stack on a single system though. I guess we’ll see what happens. I’m considering enabling plugins for that too, but right now I’m still fighting with how to expose data engine selection in UI. I haven’t figured out a way I like yet.

Meet RunDialog V2

I know what you’re thinking, and the answer is YES. It does remember all of the settings each time you open it! You’ll be able to save and load configurations shortly, too.

SlimDX Performance Tips and Tricks

Previously, I discussed some of the inherent performance costs that SlimDX suffers. Although that’s somewhat educational if you’re evaluating SlimDX, it’s pretty useless if you’re already using it and would like to get the most out of your code. This time, I’ll go over what you can actually do to make sure you’re running optimally.

There are two big problems with managed vector/matrix math, and this applies just as well to XNA. First, all of the types are value types, and passed by value to operators. That means when you multiply two matrices via operator*, two matrices (32 floats, 128 bytes) are copied onto the stack, and then another one is copied back into your result. This can get quite expensive, and the solution is to pass by reference, not by value. Unfortunately that means operators are a problem for performance sensitive code; you’ll have to use functions like Add and Multiply instead.

There’s also the problem that generally speaking, vector operations are not candidates for inlining. They’re too big for the JIT’s metrics to pick them up as inlining candidates (the 3.5 SP1 revision may have changed this). For small vector operations, this can again become a substantial cost. Unfortunately this is a messy one to deal with, as you can’t ask the compiler or JIT to inline things for you. The most effective approach I’ve seen is to replace vector operations in stable code with hand-inlined code. Farseer Physics uses this method, and wraps the inlined blocks in #region to clarify what’s going on. Yes it’s incredibly tedious, but if that’s what you have to do, then there it is.

Don’t use strings as effect handles if you can help it. We have to convert from Unicode to ANSI internally, and create a temporary handle. This gets slow and can cause other bugs as well. In future releases, this problem will actually be alleviated somewhat, but it’s best to avoid it completely.

Also make sure that SlimDX itself is configured correctly. These settings in the Configuration class. For example, object tracking is an incredibly useful debugging feature that tells you what objects you’re leaking and where they came from. But because it records call stacks, it’s also quite expensive. The default setting is for it to be active; turn it off for production builds. Also consider disabling exceptions for return codes you don’t care about (device lost and device not reset are common ones), instead of catching and ignoring.

Be careful with get properties and functions. An object’s Description property will always call GetDesc() on the underlying object, and then return a whole struct. This can get expensive quickly, especially if you casually access the property multiple times. We’ve chosen not to cache much of anything in SlimDX for the time being due to some nasty bugs early on. Querying data is expensive as a result.

Anything involving callbacks and callback interfaces is bad news, and it’d be best to avoid them for performance critical code. Every time you cross the boundary from managed to unmanaged or back again involves overhead, and for callbacks we end up bouncing multiple times — all while doing various kinds of fix up and data marshaling. Texture.Fill in particular is incredibly slow.

If you’re working with large amounts of raw data that will be sent to SlimDX, consider using DataStream, especially as a replacement for (Unmanaged)MemoryStream. When you give SlimDX a generic Stream to work with, it has to allocate a buffer large enough to hold the data, read all the data into the buffer, and then copy that into the target native DirectX buffer. This is quite inefficient for certain types of data that are already in memory. If you hand us a DataStream, we can skip the allocation and read, doing a fast memory copy only.

Hopefully that’s helpful. I’ll update this post as I remember more tips.

SlimTune Profiler 0.1.5 Released!

Let’s recap. For about two months now, I’ve been working on a brand new profiling tool for .NET, C#, CLR, and all that jazz. It’s open source, completely free, and supports frameworks 2.0 and later (no 1.x, sorry). Some of the notable features include remote profiling, real time results analysis, and multiple visualizers. Today, the first public release, version 0.1.5, is available to the public.

Project Homepage
Direct Link to Installer

Although this is still an early version, it is already quite capable. It supports sampling mode profiling for both x86 and x64 applications, and provides views that will be familiar to users of NProf or dotTrace. Speaking of NProf, it’s my belief that this completely replaces it for .NET 2.0+, with a better UI and more features too. (And a far more lenient source code license as well.) There is still a lot to come, of course, but with this release I finally feel that this is ready for the general public.

I’m looking forward to getting lots of feedback, both positive and negative, and I hope that this is a useful tool for everyone.

(P.S. If you want to build from source, you’ll need to do it with a non-express version of VC++ 2008 SP1 and VC# 08, with a full boost installation. Also install the SQL Server Compact redist, which is in the repository under trunk\install\ExtraFiles.)

SlimTune’s Hybrid Mode

I decided to try out the dotTrace Profiler, which runs $200 for a personal license and $500 per developer for organizations. IOW, it’s expensive. That $200 license makes it the cheapest of the commercial options, and I ran the trial on one of my games. They have some nice UI touches I like. The data is valuable as well — I would not have guessed that my MainGame.Update function takes five billion percent of total program time.

Generally speaking, profilers operate in one of two modes: sampling (statistical), and tracing (instrumenting). Sampling operates by suspending the process at high frequency and examining the program state. It converges to a decent overview of where your code is spending time, but it doesn’t produce meaningful timing information. The frequency simply isn’t high enough. Tracing injects calls to the profiler every time a function is entered or exited, allowing it to monitor the complete progression of your code. You get accurate results with fairly reliable timings, but it’s incredibly slow. (Oh, and it crashes dotTrace. People pay for this?)

I’ve been looking at doing a hybrid mode since the beginning of the project; I finally came up with a concrete approach when a friend gave me a rough overview of SN Tuner, the profiler for the PS3. The idea is fairly simple: you don’t generally want tracing-level accuracy for the vast majority of the code. All you really need is an overview, which sampling does a good job of, and then tracing when you’re focusing on one specific piece. You also don’t really need detailed profiling of framework code (everything in System, for example). Although I’m still working on the front-end, SlimTune is now able to do this type of selective instrumentation at runtime.

Using it is pretty simple. When you start up the target, select Hybrid in the SlimTune UI. This will cause the program to run in sampling mode, and you’ll get your overview results. Then, you can select a function from the overview and ask it to be traced, and then results will flow in from that function and its children only. You can also turn it off again, and you can ask for entire namespaces to be skipped in either tracing or hybrid mode. Hybrid mode is a little slower than sampling overall, but it allows you to get very detailed results without the huge performance hit that normally accompanies that level of detail.

Internally, it’s a little tricky to pull this off but it’s not too bad. I discovered early on that taking a lock on the function hooks is hideously expensive, even at zero contention. I use a few lockfree tricks to get the necessary data much faster. It’s also very important not to let the sampling profiler attempt to sample inside the hooks, as this leads to some nasty deadlocks; again, lockfree code is used to lay out some unsafe zones that the sampler can detect and avoid. SuspendThread is one messy son of a bitch.

So there are at least three features I’m giving you for free which a five hundred dollar competitor doesn’t have. Sure they have a much cleaner interface, VS integration, memory profiling, and so forth…but I’ve only been doing this for a month. Kinda makes me wonder. Oh, and guess what I spent the last day or so doing…

Cleaned up version of dotTrace style visualizer

Moar SlimTune

Primarily due to Washu request!

It's...an actual profile!

SlimTune Profiler for .NET

I basically took last week off from blogging. Time to try and get some new entries out! Things have been very SlimDX focused, but what did you expect? It’s what I do. Maybe today’s will inspire a bit more general interest.

As a creature of GameDev.Net, I get to see lots and lots of discussions questioning whether or not C# and .NET are “fast enough” for games. What I don’t much is people working on actually analyzing and tuning the performance of their .NET code to see what’s going on. I’m not sure why this is, but I have a theory that it’s partly because of the sorry state of available performance tools. The only version of VS that has a profiler is Team Edition, which damned near nobody has. Other commercial offerings are also seriously expensive. There are only two free profiling tools that are really available for use: CLR Profiler and NProf. (I’ve seen a few other tools, but it’s clear that they’re fringe tools that aren’t well supported.)

CLR Profiler is written by Microsoft, and it’s a pretty good tool. They’ve even released the source code, although the licensing is vague. It has a few drawbacks though. First of all, it only does memory profile analysis. It does a very good job of tracking allocations and garbage collections, and the visualizations are very well done too. But that’s all you get — no timings of any kind, let alone a breakdown of where time is being spent. Also, it hasn’t been updated since late 2005.

Then there’s NProf. Oh dear. The good news is it works, barely. That bad news is that’s all the favorable comments I have about it. It does simple sampling based profiling only, and will show you a simple tree based breakdown of time spent. It’s not that NProf is useless; I’ve done lots of good performance tuning with it. But this is literally all it can do, and there’s a lot more you want from a profiler. The last release was December 2006, and there’s some scattered SVN traffic since then but it’s basically dead. Support for x64 is apparently doable if you compile from source. I looked at the source, which is also poorly written. I decided immediately that I could do better than this toy, and now I’m putting my money where my mouth is.

I’m working on a new open source profiler tool right now called the SlimTune Profiler. It will probably release in early September, and the initial feature set is taking direct aim at NProf. The initial version will support sampling and instrumentation profiling for .NET 2.0 and above on local and remote machines. A little later on, you’ll be able to profile-enable a long running process at zero performance cost, and then profile it in real time for short periods. Imagine running a production server, and actually connecting with the profiler while it’s serving real requests to see what’s happening.

On the front-end, data will be collected from the profiling backend and dropped into an embedded relational database. There will be some preset views of the data, but the idea here is that you should be able to apply your own queries to the data and get results that are useful to you. Reporting is not expected for the initial version, but it will be supported eventually as well. I imagine you’ll be able to create various tables, graphs, etc and export them, although I’m not sure exactly what format that’ll be in. PNG and Excel seem reasonable. I’m hoping that you’ll be able to combine results from multiple runs, which would allow you to make all kinds of snappy graphs to show off to your boss.

It’s been my plan for some time now to expand beyond SlimDX, and create a suite of Slim software. We’ve got a good reputation and lots of respect for our work, and I’m looking to build on that. SlimTune is the first step. It probably won’t be able to compete with the commercial offerings — but RedGate ANTS runs $400 or more per license. SlimTune will blow NProf out of the water in a scant two months, and it won’t cost you a dime. The feature set is pretty well specified, and the profiler already works in prototype form. The work over the coming weeks is in building a product instead of a project.

And yes, I know I’m a tease. It’ll be worth the wait.

SlimDX Performance

Whoo, I burnt through a lot of posts last week. I really ought to pace myself a bit more conservatively.

I’m not equipped at the moment to undertake a full treatment of SlimDX’s performance right now, as that’s a rather touchy set of problems. However, I did want to provide a general overview of what kinds of things affect SlimDX performance and what you can expect from your own programs. Strictly from the library’s point of view, performance is actually surprisingly good. It is not as good as unmanaged code — and can never be — but it is much better than you might think.

There are a few key sources of inefficiency in SlimDX, which are varied and complex. Some are avoidable, and some are not. Some are inherent to the process as a whole, and some are results of architectural decisions in building the library. The name “Slim” officially means that there’s a very minimal barrier between you and DirectX. While this is basically true, there’s still a lot details to be aware of.

The very process of exposing a native API to managed code is complex. Apart from converting the parameters themselves — a process which SlimDX is exceedingly efficient at — there is a lot of bookkeeping to do in making sure the call stacks stay correct, that permissions are handled properly, that exceptions are trapped safely, and so on. C++/CLI handles all of this for us behind the scenes, but there is a substantial cost for making ANY native call as a result. We have studied techniques for addressing those costs by allowing you to batch certain kinds of calls (eg SetRenderState), but nothing has yet shown itself to be clearly advantageous.

The most obvious candidate for performance issues is the math library. Floating point is well known for being performance critical in games (and certain other software) and that’s why we have processor extensions like SSE to make it as fast as possible. The D3DX routines that Microsoft provides are heavily optimized, and take full advantage of processor extensions. Unfortunately, using these routines from managed code is very expensive, so we instead provide completely managed implementations. XNA takes the same approach, but MDX calls into D3DX. These managed functions will JIT into scalar SSE which, while not optimal, is still very fast. Rough benchmarks have shown that completely native D3DX code has about a 10% advantage over our implementation. In other words, if you spend 20% of your time just doing math, a native code version could be 2% faster in that respect. In order to get the best possible performance out of SlimDX — or XNA, for that matter — make sure to use the ref/out overloads of parameters. It won’t get you Havok levels of math performance, but it’s still extremely fast.

The vast majority of calls into the DirectX API itself return an HRESULT error code. In C++, you’re free to check or ignore this as you please. MDX took the fairly aggressive step of checking every single return value and converting to an exception; SlimDX and XNA both follow essentially that same model. SlimDX is particularly powerful, because it allows you to trap specific failed results, to control what actually causes an exception, or to disable exceptions outright. There’s also a LastError facility that stores a thread-local value for the last result code recorded by SlimDX. It’s a lot of truly useful flexibility, but it comes with a cost. For the March 2009 release, we leaned out the error checking significantly, making it much, much cheaper to make short DirectX calls. For a successful result, you basically pay for a method call, a comparison, and a small write to thread-local storage. A failed result triggers a much slower path. In the end, we are occasionally slower than MDX, but usually not thanks to improvements in .NET 2.0 and our improved architecture overall. We’re also nearly always faster than XNA on the CPU; XNA spends a lot of time doing thorough parameter validation.

Be wary of properties, especially ones that return complex value types. Many properties will cause a DirectX call on every invocation. If the return type is large, we’re probably reading an entire struct back, and then copying it to your variable. The Description property on many objects, for example, causes a GetDesc call. If you check obj.Description.Width and obj.Description.Height, that is two calls to GetDesc and two full copy operations. Cache the return value if you’re invoking DirectX a lot in performance sensitive code areas!

Anything in SlimDX that returns a ComObject derivative is typically translating from a native pointer to a managed object. In SlimDX, this invokes some internal machinery that will take a lock, do a table lookup, possibly insert to the table, and then construct an object if necessary. In the vast majority of cases this should never be an issue, but if you’re getting lock contention on our object table there is potential for problems. We don’t really have good data on this one either way, so it’s probably safe to assume it’s not a concern. The lock is only held for very short amounts of time.

Lastly, cache all of your effect handles when working with D3DX effects! Because these are string pointers internally, and D3DX doesn’t support unicode for them, we have to allocate every time a handle is created. Passing a string instead of a cached effect handle will cause allocation, every call and every frame. The net effect on performance of passing strings directly is quite extreme, since it follows slow paths in both SlimDX and D3DX.

That’s a somewhat high level overview, and I’m planning to add documentation that discusses SlimDX’s performance characteristics in more detail. However, we’ve been completely silent on performance and I decided to at least rectify that in brief. If I had to summarize, I’d say that in the general case:

XNA and SlimDX are both faster than MDX, and sometimes much faster.
SlimDX is faster than XNA at specific tasks, but it’s unlikely to make a noticeable difference for most people.
SlimDX is still slower than native code, probably on the order of 5%-10% with respect to DirectX calls. So if 50% of your time is spent calling SlimDX/DirectX (VERY high), you could be running as much as 5% faster in native code in that respect, but 3% is a more realistic estimate. For sane applications, even less.