I am sort of, I used getmicroseconds to profile some functions to see easy optimizations. However it's not easy to reduce the coupling, and one cannot cache QueryInterface Pointers as they keep changing. What could be nice is implementing workers so that each component could run in threads ? Also running the same profiling with SM45 would be interesting. One might also try to use the tracelogger to find potential bottlenecks.