Monday, 23 November 2015

C++ vs C# performance [deleted]

The following answer to a question about C++ vs C# performance on Stack Overflow has sadly been deleted despite having 305 upvotes:


I often heard that people prefer C++ to C# mainly in the performance critical code,because the GC might turn up on critical path, causing the performance penalty.

I have heard that in some circles but never respectable circles.

For example, I consulted for a company in London who were selling stock exchange software that had been written in 1,000,000 lines of C++. Over 40 developers had been working on it for almost 15 years and they were convinced that C++ was the correct solution for such software because latency and throughput performance were both critical. They were achieving latencies as low as 50ms (with a single trader connected!) and throughput as high as 10k trades per second (tps). However, they were struggling to support more than 2,000 traders because they had several threads per trader (no async) and, in fact, traders were reporting latencies as high as six seconds because the latency of their C++ code increased exponentially with the number of traders. I rewrote their code in 3 months using F# on .NET and achieved latencies as low as 0.1ms and throughputs over 200ktps using just 6,000 lines of F#. My solution was fully asynchronous (supported over 10,000 simultaneous trader connections) and fault tolerant.

Now, I'm not saying that C++ could not have been used to achieve even better performance than mine. On the contrary, I'm sure it could have achieved better performance but I also believe it would have taken man-decades of work by real experts and cost millions of pounds. After all, there's a reason why the London Stock Exchange paid £18m for Millenium IT and their low-latency C++ solution. However, I do believe that the vast majority of the people who prematurely optimize away garbage collection don't know what they are talking about and would not be capable of building a good solution in any language. Such people usually only know C++ and have no knowledge of garbage collection algorithms, which is scary because C++ programmers reinvent GC algorithms every day. A good test is to ask them how garbage collection works. If they describe naive mark-sweep circa 1960 then they haven't done their homework.

On the other hand, some people write excellent low-latency and high-throughput code in garbage collected languages. For example, see the LMAX Disruptor (Java) and Rapid Addition FIX engine(C#). So people have written low-latency software in Java and C# and, therefore, it clearly is possible. In particular, the use of arrays of value types is a known but under-appreciated solution for low-latency programming on .NET.

However, when I read through the C++, I realized that C++ offers the smart pointer features in which the programmer did not need to worry about memory management. For example, the shared_ptr with reference counting will manage the memory for us. Hence,we did not really care about the life-time of an object and when did it being deleted. Wouldn't that similar to the C# GC and the destructor of the object would be called at the performance critical code?

Yes. C++ programmers often complain about tracing garbage collectors being non-deterministic and causing pauses. Thread-safe shared_ptr in C++ is non-deterministic because threads race to decrement the count to zero and the winner of the race condition is burdened with calling the destructor. And shared_ptr causes pauses when decrements avalanche, e.g. when a thread releases the last reference to a tree the thread is paused for an unbounded length of time while every destructor in the tree is called. Reference counting can be made incremental by queuing destructors but that recovers the non-determinism of tracing garbage collection. Finally, reference counting with shared_ptr is several times slower than tracing garbage collection because incrementing and decrementing counts is cache unfriendly.

On a related note, C++ programmers often mistakenly claim that shared_ptr collects garbage at the earliest possible point in the program and, therefore, collects more "promptly" than a tracing garbage collector can. In fact, scope-based reference counting like shared_ptr keeps floating garbage around until it falls out of scope which increases register pressure can even increase memory consumption compared to tracing garbage collection.

So shared_ptr is indeed nothing more than a poor man's garbage collector. After all, old JVMs and CLRs both used reference counting at some point in history and both dropped it in favor of better forms of garbage collection. Reference counting is only popular in C++ because there is no easy way to walk the stack and redirect pointers so accurate tracing collection is prohibitively difficult.

Also, another question is if we didn't use smart pointer in C++ and we just resort to raw pointer, we still need to call delete to clear the heap memory. So from my understanding, every object created by C++ or C# would still be destroyed but the difference is only in we manage the memory ourselves in C++ but in C#, we let the GC to manage it. So what is the NET effect of it when comparing C++ and C# since both object still need to be deleted?

In its simplest form, allocation in C++ boils down to calling a general-purpose shared (global) memory allocator like malloc and in C# it boils down to pointer bump allocating into a thread-local nursery generation (gen0). Consequently, ordinary allocation in C# is much faster than ordinary allocation in C++. However, that misrepresents real software. In practice, C++ programmers avoid calling the general purpose global allocator in favor of using thread-local pool allocators whenever possible. On the other hand, C# developers rely on the general purpose memory management solution provided by .NET because it greatly simplifies APIs (memory ownership has been abstracted away) and is more than fast enough in the vast majority of cases. In the few cases where the simple solution is not adequate, the C# developer drops to lower level C# code and writes a pool allocator using an array of value types.

So I'd probably just make two observations:

·       Accurate tracing garbage collection is extremely useful in general and is bundled with C# and prohibitively difficult with C++.

·       Memory management bit tricks (e.g. smuggling bits in pointers) are sometimes possible in C++ but prohibited in C#.

So there is no easy way to compare C++ and C# fairly in this context.

Moreover, memory management is arguably not the biggest performance concern anyway. Many other issues can have a significant effect such as the quality of generated code on obscure architectures (where C compilers are usually much more mature) vs JIT compiling for a specific CPU, vectorization like SIMD (.NET does little), JIT-compiled run-time-generated code (like regular expressions in .NET) vs an interpreter in C++ and compilation to GPUs or FPGAs.

I think the only piece of good advice I can give you here is: do your own research and don't listen to the unwashed masses.



Monday, 24 August 2015

Bjarne Stroustrup is catching up

Bjarne Stroustrup, creator of the C++ programming language, once famously said "There are only two kinds of languages: the ones people complain about and the ones nobody uses". Interestingly, Bjarne has gone on the defensive in his recent lectures, completely changed his tune and is catching up with the conclusions that most former-C++ developers have arrived at.

In a recent lecture Bjarne made many eyebrow-raising assertions. He is happy that people are no longer talking about C++ because that means it has succeeded. In reality, C++ demand in the job market has been in freefall for years and few new software projects are choosing it. He attacked computer scientists for copying data and said that "even babies don't do that", a very strange statement to make in a technical presentation. He also implied that other languages deep copy 10,000x10,000 matrices and claimed that a shared_ptr is "like an old fashioned garbage collector except it gets resource release correct". Perhaps most interestingly the topic of his presentation was OOP without inheritance.

So C++ is moving from templates to the kind of parametric polymorphism ML offered before C++ was invented. Is the backbone of OOP, inheritance, being deprecated? And new features in C++ are closer to first-class functions and garbage collection.

Looking at "modern" C++ makes me angry. I wasted so much time learning all of this incidental complexity that just gets in the way of software development. And I am angry that so many people are still being deceived by this nonsense. Thankfully fewer and fewer people each year but where did we go wrong? How did we let this happen? I think it reflects a serious disconnect between academic and industry.

"Premature optimization is the root of all evil" considered harmful

Computer science is coming full circle on performance. For decades, people worried intensely about performance and squeezed every ounce of speed they could from their code. But today the story is changing. The growing popularity of functional programming in the mainstream is encouraging people to think at a higher-level of abstraction. These people are often found reciting Knuth's famous quote "premature optimization is the root of all evil". Unfortunately the extremists among them are taking this too far and architecting systems with no regard for performance. The only fix is then tantamount to completely redesigning and reimplementing the entire system.

We can only conclude that this extremist form of "premature optimization is the root of all evil" must be considered harmful. Joe Duffy of Microsoft already expressed a similar opinion.

Friday, 27 March 2015

Functional programming and databases

Eric Lippert made some interesting statements about the disadvantages of functional programming on Stack Overflow:


“When Jane Smith in accounting gets married and changes her name to Jane Jones, the database backing the business process that prints her paycheque had better be all about handling that sort of mutation. When you fire the machine gun at the alien, most people do not mentally model that as the construction of a new alien with fewer hit points; they model that as a mutation of an existing alien's properties… My point is that if you have an object that represents an employee, it makes sense to think of the operation "change the name of the employee" to be a mutation of the object representing the employee that does not change object identity. When Jane Smith changes her name you don't create a different employee called Jane Jones that is otherwise the same. There aren't two employees with two different names. It is natural to model this process as a mutation of an object, not as the construction of a new object.”


I don’t know about aliens but I can tell you many horror stories I have witnessed in industry caused by the approach to databases that Eric is advocating.


Eric’s interpretation of the immutable approach as creating a “different employee called Jane Jones that is otherwise the same” is a strawman argument. The immutable approach is actually about representing a function that maps dates or versions onto data. What was Jane’s full name in 2012 and what is it now? As the data changes at discrete points in time this function can be represented by a database table or Map from date or version to data. This is the essence of the “functional” approach.


In many circles it is useful or even essential to retain all historical data. In the financial and insurance industries this is usually a legal requirement. Everywhere else it can greatly simplify testing and validation. For example, when we wrote stock exchange software it was theoretically possible to compress repeated trades into a single larger trade by overwriting the data in the database but this was completely prohibited because the regulatory authorities required the ability to see individual trades. When we wrote pension fund calculation software for the insurance industry the system was legally required to be able to rerun old calculations and obtain the exact same answer that it had given 10 years before. In both circles we found people building databases using the imperative approach that Eric Lippert is advocating and then struggling to regain old data. Their solution was often to mutate the database equivalent of an undo buffer in order to retain the ability to regenerate old data. Suffice to say, this approach is very error prone.


For most of the people most of the time a more “functional” approach to database updates is preferable. Key any data that you are tempted to mutate by global version number. When searching, search for the latest version. When updating the database, read the latest version and write with the next version number. You can maintain a separate table mapping dates to version numbers. If you are running a concurrent system, obtain the current date using techniques like Lamport’s clock or vector clocks. Then when you want to examine historical data you can fetch the appropriate version of the data from the database, getting the version from the date if necessary. Performance will be slightly worse due to the extra data but any operations on historical data are much easier with this approach.


I am currently struggling to do some analytics for a market research company. They have a comprehensive relational database of people, companies, products and services. Each person works for a company. If a person changes jobs their old job is overwritten with their new job. If a person’s address changes, their old address is overwritten by their new address. So our computers once knew where people used to work but that information is not readily available to me precisely because the data was overwritten. So I cannot calculate the social networks people are in or estimate how their work might be cross-pollinated between different companies. I cannot even tell someone where the product they ordered 6 months ago was sent because their address has changed since. So even in a situation where historical data is not legally required it would still have been very useful!


Saturday, 28 February 2015

Memory management myths in Apple circles

Nothing gets my hackles up more than people perpetuating memory management myths. Apparently there is a new trend in town thanks to Apple, who are deprecating their garbage collector on OS X in favor of Automatic Reference Counting (ARC).
News website Cult of Mac say:
iOS is twice as memory-efficient as Android. Here’s why... According to Glyn Williams over on Quora, iOS devices run better than Android devices with twice the RAM because Android apps use Java, and need all the extra RAM to do something called garbage collection.”
Another news website, Redmond Pie, say:
“That was basically the same question put to Quora, the social website that gives people a way to ask questions and then have them answered by people who are experts in their respective field. The upvoting system adds a spot of authority tracking to the answers that are provided, and we have a clear winner as far as the question around why Android phones have so much more memory than iPhones.

Enter Glyn Williams.

The response, upvoted by over 2,600 people, included a handy graph and an explanation that involves garbage collection and Java. Basically, Android needs more memory because of the way it handles things.

You can head on over to the Quora question and check out Glyn’s explanation yourself, but what it boils down to is this: Android apps use Java, and as a result Android does something called garbage collection which involves memory being recycled once applications are finished with it. That’s all well and good, and actually performs really well when given plenty of memory to work with. The problems arise when the system is starved of memory.”
They are both referring to the same answer on Quora by a guy called Glyn Williams. His answer is as follows:
“Servicing RAM uses power. So more memory = more power consumption.
Android apps using Java, recycle released memory using garbage collection.

What this diagram shows is that garbage collectors are really awesomely fast if you have a relative memory footprint of 4 or 8.

In other words, you need four or eight times more memory, than you are actually using to be super efficient. But when the memory becomes constrained, that performance goes way down.

This is why Android devices have all that RAM.

iOS does not use this style of garbage collection and does not slow down in constrained memory environments.

So 1GB for iOS results in more performance than 3GB for Android.”

Some problems are immediately obvious with this. Firstly, RAM typically only accounts for a fraction of a percent of total power consumption in a mobile phone so power is not an excuse for skimping on RAM. Secondly, the graph is one of seven graphs from a ten-year-old research paper that compared various toy garbage collectors with an alternative scheme that used trial runs to deallocate memory aggressively. There are many problems with this. The other six graphs in the paper do not substantiate Glyn’s claims, i.e. he cherry picked his graph. Most of the garbage collectors (e.g. Cheney semi space, stop and copy, non-generational mark-sweep) are not representative of anything used on Android. The most realistic garbage collector on the graph is the generational mark-sweep collector that outperformed all of the others but even this is not as sophisticated as the concurrent garbage collector employed by the latest Android's run-time, ART. Thirdly, Glyn asserts that GCs must “have a relative memory footprint of 4 or 8” to be really awesomely fast when this graph clearly shows a relative footprint of 2.5 for the only realistic GC provides the best possible performance. Fourthly, Glyn implicitly assumed that ARC has optimal performance and memory overhead when, in fact, reference counting can be 10x slower than tracing GC and reference counts take up a lot of room. Ten times slower is literally off the chart here. Fifthly, Glyn asserts that garbage collection is the reason why Android devices have more RAM but there is no evidence to support this. Finally, Glyn asserts that this (garbage collection) is the reason why a 3GB Android device performs like a 1GB iOS device when there is clearly no evidence to support that conclusion which is, in fact, pure speculation.
I took the opportunity to ask Glyn himself what had given him the impression than Android needs more RAM to attain the same performance as iOS and why exactly he thought that was due to garbage collection. The only concrete evidence Glyn offered was this video showing two devices being switched by hand between a variety of different applications. Most of the applications are written in C++ and the Android device actually won the benchmark.
So this news-worthy gem turned out to be pure speculation.

Sunday, 26 January 2014

On the performance of boxed tuples

Stephen Dolan, a PhD student at the University of Cambridge, published an interesting article A “sane” design for multicore OCaml recently and the following discussion on Reddit proved to be quite informative. In particular, more than one person asserted that ints, floats and tuples are fast in OCaml. In this blog post I’m going to take a look at tuple performance.


As Stephen points out, one might reasonably expect unboxed tuples to be faster for passing arguments to functions and returning multiple values from functions because the elements can stay in registers but slower for storing in the heap because they require multi-word reads and writes instead of a single word (a pointer to the existing tuple). However, HLVM has shown that unboxed tuples can be extremely fast so why the discrepancy?


The performance charactistics of different heap topologies are not quite so simple in a garbage collected environment. Two aspects of garbage collection affect the results: the write barrier and survivors. The write barrier is a relatively-slow piece of code injected whenever a program writes a reference into the heap in order to keep the garbage collector apprised of the constantly-changing heap topology. Therefore, writing an unboxed pair of ints into the heap requires two int writes whereas writing a boxed pair of ints into the heap requires one pointer write and a write barrier. Now, in order to record information for the garbage collector the write barrier always performs at least one write in addition to other housekeeping work. Therefore, writing a pair of ints will always be slower if the pair is boxed. In fact, the F# Journal article Pathological garbage collector behaviour found that a write that incurs the write barrier is 2.4x slower than a write that does not. Moreover, .NET is heavily optimized for mutable code so it has a very efficient write barrier whereas OCaml is heavily optimized for purely functional code and has a notoriously slow write barrier.


The next issue that complicates the performance of boxed vs unboxed tuples in the heap is survivors. Both OCaml and .NET use generational garbage collectors. New objects like boxed tuples are allocated in a nursery generation. When the nursery is full, surviving objects are identified and physically copied to the next generation. If a program violates the generational hypothesis (that most objects die young) by allocating many objects that survive then it incurs a performance overhead for marking the survivors, copying them into the next generation and fixing up all pointers to those objects to point at their new locations. If tuples are unboxed then none of these overheads exist.


So it is instructive to measure the performance of writing newly-allocated tuples into progressively longer slices of an array. We have done this using boxed tuples in OCaml and F# as well as unboxed structs in F#. The following graph visualizes the performance as a function of the size of the array slice:

For array slices containing up to 1,000 elements none of the tuples survive the first generation and the performance of the boxed tuples is only slightly worse than for the unboxed representation (probably due to the write barrier). For more than 1,000 elements the performance of boxed tuples in both F# and OCaml rapidly worsens until they are 10x slower than unboxed tuples for 1,000,000 elements.


So tuples can indeed be fast in OCaml but only if they are short-lived temporaries. If tuples survive the nursery generation (which is 256kB by default) then performance is very bad.


The poor performance of long-lived tuples in OCaml has actually been worked around on several occasions. The Map implementation is almost identical to the Set implementation except the key-value pairs have been manually unboxed into the variant type representing the AVL tree, resulting in a large amount of unnecessary code duplication. The Hashtbl implementation uses a custom bucket type that is a list where the key-value pairs in each cons cell have been manually unboxed. Variant types are most elegantly represented as a tag and argument. For multiple arguments, the argument can just be a tuple. This simple and efficient representation is used in HLVM and it works very well. In OCaml, the compiler unboxes multiple arguments as a special case in order to combat the performance of boxed tuples. This results in a language wart where brackets around the arguments to a variant type constructor alter the meaning from multiple arguments to a single argument that is a tuple.


Locality is another important aspect of the performance of boxed vs unboxed tuples. Consider sorting an array of pairs and then enumerating the sorted array. With an unboxed representation the elements of the array are physically moved into place within a contiguous block of memory and enumeration is cache friendly. With a boxed representation, sorting scrambles the pointers and enumeration then has worst-case cache behaviour.


Suffice to say, the performance of boxed tuples is not as clear-cut as one might imagine.


Sunday, 13 October 2013

Memory management myths: promptness

People often assert that scope-based reference counting such as shared_ptr in C++ collects garbage “promptly” and some people define this as “collects at the earliest possible point”. For example, at the time of writing the Wikipedia page about garbage collection says:

Compared to tracing garbage collection, reference counting guarantees that objects are destroyed as soon as they become unreachable” – Wikipedia

Similar claims can even be seen in published research such as the paper “Down for the Count? Getting Reference Counting Back in the Ring”:

“Of the two fundamental algorithms on which the garbage collection literature is built, reference counting has lived in the shadow of tracing. It has a niche among language developers for whom either performance or completeness is not essential, and is unused by mature high performance systems, despite a number of intrinsic advantages such as promptness of recovery and dependence on local rather than global state.” Blackburn et al.

On the other hand you can see statements by experts like Richard Jones, co-author of the excellent Garbage Collection Handbook, make statements like:

“More importantly, note also that even an immediate (i.e. non deferred) reference counter cannot reclaim objects as soon as they are no longer referenced as finalisation must be asynchronous (see Hans Boehm's POPL03 paper "Destructors, finalizers and synchronization").” – a post on the gc-list by Richard Jones.

Let’s have a closer look at the thinking behind this belief and test it with a simple program. The mental model that underpins this belief is that any function’s local variables are stored in separate slots in the function’s stack frame for the entire duration of a function’s body and, therefore, will be reachable from the point of view of the garbage collector for the duration of the call to the function. This mental model underpins exam and interview questions such as Is object eligible for garbage collection after “obj = null”? and When Is The Object Eligible For Garbage Collection?.

In reality, this mental model is simple, obvious and wrong. Why? Firstly, the garbage collector sees the run-time representation of a program after it has been subjected to transforms such as inlining, instruction reordering and code block reordering by the compiler that can mutilate the structure of a program beyond recognition and, consequently, concepts like scope that exist only in the source code and not in the compiled form are not visible to the garbage collector. Secondly, the register allocator does everything possible to keep references in registers and avoid spilling them to the stack and when they must be spilled it uses the results of liveness analysis to overwrite any dead references in the stack frame whenever possible. In fact, some compilers don’t even use stack frames, such as our own x86 JIT in F# and the HLVM project, and other compilers like SML/NJ convert every call into continuation style and put stack frames on the heap, splitting every segment of code between a pair of function calls in the source into its own separate function in the compiled form.

Enough theory, let’s take a look at some working code. Here is a simple example using tracing garbage collection in OCaml/F# where an argument tmp to a function dies in the middle of the function body and, in particular, before a recursive call:

let rec loop tmp i =
  if i<=0 then tmp else
    let tmp2 = loop (Array.copy tmp) (i-1)
    tmp2.[0] <- tmp2.[0] + 1

When run using loop (Array.init m id) n, this code clearly uses less than mn space and keeps on running indefinitely. This can only be because the argument tmp is no longer reachable via the stack when the recursive call is made and, consequently, gets garbage collected.

Here is the equivalent using scope-based reference counting in C++:

shared_ptr<vector<double> > loop(shared_ptr<vector<double> > tmp, int i) {
  if (i<=0) {
    return tmp;
  } else {
    shared_ptr<vector<double> > tmp1(new vector<double>(*tmp));
    shared_ptr<vector<double> > tmp2 = loop(tmp1, i-1);
    return tmp2;

In contrast, this code clearly requires at least mn space when run, goes to swap and (on Windows) dies from out of memory. Unlike the OCaml/F# code, the scope-based reference counting using shared_ptr in C++ keeps the tmp array allocated for longer than necessary, right until the end of the function call.

This observation also destroys another popular memory management myth: that tracing garbage collection always requires more memory than reference counting.

If there is any advantage to the C++ then it is the presence of guarantees. The semantics of C++ guarantee that after the end of scope the object has been deleted. However, it is worth noting that this guarantee of determinism does not apply to objects shared between threads because in that situation the threads race to decrement the reference counter to zero and the winner of the race condition is burdened with executing the destructor.