Proliferation of Niches: Ant Colony Optimization

Showing posts with label Ant Colony Optimization. Show all posts

Sunday, October 24, 2010

Hyper-Threading

I've re-run my real-time ray tracing and Go ACO TSP programs on a 6 core processor with hyper-threading. The Intel i7-980X supports two hardware threads per core, so the OS sees it as a 12-core processor.

Here hyper-threading has improved performance for both programs in all configurations. At one core both programs were sped up 28%. At 6 cores ray tracing was sped up 9% versus non-HT, while ACO TSP was sped up 18% versus non-HT. The higher speedup for ACO TSP is expected as it matches the general speedup from increased cores.

Both programs were run with 24 threads in all test configurations. Having more program threads than cores*hardware_threads_per_core is key. Otherwise the OS may schedule multiple threads on the same core while leaving others idle. Linux is hyper-theading aware and attempts to avoid this scheduling problem, but doesn't always get it right. In testing with 6 threads and 6 cores, I saw 28% decreases in performance for both programs when hyper-threading was enabled.

The Intel i7-980X also supports "Turbo Boost", where one core is automatically overclocked when other cores are idle. Turbo boost was disabled for the tests above. With one core and HT disabled, Turbo boost provides a speedup of 8% for both programs. Surprisingly, with six cores and HT enabled (so all cores should be fully utilized) Turbo boost provides a speedup of 5% for both programs.

The performance of Go has also improved relative to the single-threaded C version of ACO TSP. For 24 ants the C version finishes in 60 seconds, so on one core Go is only 1.7x slower than C. At 2 cores Go has exceeded the performance of the single-threaded C version. Previously Go was 2.8x slower than C on one core; I don't know whether this improvement is due to an improved Go compiler, the move to 64-bit compilers and OS, or architectural differences between the Intel Core 2 and i7-980X.

Sunday, January 3, 2010

Multi-Core Ant Colony Optimization for TSP in Go

Go is a new statically typed, garbage collected, concurrent programming language. It is native compiled and offers near-C performance. Like Erlang it uses a CSP model of lightweight processes ("goroutines") communicating via message passing. Messages can be synchronous (unbuffered) or asynchronous. Go supports multi-core processors.

On my usual ACO TSP test case Go is the best performing language so far. On a single core it was 2.8x slower than C, beating the previous best Shedskin Python. Go's speedup from 1 to 2 cores was 1.9, matching previous best Erlang. With 4 cores Go exceeds C's single core performance -- the first tested language to achieve this goal.

Go's lightweight processes are scheduled onto OS threads by the Go runtime system. This allows the system to support many more processes. I was able to run ACO TSP with 100,000 ant processes. The runtime system scheduler appears to treat these processes somewhat like futures, lazily scheduling them when another process is blocking on receipt of a message from a shared channel. This is inefficient for the type of parallel processing used in ACO (many seconds of parallel computation ending with a single message send), as only a single process (and thus a single core) is active at a time. This is a known issue, and the simple solution suggested on the golang-nuts mailing list is to add runtime.Gosched() calls to yield the processor. For my case this was insufficient, and adding additional heartbeat messages to each process helped force maximal multi-core usage.

See here for code and more details.

Sunday, September 6, 2009

More Shedskin Python

Previously I discussed Shedskin Python, a Python compiler I used on my ACO TSP test program. My ant.py test program is now being included as one of the example programs with Shedskin.

The earlier discussed version of Shedskin used intermediate templated C++ code to support the parametric polymorphism (generic functions) inherent in dynamically typed Python programs. In the most recent versions this templating has been dropped, as it was found to be complex to maintain and rarely useful. So my palindrome examples no longer compile.

In testing ant.py could also be sped up by 1.5x with the following shedskin flags:


 -b --nobounds          Disable bounds checking
 -r --random            Use fast random number generator 
 -w --nowrap            Disable wrap-around checking

Friday, July 18, 2008

ACO for TSP with SML/NJ

I was asked about the performance of SML/NJ on my usual ACO for TSP test. I used Standard ML of New Jersey v110.67 from Debian apt-get. I ran it in the interactive shell; no Compilation Manager. I didn't experiment with Concurrent ML (I don't think it supports multi-core). Results:


              Time   Multiple   
           in secs       of C      
C              0.4          1
MLton SML      1.6          4     
SML/NJ         4.9         12
Alice SML    169.0        423

Results scaled up proportionally when I increased iterations by a factor of 10.

Tuesday, July 15, 2008

Shedskin Python

Shedskin is another experimental Python compiler. Previously I tested Psyco, a Just-in-Time Python compiler which operates while the program is running. Shedskin is a more traditional compiler, run in separate steps beforehand. Shedskin actually generates templated C++ code, which is then compiled with GCC.

Since Python is dynamically typed while C++ is statically typed, Shedskin performs type inference over your program. It generates C++ templates to support parametrically polymorphic functions. A simple example, previously used in my article Types and Programming Languages:


  def palindrome(x, y):
    return (x, y, x)
 
  print palindrome(1, 2)
  print palindrome("a", "b")

Results:


  (1, 2, 1)
  ('a', 'b', 'a')

With shedskin, the following C++ template is generated:


  template <class A> tuple2<A, A> *palindrome(A x, A y) {
      return (new tuple2<A, A>(3, x, y, x));
  }

and used:


  str *const_0, *const_1;
  const_0 = new str("a");
  const_1 = new str("b");
  print("%s\n", palindrome(1, 2));
  print("%s\n", palindrome(const_0, const_1));

While support for many Python constructs and library functions is currently limited, Shedskin was able to compile unmodified the same Ant-Colony-Optimization Python code I had used previously with Psyco Python. It is significantly faster, coming within 3x the performance of the baseline C version:


                    Time   Multiple   
                 in secs       of C   
C                    0.4          1
Shedskin Python      1.2          3
Psyco Python         5.5         14
Python              40.0        100

Monday, July 14, 2008

C baseline for ACO TSP

In my previous post on the Multi-Core Problem I discussed comparing the performance of multi-core-friendly languages to the baseline performance of single-threaded C code. The test case I have been using is Ant Colony Optimization for TSP. In a recent post Leonardo Maffi has rewritten my ACO TSP code in C. This provides a baseline for comparisons:


           Time   Multiple   Speedup    
        in secs       of C    (1->2)          
C           0.4          1       n/a
Scala      10.0         25       1.7
Haskell    22.0         55       1.4
Erlang     93.0        233       1.9

gcc (GCC) 4.2.4 (Debian 4.2.4-1).
Compiled gcc -Wall -O3 -s ant_c.c -o ant_c
Still 1000 iterations, 200 cities.
Times for other languages from here .
Initial speedups from here (Scala interpolated), and do not model decay at higher numbers of cores.

Wednesday, July 9, 2008

Ant Colony Optimization for TSP in Psyco Python

My last post about Parametric Polymorphism and runtime efficiency got me curious about Psyco Python, so I've implemented Ant Colony Optimization for TSP in Psyco Python.

Psyco is a compiler for Python. It is a just-in-time compiler, and uses the run-time type usage information to determine which functions should be native-compiled. It is a specializing compiler -- it will create multiple instantiations of the same function with different type parameters. It is very easy to use; just add 2 lines to the top of your program:


import psyco
psyco.full()

It is an experimental project. The documentation warns that map and lambda perform poorly, and that list comprehensions (often a reasonable substitute) should be used instead. The original developer has moved on to work on the PyPy project; I should investigate that later.

Compared to other tested languages the performance of uncompiled Python is average; with Psyco compilation it is surprisingly good.


Erlang   Haskell   Scala   MLton_SML  Python Psyco
93       22        10      1.5        40     5.5

All times in seconds.

Update 7/12
Leonardo Maffi has pointed out numerous potential style and performance improvements for the Python code. Turns out Python already has a zip function (hard to find the info searching www.python.org, mixed with the zip file functions :-), so I have removed my implementation from the code. No performance change with my zip removed.

Update 7/15
See new post on Shedskin Python.

Friday, June 20, 2008

Multi-Core Ant Colony Optimization for TSP in Scala

I've added a new page Multi-Core Ant Colony Optimization for TSP in Scala.

Scala is a hybrid functional and object-oriented programming language. It is based on the Java JVM, so it has available Just-in-Time (JIT) native compilation and native thread support. It has access to the wide variety of class libraries available for Java.

Like Erlang, Scala supports an Actor-style message-passing concurrency model. More traditional thread-based concurrency with locks and shared memory is also available. Unlike Erlang and Haskell, Scala values are not immutable. Like SML and Haskell, typing is static with type inference.

Erlang and Alice SML omitted from graph. Note change from previous tests: 1000 interations, instead of 100. Still 200 cities.


Cores  Erlang  Haskell  Scala  OCaml  MLton  Alice
                                        SML    SML
    1     372       86   41.6           6.0    676
    2     192       63
    3     141       58
    4      98       54   14.6    1.6

All times in seconds.

If anyone knows how to reduce the number of cores used by the JVM, I'll fill in the rest of the table for Scala.

Tuesday, June 3, 2008

Functional Languages on Quad-Core (revised)

Yesterday's post generated lots of suggestions. I've revised the Erlang code to use dict mapping and updating functions. I didn't use the new Erlang arrays since they are one-dimensional and I need two dimensional arrays. Here are the new results:


Cores   Erlang   Haskell   SML
1       38.4     8.6       0.76
2       19.6     6.1
3       14.5     5.5
4       10.3     5.1

Here is the revised Erlang source code, as well as the Haskell quad-core source code used in tests. The MLton Standard ML implementation is still single-core, as MLton does not have SMP support (see comments below).

Update: Added new post with Scala implementation.

Sunday, June 1, 2008

Functional Languages on Quad-Core

In an earlier series of articles ending with Multi-Core Ant Colony Optimization for TSP in Erlang I evaluated the performance of the Standard ML, Haskell, and Erlang functional programming languages using a test problem of solving the traveling salesman problem using ant colony optimization. At the time I only had available a dual-core machine. I have re-run the tests using an Intel quad-core machine and the latest language versions available for Debian Linux.

The results match my previous summary: Erlang is much slower than the others. Erlang is a bytecode interpreted virtual machine while the others are native code compilers. Erlang is also dynamically typed while the others are statically typed. Like Haskell, Erlang has immutable (single assignment) variables which force lots of copying. Unlike Standard ML and Haskell, no array type is provided so I had to inefficiently implement them using dictionaries. In Erlang each process (thread) does have a separate heap, so there is little interference between processes due to memory allocation or garbage collection. Near perfect speedup is achieved.

The CPU utilization graph (Erlang, then Haskell, then Standard ML) does show the lack of interference between Erlang threads which makes its speedup possible. Haskell achieves little speedup at higher numbers of cores.

Update: thanks, responses to comments:

As discussed at the end of an earlier posting, based on email feedback I did try HiPE compilation with Erlang 5.5 on AMD. It provided no significant improvement, and was not compatible with multi-core operation. This time I again tried HiPE compilation with Erlang 5.6.2, and get the following with erl:


$erl
Erlang (BEAM) emulator version 5.6.2 [source] [smp:4] 
  [async-threads:0] [kernel-poll:false]

Eshell V5.6.2  (abort with ^G)
1>  c(ant,[native]).
./ant.erl:none: Warning: this system is not configured for
  native-code compilation.
{ok,ant}
2>

Same for erlc:


$erlc +native ant.erl
./ant.erl:none: Warning: this system is not configured for 
  native-code compilation.

So it appears the latest i386 Erlang available from Debian Linux apt-get doesn't support native compilation, probably because they chose to enable SMP support instead.

Erlang arrays are new in the latest release. Thanks for the info, I'll try them out.

Yes, MLton SML with one core was the fastest. Older (pre-quad-core) versions of the code are available linked from the various ACO TSP pages and here: Standard ML, Haskell, Erlang.

Actually the next language I will probably try is Scala. Based on the Java JVM, it may inherit JIT native compilation and good SMP thread support. It also supports an Actor concurrency model like Erlang. Apparently Caml dialects don't support SMP. Of course for many problems (like ACO TSP :-) you are better off simply starting N different fast single-threaded instances of the program and collecting the results in Ruby or some-such. I am really looking for instances where using a SMP-capable language is a clear win over single-threaded C++. I haven't written a single-threaded STL C++ version of ACO TSP; I expect it could easily crush Mlton SML once you finally got the bugs out...

Here are the raw numbers, in seconds:


Cores   Erlang   Haskell   SML
1       87.0     8.6       0.76
2       46.3     6.1
3       34.6     5.5
4       26.2     5.1

When I increase the iterations in Haskell by 5x the runtime goes up by 5x (43 secs with 1 core, 26 secs with 4 cores), still showing the same pattern of decreasing improvement with increased cores.

Update 2: see new post for revised Erlang code and results.

Update 3:
Turns out Erlang HiPE is available in Debian as a separate package (erlang-base-hipe), and is now SMP compatible. But the improvement is only 5%, or 11% if you also native compile dict.erl.

Proliferation of Niches