Discussion:
[asio-users] Poor Asio performance at Linux
Marat Abrarov
2012-07-25 17:15:03 UTC
Permalink
Hi dear Asio users and Asio author.

Some time ago I received unexpected feedback about my project asio_performance_test_client
(http://sourceforge.net/projects/asio-samples):
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396 (sorry, but it's
in Russian, the texts are rather simple so I hope any online translator can help).

The most interesting part is Loading Image....

I tried to do some tests at Windows (http://asio-samples.blogspot.com/2012/07/asio-performance-test.html) and it seems
that at least at Windows performance of Asio is well enough (most of the time is spent in system functions related to IO
and event demultiplexing).

I think the reason of poor Asio performance at Linux (in case of
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396) may be related
to implementation of asio::detail::mutex at Linux - asio::detail::posix_mutex class.

asio::detail::mutex at Windows uses Windows critical section which internally uses spinlock + OS mutex. May be this
spinlock causes the difference between mine test (https://docs.google.com/open?id=0B4NhcxJYpyXGMUgxQ2VUN0VYcDg) and the
Nikki's one.

There was the theme named "boost asio overhead/scalability" in this mailing list. I think it's author could have the
same problem with asio::detail::mutex. Any suggestions?

Regards,
Marat Abrarov.
Gruenke, Matt
2012-07-26 17:22:37 UTC
Permalink
Thanks for sharing this interesting and useful data, Marat.

The more I think about it, the more I feel it comes down to the classic latency/throughput tradeoff. It seems that ASIO is designed to scale to a large number of connections and is adept at parallelizing potentially expensive handlers. For this, you need support for multiple threads and all the attendant synchronization overhead.

However, the sort of performance-related cases that have received the most attention are those where a small number of lightweight handlers are being invoked at extremely high frequencies. In these cases, the overhead of locks, context switches, and cache misses due to no concept of processor affinities all significantly hamper performance relative to a naïve, single-threaded implementation.

Perhaps what's needed is an alternate io_service that's low-overhead and can be driven only by a single thread. Where possible, it should maintain interface compatibility with the existing io_service, though some of the thread safety policies might differ and I'm not sure that io_service::dispatch() can stay. The issue of processor-affinity is tricky, and possibly best left up to the user - this way, we'd at least give no pretense of doing anything to address it.

What, then, is the benefit of using a general framework like ASIO, only to handicap it so severely? For one thing, generic code can be written (implementing common protocols, for example), of which most can be used with either type of io_service, as needed or desired by the user. Another benefit I see is that ASIO provides powerful abstractions that make it much easier to correctly implement asynchronous and low-latency I/O code.


Matt


-----Original Message-----
From: Marat Abrarov
Sent: Wednesday, July 25, 2012 1:15 PM
To: asio-***@lists.sourceforge.net
Subject: [asio-users] Poor Asio performance at Linux

Hi dear Asio users and Asio author.

Some time ago I received unexpected feedback about my project asio_performance_test_client
(http://sourceforge.net/projects/asio-samples):
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396 (sorry, but it's in Russian, the texts are rather simple so I hope any online translator can help).

The most interesting part is http://i038.radikal.ru/1207/50/a317e304fcd2.png.

I tried to do some tests at Windows (http://asio-samples.blogspot.com/2012/07/asio-performance-test.html) and it seems that at least at Windows performance of Asio is well enough (most of the time is spent in system functions related to IO and event demultiplexing).

I think the reason of poor Asio performance at Linux (in case of
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396) may be related to implementation of asio::detail::mutex at Linux - asio::detail::posix_mutex class.

asio::detail::mutex at Windows uses Windows critical section which internally uses spinlock + OS mutex. May be this spinlock causes the difference between mine test (https://docs.google.com/open?id=0B4NhcxJYpyXGMUgxQ2VUN0VYcDg) and the Nikki's one.

There was the theme named "boost asio overhead/scalability" in this mailing list. I think it's author could have the same problem with asio::detail::mutex. Any suggestions?

Regards,
Marat Abrarov.
Nir Tzachar
2012-07-27 06:35:25 UTC
Permalink
Hello.

If you think your performance issues arise from locks etc., and are using a
single thread to both drive an io_service and submit jubs to the same
io_service, you can try compiling with -DBOOST_ASIO_DISABLE_THREADS.

If your performance problems persist, there is a good chance your problems
come from somewhere else.

Cheers.
Post by Gruenke, Matt
Thanks for sharing this interesting and useful data, Marat.
The more I think about it, the more I feel it comes down to the classic
latency/throughput tradeoff. It seems that ASIO is designed to scale to a
large number of connections and is adept at parallelizing potentially
expensive handlers. For this, you need support for multiple threads and
all the attendant synchronization overhead.
However, the sort of performance-related cases that have received the most
attention are those where a small number of lightweight handlers are being
invoked at extremely high frequencies. In these cases, the overhead of
locks, context switches, and cache misses due to no concept of processor
affinities all significantly hamper performance relative to a naïve,
single-threaded implementation.
Perhaps what's needed is an alternate io_service that's low-overhead and
can be driven only by a single thread. Where possible, it should maintain
interface compatibility with the existing io_service, though some of the
thread safety policies might differ and I'm not sure that
io_service::dispatch() can stay. The issue of processor-affinity is
tricky, and possibly best left up to the user - this way, we'd at least
give no pretense of doing anything to address it.
What, then, is the benefit of using a general framework like ASIO, only to
handicap it so severely? For one thing, generic code can be written
(implementing common protocols, for example), of which most can be used
with either type of io_service, as needed or desired by the user. Another
benefit I see is that ASIO provides powerful abstractions that make it much
easier to correctly implement asynchronous and low-latency I/O code.
Matt
-----Original Message-----
From: Marat Abrarov
Sent: Wednesday, July 25, 2012 1:15 PM
Subject: [asio-users] Poor Asio performance at Linux
Hi dear Asio users and Asio author.
Some time ago I received unexpected feedback about my project
asio_performance_test_client
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396(sorry, but it's in Russian, the texts are rather simple so I hope any
online translator can help).
The most interesting part is
http://i038.radikal.ru/1207/50/a317e304fcd2.png.
I tried to do some tests at Windows (
http://asio-samples.blogspot.com/2012/07/asio-performance-test.html) and
it seems that at least at Windows performance of Asio is well enough (most
of the time is spent in system functions related to IO and event
demultiplexing).
I think the reason of poor Asio performance at Linux (in case of
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396)
may be related to implementation of asio::detail::mutex at Linux -
asio::detail::posix_mutex class.
asio::detail::mutex at Windows uses Windows critical section which
internally uses spinlock + OS mutex. May be this spinlock causes the
difference between mine test (
https://docs.google.com/open?id=0B4NhcxJYpyXGMUgxQ2VUN0VYcDg) and the
Nikki's one.
There was the theme named "boost asio overhead/scalability" in this
mailing list. I think it's author could have the same problem with
asio::detail::mutex. Any suggestions?
Regards,
Marat Abrarov.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
asio-users mailing list
https://lists.sourceforge.net/lists/listinfo/asio-users
_______________________________________________
Using Asio? List your project at
http://think-async.com/Asio/WhoIsUsingAsio
Marat Abrarov
2012-07-27 15:57:51 UTC
Permalink
Hello Nir and others.
Post by Nir Tzachar
If you think your performance issues arise from locks etc.,
and are using a single thread to both drive an io_service and submit
jubs to the same io_service, you can try compiling with -DBOOST_ASIO_DISABLE_THREADS.
The only cause I haven't done this yet is the absence of Linux installed on real hardware with at least two CPU cores
(one for the client and one for the server - to reproduce conditions of my Windows-based test). May be someone can do it
himself (he has to use Intel VTune like Nikki did or some other well-known profilers)?

Also it would be interesting to see mine Windows-based test at more concurrent environment (to increase the influence of
locks) supplying echo_server/asio_performance_test_client with more hardware CPU cores (more than mine 4 + 4, for
example 8 + 8 or 16 + 16).
I have a network and some 4-core-machines (2 + Hyper Threading) within it. But this network's bandwidth seems to be too
small to fully load even 2-core-machine (to reach 90-100% of Intel Core 2 DUO E7200 CPU).

Regards,
Marat Abrarov.
Christof Meerwald
2012-07-28 15:57:19 UTC
Permalink
Post by Marat Abrarov
I think the reason of poor Asio performance at Linux (in case of
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396) may be related
to implementation of asio::detail::mutex at Linux - asio::detail::posix_mutex class.
asio::detail::mutex at Windows uses Windows critical section which internally uses spinlock + OS mutex. May be this
spinlock causes the difference between mine test (https://docs.google.com/open?id=0B4NhcxJYpyXGMUgxQ2VUN0VYcDg) and the
Nikki's one.
There was the theme named "boost asio overhead/scalability" in this mailing list. I think it's author could have the
same problem with asio::detail::mutex. Any suggestions?
I don't think the different mutex implementations are the reason for
the difference you are seeing. My view is that the win_iocp_io_service
that's being used on Windows is implemented with less locking than the
task_io_service that's being used on Linux (because I/O completion
ports on Windows are easier to work with in a multithreaded
environment).

On Linux boost.asio uses locking to only allow one thread to call
epoll_wait at any one time - see my blog post at
http://cmeerw.org/blog/751.html#751 comparing boost.asio with an
"optimal" use (in terms of locking for this use-case) of epoll_wait in
a multithreaded program.


Christof
--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org
Gruenke, Matt
2012-07-28 19:45:45 UTC
Permalink
I agree that the lock around epoll_wait() is problematic, but not so much because of lost parallelism or mutex overhead. Rather, my primary concern about it is the cost of the forced a context switch and consequent cache misses. Furthermore, today's NUMA server architectures add even more weight to the argument for processor affinity-friendly desigs.


Matt


________________________________

From: Christof Meerwald
Sent: Sat 7/28/2012 11:57 AM
To: asio-***@lists.sourceforge.net
Subject: Re: [asio-users] Poor Asio performance at Linux
Post by Marat Abrarov
I think the reason of poor Asio performance at Linux (in case of
http://asio-samples.blogspot.com/2012/06/blog-post.html?showComment=1342554363809#c9137155316299593396) may be related
to implementation of asio::detail::mutex at Linux - asio::detail::posix_mutex class.
asio::detail::mutex at Windows uses Windows critical section which internally uses spinlock + OS mutex. May be this
spinlock causes the difference between mine test (https://docs.google.com/open?id=0B4NhcxJYpyXGMUgxQ2VUN0VYcDg) and the
Nikki's one.
There was the theme named "boost asio overhead/scalability" in this mailing list. I think it's author could have the
same problem with asio::detail::mutex. Any suggestions?
I don't think the different mutex implementations are the reason for
the difference you are seeing. My view is that the win_iocp_io_service
that's being used on Windows is implemented with less locking than the
task_io_service that's being used on Linux (because I/O completion
ports on Windows are easier to work with in a multithreaded
environment).

On Linux boost.asio uses locking to only allow one thread to call
epoll_wait at any one time - see my blog post at
http://cmeerw.org/blog/751.html#751 comparing boost.asio with an
"optimal" use (in terms of locking for this use-case) of epoll_wait in
a multithreaded program.


Christof

--

http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org
Loading...