Discussion:
[BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s
(too old to reply)
Arnaud Ebalard
2013-11-10 13:53:46 UTC
Permalink
Hi,

I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to
3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
driver. Mine runs Debian armel unstable but I can confirm the issue also
happens on a debian harmhf unstable.

Doing some scp transfers of files located on the NAS (1000baseT-FD on
both side), I noticed the transfers rate is ridiculously slow (280KB/s).
I did the same test with a 3.12 kernel and got the same results,
i.e. AFAICT the bug also exist upstream.

So, I decided to go to hell and start digging a bit: I run a 'git bisect'
session on stable tree from 3.11.1 (known good) to 3.11.7 (known
bad). The results are given below.

I decided to reboot on my old 3.11.1 kernel and do 20 files transfers
of a 1GB file located on the NAS to my laptop via scp. I took the 20+
minutes and let them all finish: each transfer took between 1min5s and
1min7s (around 16MB/s, the limitation in that case being the crypto part).

I rebooted again and did the exact same thing on the 3.11.7 and after
the completion of the first file transfer in 1m6s (16MB/s), the second
one gave me that:

***@small:~scp RN102:/tmp/random /dev/null
random 0% 1664KB 278.9KB/s 1:05:37 ETA^C

And it continued that way for the remaining transfers (i did ^c after
some seconds to restart the transfer when the rate was low):

$ for k in $(seq 1 20) ; do scp RN102:random /dev/null ; done
random 100% 1024MB 15.6MB/s 01:06 ETA^C
random 0% 9856KB 282.2KB/s 1:01:20 ETA^C
random 16% 168MB 563.9KB/s 25:54 ETA^C
random 0% 2816KB 273.5KB/s 1:03:43 ETA^C
random 100% 1024MB 15.5MB/s 01:06
random 1% 17MB 282.3KB/s 1:00:54 ETA^C
random 0% 544KB 259.2KB/s 1:07:23 ETA^C
random 0% 4224KB 277.3KB/s 1:02:45 ETA^C
random 0% 832KB 262.1KB/s 1:06:37 ETA^C
random 0% 3360KB 273.4KB/s 1:03:43 ETA^C
random 0% 3072KB 271.8KB/s 1:04:07 ETA^C
random 0% 832KB 262.1KB/s 1:06:37 ETA^C
random 0% 1408KB 267.0KB/s 1:05:21 ETA^C
random 0% 1120KB 264.7KB/s 1:05:57 ETA
...

To be sure, I did 2 additional reboots, one on each kernel and the
result is consistent. Perfect on 3.11.1 and slow rate most of the time
on 3.11.7 (Both kernel are compiled from a fresh make clean, using the
same config file).

Then, knowing that, I started a git bisect session on stable tree to end
up with the following suspects. I failed to go any further to a single
commit, due to crashes, but I could recompile a kernel w/ debug info and
report what I get if neeeded.

commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb() [bad]
commit 18ddf5127c9f tcp: must unclone packets before mangling them
commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
commit dbeb18b22197 tcp: TSO packets automatic sizing
commit 50704410d014 Linux 3.11.6 [good]

Eric, David, if it has already been reported and fixed, just tell
me. Otherwise, if you have any ideas, I'll be happy to test this
evening.

Cheers,

a+


Just in case it may be useful, here is what ethtool reports on RN102:

# ethtool -i eth0
driver: mvneta
version: 1.0
firmware-version:
bus-info: eth0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

# ethtool -k eth0
Features for eth0:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off [fixed]
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
Cong Wang
2013-11-12 06:48:59 UTC
Permalink
Post by Arnaud Ebalard
Hi,
I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to
3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
driver. Mine runs Debian armel unstable but I can confirm the issue also
happens on a debian harmhf unstable.
[...]
Post by Arnaud Ebalard
Then, knowing that, I started a git bisect session on stable tree to end
up with the following suspects. I failed to go any further to a single
commit, due to crashes, but I could recompile a kernel w/ debug info and
report what I get if neeeded.
commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb() [bad]
commit 18ddf5127c9f tcp: must unclone packets before mangling them
commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
commit dbeb18b22197 tcp: TSO packets automatic sizing
commit 50704410d014 Linux 3.11.6 [good]
This regression is probably introduced the last TSQ commit, Eric has a patch
for mvneta in the other thread:

http://article.gmane.org/gmane.linux.network/290359
Arnaud Ebalard
2013-11-12 07:56:25 UTC
Permalink
Hi,

Thanks for the pointer. See below.
Post by Cong Wang
Post by Arnaud Ebalard
Hi,
I decided to upgrade the kernel on one of my ReadyNAS 102 from 3.11.1 to
3.11.7. The device is based on Marvell Armada 370 SoC and uses mvneta
driver. Mine runs Debian armel unstable but I can confirm the issue also
happens on a debian harmhf unstable.
[...]
Post by Arnaud Ebalard
Then, knowing that, I started a git bisect session on stable tree to end
up with the following suspects. I failed to go any further to a single
commit, due to crashes, but I could recompile a kernel w/ debug info and
report what I get if neeeded.
commit dc0791aee672 tcp: do not forget FIN in tcp_shifted_skb() [bad]
commit 18ddf5127c9f tcp: must unclone packets before mangling them
commit 80bd5d8968d8 tcp: TSQ can use a dynamic limit
commit dbeb18b22197 tcp: TSO packets automatic sizing
commit 50704410d014 Linux 3.11.6 [good]
This regression is probably introduced the last TSQ commit, Eric has a patch
http://article.gmane.org/gmane.linux.network/290359
I had some offline (*) discussions w/ Eric and did some test w/ the patches
he sent. It does not fix the regression I see. It would be nice if someone
w/ the hardware and more knowledge of mvneta driver could reproduce the
issue and spend some time on it.

That been said, even if the driver is most probably not the only one to
blame here (considering the result of bisect and current thread on
netdev), I never managed to get the performance I have on my ReadyNAS
Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
platform (RN102, RN104 or RN2120). Understanding why is on an already a
long todo list.

Cheers,

a+

(*): for some reasons, my messages to netdev and stable are not published
even though I can interact w/ {majordomo,autoanswer}@vger.kernel.org. I
poked postmaster@ bug got no reply yet.
Willy Tarreau
2013-11-12 08:36:33 UTC
Permalink
Hi Arnaud,
Post by Arnaud Ebalard
I had some offline (*) discussions w/ Eric and did some test w/ the patches
he sent. It does not fix the regression I see. It would be nice if someone
w/ the hardware and more knowledge of mvneta driver could reproduce the
issue and spend some time on it.
I could give it a try but am falling very short of time at the moment.
Post by Arnaud Ebalard
That been said, even if the driver is most probably not the only one to
blame here (considering the result of bisect and current thread on
netdev), I never managed to get the performance I have on my ReadyNAS
Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
platform (RN102, RN104 or RN2120). Understanding why is on an already a
long todo list.
Yes I found that your original numbers were already quite low, so it is
also possible that you have a different problem (eg: faulty switch or
auto-negociation problem where the switch goes to half duplex because
the neta does not advertise nway or whatever) that is emphasized by the
latest changes.
Post by Arnaud Ebalard
Cheers,
a+
(*): for some reasons, my messages to netdev and stable are not published
I can confirm that I got this message from you on netdev so it should be OK
now.

Willy
Arnaud Ebalard
2013-11-12 09:14:34 UTC
Permalink
Hi,
Post by Willy Tarreau
Post by Arnaud Ebalard
That been said, even if the driver is most probably not the only one to
blame here (considering the result of bisect and current thread on
netdev), I never managed to get the performance I have on my ReadyNAS
Duo v2 (i.e. 108MB/s for a file served by an Apache) with a mvneta-based
platform (RN102, RN104 or RN2120). Understanding why is on an already a
long todo list.
Yes I found that your original numbers were already quite low,
Tests for the rgression were done w/ scp, and were hence limited by the
crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
wget for a file served by Apache *before* the regression and I never got
more than 60MB/s from what I recall. Can you beat that?
Post by Willy Tarreau
so it is also possible that you have a different problem (eg: faulty
switch or auto-negociation problem where the switch goes to half
duplex because the neta does not advertise nway or whatever) that is
emphasized by the latest changes
Tested w/ back to back connections to the NAS from various hosts and
through different switch. Never saturated the link.
Post by Willy Tarreau
I can confirm that I got this message from you on netdev so it should be OK
now.
Good. Thanks for the info.

a+
Willy Tarreau
2013-11-12 10:01:26 UTC
Permalink
Hi Arnaud,
Post by Arnaud Ebalard
Tests for the rgression were done w/ scp, and were hence limited by the
crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
wget for a file served by Apache *before* the regression and I never got
more than 60MB/s from what I recall. Can you beat that?
Yes, I finally picked my mirabox out of my bag for a quick test. It boots
off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
a single TCP stream.

With two systems, one directly connected (dockstar) and the other one via
a switch, I get 2*650 Mbps (a single TCP stream is enough on each).

I'll have to re-run some tests using a more up to date kernel, but that
will probably not be today though.

Regards,
Willy
Arnaud Ebalard
2013-11-12 15:34:24 UTC
Permalink
Hi,
Post by Willy Tarreau
Post by Arnaud Ebalard
Tests for the rgression were done w/ scp, and were hence limited by the
crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
wget for a file served by Apache *before* the regression and I never got
more than 60MB/s from what I recall. Can you beat that?
Yes, I finally picked my mirabox out of my bag for a quick test. It boots
off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
a single TCP stream.
Thanks for the feedback. That's interesting. What are you using for your tests
(wget, ...)?
Post by Willy Tarreau
With two systems, one directly connected (dockstar) and the other one via
a switch, I get 2*650 Mbps (a single TCP stream is enough on each).
I'll have to re-run some tests using a more up to date kernel, but that
will probably not be today though.
Can you give a pre-3.11.7 kernel a try if you find the time? I started
working on RN102 during 3.10-rc cycle but do not remember if I did the
first preformance tests on 3.10 or 3.11. And if you find more time,
3.11.7 would be nice too ;-)

Cheers,

a+
Willy Tarreau
2013-11-13 07:22:57 UTC
Permalink
Post by Arnaud Ebalard
Hi,
Post by Willy Tarreau
Post by Arnaud Ebalard
Tests for the rgression were done w/ scp, and were hence limited by the
crypto (16MB/s using arcfour128). But I also did some tests w/ a simple
wget for a file served by Apache *before* the regression and I never got
more than 60MB/s from what I recall. Can you beat that?
Yes, I finally picked my mirabox out of my bag for a quick test. It boots
off 3.10.0-rc7 and I totally saturate one port (stable 988 Mbps) with even
a single TCP stream.
Thanks for the feedback. That's interesting. What are you using for your tests
(wget, ...)?
No, inject (for the client) + httpterm (for the server), but it also works with
a simple netcat < /dev/zero, except that netcat uses 8kB buffers and is quickly
CPU-bound. The tools I'm talking about are available here :

http://1wt.eu/tools/inject/?C=M;O=D
http://1wt.eu/tools/httpterm/httpterm-1.7.2.tar.gz

Httpterm is a dummy web server. You can send requests like
"GET /?s=1m HTTP/1.0" and it returns 1 MB of data in the response,
which is quite convenient! I'm sorry for the limited documentation
(don't even try to write a config file, it's a fork of an old haproxy
version). Simply start it as :

httpterm -D -L ip:port (where 'ip' is optional)

Inject is an HTTP client initially designed to test applications but still
doing well enough for component testing (though it does not scale well with
large numbers of connections). I remember that Pablo Neira rewrote a simpler
equivalent here : http://1984.lsi.us.es/git/http-client-benchmark, but I'm
used to use my old version.

There's an old doc in PDF in the download directory. Unfortunately it
speaks french which is not always very convenient. But what I like there
is that you get one line of stats per second so you can easily follow how
the test goes, as opposite to some tools like "ab" which only give you a
summary at the end. That's one of the key points that Pablo has reimplemented
in his tool BTW.
Post by Arnaud Ebalard
Post by Willy Tarreau
With two systems, one directly connected (dockstar) and the other one via
a switch, I get 2*650 Mbps (a single TCP stream is enough on each).
I'll have to re-run some tests using a more up to date kernel, but that
will probably not be today though.
Can you give a pre-3.11.7 kernel a try if you find the time? I started
working on RN102 during 3.10-rc cycle but do not remember if I did the
first preformance tests on 3.10 or 3.11. And if you find more time,
3.11.7 would be nice too ;-)
Still have not found time for this but I observed something intriguing
which might possibly match your experience : if I use large enough send
buffers on the mirabox and receive buffers on the client, then the
traffic drops for objects larger than 1 MB. I have quickly checked what's
happening and it's just that there are pauses of up to 8 ms between some
packets when the TCP send window grows larger than about 200 kB. And
since there are no drops, there is no reason for the window to shrink.
I suspect it's exactly related to the issue explained by Eric about the
timer used to recycle the Tx descriptors. However last time I checked,
these ones were also processed in the Rx path, which means that the
ACKs that flow back should have had the same effect as a Tx IRQ (unless
I'd use asymmetric routing, which was not the case). So there might be
another issue. Ah, and it only happens with GSO.

I really need some time to perform more tests, I'm sorry Arnaud, but I
can't do them right now. What you can do is to try to reduce your send
window to 1 MB or less to see if the issue persists :

$ cat /proc/sys/ipv4/tcp_wmem
$ echo 4096 16384 1048576 > /proc/sys/ipv4/tcp_wmem

You also need to monitor your CPU usage to ensure that you're not limited
by some processing inside apache. At 1 Gbps, you should use only something
like 40-50% of the CPU.

Cheers,
Willy
Willy Tarreau
2013-11-17 14:19:40 UTC
Permalink
Hi Arnaud,
Post by Willy Tarreau
Post by Arnaud Ebalard
Can you give a pre-3.11.7 kernel a try if you find the time? I started
working on RN102 during 3.10-rc cycle but do not remember if I did the
first preformance tests on 3.10 or 3.11. And if you find more time,
3.11.7 would be nice too ;-)
Still have not found time for this but I observed something intriguing
which might possibly match your experience : if I use large enough send
buffers on the mirabox and receive buffers on the client, then the
traffic drops for objects larger than 1 MB. I have quickly checked what's
happening and it's just that there are pauses of up to 8 ms between some
packets when the TCP send window grows larger than about 200 kB. And
since there are no drops, there is no reason for the window to shrink.
I suspect it's exactly related to the issue explained by Eric about the
timer used to recycle the Tx descriptors. However last time I checked,
these ones were also processed in the Rx path, which means that the
ACKs that flow back should have had the same effect as a Tx IRQ (unless
I'd use asymmetric routing, which was not the case). So there might be
another issue. Ah, and it only happens with GSO.
I just had a quick look at the driver and I can confirm that Eric is right
about the fact that we use up to two descriptors per GSO segment. Thus, we
can saturate the Tx queue at 532/2 = 266 Tx segments = 388360 bytes (for
1460 MSS). I thought I had seen a tx flush from the rx poll function but I
can't find it so it seems I was wrong, or that I possibly misunderstood
mvneta_poll() the first time I read it. Thus the observed behaviour is
perfectly normal.

With GSO enabled, as soon as the window grows large enough, we can fill
all the Tx descriptors with few segments, then need to wait for 10ms (12
if running at 250 Hz as I am) to flush them, which explains the low speed
I was observing with large windows. When disabling GSO, as much as twice
the number of descriptors can be used, which is enough to fill the wire
in the same time frame. Additionally it's likely that more descriptors
get the time to be sent during that period and that each call to mvneta_tx()
causing a call to mvneta_txq_done() releases some of the previously sent
descriptors, allowing to sustain wire rate.

I wonder if we can call mvneta_txq_done() from the IRQ handler, which would
cause some recycling of the Tx descriptors when receiving the corresponding
ACKs.

Ideally we should enable the Tx IRQ, but I still have no access to this
chip's datasheet despite having asked Marvell several times in one year
(Thomas has it though).

So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.

I still did not have time to run a new kernel on this device however :-(

Best regards,
Willy
Eric Dumazet
2013-11-17 17:41:38 UTC
Permalink
Post by Willy Tarreau
So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.
BTW I have a very simple patch for TCP stack that could help this exact
situation...

Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
very small frames, and let tcp_sendmsg() have more chance to fill
complete packets.

Again, for this to work very well, you need that NIC performs TX
completion in reasonable amount of time...

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3dc0c6c..10456cf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
{
if (tcp_send_head(sk)) {
struct tcp_sock *tp = tcp_sk(sk);
+ struct sk_buff *skb = tcp_write_queue_tail(sk);

if (!(flags & MSG_MORE) || forced_push(tp))
- tcp_mark_push(tp, tcp_write_queue_tail(sk));
+ tcp_mark_push(tp, skb);

tcp_mark_urg(tp, flags);
- __tcp_push_pending_frames(sk, mss_now,
- (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
+ if (flags & MSG_MORE)
+ nonagle = TCP_NAGLE_CORK;
+ if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
+ set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+ nonagle = TCP_NAGLE_CORK;
+ }
+ __tcp_push_pending_frames(sk, mss_now, nonagle);
}
}
Arnaud Ebalard
2013-11-19 06:44:50 UTC
Permalink
Hi,
Post by Eric Dumazet
Post by Willy Tarreau
So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.
BTW I have a very simple patch for TCP stack that could help this exact
situation...
Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
very small frames, and let tcp_sendmsg() have more chance to fill
complete packets.
Again, for this to work very well, you need that NIC performs TX
completion in reasonable amount of time...
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3dc0c6c..10456cf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -624,13 +624,19 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
{
if (tcp_send_head(sk)) {
struct tcp_sock *tp = tcp_sk(sk);
+ struct sk_buff *skb = tcp_write_queue_tail(sk);
if (!(flags & MSG_MORE) || forced_push(tp))
- tcp_mark_push(tp, tcp_write_queue_tail(sk));
+ tcp_mark_push(tp, skb);
tcp_mark_urg(tp, flags);
- __tcp_push_pending_frames(sk, mss_now,
- (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
+ if (flags & MSG_MORE)
+ nonagle = TCP_NAGLE_CORK;
+ if (atomic_read(&sk->sk_wmem_alloc) > 2048) {
+ set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+ nonagle = TCP_NAGLE_CORK;
+ }
+ __tcp_push_pending_frames(sk, mss_now, nonagle);
}
}
I did some test regarding mvneta perf on current linus tree (commit
2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
reverted). It has Simon's tclk patch for mvebu (1022c75f5abd, "clk:
armada-370: fix tclk frequencies"). Kernel has some debug options
enabled and the patch above is not applied. I will spend some time on
this two directions this evening. The idea was to get some numbers on
the impact of TCP send window size and tcp_limit_output_bytes for
mvneta.

The test is done with a laptop (Debian, 3.11.0, e1000e) directly
connected to a RN102 (Marvell Armada 370 @1.2GHz, mvneta). The RN102
is running Debian armhf with an Apache2 serving a 1GB file from ext4
over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
a simple wget w/ -O /dev/null option.

With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
@1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
(cannot remember the kernel version but sth between 3.8 and 3.10.

So with that setup:

w/ TCP send window set to 4MB: 17.4 MB/s
w/ TCP send window set to 2MB: 16.2 MB/s
w/ TCP send window set to 1MB: 15.6 MB/s
w/ TCP send window set to 512KB: 25.6 MB/s
w/ TCP send window set to 256KB: 57.7 MB/s
w/ TCP send window set to 128KB: 54.0 MB/s
w/ TCP send window set to 64KB: 46.2 MB/s
w/ TCP send window set to 32KB: 42.8 MB/s

Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
w/ TCP send window set to 256KB:

tcp_limit_output_bytes set to 512KB: 59.3 MB/s
tcp_limit_output_bytes set to 256KB: 58.5 MB/s
tcp_limit_output_bytes set to 128KB: 56.2 MB/s
tcp_limit_output_bytes set to 64KB: 32.1 MB/s
tcp_limit_output_bytes set to 32KB: 4.76 MB/s

As a side note, during the test, I sometimes gets peak for some seconds
at the beginning at 90MB/s which tend to confirm what WIlly wrote,
i.e. that the hardware can do more.

Cheers,

a+
Eric Dumazet
2013-11-19 13:53:14 UTC
Permalink
Post by Arnaud Ebalard
I did some test regarding mvneta perf on current linus tree (commit
2d3c627502f2a9b0, w/ c9eeec26e32e "tcp: TSQ can use a dynamic limit"
armada-370: fix tclk frequencies"). Kernel has some debug options
enabled and the patch above is not applied. I will spend some time on
this two directions this evening. The idea was to get some numbers on
the impact of TCP send window size and tcp_limit_output_bytes for
mvneta.
Note the last patch I sent was not relevant to your problem, do not
bother trying it. Its useful for applications doing lot of consecutive
short writes, like interactive ssh launching line buffered commands.
Post by Arnaud Ebalard
The test is done with a laptop (Debian, 3.11.0, e1000e) directly
is running Debian armhf with an Apache2 serving a 1GB file from ext4
over lvm over RAID1 from 2 WD30EFRX. The client is nothing fancy, i.e.
a simple wget w/ -O /dev/null option.
With the exact same setup on a ReadyNAS Duo v2 (Kirkwood 88f6282
@1.6GHz, mv643xx_eth), I managed to get a throughput of 108MB/s
(cannot remember the kernel version but sth between 3.8 and 3.10.
w/ TCP send window set to 4MB: 17.4 MB/s
w/ TCP send window set to 2MB: 16.2 MB/s
w/ TCP send window set to 1MB: 15.6 MB/s
w/ TCP send window set to 512KB: 25.6 MB/s
w/ TCP send window set to 256KB: 57.7 MB/s
w/ TCP send window set to 128KB: 54.0 MB/s
w/ TCP send window set to 64KB: 46.2 MB/s
w/ TCP send window set to 32KB: 42.8 MB/s
One of the problem is that tcp_sendmsg() holds the socket lock for the
whole duration of the system call if it has not to sleep. This model
doesnt allow for incoming ACKS to be processed (they are put in socket
backlog and will be processed at socket release time), and TX completion
to also queue the next chunk.

These strange results you have tend to show that if you have a big TCP
send window, the web server pushes a lot of bytes per system call and
might stall the ACK clocking or TX refills.
Post by Arnaud Ebalard
Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
tcp_limit_output_bytes set to 512KB: 59.3 MB/s
tcp_limit_output_bytes set to 256KB: 58.5 MB/s
tcp_limit_output_bytes set to 128KB: 56.2 MB/s
tcp_limit_output_bytes set to 64KB: 32.1 MB/s
tcp_limit_output_bytes set to 32KB: 4.76 MB/s
As a side note, during the test, I sometimes gets peak for some seconds
at the beginning at 90MB/s which tend to confirm what WIlly wrote,
i.e. that the hardware can do more.
I would also check the receiver. I suspect packets drops because of a
bad driver doing skb->truesize overshooting.

nstat >/dev/null ; wget .... ; nstat
Willy Tarreau
2013-11-19 17:43:23 UTC
Permalink
Hi Eric,
Post by Eric Dumazet
These strange results you have tend to show that if you have a big TCP
send window, the web server pushes a lot of bytes per system call and
might stall the ACK clocking or TX refills.
It's tx refills which are not done in this case from what I think I
understood in the driver. IIRC, the refill is done once at the beginning
of xmit and in the tx timer callback. So if you have too large a window
that fills the descriptors during a few tx calls during which no desc was
released, you could end up having to wait for the timer since you're not
allowed to send anymore.
Post by Eric Dumazet
Post by Arnaud Ebalard
Then, I started playing w/ tcp_limit_output_bytes (default is 131072),
tcp_limit_output_bytes set to 512KB: 59.3 MB/s
tcp_limit_output_bytes set to 256KB: 58.5 MB/s
tcp_limit_output_bytes set to 128KB: 56.2 MB/s
tcp_limit_output_bytes set to 64KB: 32.1 MB/s
tcp_limit_output_bytes set to 32KB: 4.76 MB/s
As a side note, during the test, I sometimes gets peak for some seconds
at the beginning at 90MB/s which tend to confirm what WIlly wrote,
i.e. that the hardware can do more.
I would also check the receiver. I suspect packets drops because of a
bad driver doing skb->truesize overshooting.
When I first observed the issue, at first I suspected my laptop's driver
when I saw this problem, so I tried with a dockstar instead and the issue
disappeared... until I increased the tcp_rmem on it to match my laptop :-)

Arnaud, you might be interested in trying checking if the following change
does something for you in mvneta.c :

- #define MVNETA_TX_DONE_TIMER_PERIOD 10
+ #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)

This can only have any effect if you run at 250 or 1000 Hz, but not at 100
of course. It should reduce the time to first IRQ.

Willy
Eric Dumazet
2013-11-19 18:31:50 UTC
Permalink
Post by Willy Tarreau
- #define MVNETA_TX_DONE_TIMER_PERIOD 10
+ #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
I suggested this in a prior mail :

#define MVNETA_TX_DONE_TIMER_PERIOD 1

But apparently it was triggering strange crashes...
Post by Willy Tarreau
This can only have any effect if you run at 250 or 1000 Hz, but not at 100
of course. It should reduce the time to first IRQ.
Willy Tarreau
2013-11-19 18:41:21 UTC
Permalink
Post by Willy Tarreau
Post by Willy Tarreau
- #define MVNETA_TX_DONE_TIMER_PERIOD 10
+ #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
#define MVNETA_TX_DONE_TIMER_PERIOD 1
Ah sorry, I remember now.
Post by Willy Tarreau
But apparently it was triggering strange crashes...
Ah, when a bug hides another one, it's the situation I prefer, because
by working on one, you end up fixing two :-)

Cheers,
Willy
Arnaud Ebalard
2013-11-19 23:53:43 UTC
Permalink
Hi,
Post by Willy Tarreau
Post by Willy Tarreau
Post by Willy Tarreau
- #define MVNETA_TX_DONE_TIMER_PERIOD 10
+ #define MVNETA_TX_DONE_TIMER_PERIOD (1000/HZ)
#define MVNETA_TX_DONE_TIMER_PERIOD 1
Ah sorry, I remember now.
Post by Willy Tarreau
But apparently it was triggering strange crashes...
Ah, when a bug hides another one, it's the situation I prefer, because
by working on one, you end up fixing two :-)
Follow me just for one sec: today, I got a USB 3.0 Gigabit Ethernet
adapter. More specifically an AX88179-based one (Logitec LAN-GTJU3H3),
about which there is currently a thread on netdev and linux-usb
lists. Anyway, I decided to give it a try on my RN102 just to check what
performance I could achieve. So I basically did the same experiment as
yesterday (wget on client against a 1GB file located on the filesystem
served by an apache on the NAS) except that time the AX88179-based
adapater was used instead of the mvneta-based interface. Well, the
download started at a high rate (90MB/s) but then drops and I get some
SATA error on the NAS (similar to the errors I already got during
12.0-rc series [1] to finally *erroneously* consider it was an artefact).

So I decided to remove the SATA controllers and disks from the equation:
I switched to my ReadyNAS 2120 whose GbE interfaces are also based on
mvneta driver and comes w/ 2GB of RAM. The main additional difference is
that the device is a dual core armada @1.2GHz, where the RN102 is a
single core armada @1.2GHz. I created a dummy 1GB file *in RAM*
(/run/shm) to have it served by the apache2 instead of the file
previously stored on the disks.

I started w/ todays linus tree (dec8e46178b) with Eric's revert patch
for c9eeec26e32e (tcp: TSQ can use a dynamic limit) and also the change
to mvneta driver to have:

-#define MVNETA_TX_DONE_TIMER_PERIOD 10
+#define MVNETA_TX_DONE_TIMER_PERIOD 1

Here are the average speed given by wget for the following TCP send
window:

4 MB: 19 MB/s
2 MB: 21 MB/s
1 MB: 21 MB/s
512KB: 23 MB/s
384KB: 105 MB/s
256KB: 112 MB/s
128KB: 111 MB/s
64KB: 93 MB/s

Then, I decided to redo the exact same test w/o the change on
MVNETA_TX_DONE_TIMER_PERIOD (i.e. w/ the initial value of 10). I get the
exact same results as with the MVNETA_TX_DONE_TIMER_PERIOD set to 1, i.e:

4 MB: 20 MB/s
2 MB: 21 MB/s
1 MB: 21 MB/s
512KB: 22 MB/s
384KB: 105 MB/s
256KB: 112 MB/s
128KB: 111 MB/s
64KB: 93 MB/s

And, then, I also dropped Eric's revert patch for c9eeec26e32e (tcp: TSQ
can use a dynamic limit), just to verify we came back where the thread
started but i got a surprise:

4 MB: 10 MB/s
2 MB: 11 MB/s
1 MB: 10 MB/s
512KB: 12 MB/s
384KB: 104 MB/s
256KB: 112 MB/s
128KB: 112 MB/s
64KB: 93 MB/s

Instead of the 256KB/s I had observed the low value was now 10MB/s. I
thought it was due to the switch from RN102 to RN2120 so I came back
to the RN102 w/o any specific patch for mvneta nor your revert patch for
c9eeec26e32e, i.e. only Linus tree as it is today (dec8e46178b). The
file is served from the disk:

4 MB: 5 MB/s
2 MB: 5 MB/s
1 MB: 5 MB/s
512KB: 5 MB/s
384KB: 90 MB/s for 4s, then 3 MB/s
256KB: 80 MB/s for 3s, then 2 MB/s
128KB: 90 MB/s for 3s, then 3 MB/s
64KB: 80 MB/s for 3s, then 3 MB/S

Then, I allocated a dummy 400MB file in RAM (/run/shm) and redid the
test on the RN102:

4 MB: 8 MB/s
2 MB: 8 MB/s
1 MB: 92 MB/s
512KB: 90 MB/s
384KB: 90 MB/s
256KB: 90 MB/s
128KB: 90 MB/s
64KB: 60 MB/s

In the end, here are the conclusions *I* draw from this test session,
do not hesitate to correct me:

- Eric, it seems something changed in linus tree betwen the beginning
of the thread and now, which somehow reduces the effect of the
regression we were seen: I never got back the 256KB/s.
- You revert patch still improves the perf a lot
- It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
- w/ your revert patch, I can confirm that mvneta driver is capable of
doing line rate w/ proper tweak of TCP send window (256KB instead of
4M)
- It seems I will I have to spend some time on the SATA issues I
previously thought were an artefact of not cleaning my tree during a
debug session [1], i.e. there is IMHO an issue.

What I do not get is what can cause the perf to drop from 90MB/s to
3MB/s (w/ a 256KB send window) when streaming from the disk instead of
the RAM. I have no issue having dd read from the fs @ 150MB/s and
mvneta streaming from RAM @ 90MB/s but both together get me 3MB/s after
a few seconds.

Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.

Cheers,

a+

[1]: http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508
Eric Dumazet
2013-11-20 00:08:49 UTC
Permalink
Post by Arnaud Ebalard
Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.
;)

Alternative would be to use netperf or iperf to not use disk at all
and focus on TCP/network issues only.

Thanks !
Willy Tarreau
2013-11-20 00:35:19 UTC
Permalink
Post by Eric Dumazet
Post by Arnaud Ebalard
Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.
;)
Alternative would be to use netperf or iperf to not use disk at all
and focus on TCP/network issues only.
Yes, that's for the same reason that I continue to use inject/httpterm
for such purposes :
- httpterm uses tee()+splice() to send pre-built pages without copying ;
- inject uses recv(MSG_TRUNC) to ack everything without copying.

Both of them are really interesting to test the hardware's capabilities
and to push components in the middle to their limits without causing too
much burden to the end points.

I don't know if either netperf or iperf can make use of this now, and
I've been used to my tools, but I should take a look again.

Cheers,
Willy
Eric Dumazet
2013-11-20 00:43:48 UTC
Permalink
Post by Willy Tarreau
Post by Eric Dumazet
Post by Arnaud Ebalard
Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.
;)
Alternative would be to use netperf or iperf to not use disk at all
and focus on TCP/network issues only.
Yes, that's for the same reason that I continue to use inject/httpterm
- httpterm uses tee()+splice() to send pre-built pages without copying ;
- inject uses recv(MSG_TRUNC) to ack everything without copying.
Both of them are really interesting to test the hardware's capabilities
and to push components in the middle to their limits without causing too
much burden to the end points.
I don't know if either netperf or iperf can make use of this now, and
I've been used to my tools, but I should take a look again.
netperf -t TCP_SENDFILE does the zero copy at sender.

And more generally -V option does copy avoidance

(Use splice(sockfd -> nullfd) at receiver if I remember well)


Anyway, we should do the normal copy, because it might demonstrate
scheduling problems.

If you want to test raw speed, you could use pktgen ;)
Willy Tarreau
2013-11-20 00:52:14 UTC
Permalink
Post by Eric Dumazet
Post by Willy Tarreau
Post by Eric Dumazet
Post by Arnaud Ebalard
Anyway, I think if the thread keeps going on improving mvneta, I'll do
all additional tests from RAM and will stop polluting netdev w/ possible
sata/disk/fs issues.
;)
Alternative would be to use netperf or iperf to not use disk at all
and focus on TCP/network issues only.
Yes, that's for the same reason that I continue to use inject/httpterm
- httpterm uses tee()+splice() to send pre-built pages without copying ;
- inject uses recv(MSG_TRUNC) to ack everything without copying.
Both of them are really interesting to test the hardware's capabilities
and to push components in the middle to their limits without causing too
much burden to the end points.
I don't know if either netperf or iperf can make use of this now, and
I've been used to my tools, but I should take a look again.
netperf -t TCP_SENDFILE does the zero copy at sender.
And more generally -V option does copy avoidance
(Use splice(sockfd -> nullfd) at receiver if I remember well)
OK thanks for the info.
Post by Eric Dumazet
Anyway, we should do the normal copy, because it might demonstrate
scheduling problems.
Yes, especially in this case, though I got the issue with GSO only,
so it might vary as well.
Post by Eric Dumazet
If you want to test raw speed, you could use pktgen ;)
Except I'm mostly focused on HTTP as you know. And for generating higher
packet rates than pktgen, I have an absolutely ugly patch that I'm a
bit ashamed of for mvneta to multiply the number of emitted descriptors
for a given skb by skb->sk->sk_mark (which I set using setsockopt), this
allows me to generate up to 1.488 Mpps on a USB-powered system, not that
bad in my opinion :-)

Willy
Thomas Petazzoni
2013-11-20 08:50:37 UTC
Permalink
Arnaud,
Post by Arnaud Ebalard
- It seems I will I have to spend some time on the SATA issues I
previously thought were an artefact of not cleaning my tree during a
debug session [1], i.e. there is IMHO an issue.
I don't remember in detail what your SATA problem was, but just to let
you know that we are currently debugging a problem that occurs on
Armada XP (more than one core is needed for the problem to occur), with
SATA (the symptom is that after some time of SATA usage, SATA traffic is
stalled, and no SATA interrupts are generated anymore). We're still
working on this one, and trying to figure out where the problem is.

Best regards,

Thomas
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
Arnaud Ebalard
2013-11-20 19:21:58 UTC
Permalink
Hi Thomas,

I removed netdev from that reply
<#secure method=pgpmime mode=sign>
Post by Willy Tarreau
Arnaud,
Post by Arnaud Ebalard
- It seems I will I have to spend some time on the SATA issues I
previously thought were an artefact of not cleaning my tree during a
debug session [1], i.e. there is IMHO an issue.
I don't remember in detail what your SATA problem was, but just to let
you know that we are currently debugging a problem that occurs on
Armada XP (more than one core is needed for the problem to occur), with
SATA (the symptom is that after some time of SATA usage, SATA traffic is
stalled, and no SATA interrupts are generated anymore). We're still
working on this one, and trying to figure out where the problem is.
The problem I had is in the first email of:

http://thread.gmane.org/gmane.linux.ports.arm.kernel/271508

Then, yesterday, when testing with the USB 3.0 to ethernet dongle
connected to my RN102 (Armada 370) as primary interface, I got the
following. It is easily reproducible:

[ 317.412873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[ 317.419947] ata1.00: failed command: READ FPDMA QUEUED
[ 317.425118] ata1.00: cmd 60/00:00:00:07:2a/01:00:00:00:00/40 tag 0 ncq 131072 in
[ 317.425118] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 317.439926] ata1.00: status: { DRDY }
[ 317.443600] ata1.00: failed command: READ FPDMA QUEUED
[ 317.448756] ata1.00: cmd 60/00:08:00:08:2a/01:00:00:00:00/40 tag 1 ncq 131072 in
[ 317.448756] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 317.463565] ata1.00: status: { DRDY }
[ 317.467244] ata1: hard resetting link
[ 318.012913] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 318.020220] ata1.00: configured for UDMA/133
[ 318.024559] ata1.00: device reported invalid CHS sector 0
[ 318.030001] ata1.00: device reported invalid CHS sector 0
[ 318.035425] ata1: EH complete

And then again:

[ 381.342873] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[ 381.349947] ata1.00: failed command: READ FPDMA QUEUED
[ 381.355119] ata1.00: cmd 60/00:00:00:03:30/01:00:00:00:00/40 tag 0 ncq 131072 in
[ 381.355119] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 381.369927] ata1.00: status: { DRDY }
[ 381.373599] ata1.00: failed command: READ FPDMA QUEUED
[ 381.378756] ata1.00: cmd 60/00:08:00:04:30/01:00:00:00:00/40 tag 1 ncq 131072 in
[ 381.378756] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 381.393563] ata1.00: status: { DRDY }
[ 381.397242] ata1: hard resetting link
[ 381.942848] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 381.950167] ata1.00: configured for UDMA/133
[ 381.954496] ata1.00: device reported invalid CHS sector 0
[ 381.959958] ata1.00: device reported invalid CHS sector 0
[ 381.965383] ata1: EH complete

But, as the problem seems to happen when the dongle is connected and in
use (and considering current threads on the topic on USB and netdev ML),
I think I will wait for things to calm down and test again with a 3.12
and then a 3.13-rcX.

Anyway, if you find something on your bug, I can give patches a try on
my RN2120.

Cheers,

a+
Willy Tarreau
2013-11-20 19:11:45 UTC
Permalink
Hi Arnaud,

first, thanks for all these tests.

On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote:
(...)
Post by Arnaud Ebalard
In the end, here are the conclusions *I* draw from this test session,
- Eric, it seems something changed in linus tree betwen the beginning
of the thread and now, which somehow reduces the effect of the
regression we were seen: I never got back the 256KB/s.
- You revert patch still improves the perf a lot
- It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
- w/ your revert patch, I can confirm that mvneta driver is capable of
doing line rate w/ proper tweak of TCP send window (256KB instead of
4M)
- It seems I will I have to spend some time on the SATA issues I
previously thought were an artefact of not cleaning my tree during a
debug session [1], i.e. there is IMHO an issue.
Could you please try Eric's patch that was just merged into Linus' tree
if it was not yet in the kernel you tried :

98e09386c0e tcp: tsq: restore minimal amount of queueing

For me it restored the original performance (I saturate the Gbps with
about 7 concurrent streams).

Further, I wrote the small patch below for mvneta. I'm not sure it's
smp-safe but it's a PoC. In mvneta_poll() which currently is only called
upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
if any. That significantly improved my transfer rate, now I easily achieve
1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.

It also increased the overall connection rate by 10% on empty HTTP responses
(small packets), very likely by reducing the dead time between some segments!

You'll probably want to give it a try, so here it comes.

Cheers,
Willy
Post by Arnaud Ebalard
From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
From: Willy Tarreau <***@1wt.eu>
Date: Wed, 20 Nov 2013 19:47:11 +0100
Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
interrupts

Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
timer to flush Tx descriptors. This causes jerky output traffic with bursts
and pauses, making it difficult to reach line rate with very few streams.
This patch tries to improve the situation which is complicated by the lack
of public datasheet from Marvell. The workaround consists in trying to flush
pending buffers during the Rx polling. The idea is that for symmetric TCP
traffic, ACKs received in response to the packets sent will trigger the Rx
interrupt and will anticipate the flushing of the descriptors.

The results are quite good, a single TCP stream is now capable of saturating
a gigabit.

This is only a workaround, it doesn't address asymmetric traffic nor datagram
based traffic.

Signed-off-by: Willy Tarreau <***@1wt.eu>
---
drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 5aed8ed..59e1c86 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
}

pp->cause_rx_tx = cause_rx_tx;
+
+ /* Try to flush pending Tx buffers if any */
+ if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
+ int tx_todo = 0;
+
+ mvneta_tx_done_gbe(pp,
+ (((1 << txq_number) - 1) &
+ MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
+ &tx_todo);
+
+ if (tx_todo > 0) {
+ mod_timer(&pp->tx_done_timer,
+ jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
+ }
+ else {
+ clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
+ del_timer(&pp->tx_done_timer);
+ }
+ }
+
return rx_done;
}
--
1.7.12.1
Arnaud Ebalard
2013-11-20 19:26:16 UTC
Permalink
Hi,
Post by Willy Tarreau
first, thanks for all these tests.
(...)
Post by Arnaud Ebalard
In the end, here are the conclusions *I* draw from this test session,
- Eric, it seems something changed in linus tree betwen the beginning
of the thread and now, which somehow reduces the effect of the
regression we were seen: I never got back the 256KB/s.
- You revert patch still improves the perf a lot
- It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
- w/ your revert patch, I can confirm that mvneta driver is capable of
doing line rate w/ proper tweak of TCP send window (256KB instead of
4M)
- It seems I will I have to spend some time on the SATA issues I
previously thought were an artefact of not cleaning my tree during a
debug session [1], i.e. there is IMHO an issue.
Could you please try Eric's patch that was just merged into Linus' tree
98e09386c0e tcp: tsq: restore minimal amount of queueing
I have it in my quilt set.
Post by Willy Tarreau
For me it restored the original performance (I saturate the Gbps with
about 7 concurrent streams).
Further, I wrote the small patch below for mvneta. I'm not sure it's
smp-safe but it's a PoC. In mvneta_poll() which currently is only called
upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
if any. That significantly improved my transfer rate, now I easily achieve
1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.
It also increased the overall connection rate by 10% on empty HTTP responses
(small packets), very likely by reducing the dead time between some segments!
You'll probably want to give it a try, so here it comes.
hehe, I was falling short of patches to test tonight ;-) I will give it
a try now.

Cheers,

a+
Arnaud Ebalard
2013-11-20 21:28:50 UTC
Permalink
Hi,
Post by Arnaud Ebalard
From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
Date: Wed, 20 Nov 2013 19:47:11 +0100
Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
interrupts
Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
timer to flush Tx descriptors. This causes jerky output traffic with bursts
and pauses, making it difficult to reach line rate with very few streams.
This patch tries to improve the situation which is complicated by the lack
of public datasheet from Marvell. The workaround consists in trying to flush
pending buffers during the Rx polling. The idea is that for symmetric TCP
traffic, ACKs received in response to the packets sent will trigger the Rx
interrupt and will anticipate the flushing of the descriptors.
The results are quite good, a single TCP stream is now capable of saturating
a gigabit.
This is only a workaround, it doesn't address asymmetric traffic nor datagram
based traffic.
---
drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 5aed8ed..59e1c86 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
}
pp->cause_rx_tx = cause_rx_tx;
+
+ /* Try to flush pending Tx buffers if any */
+ if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
+ int tx_todo = 0;
+
+ mvneta_tx_done_gbe(pp,
+ (((1 << txq_number) - 1) &
+ MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
+ &tx_todo);
+
+ if (tx_todo > 0) {
+ mod_timer(&pp->tx_done_timer,
+ jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
+ }
+ else {
+ clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
+ del_timer(&pp->tx_done_timer);
+ }
+ }
+
return rx_done;
}
With current Linus tree (head being b4789b8e: aacraid: prevent invalid
pointer dereference), as a baseline here is what I get:

w/ tcp_wmem left w/ default values (4096 16384 4071360)

via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
via wget against apache: 15.4 MB/s
via wget against nginx: 104 MB/s

w/ tcp_wmem set to 4096 16384 262144:

via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
via wget against apache: 63.3 MB/s
via wget against nginx: 104 MB/s

With your patch on top of it (and tcp_wmem kept at its default value):

via netperf: 939.16 / 935.44 Mbits/s
via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
and 72% CPU for apache2)
via wget against nginx: 106 MB/s


With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
(still w/ and tcp_wmem kept at its default value):

via netperf: 939.12 / 935.84 Mbits/s
via wget against apache: 63.7 MB/s
via wget against nginx: 108 MB/s

So:

- First, Eric's patch sitting in Linus tree does fix the regression
I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).

- As can be seen in the results of first test, Eric's patch still
requires some additional tweaking of tcp_wmem to get netperf and
apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
15.4MB/s by setting max tcp send buffer space to 256KB for apache).

- For unknown reasons, nginx manages to provide a 104MB/s download rate
even with a tcp_wmem set to default and no specific patch of mvneta.

- Now, Willy's patch seems to makes netperf happy (link saturated from
server to client), w/o tweaking tcp_wmem.

- Again with Willy's patch I guess the "limitations" of the platform
(1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
link. All I can say is that the same test some months ago on a 1.6GHz
ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
know if it is some apache regression, some mvneta vs mv63xx_eth
difference or some CPU frequency issue but having netperf and nginx
happy make me wonder about Apache.

- Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
patch does not improve the already good value I get w/ your patch.


In the end if you iterate on your work to push a version of your patch
upstream, I'll be happy to test it. And thanks for the time you already
spent!

Cheers,

a+
Willy Tarreau
2013-11-20 21:54:35 UTC
Permalink
Hi Arnaud,
Post by Arnaud Ebalard
With current Linus tree (head being b4789b8e: aacraid: prevent invalid
w/ tcp_wmem left w/ default values (4096 16384 4071360)
via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
via wget against apache: 15.4 MB/s
via wget against nginx: 104 MB/s
via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
via wget against apache: 63.3 MB/s
via wget against nginx: 104 MB/s
via netperf: 939.16 / 935.44 Mbits/s
via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
and 72% CPU for apache2)
via wget against nginx: 106 MB/s
With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
via netperf: 939.12 / 935.84 Mbits/s
via wget against apache: 63.7 MB/s
via wget against nginx: 108 MB/s
- First, Eric's patch sitting in Linus tree does fix the regression
I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).
- As can be seen in the results of first test, Eric's patch still
requires some additional tweaking of tcp_wmem to get netperf and
apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
15.4MB/s by setting max tcp send buffer space to 256KB for apache).
- For unknown reasons, nginx manages to provide a 104MB/s download rate
even with a tcp_wmem set to default and no specific patch of mvneta.
- Now, Willy's patch seems to makes netperf happy (link saturated from
server to client), w/o tweaking tcp_wmem.
- Again with Willy's patch I guess the "limitations" of the platform
(1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
link. All I can say is that the same test some months ago on a 1.6GHz
ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
know if it is some apache regression, some mvneta vs mv63xx_eth
difference or some CPU frequency issue but having netperf and nginx
happy make me wonder about Apache.
- Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
patch does not improve the already good value I get w/ your patch.
Great, thanks for your detailed tests! Concerning Apache, it's common to
see it consume more CPU than others, which makes it more sensible to small
devices like these ones (which BTW have a very small cache and only a 16bit
RAM bus). Please still note that there could be a number of other differences
such as Apache always doing TCP_NODELAY resulting in sending incomplete
segments at the end of each buffer, which consume slightly more descriptors.
Post by Arnaud Ebalard
In the end if you iterate on your work to push a version of your patch
upstream, I'll be happy to test it. And thanks for the time you already
spent!
I'm currently trying to implement TX IRQ handling. I found the registers
description in the neta driver that is provided in Marvell's LSP kernel
that is shipped with some devices using their CPUs. This code is utterly
broken (eg: splice fails with -EBADF) but I think the register descriptions
could be trusted.

I'd rather have real IRQ handling than just relying on mvneta_poll(), so
that we can use it for asymmetric traffic/routing/whatever.

Regards,
Willy
Willy Tarreau
2013-11-21 00:44:30 UTC
Permalink
Hi Arnaud,
Post by Willy Tarreau
I'm currently trying to implement TX IRQ handling. I found the registers
description in the neta driver that is provided in Marvell's LSP kernel
that is shipped with some devices using their CPUs. This code is utterly
broken (eg: splice fails with -EBADF) but I think the register descriptions
could be trusted.
I'd rather have real IRQ handling than just relying on mvneta_poll(), so
that we can use it for asymmetric traffic/routing/whatever.
OK it paid off. And very well :-)

I did it at once and it worked immediately. I generally don't like this
because I always fear that some bug was left there hidden in the code. I have
only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
on the XP-GP board for some SMP stress tests.

I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
with and without the patch.

without :
- need at least 12 streams to reach gigabit.
- 60% of idle CPU remains at 1 Gbps
- HTTP connection rate on empty objects is 9950 connections/s
- cumulated outgoing traffic on two ports reaches 1.3 Gbps

with the patch :
- a single stream easily saturates the gigabit
- 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
- HTTP connection rate on empty objects is 10250 connections/s
- I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.

BTW I must say I was impressed to see that big an improvement in CPU
usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
that Eric has done in between account for this.

I cut the patch in 3 parts :
- one which reintroduces the hidden bits of the driver
- one which replaces the timer with the IRQ
- one which changes the default Tx coalesce from 16 to 4 packets
(larger was preferred with the timer, but less is better now).

I'm attaching them, please test them on your device.

Note that this is *not* for inclusion at the moment as it has not been
tested on the SMP CPUs.

Cheers,
Willy
Willy Tarreau
2013-11-21 18:38:34 UTC
Permalink
Hi Rob,

While we were diagnosing a network performance regression that we finally
found and fixed, it appeared during a test that Linus' tree shows a much
higher performance on Armada 370 (armv7) than its predecessors. I can
saturate the two Gig links of my Mirabox each with a single TCP flow and
keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
can achieve around 1.3 Gbps when the two ports are used in parallel.

Today I bisected these kernels to find what was causing this difference.
I found it was your patch below which I can copy entirely here :

commit 0589342c27944e50ebd7a54f5215002b6598b748
Author: Rob Herring <***@calxeda.com>
Date: Tue Oct 29 23:36:46 2013 -0500

of: set dma_mask to point to coherent_dma_mask

Platform devices created by DT code don't initialize dma_mask pointer to
anything. Set it to coherent_dma_mask by default if the architecture
code has not set it.

Signed-off-by: Rob Herring <***@calxeda.com>

diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index 9b439ac..c005495 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -216,6 +216,8 @@ static struct platform_device *of_platform_device_create_pdata(
dev->archdata.dma_mask = 0xffffffffUL;
#endif
dev->dev.coherent_dma_mask = DMA_BIT_MASK(32);
+ if (!dev->dev.dma_mask)
+ dev->dev.dma_mask = &dev->dev.coherent_dma_mask;
dev->dev.bus = &platform_bus_type;
dev->dev.platform_data = platform_data;

And I can confirm that applying this patch on 3.10.20 + the fixes we found
yesterday substantially boosted my network performance (and reduced the CPU
usage when running on a single link).

I'm not at ease with these things so I'd like to ask your opinion here, is
this supposed to be an improvement or a fix ? Is this something we should
backport into stable versions, or is there something to fix in the armada
platform so that it works just as if the patch was applied ?

Thanks,
Willy
Thomas Petazzoni
2013-11-21 19:04:33 UTC
Permalink
Dear Willy Tarreau,
Post by Willy Tarreau
While we were diagnosing a network performance regression that we finally
found and fixed, it appeared during a test that Linus' tree shows a much
higher performance on Armada 370 (armv7) than its predecessors. I can
saturate the two Gig links of my Mirabox each with a single TCP flow and
keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
can achieve around 1.3 Gbps when the two ports are used in parallel.
Interesting finding and analysis, once again!
Post by Willy Tarreau
I'm not at ease with these things so I'd like to ask your opinion here, is
this supposed to be an improvement or a fix ? Is this something we should
backport into stable versions, or is there something to fix in the armada
platform so that it works just as if the patch was applied ?
I guess the driver should have been setting its dma_mask to 0xffffffff,
since the platform is capable of doing DMA on the first 32 bits of the
physical address space, probably something like calling
pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that. I know
Russell has recently added some helpers to prevent stupid people (like
me) from doing mistakes when setting the DMA masks. Certainly worth
having a look.

Best regards,

Thomas
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
Willy Tarreau
2013-11-21 21:51:26 UTC
Permalink
Hi Thomas,
Post by Thomas Petazzoni
Post by Willy Tarreau
I'm not at ease with these things so I'd like to ask your opinion here, is
this supposed to be an improvement or a fix ? Is this something we should
backport into stable versions, or is there something to fix in the armada
platform so that it works just as if the patch was applied ?
I guess the driver should have been setting its dma_mask to 0xffffffff,
since the platform is capable of doing DMA on the first 32 bits of the
physical address space, probably something like calling
pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that.
Almost, yes. Thanks for the tip! There are so few drivers which do this
that I was convinced something was missing (nobody initializes dma_mask
on this platform), so calls to dma_set_mask() from drivers return -EIO
and are ignored.

I ended up with that in mvneta_init() at the end :

/* setting DMA mask significantly improves transfer rates */
pp->dev->dev.parent->coherent_dma_mask = DMA_BIT_MASK(32);
pp->dev->dev.parent->dma_mask = &pp->dev->dev.parent->coherent_dma_mask;

This method changed in 3.12 with Russell's commit fa6a8d6 (DMA-API: provide
a helper to setup DMA masks) doing it a cleaner and safer way, using
dma_coerce_mask_and_coherent().

Then Rob's commit 0589342 (of: set dma_mask to point to coherent_dma_mask) also
merged in 3.12 pre-initialized the dma_mask to point to &coherent_dma_mask for
all devices by default.
Post by Thomas Petazzoni
I know
Russell has recently added some helpers to prevent stupid people (like
me) from doing mistakes when setting the DMA masks. Certainly worth
having a look.
My change now allows me to proxy HTTP traffic at 1 Gbps between the two
ports of the mirabox, while it limits to 650 Mbps without the change. But
it's not needed in mainline anymore. However it might be worth having in
older kernels (I don't know if it's suitable for stable since I don't
know if that's a bug), or at least in your own kernels if you have to
maintain and older branch for some customers.

That said, I tend to believe that applying Rob's patch will be better than
just the change above since it will cover all drivers, not only mvneta.
I'll have to test on the AX3 and the XP-GP to see the performance gain in
SMP and using the PCIe.

Best regards,
Willy
Rob Herring
2013-11-21 22:01:42 UTC
Permalink
Post by Willy Tarreau
Hi Rob,
While we were diagnosing a network performance regression that we finally
found and fixed, it appeared during a test that Linus' tree shows a much
higher performance on Armada 370 (armv7) than its predecessors. I can
saturate the two Gig links of my Mirabox each with a single TCP flow and
keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
can achieve around 1.3 Gbps when the two ports are used in parallel.
Today I bisected these kernels to find what was causing this difference.
commit 0589342c27944e50ebd7a54f5215002b6598b748
Date: Tue Oct 29 23:36:46 2013 -0500
of: set dma_mask to point to coherent_dma_mask
Platform devices created by DT code don't initialize dma_mask pointer to
anything. Set it to coherent_dma_mask by default if the architecture
code has not set it.
diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index 9b439ac..c005495 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -216,6 +216,8 @@ static struct platform_device *of_platform_device_create_pdata(
dev->archdata.dma_mask = 0xffffffffUL;
#endif
dev->dev.coherent_dma_mask = DMA_BIT_MASK(32);
+ if (!dev->dev.dma_mask)
+ dev->dev.dma_mask = &dev->dev.coherent_dma_mask;
dev->dev.bus = &platform_bus_type;
dev->dev.platform_data = platform_data;
And I can confirm that applying this patch on 3.10.20 + the fixes we found
yesterday substantially boosted my network performance (and reduced the CPU
usage when running on a single link).
I'm not at ease with these things so I'd like to ask your opinion here, is
this supposed to be an improvement or a fix ? Is this something we should
backport into stable versions, or is there something to fix in the armada
platform so that it works just as if the patch was applied ?
The patch was to fix this issue[1]. It is fixed in the core code because
dma_mask not being set has been a known issue with DT probing for some
time. Since most drivers don't seem to care, we've gotten away with it.
I thought the normal failure mode was drivers failing to probe.

As to why it helps performance, I'm not really sure. Perhaps it is
causing some bounce buffers to be used.

Rob

[1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html
Willy Tarreau
2013-11-21 22:13:10 UTC
Permalink
Post by Rob Herring
The patch was to fix this issue[1]. It is fixed in the core code because
dma_mask not being set has been a known issue with DT probing for some
time. Since most drivers don't seem to care, we've gotten away with it.
I thought the normal failure mode was drivers failing to probe.
It seems that very few drivers try to set their mask, so probably that
the default value was already OK even though less performant.
Post by Rob Herring
As to why it helps performance, I'm not really sure. Perhaps it is
causing some bounce buffers to be used.
That's also the thing I have been thinking about, and given this device
only has a 16-bit DDR bus, bounce buffers can make a difference.
Post by Rob Herring
Rob
[1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html
Thanks for your quick explanation Rob!
Willy
Arnaud Ebalard
2013-11-21 21:51:09 UTC
Permalink
Hi,
Post by Willy Tarreau
OK it paid off. And very well :-)
I did it at once and it worked immediately. I generally don't like this
because I always fear that some bug was left there hidden in the code. I have
only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
on the XP-GP board for some SMP stress tests.
I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
with and without the patch.
- need at least 12 streams to reach gigabit.
- 60% of idle CPU remains at 1 Gbps
- HTTP connection rate on empty objects is 9950 connections/s
- cumulated outgoing traffic on two ports reaches 1.3 Gbps
- a single stream easily saturates the gigabit
- 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
- HTTP connection rate on empty objects is 10250 connections/s
- I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
BTW I must say I was impressed to see that big an improvement in CPU
usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
that Eric has done in between account for this.
- one which reintroduces the hidden bits of the driver
- one which replaces the timer with the IRQ
- one which changes the default Tx coalesce from 16 to 4 packets
(larger was preferred with the timer, but less is better now).
I'm attaching them, please test them on your device.
Well, on the RN102 (Armada 370), I get the same results as with your
previous patch, i.e. netperf and nginx saturate the link. Apache still
lagging behind though.
Post by Willy Tarreau
Note that this is *not* for inclusion at the moment as it has not been
tested on the SMP CPUs.
I tested it on my RN2120 (2-core armada XP): I got no problem and the
link saturated w/ apache, nginx and netperf. Good work!

Cheers,

a+
Willy Tarreau
2013-11-21 21:52:59 UTC
Permalink
Post by Arnaud Ebalard
Hi,
Post by Willy Tarreau
OK it paid off. And very well :-)
I did it at once and it worked immediately. I generally don't like this
because I always fear that some bug was left there hidden in the code. I have
only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
on the XP-GP board for some SMP stress tests.
I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
with and without the patch.
- need at least 12 streams to reach gigabit.
- 60% of idle CPU remains at 1 Gbps
- HTTP connection rate on empty objects is 9950 connections/s
- cumulated outgoing traffic on two ports reaches 1.3 Gbps
- a single stream easily saturates the gigabit
- 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
- HTTP connection rate on empty objects is 10250 connections/s
- I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
BTW I must say I was impressed to see that big an improvement in CPU
usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
that Eric has done in between account for this.
- one which reintroduces the hidden bits of the driver
- one which replaces the timer with the IRQ
- one which changes the default Tx coalesce from 16 to 4 packets
(larger was preferred with the timer, but less is better now).
I'm attaching them, please test them on your device.
Well, on the RN102 (Armada 370), I get the same results as with your
previous patch, i.e. netperf and nginx saturate the link. Apache still
lagging behind though.
Post by Willy Tarreau
Note that this is *not* for inclusion at the moment as it has not been
tested on the SMP CPUs.
I tested it on my RN2120 (2-core armada XP): I got no problem and the
link saturated w/ apache, nginx and netperf. Good work!
Great, thanks for your tests Arnaud. I forgot to mention that all my
tests this evening involved this patch as well.

Cheers,
Willy
Eric Dumazet
2013-11-21 22:00:01 UTC
Permalink
Post by Willy Tarreau
Post by Arnaud Ebalard
Hi,
Post by Willy Tarreau
OK it paid off. And very well :-)
I did it at once and it worked immediately. I generally don't like this
because I always fear that some bug was left there hidden in the code. I have
only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
on the XP-GP board for some SMP stress tests.
I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
with and without the patch.
- need at least 12 streams to reach gigabit.
- 60% of idle CPU remains at 1 Gbps
- HTTP connection rate on empty objects is 9950 connections/s
- cumulated outgoing traffic on two ports reaches 1.3 Gbps
- a single stream easily saturates the gigabit
- 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
- HTTP connection rate on empty objects is 10250 connections/s
- I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
BTW I must say I was impressed to see that big an improvement in CPU
usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
that Eric has done in between account for this.
- one which reintroduces the hidden bits of the driver
- one which replaces the timer with the IRQ
- one which changes the default Tx coalesce from 16 to 4 packets
(larger was preferred with the timer, but less is better now).
I'm attaching them, please test them on your device.
Well, on the RN102 (Armada 370), I get the same results as with your
previous patch, i.e. netperf and nginx saturate the link. Apache still
lagging behind though.
Post by Willy Tarreau
Note that this is *not* for inclusion at the moment as it has not been
tested on the SMP CPUs.
I tested it on my RN2120 (2-core armada XP): I got no problem and the
link saturated w/ apache, nginx and netperf. Good work!
Great, thanks for your tests Arnaud. I forgot to mention that all my
tests this evening involved this patch as well.
Now you might try to set a lower value
for /proc/sys/net/ipv4/tcp_limit_output_bytes

Ideally, a value of 8192 (instead of 131072) allows
to queue less data per tcp flow, and react faster to losses,
as retransmits don't have to wait that previous packets in Qdisc left
the host.

131072 bytes for a 80 Mbit flow means more than 11 ms of queueing :(
Arnaud Ebalard
2013-11-21 22:55:09 UTC
Permalink
Hi eric,
Post by Eric Dumazet
Post by Willy Tarreau
Post by Arnaud Ebalard
I tested it on my RN2120 (2-core armada XP): I got no problem and the
link saturated w/ apache, nginx and netperf. Good work!
Great, thanks for your tests Arnaud. I forgot to mention that all my
tests this evening involved this patch as well.
Now you might try to set a lower value
for /proc/sys/net/ipv4/tcp_limit_output_bytes
On the RN2120, for a file served from /run/shm (for apache and nginx):

Apache nginx netperf
131072: 102 MB/s 112 MB/s 941.11 Mb/s
65536: 102 MB/s 112 MB/s 935.97 Mb/s
32768: 101 MB/s 105 MB/s 940.49 Mb/s
16384: 94 MB/s 90 MB/s 770.07 Mb/s
8192: 83 MB/s 66 MB/s 556.79 Mb/s

On the RN102, this time for apache and nginx, the file is served from
disks (ext4/lvm/raid1):

Apache nginx netperf
131072: 66 MB/s 105 MB/s 925.63 Mb/s
65536: 59 MB/s 105 MB/s 862.55 Mb/s
32768: 62 MB/s 105 MB/s 918.99 Mb/s
16384: 65 MB/s 105 MB/s 927.71 Mb/s
8192: 60 MB/s 104 MB/s 915.63 Mb/s

Values above are for a single flow though.

Cheers,

a+
Rick Jones
2013-11-21 23:23:09 UTC
Permalink
Post by Arnaud Ebalard
Apache nginx netperf
131072: 102 MB/s 112 MB/s 941.11 Mb/s
65536: 102 MB/s 112 MB/s 935.97 Mb/s
32768: 101 MB/s 105 MB/s 940.49 Mb/s
16384: 94 MB/s 90 MB/s 770.07 Mb/s
8192: 83 MB/s 66 MB/s 556.79 Mb/s
If you want to make the units common across all three tests, netperf
accepts a global -f option to alter the output units. If you add -f M
netperf will then emit results in MB/s (M == 1048576). I'm assuming of
course that the MB/s of Apache and nginx are also M == 1048576.

happy benchmarking,

rick jones

Willy Tarreau
2013-11-20 17:12:27 UTC
Permalink
Hi guys,
Post by Eric Dumazet
Post by Willy Tarreau
So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.
BTW I have a very simple patch for TCP stack that could help this exact
situation...
Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
very small frames, and let tcp_sendmsg() have more chance to fill
complete packets.
Again, for this to work very well, you need that NIC performs TX
completion in reasonable amount of time...
Eric, first I would like to confirm that I could reproduce Arnaud's issue
using 3.10.19 (160 kB/s in the worst case).

Second, I confirm that your patch partially fixes it and my performance
can be brought back to what I had with 3.10-rc7, but with a lot of
concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
applied, performance is still only 27 MB/s at 7 concurrent streams, and I
need at least 35 concurrent streams to fill the pipe. Strangely, after
2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
fell to 10 MB/s again.

If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
your latest patch, the performance is back to original.

Now I understand there's a major issue with the driver. But since the
patch emphasizes the situations where drivers take a lot of time to
wake the queue up, don't you think there could be an issue with low
bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
I'm a bit worried about what we might discover in this area I must
confess (despite generally being mostly focused on 10+ Gbps).

Best regards,
Willy
Eric Dumazet
2013-11-20 17:30:07 UTC
Permalink
Post by Willy Tarreau
Hi guys,
Eric, first I would like to confirm that I could reproduce Arnaud's issue
using 3.10.19 (160 kB/s in the worst case).
Second, I confirm that your patch partially fixes it and my performance
can be brought back to what I had with 3.10-rc7, but with a lot of
concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
applied, performance is still only 27 MB/s at 7 concurrent streams, and I
need at least 35 concurrent streams to fill the pipe. Strangely, after
2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
fell to 10 MB/s again.
If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
your latest patch, the performance is back to original.
Now I understand there's a major issue with the driver. But since the
patch emphasizes the situations where drivers take a lot of time to
wake the queue up, don't you think there could be an issue with low
bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
I'm a bit worried about what we might discover in this area I must
confess (despite generally being mostly focused on 10+ Gbps).
Well, all TCP performance results are highly dependent on the workload,
and both receivers and senders behavior.

We made many improvements like TSO auto sizing, DRS (dynamic Right
Sizing), and if the application used some specific settings (like
SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
exact performance is reached from kernel version X to kernel version Y.

We try to make forward progress, there is little gain to revert all
these great works. Linux had this tendency to favor throughput by using
overly large skbs. Its time to do better.

As explained, some drivers are buggy, and need fixes.

If nobody wants to fix them, this really means no one is interested
getting them fixed.

I am willing to help if you provide details, because otherwise I need
a crystal ball ;)

One known problem of TCP is the fact that an incoming ACK making room in
socket write queue immediately wakeup a blocked thread (POLLOUT), even
if only one MSS was ack, and write queue has 2MB of outstanding bytes.

All these scheduling problems should be identified and fixed, and yes,
this will require a dozen more patches.

max (128KB , 1-2 ms) of buffering per flow should be enough to reach
line rate, even for a single flow, but this means the sk_sndbuf value
for the socket must take into account the pipe size _plus_ 1ms of
buffering.
Willy Tarreau
2013-11-20 17:38:28 UTC
Permalink
Post by Eric Dumazet
Well, all TCP performance results are highly dependent on the workload,
and both receivers and senders behavior.
We made many improvements like TSO auto sizing, DRS (dynamic Right
Sizing), and if the application used some specific settings (like
SO_SNDBUF / SO_RCVBUF or other tweaks), we can not guarantee that same
exact performance is reached from kernel version X to kernel version Y.
Of course, which is why I only care when there's a significant
difference. If I need 6 streams in a version and 8 in another one to
fill the wire, I call them identical. It's only when we dig into the
details that we analyse the differences.
Post by Eric Dumazet
We try to make forward progress, there is little gain to revert all
these great works. Linux had this tendency to favor throughput by using
overly large skbs. Its time to do better.
I agree. Unfortunately our mails have crossed each other, so just to
keep this tread mostly linear, your next patch here :

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

Fixes that regression and the performance is back to normal which is
good.
Post by Eric Dumazet
As explained, some drivers are buggy, and need fixes.
Agreed!
Post by Eric Dumazet
If nobody wants to fix them, this really means no one is interested
getting them fixed.
I was exactly reading the code when I found a window with your patch
above that I was looking for :-)
Post by Eric Dumazet
I am willing to help if you provide details, because otherwise I need
a crystal ball ;)
One known problem of TCP is the fact that an incoming ACK making room in
socket write queue immediately wakeup a blocked thread (POLLOUT), even
if only one MSS was ack, and write queue has 2MB of outstanding bytes.
Indeed.
Post by Eric Dumazet
All these scheduling problems should be identified and fixed, and yes,
this will require a dozen more patches.
max (128KB , 1-2 ms) of buffering per flow should be enough to reach
line rate, even for a single flow, but this means the sk_sndbuf value
for the socket must take into account the pipe size _plus_ 1ms of
buffering.
Which is the purpose of your patch above and I confirm it fixes the
problem.

Now looking at how to workaround this lack of Tx IRQ.

Thanks!
Willy
David Miller
2013-11-20 18:52:21 UTC
Permalink
From: Eric Dumazet <***@gmail.com>
Date: Wed, 20 Nov 2013 09:30:07 -0800
Post by Eric Dumazet
max (128KB , 1-2 ms) of buffering per flow should be enough to reach
line rate, even for a single flow, but this means the sk_sndbuf value
for the socket must take into account the pipe size _plus_ 1ms of
buffering.
And we can implement this using the estimated pacing rate.
Willy Tarreau
2013-11-20 17:34:36 UTC
Permalink
Post by Willy Tarreau
Hi guys,
Post by Eric Dumazet
Post by Willy Tarreau
So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.
BTW I have a very simple patch for TCP stack that could help this exact
situation...
Idea is to use TCP Small Queue so that we dont fill qdisc/TX ring with
very small frames, and let tcp_sendmsg() have more chance to fill
complete packets.
Again, for this to work very well, you need that NIC performs TX
completion in reasonable amount of time...
Eric, first I would like to confirm that I could reproduce Arnaud's issue
using 3.10.19 (160 kB/s in the worst case).
Second, I confirm that your patch partially fixes it and my performance
can be brought back to what I had with 3.10-rc7, but with a lot of
concurrent streams. In fact, in 3.10-rc7, I managed to constantly saturate
the wire when transfering 7 concurrent streams (118.6 kB/s). With the patch
applied, performance is still only 27 MB/s at 7 concurrent streams, and I
need at least 35 concurrent streams to fill the pipe. Strangely, after
2 GB of cumulated data transferred, the bandwidth divided by 11-fold and
fell to 10 MB/s again.
If I revert both "0ae5f47eff tcp: TSQ can use a dynamic limit" and
your latest patch, the performance is back to original.
Now I understand there's a major issue with the driver. But since the
patch emphasizes the situations where drivers take a lot of time to
wake the queue up, don't you think there could be an issue with low
bandwidth links (eg: PPPoE over xDSL, 10 Mbps ethernet, etc...) ?
I'm a bit worried about what we might discover in this area I must
confess (despite generally being mostly focused on 10+ Gbps).
One important point, I was looking for the other patch you pointed
Post by Willy Tarreau
So
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
restored this minimal amount of buffering, and let the bigger amount for
40Gb NICs ;)
This one definitely restores original performance, so it's a much better
bet in my opinion :-)

Best regards,
Willy
Eric Dumazet
2013-11-20 17:40:22 UTC
Permalink
Post by Willy Tarreau
One important point, I was looking for the other patch you pointed
Post by Eric Dumazet
So
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
restored this minimal amount of buffering, and let the bigger amount for
40Gb NICs ;)
This one definitely restores original performance, so it's a much better
bet in my opinion :-)
I don't understand. I thought you were using this patch.

I guess we are spending time on a already solved problem.
Willy Tarreau
2013-11-20 18:15:32 UTC
Permalink
Post by Eric Dumazet
Post by Willy Tarreau
One important point, I was looking for the other patch you pointed
Post by Eric Dumazet
So
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
restored this minimal amount of buffering, and let the bigger amount for
40Gb NICs ;)
This one definitely restores original performance, so it's a much better
bet in my opinion :-)
I don't understand. I thought you were using this patch.
No, I was on latest stable (3.10.19) which exhibits the regression but does
not yet have your fix above. Then I tested the patch your proposed in this
thread, then this latest one. Since the patch is not yet even in Linus'
tree, I'm not sure Arnaud has tried it yet.
Post by Eric Dumazet
I guess we are spending time on a already solved problem.
That's possible indeed. Sorry if I was not clear enough, I tried.

Regards,
Willy
Eric Dumazet
2013-11-20 18:21:24 UTC
Permalink
Post by Willy Tarreau
No, I was on latest stable (3.10.19) which exhibits the regression but does
not yet have your fix above. Then I tested the patch your proposed in this
thread, then this latest one. Since the patch is not yet even in Linus'
tree, I'm not sure Arnaud has tried it yet.
Oh right ;)

BTW Linus tree has the fix, I just checked this now.

(Linus got David changes 19 hours ago)
Willy Tarreau
2013-11-20 18:29:19 UTC
Permalink
Post by Eric Dumazet
Post by Willy Tarreau
No, I was on latest stable (3.10.19) which exhibits the regression but does
not yet have your fix above. Then I tested the patch your proposed in this
thread, then this latest one. Since the patch is not yet even in Linus'
tree, I'm not sure Arnaud has tried it yet.
Oh right ;)
BTW Linus tree has the fix, I just checked this now.
(Linus got David changes 19 hours ago)
Ah yes I didn't look far enough back.

Thanks,
Willy
Arnaud Ebalard
2013-11-20 19:22:31 UTC
Permalink
Hi,
Post by Willy Tarreau
Post by Eric Dumazet
Post by Willy Tarreau
One important point, I was looking for the other patch you pointed
Post by Eric Dumazet
So
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee
restored this minimal amount of buffering, and let the bigger amount for
40Gb NICs ;)
This one definitely restores original performance, so it's a much better
bet in my opinion :-)
I don't understand. I thought you were using this patch.
No, I was on latest stable (3.10.19) which exhibits the regression but does
not yet have your fix above. Then I tested the patch your proposed in this
thread, then this latest one. Since the patch is not yet even in Linus'
tree, I'm not sure Arnaud has tried it yet.
This is the one I have used for a week now, since its publication by
Eric a week ago:

http://www.spinics.net/lists/netdev/msg257396.html

Cheers,

a+
David Laight
2013-11-18 10:09:53 UTC
Permalink
Post by Willy Tarreau
I wonder if we can call mvneta_txq_done() from the IRQ handler, which would
cause some recycling of the Tx descriptors when receiving the corresponding
ACKs.
Ideally we should enable the Tx IRQ, but I still have no access to this
chip's datasheet despite having asked Marvell several times in one year
(Thomas has it though).
So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.
Or you have a significant number of active tcp connections.

Even if there were no requirement to free the skb quickly you still
need to take a 'tx done' interrupt when the link is transmit rate limited.
There are scenarios when there is no receive traffic - eg asymmetric
routing, but testable with netperf UDP transmits.

David
Willy Tarreau
2013-11-18 10:52:58 UTC
Permalink
Post by David Laight
Post by Willy Tarreau
So it is fairly possible that in your case you can't fill the link if you
consume too many descriptors. For example, if your server uses TCP_NODELAY
and sends incomplete segments (which is quite common), it's very easy to
run out of descriptors before the link is full.
Or you have a significant number of active tcp connections.
Even if there were no requirement to free the skb quickly you still
need to take a 'tx done' interrupt when the link is transmit rate limited.
There are scenarios when there is no receive traffic - eg asymmetric
routing, but testable with netperf UDP transmits.
Yes absolutely, but I was talking about the current situation that Arnaud
is facing and which I could reproduce with a large window.

Willy
Thomas Petazzoni
2013-11-18 10:26:01 UTC
Permalink
Willy, All,
Post by Willy Tarreau
Post by Arnaud Ebalard
Can you give a pre-3.11.7 kernel a try if you find the time? I
started working on RN102 during 3.10-rc cycle but do not remember
if I did the first preformance tests on 3.10 or 3.11. And if you
find more time, 3.11.7 would be nice too ;-)
Still have not found time for this but I observed something
intriguing which might possibly match your experience : if I use
large enough send buffers on the mirabox and receive buffers on the
client, then the traffic drops for objects larger than 1 MB. I have
quickly checked what's happening and it's just that there are
pauses of up to 8 ms between some packets when the TCP send window
grows larger than about 200 kB. And since there are no drops, there
is no reason for the window to shrink. I suspect it's exactly
related to the issue explained by Eric about the timer used to
recycle the Tx descriptors. However last time I checked, these ones
were also processed in the Rx path, which means that the ACKs that
flow back should have had the same effect as a Tx IRQ (unless I'd
use asymmetric routing, which was not the case). So there might be
another issue. Ah, and it only happens with GSO.
I haven't read the entire discussion yet, but do you guys have
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
applied? It got merged recently, and it fixes a number of networking
problems on Armada 370.

I've added Simon Guinot in Cc, who is the author of this patch.

Best regards,

Thomas
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
Simon Guinot
2013-11-18 10:44:48 UTC
Permalink
Post by Thomas Petazzoni
Willy, All,
Post by Willy Tarreau
Post by Arnaud Ebalard
Can you give a pre-3.11.7 kernel a try if you find the time? I
started working on RN102 during 3.10-rc cycle but do not remember
if I did the first preformance tests on 3.10 or 3.11. And if you
find more time, 3.11.7 would be nice too ;-)
Still have not found time for this but I observed something
intriguing which might possibly match your experience : if I use
large enough send buffers on the mirabox and receive buffers on the
client, then the traffic drops for objects larger than 1 MB. I have
quickly checked what's happening and it's just that there are
pauses of up to 8 ms between some packets when the TCP send window
grows larger than about 200 kB. And since there are no drops, there
is no reason for the window to shrink. I suspect it's exactly
related to the issue explained by Eric about the timer used to
recycle the Tx descriptors. However last time I checked, these ones
were also processed in the Rx path, which means that the ACKs that
flow back should have had the same effect as a Tx IRQ (unless I'd
use asymmetric routing, which was not the case). So there might be
another issue. Ah, and it only happens with GSO.
I haven't read the entire discussion yet, but do you guys have
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
applied? It got merged recently, and it fixes a number of networking
problems on Armada 370.
I've added Simon Guinot in Cc, who is the author of this patch.
I don't think it is related. We also have noticed a huge performance
regression. Reverting the following patch restores the rate:

c9eeec26 tcp: TSQ can use a dynamic limit

I don't understand why...

Regards,

Simon
Post by Thomas Petazzoni
Best regards,
Thomas
--
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
Stephen Hemminger
2013-11-18 16:54:07 UTC
Permalink
On Mon, 18 Nov 2013 11:44:48 +0100
Post by Simon Guinot
Post by Thomas Petazzoni
Willy, All,
Post by Willy Tarreau
Post by Arnaud Ebalard
Can you give a pre-3.11.7 kernel a try if you find the time? I
started working on RN102 during 3.10-rc cycle but do not remember
if I did the first preformance tests on 3.10 or 3.11. And if you
find more time, 3.11.7 would be nice too ;-)
Still have not found time for this but I observed something
intriguing which might possibly match your experience : if I use
large enough send buffers on the mirabox and receive buffers on the
client, then the traffic drops for objects larger than 1 MB. I have
quickly checked what's happening and it's just that there are
pauses of up to 8 ms between some packets when the TCP send window
grows larger than about 200 kB. And since there are no drops, there
is no reason for the window to shrink. I suspect it's exactly
related to the issue explained by Eric about the timer used to
recycle the Tx descriptors. However last time I checked, these ones
were also processed in the Rx path, which means that the ACKs that
flow back should have had the same effect as a Tx IRQ (unless I'd
use asymmetric routing, which was not the case). So there might be
another issue. Ah, and it only happens with GSO.
I haven't read the entire discussion yet, but do you guys have
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
applied? It got merged recently, and it fixes a number of networking
problems on Armada 370.
I've added Simon Guinot in Cc, who is the author of this patch.
I don't think it is related. We also have noticed a huge performance
c9eeec26 tcp: TSQ can use a dynamic limit
But without that patch there was a performance regression for high speed
interfaces whihc was caused by TSQ. 10G performance dropped to 8G
Eric Dumazet
2013-11-18 17:13:28 UTC
Permalink
Post by Stephen Hemminger
On Mon, 18 Nov 2013 11:44:48 +0100
Post by Simon Guinot
c9eeec26 tcp: TSQ can use a dynamic limit
But without that patch there was a performance regression for high speed
interfaces whihc was caused by TSQ. 10G performance dropped to 8G
Yes, this made sure we could feed more than 2 TSO packets on TX ring.

But decreasing minimal amount of queuing from 128KB to ~1ms of the
current rate did not please NIC which can have a big delay between
ndo_start_xmit() and actual skb freeing (TX completion)

So
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=98e09386c0ef4dfd48af7ba60ff908f0d525cdee

restored this minimal amount of buffering, and let the bigger amount for
40Gb NICs ;)
Willy Tarreau
2013-11-18 10:51:50 UTC
Permalink
Hi Thomas,
Post by Thomas Petazzoni
I haven't read the entire discussion yet, but do you guys have
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
applied? It got merged recently, and it fixes a number of networking
problems on Armada 370.
No, because my version was even older than the code which introduced this
issue :-)

The main issue is related to something we discussed once ago which surprized
both of us, the use of a Tx timer to release the Tx descriptors. I remember
I considered that it was not a big issue because the flush was also done in
the Rx path (thus on ACKs) but I can't find trace of this code so my analysis
was wrong. Thus we can hit some situations where we fill the descriptors
before filling the link.

Ideally we should have a Tx IRQ. At the very least we should call the tx
refill function in mvneta_poll() I believe. I can try to do it but I'd
rather have the Tx IRQ working instead.

Regards,
Willy
Florian Fainelli
2013-11-18 17:58:38 UTC
Permalink
Hello Willy, Thomas,
Post by Willy Tarreau
Hi Thomas,
Post by Thomas Petazzoni
I haven't read the entire discussion yet, but do you guys have
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/clk/mvebu?id=1022c75f5abd3a3b25e679bc8793d21bedd009b4
applied? It got merged recently, and it fixes a number of networking
problems on Armada 370.
No, because my version was even older than the code which introduced this
issue :-)
The main issue is related to something we discussed once ago which surprized
both of us, the use of a Tx timer to release the Tx descriptors. I remember
I considered that it was not a big issue because the flush was also done in
the Rx path (thus on ACKs) but I can't find trace of this code so my analysis
was wrong. Thus we can hit some situations where we fill the descriptors
before filling the link.
So long as you are using TCP this works because the ACKs will somehow
create an artificial "forced" completion of your transmitted SKBs, how
about an UDP streamer use case? In that case you will quickly fill up
all of your descriptors and have to wait for the descriptors to be
freed by the 10ms timer. I do not think this is desirable at all, and
this will requite very large UDP sender socket buffers. I remember
asking Thomas what was the reason for not using the TX completion IRQ
during the first incarnation of the driver, but I do not quite
remember what was the answer.

If the original mvneta driver authors fears where that TX completion
could generate too many IRQs, they should use netif_stop_queue() /
netif_wake_queue() and mask off/on interrupts appropriately to slow
down the pace of TX interrupts.
Post by Willy Tarreau
Ideally we should have a Tx IRQ. At the very least we should call the tx
refill function in mvneta_poll() I believe. I can try to do it but I'd
rather have the Tx IRQ working instead.
Right, actually you should do both, free transmitted SKBs from your
NAPI poll callback and from the TX completion IRQ to ensure SKBs are
freed up in time no matter what workload/use case is being used.
--
Florian
Continue reading on narkive:
Loading...