Discussion:
Low network throughput on i.MX28
(too old to reply)
Jörg Krause
2016-10-12 23:09:13 UTC
Permalink
Hi,

I am using a custom i.MX28 board similar to the i.MX28-EVK. For Wi-Fi
the board assembles a BCM43362 from Broadcom and for Ethernet a
LAN8720A from Microchip. The board is running mainline Linux 4.7.

While both, wireless and wired network interfaces work, the TCP
throughput measured with iperf is low.

The bandwith for Ethernet is between 20-30 MBits/sec and for WLAN is
about 4-5 MBits/sec.

There exists an Application Note "i.MX28 Ethernet Performance on
Linux" [1] which shows a bandwith of > 60 MBits/sec. A user an the NXP
forum [2] told he achieved 20 MBits/sec with some Qualcom chip.

Note, that these values are most probably measured with the legacy
Linux Kernel 2.6.35 from NXP.

Does anybody has done throughput tests on i.MX28 with mainline Kernel?
If so, what are the results? What might be the bottleneck?

[1] http://cache.freescale.com/files/32bit/doc/app_note/AN4544.pdf
[2] https://community.nxp.com/thread/353921

Best regards
Jörg Krause
Lothar Waßmann
2016-10-13 06:48:07 UTC
Permalink
Hi,
Post by Jörg Krause
Hi,
I am using a custom i.MX28 board similar to the i.MX28-EVK. For Wi-Fi
the board assembles a BCM43362 from Broadcom and for Ethernet a
LAN8720A from Microchip. The board is running mainline Linux 4.7.
While both, wireless and wired network interfaces work, the TCP
throughput measured with iperf is low.
The bandwith for Ethernet is between 20-30 MBits/sec and for WLAN is
about 4-5 MBits/sec.
There exists an Application Note "i.MX28 Ethernet Performance on
Linux" [1] which shows a bandwith of > 60 MBits/sec. A user an the NXP
forum [2] told he achieved 20 MBits/sec with some Qualcom chip.
Note, that these values are most probably measured with the legacy
Linux Kernel 2.6.35 from NXP.
Does anybody has done throughput tests on i.MX28 with mainline Kernel?
If so, what are the results? What might be the bottleneck?
This is the iperf output on a TX28 with current mainline kernel
(4.8.0-rc5):
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.100.56 port 60325 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 57.5 MBytes 48.2 Mbits/sec

You might check your kernel DEBUG configs (especially
CONFIG_DEBUG_PAGEALLOC).


Lothar Waßmann
Jörg Krause
2016-10-13 19:43:00 UTC
Permalink
Hi Lothar,
Post by Jörg Krause
Hi,
Post by Jörg Krause
Hi,
I am using a custom i.MX28 board similar to the i.MX28-EVK. For Wi-Fi
the board assembles a BCM43362 from Broadcom and for Ethernet a
LAN8720A from Microchip. The board is running mainline Linux 4.7.
While both, wireless and wired network interfaces work, the TCP
throughput measured with iperf is low.
The bandwith for Ethernet is between 20-30 MBits/sec and for WLAN is
about 4-5 MBits/sec.
There exists an Application Note "i.MX28 Ethernet Performance on
Linux" [1] which shows a bandwith of > 60 MBits/sec. A user an the
NXP
Post by Jörg Krause
forum [2] told he achieved 20 MBits/sec with some Qualcom chip.
Note, that these values are most probably measured with the legacy
Linux Kernel 2.6.35 from NXP.
Does anybody has done throughput tests on i.MX28 with mainline
Kernel?
Post by Jörg Krause
If so, what are the results? What might be the bottleneck?
This is the iperf output on a TX28 with current mainline kernel
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.100.56 port 60325 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 57.5 MBytes 48.2 Mbits/sec
You might check your kernel DEBUG configs (especially
CONFIG_DEBUG_PAGEALLOC).
Thanks for sharing the iperf output. What LAN transceiver does the TX28 has assembled?

I checked the config and is has no DEBUG_PAGEALLOC enabled and no DEBUG options related to network.

Best regards
Jörg Krause
Uwe Kleine-König
2016-10-13 20:42:35 UTC
Permalink
Hello,
Post by Jörg Krause
Post by Lothar Waßmann
This is the iperf output on a TX28 with current mainline kernel
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.100.56 port 60325 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 57.5 MBytes 48.2 Mbits/sec
Just for the record: I have another i.MX28 system here and got 43
Mbits/sec with PREEMPT_NONE and 37 Mbits/sec with PREEMPT_RT both using
a 3.14 kernel.
Post by Jörg Krause
Post by Lothar Waßmann
You might check your kernel DEBUG configs (especially
CONFIG_DEBUG_PAGEALLOC).
Thanks for sharing the iperf output. What LAN transceiver does the TX28 has assembled?
My system has a Marvell Switch (88e6083) as "transceiver".

Best regards
Uwe
--
Pengutronix e.K. | Uwe Kleine-König |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Lothar Waßmann
2016-10-14 06:13:49 UTC
Permalink
Hi,
[...]
Post by Jörg Krause
Post by Lothar Waßmann
This is the iperf output on a TX28 with current mainline kernel
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.100.56 port 60325 connected with 192.168.100.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 57.5 MBytes 48.2 Mbits/sec
You might check your kernel DEBUG configs (especially
CONFIG_DEBUG_PAGEALLOC).
Thanks for sharing the iperf output. What LAN transceiver does the TX28 has assembled?
The ethernet PHY is an SMSC LAN8710A.


Lothar Waßmann
Jörg Krause
2016-10-15 08:46:11 UTC
Permalink
Post by Lothar Waßmann
Hi,
[...]
Post by Jörg Krause
Post by Lothar Waßmann
This is the iperf output on a TX28 with current mainline kernel
------------------------------------------------------------
Client connecting to 192.168.100.1, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  3] local 192.168.100.56 port 60325 connected with
192.168.100.1 port
5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  57.5 MBytes  48.2 Mbits/sec
You might check your kernel DEBUG configs (especially
CONFIG_DEBUG_PAGEALLOC).
Thanks for sharing the iperf output. What LAN transceiver does the TX28 has assembled?
The ethernet PHY is an SMSC LAN8710A.
Thanks!


For the record:

Note, this is the result for the wireless interface.

I got one of my custom boards running the legacy Linux Kernel 2.6.35
officially supported from Freescale (NXP) and the bcmdhd driver from
the Wiced project, where I get >20Mbps TCP throughput. The firmware
version reported is:

# wl ver
5.90 RC115.2
wl0: Apr 24 2014 14:08:41 version 5.90.195.89.24 FWID 01-bc2d0891


I got it also running with the Linux Kernel 4.1.15 from Freescale [2],
which is not officially supported for the i.MX28 target, with the
latest bcmdhd version where I get <7Mbps TCP throughput (which is much
the same I get with the brcmfmac driver). The firmware version reported
is:

# wl ver
1.107 RC5.0
wl0: Aug  8 2016 02:17:48 version 5.90.232 FWID 01-0

So, probably something is missing in the newer Kernel version, which is
present in the legacy Kernel 2.6.35.

[1] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
imx_2.6.35_1.1.0
[2] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
imx_4.1.15_1.0.0_ga
Stefan Wahren
2016-10-15 08:59:41 UTC
Permalink
Hi Jörg,
Post by Jörg Krause
Thanks!
Note, this is the result for the wireless interface.
I got one of my custom boards running the legacy Linux Kernel 2.6.35
officially supported from Freescale (NXP) and the bcmdhd driver from
the Wiced project, where I get >20Mbps TCP throughput. The firmware
# wl ver
5.90 RC115.2
wl0: Apr 24 2014 14:08:41 version 5.90.195.89.24 FWID 01-bc2d0891
I got it also running with the Linux Kernel 4.1.15 from Freescale [2],
which is not officially supported for the i.MX28 target, with the
latest bcmdhd version where I get <7Mbps TCP throughput (which is much
the same I get with the brcmfmac driver). The firmware version reported
# wl ver
1.107 RC5.0
wl0: Aug  8 2016 02:17:48 version 5.90.232 FWID 01-0
So, probably something is missing in the newer Kernel version, which is
present in the legacy Kernel 2.6.35.
[1] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
imx_2.6.35_1.1.0
[2] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
imx_4.1.15_1.0.0_ga
during implementation of DDR mode for the mmc driver [1] i noticed a performance
gap between the vendor kernel and mainline by a factor of 2. I expect that your
wireless interface is connected via SDIO.

Stefan

[1] -
http://linux-arm-kernel.infradead.narkive.com/GNkqjvo8/patch-rfc-0-3-mmc-mxs-mmc-implement-ddr-support
Jörg Krause
2016-10-15 09:41:41 UTC
Permalink
Hi Stefan,
Post by Stefan Wahren
Hi Jörg,
10:46
Post by Jörg Krause
Thanks!
Note, this is the result for the wireless interface.
I got one of my custom boards running the legacy Linux Kernel 2.6.35
officially supported from Freescale (NXP) and the bcmdhd driver from
the Wiced project, where I get >20Mbps TCP throughput. The firmware
# wl ver
5.90 RC115.2
wl0: Apr 24 2014 14:08:41 version 5.90.195.89.24 FWID 01-bc2d0891
I got it also running with the Linux Kernel 4.1.15 from Freescale
[2],
Post by Jörg Krause
which is not officially supported for the i.MX28 target, with the
latest bcmdhd version where I get <7Mbps TCP throughput (which is
much
Post by Jörg Krause
the same I get with the brcmfmac driver). The firmware version
reported
Post by Jörg Krause
# wl ver
1.107 RC5.0
wl0: Aug  8 2016 02:17:48 version 5.90.232 FWID 01-0
So, probably something is missing in the newer Kernel version, which
is
Post by Jörg Krause
present in the legacy Kernel 2.6.35.
[1]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
Post by Jörg Krause
imx_2.6.35_1.1.0
[2]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
Post by Jörg Krause
imx_4.1.15_1.0.0_ga
during implementation of DDR mode for the mmc driver [1] i noticed a performance
gap between the vendor kernel and mainline by a factor of 2. I expect that your
wireless interface is connected via SDIO.
Yes, it is. I had the suspicion that the MMC or the DMA driver is the bootleneck, too.
Post by Stefan Wahren
[1] -
http://linux-arm-kernel.infradead.narkive.com/GNkqjvo8/patch-rfc-0-3-mmc-mxs-mmc-implement-ddr-support
Looks like the patches might help. Have you tried SDIO wifi so far?

Jörg
Stefan Wahren
2016-10-15 16:16:05 UTC
Permalink
Post by Jörg Krause
Hi Stefan,
Am 15. Oktober 2016 10:59:41 MESZ, schrieb Stefan Wahren
Post by Stefan Wahren
Hi Jörg,
10:46
Post by Jörg Krause
Thanks!
Note, this is the result for the wireless interface.
I got one of my custom boards running the legacy Linux Kernel 2.6.35
officially supported from Freescale (NXP) and the bcmdhd driver from
the Wiced project, where I get >20Mbps TCP throughput. The firmware
# wl ver
5.90 RC115.2
wl0: Apr 24 2014 14:08:41 version 5.90.195.89.24 FWID 01-bc2d0891
I got it also running with the Linux Kernel 4.1.15 from Freescale
[2],
Post by Jörg Krause
which is not officially supported for the i.MX28 target, with the
latest bcmdhd version where I get <7Mbps TCP throughput (which is
much
Post by Jörg Krause
the same I get with the brcmfmac driver). The firmware version
reported
Post by Jörg Krause
# wl ver
1.107 RC5.0
wl0: Aug  8 2016 02:17:48 version 5.90.232 FWID 01-0
So, probably something is missing in the newer Kernel version, which
is
Post by Jörg Krause
present in the legacy Kernel 2.6.35.
[1]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
Post by Jörg Krause
imx_2.6.35_1.1.0
[2]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
Post by Jörg Krause
imx_4.1.15_1.0.0_ga
during implementation of DDR mode for the mmc driver [1] i noticed a performance
gap between the vendor kernel and mainline by a factor of 2. I expect that your
wireless interface is connected via SDIO.
Yes, it is. I had the suspicion that the MMC or the DMA driver is the bootleneck, too.
Post by Stefan Wahren
[1] -
http://linux-arm-kernel.infradead.narkive.com/GNkqjvo8/patch-rfc-0-3-mmc-mxs-mmc-implement-ddr-support
Looks like the patches might help.
Unfortunately not, the performance gain is smaller than expected.
Post by Jörg Krause
Have you tried SDIO wifi so far?
No.
Post by Jörg Krause
Jörg
Jörg Krause
2016-10-28 23:07:08 UTC
Permalink
Post by Stefan Wahren
Post by Jörg Krause
Hi Stefan,
Am 15. Oktober 2016 10:59:41 MESZ, schrieb Stefan Wahren
Post by Stefan Wahren
Hi Jörg,
10:46
Post by Jörg Krause
Thanks!
Note, this is the result for the wireless interface.
I got one of my custom boards running the legacy Linux Kernel 2.6.35
officially supported from Freescale (NXP) and the bcmdhd driver from
the Wiced project, where I get >20Mbps TCP throughput. The firmware
# wl ver
5.90 RC115.2
wl0: Apr 24 2014 14:08:41 version 5.90.195.89.24 FWID 01-
bc2d0891
I got it also running with the Linux Kernel 4.1.15 from
Freescale
[2],
Post by Jörg Krause
which is not officially supported for the i.MX28 target, with the
latest bcmdhd version where I get <7Mbps TCP throughput (which is
much
Post by Jörg Krause
the same I get with the brcmfmac driver). The firmware version
reported
Post by Jörg Krause
# wl ver
1.107 RC5.0
wl0: Aug  8 2016 02:17:48 version 5.90.232 FWID 01-0
So, probably something is missing in the newer Kernel version, which
is
Post by Jörg Krause
present in the legacy Kernel 2.6.35.
[1]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?
h=
Post by Jörg Krause
imx_2.6.35_1.1.0
[2]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?
h=
Post by Jörg Krause
imx_4.1.15_1.0.0_ga
during implementation of DDR mode for the mmc driver [1] i
noticed a
performance
gap between the vendor kernel and mainline by a factor of 2. I
expect
that your
wireless interface is connected via SDIO.
Yes, it is. I had the suspicion that the MMC or the DMA driver is
the
bootleneck, too.
Post by Stefan Wahren
[1] -
http://linux-arm-kernel.infradead.narkive.com/GNkqjvo8/patch-rfc-
0-3-mmc-mxs-mmc-implement-ddr-support
Looks like the patches might help.
Unfortunately not, the performance gain is smaller than expected.
Post by Jörg Krause
Have you tried SDIO wifi so far?
No.
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the mainline
kernel.

Did you ever tried to port this code from Freescale to mainline?

Is it even possible, as the mainline driver uses the DMA engine?

[1] http://linux-arm-kernel.infradead.narkive.com/GNkqjvo8/patch-rfc-0-
3-mmc-mxs-mmc-implement-ddr-support#post8
[2] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/commit/
?h=imx_2.6.35_maintain&id=b09358887fb4b67f6d497fac8cc48475c8bd292d

Best regards,
Jörg Krause
Stefan Wahren
2016-10-29 09:08:53 UTC
Permalink
Post by Jörg Krause
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to mainline?
Yes, i tried once but i was frustrated soon because of the lot of required
changes and resulting issues.
Post by Jörg Krause
Is it even possible, as the mainline driver uses the DMA engine?
I think the more important part would be analyse why the Mainline driver is
slowlier. I mean to exactly identify the bottleneck.

I don't have enough time and equipment for this. I better concentrate on standby
support.
Post by Jörg Krause
[1] http://linux-arm-kernel.infradead.narkive.com/GNkqjvo8/patch-rfc-0-
3-mmc-mxs-mmc-implement-ddr-support#post8
[2] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/commit/
?h=imx_2.6.35_maintain&id=b09358887fb4b67f6d497fac8cc48475c8bd292d
Best regards,
Jörg Krause
Jörg Krause
2016-10-29 13:08:34 UTC
Permalink
Post by Stefan Wahren
Post by Jörg Krause
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to mainline?
Yes, i tried once but i was frustrated soon because of the lot of required
changes and resulting issues.
I can imagine.
Post by Stefan Wahren
Post by Jörg Krause
Is it even possible, as the mainline driver uses the DMA engine?
I think the more important part would be analyse why the Mainline driver is
slowlier. I mean to exactly identify the bottleneck.
I'll try to understand the driver implementation. However, I am not a
Linux kernel developer, so this will need some time for sure. Any help
will be appreciated!
Post by Stefan Wahren
I don't have enough time and equipment for this. I better concentrate on standby
support.
Many thanks for your work!

Jörg
Jörg Krause
2016-11-02 08:14:49 UTC
Permalink
Post by Stefan Wahren
Post by Jörg Krause
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to mainline?
Yes, i tried once but i was frustrated soon because of the lot of required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of the DMA
engine entirely.

Testing network bandwidth with iperf, I get about ~10Mbit/sec with PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.

Note, that the vendor kernel implements a switch between PIO and DMA
mode for the ADTC command type depending on the data size. For this
test, I removed this switch and used PIO mode solely.
Post by Stefan Wahren
Post by Jörg Krause
Is it even possible, as the mainline driver uses the DMA engine?
I think the more important part would be analyse why the Mainline
driver is slowlier. I mean to exactly identify the bottleneck.
I will further investigate this issue.

Best regards,
Jörg Krause
Stefan Wahren
2016-11-02 08:24:12 UTC
Permalink
Post by Jörg Krause
Post by Stefan Wahren
Post by Jörg Krause
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to mainline?
Yes, i tried once but i was frustrated soon because of the lot of required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of the DMA
engine entirely.
Good job
Post by Jörg Krause
Testing network bandwidth with iperf, I get about ~10Mbit/sec with PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
Jörg Krause
2016-11-02 08:30:18 UTC
Permalink
Post by Stefan Wahren
Post by Jörg Krause
Post by Stefan Wahren
2016
um 01:07
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the
mainline
kernel.
Did you ever tried to port this code from Freescale to
mainline?
Yes, i tried once but i was frustrated soon because of the lot of required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of the DMA
engine entirely.
Good job
Thanks!
Post by Stefan Wahren
Post by Jörg Krause
Testing network bandwidth with iperf, I get about ~10Mbit/sec with PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
Can't tell as my custom i.MX28 board does not have a SD card interface.

I will share the code after doing some cleanups and further tests so
you might test it as well.

Jörg
Jörg Krause
2016-11-04 18:44:57 UTC
Permalink
Hi Shawn,
Post by Stefan Wahren
Post by Jörg Krause
Post by Stefan Wahren
2016
um 01:07
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the
mainline
kernel.
Did you ever tried to port this code from Freescale to
mainline?
Yes, i tried once but i was frustrated soon because of the lot of required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of the DMA
engine entirely.
Good job
Post by Jörg Krause
Testing network bandwidth with iperf, I get about ~10Mbit/sec with PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
I noticed poor performance with the i.MX28 MMC and/or DMA driver using
the mainline kernel compared to the vendor Freescale kernel 2.6.35.
I've seen that hou have added the drivers to mainline some years ago.

My custom i.MX28 board has a wifi chip attached to the SSP2 interface.
Comparing the bandwith with iperf I get >20Mbits/sec on the vendor
kernel and <5Mbits/sec on the mainline kernel.

My best guess is that there is some kind of bottleneck in the drivers.
I already started looking at the vendor drivers as well as at the
mainline drivers, but I need some more investigation to understand the
complexity.

Do you have any idea what the bottleneck might be?

Best regards,
Jörg Krause
Stefan Wahren
2016-11-04 19:30:47 UTC
Permalink
Hi Jörg,
Post by Jörg Krause
Hi Shawn,
Post by Stefan Wahren
Post by Jörg Krause
Post by Stefan Wahren
2016
um 01:07
You mentioned [1] an optimization in the Freescale vendor Linux kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to
mainline?
Yes, i tried once but i was frustrated soon because of the lot of required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of the DMA
engine entirely.
Good job
Post by Jörg Krause
Testing network bandwidth with iperf, I get about ~10Mbit/sec with PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
I noticed poor performance with the i.MX28 MMC and/or DMA driver using
the mainline kernel compared to the vendor Freescale kernel 2.6.35.
I've seen that hou have added the drivers to mainline some years ago.
My custom i.MX28 board has a wifi chip attached to the SSP2 interface.
Comparing the bandwith with iperf I get >20Mbits/sec on the vendor
kernel and <5Mbits/sec on the mainline kernel.
there is one thing about the clock handling. I noticed that the Vendor Kernel
round up the clock frequency and the Mainline Kernel round down the clock
frequency [1]. So don't trust the clock ratings from DT / board code. Better
verify the register settings or check it with an osci.

[1] - http://www.spinics.net/lists/linux-mmc/msg09132.html
Post by Jörg Krause
My best guess is that there is some kind of bottleneck in the drivers.
I already started looking at the vendor drivers as well as at the
mainline drivers, but I need some more investigation to understand the
complexity.
Do you have any idea what the bottleneck might be?
Best regards,
Jörg Krause
Jörg Krause
2016-11-04 20:56:14 UTC
Permalink
Post by Stefan Wahren
Hi Jörg,
Post by Jörg Krause
Hi Shawn,
Post by Stefan Wahren
Post by Jörg Krause
Post by Stefan Wahren
2016
um 01:07
You mentioned [1] an optimization in the Freescale vendor
Linux
kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to mainline?
Yes, i tried once but i was frustrated soon because of the
lot of
required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of
the
DMA
engine entirely.
Good job
Post by Jörg Krause
Testing network bandwidth with iperf, I get about ~10Mbit/sec
with
PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
I noticed poor performance with the i.MX28 MMC and/or DMA driver using
the mainline kernel compared to the vendor Freescale kernel 2.6.35.
I've seen that hou have added the drivers to mainline some years ago.
My custom i.MX28 board has a wifi chip attached to the SSP2
interface.
Comparing the bandwith with iperf I get >20Mbits/sec on the vendor
kernel and <5Mbits/sec on the mainline kernel.
there is one thing about the clock handling. I noticed that the Vendor Kernel
round up the clock frequency and the Mainline Kernel round down the clock
frequency [1]. So don't trust the clock ratings from DT / board code. Better
verify the register settings or check it with an osci.
[1] - http://www.spinics.net/lists/linux-mmc/msg09132.html
I checked the clock rate setting by reading the register 0x80014070
(HW_SSP2_TIMING). CLOCK_DIVIDE is 0x2 and CLOCK_RATE is 0x0. As SSP CLK
is 96MHz this makes a clock rate of 48MHz.

There was a discussion on the mailing list [1] about that tasklets
might be slow.

Jörg
Jörg Krause
2016-11-04 22:42:39 UTC
Permalink
Hi Stefan,

sorry, I forget the link in the previous mail.
Post by Stefan Wahren
Hi Jörg,
Post by Jörg Krause
Hi Shawn,
Post by Stefan Wahren
Post by Jörg Krause
Post by Stefan Wahren
2016
um 01:07
You mentioned [1] an optimization in the Freescale vendor
Linux
kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to mainline?
Yes, i tried once but i was frustrated soon because of the
lot of
required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of
the
DMA
engine entirely.
Good job
Post by Jörg Krause
Testing network bandwidth with iperf, I get about ~10Mbit/sec
with
PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
I noticed poor performance with the i.MX28 MMC and/or DMA driver using
the mainline kernel compared to the vendor Freescale kernel 2.6.35.
I've seen that hou have added the drivers to mainline some years ago.
My custom i.MX28 board has a wifi chip attached to the SSP2
interface.
Comparing the bandwith with iperf I get >20Mbits/sec on the vendor
kernel and <5Mbits/sec on the mainline kernel.
there is one thing about the clock handling. I noticed that the Vendor Kernel
round up the clock frequency and the Mainline Kernel round down the clock
frequency [1]. So don't trust the clock ratings from DT / board code. Better
verify the register settings or check it with an osci.
[1] - http://www.spinics.net/lists/linux-mmc/msg09132.html
I checked the clock rate setting by reading the register 0x80014070
(HW_SSP2_TIMING). CLOCK_DIVIDE is 0x2 and CLOCK_RATE is 0x0. As SSP CLK
is 96MHz this makes a clock rate of 48MHz.

There was a discussion on the mailing list [1] about that tasklets
might be slow.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2011-February
/043395.html

Jörg
Stefan Wahren
2016-11-05 11:33:14 UTC
Permalink
Hi Jörg,
Post by Jörg Krause
Hi Stefan,
sorry, I forget the link in the previous mail.
Post by Stefan Wahren
Hi Jörg,
Post by Jörg Krause
Hi Shawn,
Post by Stefan Wahren
Post by Jörg Krause
Post by Stefan Wahren
2016
um 01:07
You mentioned [1] an optimization in the Freescale vendor
Linux
kernel
[2]. I would really like to see this optimization in the mainline
kernel.
Did you ever tried to port this code from Freescale to mainline?
Yes, i tried once but i was frustrated soon because of the
lot of
required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this I ported
the PIO code from the vendor kernel and removed the usage of
the
DMA
engine entirely.
Good job
Post by Jörg Krause
Testing network bandwidth with iperf, I get about ~10Mbit/sec
with
PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
I noticed poor performance with the i.MX28 MMC and/or DMA driver using
the mainline kernel compared to the vendor Freescale kernel 2.6.35.
I've seen that hou have added the drivers to mainline some years ago.
My custom i.MX28 board has a wifi chip attached to the SSP2 interface.
Comparing the bandwith with iperf I get >20Mbits/sec on the vendor
kernel and <5Mbits/sec on the mainline kernel.
there is one thing about the clock handling. I noticed that the Vendor Kernel
round up the clock frequency and the Mainline Kernel round down the clock
frequency [1]. So don't trust the clock ratings from DT / board code. Better
verify the register settings or check it with an osci.
[1] - http://www.spinics.net/lists/linux-mmc/msg09132.html
I checked the clock rate setting by reading the register 0x80014070
(HW_SSP2_TIMING). CLOCK_DIVIDE is 0x2 and CLOCK_RATE is 0x0. As SSP CLK
is 96MHz this makes a clock rate of 48MHz.
There was a discussion on the mailing list [1] about that tasklets
might be slow.
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2011-February
/043395.html
if i unterstand it right the tasklet is not the problem, but the design of the
MXS DMA driver. Please refer to the chapter "General Design Notes" to the
documentation of the DMA provider [2].
I think the MXS DMA driver is affected. Maybe you should ask Vinod Koul about
this.

[2] - https://www.kernel.org/doc/Documentation/dmaengine/provider.txt
Post by Jörg Krause
Jörg
Jörg Krause
2016-11-05 12:06:50 UTC
Permalink
Hello Vinod,

as recommanded by Stefan Wahren I'm turning on you about this issue.
Please see below...
Post by Stefan Wahren
Hi Jörg,
Post by Jörg Krause
Hi Stefan,
sorry, I forget the link in the previous mail.
Post by Stefan Wahren
Hi Jörg,
2016
um 19:44
Hi Shawn,
Post by Stefan Wahren
Post by Jörg Krause
Post by Stefan Wahren
2016
um 01:07
You mentioned [1] an optimization in the Freescale vendor
Linux
kernel
[2]. I would really like to see this optimization in the
mainline
kernel.
Did you ever tried to port this code from Freescale to
mainline?
Yes, i tried once but i was frustrated soon because of the
lot of
required
changes and resulting issues.
I got the PIO mode working for the mxs-mmc driver. For this
I
ported
the PIO code from the vendor kernel and removed the usage of
the
DMA
engine entirely.
Good job
Post by Jörg Krause
Testing network bandwidth with iperf, I get about
~10Mbit/sec
with
PIO
mode compared to ~6.5Mbit/sec with DMA mode for UDP and about
~6.5Mbit/sec compared to ~4.5Mbit/sec with DMA mode for TCP.
And how about MMC / sd card performance?
I noticed poor performance with the i.MX28 MMC and/or DMA
driver
using
the mainline kernel compared to the vendor Freescale kernel 2.6.35.
I've seen that hou have added the drivers to mainline some
years
ago.
My custom i.MX28 board has a wifi chip attached to the SSP2 interface.
Comparing the bandwith with iperf I get >20Mbits/sec on the vendor
kernel and <5Mbits/sec on the mainline kernel.
there is one thing about the clock handling. I noticed that the Vendor Kernel
round up the clock frequency and the Mainline Kernel round down
the
clock
frequency [1]. So don't trust the clock ratings from DT / board
code.
Better
verify the register settings or check it with an osci.
[1] - http://www.spinics.net/lists/linux-mmc/msg09132.html
I checked the clock rate setting by reading the register 0x80014070
(HW_SSP2_TIMING). CLOCK_DIVIDE is 0x2 and CLOCK_RATE is 0x0. As SSP CLK
is 96MHz this makes a clock rate of 48MHz.
There was a discussion on the mailing list [1] about that tasklets
might be slow.
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2011-Febr
uary
/043395.html
if i unterstand it right the tasklet is not the problem, but the design of the
MXS DMA driver. Please refer to the chapter "General Design Notes" to the
documentation of the DMA provider [2].
I think the MXS DMA driver is affected. Maybe you should ask Vinod Koul about
this.
[2] - https://www.kernel.org/doc/Documentation/dmaengine/provider.txt
@ Vinod
In short, I noticed poor performance in the SSP2 (MMC/SD/SDIO)
interface on a custom i.MX28 board with a wifi chip attached. Comparing
the bandwith with iperf I get >20Mbits/sec on the vendor kernel and
<5Mbits/sec on the mainline kernel. I am trying to investigate what the
bottleneck is.

@ Stefan, all
My understanding is that the tasklet in this case is responsible for
reading the response registers of the DMA controller and return the
response to the MMC host driver.

The vendor kernel does this in the interrupt routine of mxs-mmc by
issueing a complete whereas the mainline kernel does this in the
interrupt routine in mxs-dma by scheduling the tasklet.

To check if this makes any difference I replaced the tasklet() usage
with using the complete() infrastructure. For this I hacked the DMA
engine and the MXS DMA driver. However, the performance stays the same.

So, if I understand correctly, this is not an issue here, right? So if
not the tasklet, what do you suspect?

Jörg
Koul, Vinod
2016-11-05 12:39:43 UTC
Permalink
Post by Jörg Krause
@ Vinod
In short, I noticed poor performance in the SSP2 (MMC/SD/SDIO)
interface on a custom i.MX28 board with a wifi chip attached.
Comparing
the bandwith with iperf I get >20Mbits/sec on the vendor kernel and
<5Mbits/sec on the mainline kernel. I am trying to investigate what the
bottleneck is.
is this imx-dma or imx-sdma..
Post by Jörg Krause
@ Stefan, all
My understanding is that the tasklet in this case is responsible for
reading the response registers of the DMA controller and return the
response to the MMC host driver.
The vendor kernel does this in the interrupt routine of mxs-mmc by
issueing a complete whereas the mainline kernel does this in the
interrupt routine in mxs-dma by scheduling the tasklet.
Is vendor kernel using dmaengine APIs or not?

Okay, if we talk about getting best performance, I always advise folks
to issue next transaction in the interrupt routine and then do
descriptor management and callback in tasklet.

Some drivers do that correctly but some don't..

Tasklet can be an issue but only if there is a huge scheduling delay for
the tasklet. You can check using tracing tools and confirm.
Post by Jörg Krause
To check if this makes any difference I replaced the tasklet() usage
with using the complete() infrastructure. For this I hacked the DMA
engine and the MXS DMA driver. However, the performance stays the same.
So, if I understand correctly, this is not an issue here, right? So if
not the tasklet, what do you suspect?
--
~Vinod
Jörg Krause
2016-11-05 12:47:35 UTC
Permalink
Post by Koul, Vinod
Post by Jörg Krause
@ Vinod
In short, I noticed poor performance in the SSP2 (MMC/SD/SDIO)
interface on a custom i.MX28 board with a wifi chip attached. Comparing
the bandwith with iperf I get >20Mbits/sec on the vendor kernel and
<5Mbits/sec on the mainline kernel. I am trying to investigate what the
bottleneck is.
is this imx-dma or imx-sdma..
Its' mxs-dma.
Post by Koul, Vinod
Post by Jörg Krause
@ Stefan, all
My understanding is that the tasklet in this case is responsible for
reading the response registers of the DMA controller and return the
response to the MMC host driver.
The vendor kernel does this in the interrupt routine of mxs-mmc by
issueing a complete whereas the mainline kernel does this in the
interrupt routine in mxs-dma by scheduling the tasklet.
Is vendor kernel using dmaengine APIs or not?
No. It's using a custom dmaengine.
Post by Koul, Vinod
Okay, if we talk about getting best performance, I always advise folks
to issue next transaction in the interrupt routine and then do
descriptor management and callback in tasklet.
Some drivers do that correctly but some don't..
Do you have an example for a driver doing it correctly?
Post by Koul, Vinod
Tasklet can be an issue but only if there is a huge scheduling delay for
the tasklet. You can check using tracing tools and confirm.
Don't think the tasklets is an issue here as I replaced the tasklets in
the dmaengine API by completion (which the vendor kernel uses) and
there are no performance benefits. However, I am not a Linux kernel
developer...

Jörg
Fabio Estevam
2016-11-05 12:48:12 UTC
Permalink
Hi Vinod,
Post by Koul, Vinod
Post by Jörg Krause
@ Vinod
In short, I noticed poor performance in the SSP2 (MMC/SD/SDIO)
interface on a custom i.MX28 board with a wifi chip attached.
Comparing
the bandwith with iperf I get >20Mbits/sec on the vendor kernel and
<5Mbits/sec on the mainline kernel. I am trying to investigate what the
bottleneck is.
is this imx-dma or imx-sdma..
This is drivers/dma/mxs-dma.c, thanks.
Jörg Krause
2016-11-05 13:14:41 UTC
Permalink
Post by Koul, Vinod
Post by Jörg Krause
@ Vinod
In short, I noticed poor performance in the SSP2 (MMC/SD/SDIO)
interface on a custom i.MX28 board with a wifi chip attached. Comparing
the bandwith with iperf I get >20Mbits/sec on the vendor kernel and
<5Mbits/sec on the mainline kernel. I am trying to investigate what the
bottleneck is.
is this imx-dma or imx-sdma..
Post by Jörg Krause
@ Stefan, all
My understanding is that the tasklet in this case is responsible for
reading the response registers of the DMA controller and return the
response to the MMC host driver.
The vendor kernel does this in the interrupt routine of mxs-mmc by
issueing a complete whereas the mainline kernel does this in the
interrupt routine in mxs-dma by scheduling the tasklet.
Is vendor kernel using dmaengine APIs or not?
It's this engine [1].

[1] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/tree/ar
ch/arm/plat-mxs/dmaengine.c?h=imx_2.6.35_1.1.0
Koul, Vinod
2016-11-05 15:45:36 UTC
Permalink
Post by Jörg Krause
Post by Koul, Vinod
Post by Jörg Krause
@ Vinod
In short, I noticed poor performance in the SSP2 (MMC/SD/SDIO)
interface on a custom i.MX28 board with a wifi chip attached. Comparing
the bandwith with iperf I get >20Mbits/sec on the vendor kernel and
<5Mbits/sec on the mainline kernel. I am trying to investigate
what
the
bottleneck is.
is this imx-dma or imx-sdma..
Post by Jörg Krause
@ Stefan, all
My understanding is that the tasklet in this case is responsible for
reading the response registers of the DMA controller and return the
response to the MMC host driver.
The vendor kernel does this in the interrupt routine of mxs-mmc by
issueing a complete whereas the mainline kernel does this in the
interrupt routine in mxs-dma by scheduling the tasklet.
Is vendor kernel using dmaengine APIs or not?
It's this engine [1].
[1] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/tree/a
r
ch/arm/plat-mxs/dmaengine.c?h=imx_2.6.35_1.1.0
Thanks for info, this looks okay.

First can you confirm that register configuration for DMA transaction is
same in both cases.

Second, looking at the driver I see that interrupt handler is not
pushing next descriptor. Also the tasklet is doing callback action and
not pushing any descriptors, did I miss anything in this?

For good dma throughput, you should have multiple dma transactions
queued up and submitted as fast as possible. Can you check if this is
being done. 

We need to minimize/eliminate the delay between two transactions. This
can be done in SW or HW based on support from HW. If HW supports
chaining of descriptors then next transaction which is given to
dmaengine driver should be appended at the end. If not submit the
descriptor to hw immediately on interrupt. 

For good example of latter please look at drivers/dma/sa11x0-dma.c

HTH
--
~Vinod
Jörg Krause
2016-11-05 22:37:13 UTC
Permalink
Post by Koul, Vinod
Post by Jörg Krause
Post by Koul, Vinod
Post by Jörg Krause
@ Vinod
In short, I noticed poor performance in the SSP2 (MMC/SD/SDIO)
interface on a custom i.MX28 board with a wifi chip attached. Comparing
the bandwith with iperf I get >20Mbits/sec on the vendor kernel and
<5Mbits/sec on the mainline kernel. I am trying to investigate
what
the
bottleneck is.
is this imx-dma or imx-sdma..
Post by Jörg Krause
@ Stefan, all
My understanding is that the tasklet in this case is
responsible
for
reading the response registers of the DMA controller and return the
response to the MMC host driver.
The vendor kernel does this in the interrupt routine of mxs-mmc by
issueing a complete whereas the mainline kernel does this in the
interrupt routine in mxs-dma by scheduling the tasklet.
Is vendor kernel using dmaengine APIs or not?
It's this engine [1].
[1] http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/tre
e/a
r
ch/arm/plat-mxs/dmaengine.c?h=imx_2.6.35_1.1.0
Thanks for info, this looks okay.
First can you confirm that register configuration for DMA transaction is
same in both cases.
They are almost identical. The difference is that the mainline MMC
driver has SDIO IRQ enabled and the APB bus has burst mode enable. Both
don't have any influence.
Post by Koul, Vinod
Second, looking at the driver I see that interrupt handler is not
pushing next descriptor. Also the tasklet is doing callback action and
not pushing any descriptors, did I miss anything in this?
Right. However, after observing the registers I noticed that the vendor
MMC kernel driver only issues one DMA command, whereas the mainline
driver issues two chained DMA commands. The relevant function in both
drivers is mxs_mmc_adtc().

The mainline function issues a DMA transaction with setting the PIO
words only and appends the data from the MMC host.

The vendor function copies the MMC host data from the scatterlist into
an owned DMA buffer, sets the buffer address as the next command
address and issues the descriptor to the DMA engine.
Post by Koul, Vinod
For good dma throughput, you should have multiple dma transactions
queued up and submitted as fast as possible. Can you check if this is
being done. 
We need to minimize/eliminate the delay between two transactions. This
can be done in SW or HW based on support from HW. If HW supports
chaining of descriptors then next transaction which is given to
dmaengine driver should be appended at the end. If not submit the
descriptor to hw immediately on interrupt. 
I see! In this particular case, the vendor driver reduces the chaining
of descriptors, whereas the mainline driver chains two DMA commands.
Note, that the i.MX28 hardware does support chaining. So, might this be
an issue for poor performance?

Jörg
Jörg Krause
2016-11-18 23:49:01 UTC
Permalink
Hi all,

[snip]

I did some time measurements on the wifi, mmc and dma driver to compare
the performance between the vendor and the mainline kernel. For this I
toggled some GPIOs and measured the time difference with an osci. I
started measuring the time before calling sdio_readsb() in the wifi
driver [1] and stopped the time when the call returns. Note that the
time was only measured for a packet length of 1536 bytes.

The vendor kernel took about 250 us to return whereas the mainline
kernel took about 325 us. To investigate where this additional time
comes from I divided the whole procedure into seperate parts and
compared their time consumed.

I noticed that the mainline kernel does took much longer to return
after the DMA request is done, signalled in this case by calling
mxs_mmc_dma_irq_callback() [2] in the mxs-mmc driver. From here it
takes about 150 us to get back to sdio_readsb().

An example for consuming much more time is the mainline mmc driver
where it hangs in mmc_wait_done() [2] about 50 us just calling
complete(), whereas the vendor mmc driver almost immediately returns
here.

I wonder why this call to complete consumes so much time? Any ideas?

[1] http://lxr.free-electrons.com/source/drivers/net/wireless/broadcom/
brcm80211/brcmfmac/bcmsdh.c#L488

[2] http://lxr.free-electrons.com/source/drivers/mmc/host/mxs-mmc.c#L17
9

[3] http://lxr.free-electrons.com/source/drivers/mmc/core/core.c#L386

Best regards,
Jörg Krause
Stefan Wahren
2016-11-19 11:36:11 UTC
Permalink
Hi Jörg,
Post by Jörg Krause
Hi all,
[snip]
I did some time measurements on the wifi, mmc and dma driver to compare
the performance between the vendor and the mainline kernel. For this I
toggled some GPIOs and measured the time difference with an osci. I
started measuring the time before calling sdio_readsb() in the wifi
driver [1] and stopped the time when the call returns. Note that the
time was only measured for a packet length of 1536 bytes.
The vendor kernel took about 250 us to return whereas the mainline
kernel took about 325 us. To investigate where this additional time
comes from I divided the whole procedure into seperate parts and
compared their time consumed.
I noticed that the mainline kernel does took much longer to return
after the DMA request is done, signalled in this case by calling
mxs_mmc_dma_irq_callback() [2] in the mxs-mmc driver. From here it
takes about 150 us to get back to sdio_readsb().
An example for consuming much more time is the mainline mmc driver
where it hangs in mmc_wait_done() [2] about 50 us just calling
complete(), whereas the vendor mmc driver almost immediately returns
here.
I wonder why this call to complete consumes so much time? Any ideas?
i don't know why, but how about putting the SDIO clk signal parallel to the
GPIOs at your osci? So could get a better view of the runtime behavior.

Btw you should also verify the necessary time between to 2 packets.

Stefan
Post by Jörg Krause
[1] http://lxr.free-electrons.com/source/drivers/net/wireless/broadcom/
brcm80211/brcmfmac/bcmsdh.c#L488
[2] http://lxr.free-electrons.com/source/drivers/mmc/host/mxs-mmc.c#L17
9
[3] http://lxr.free-electrons.com/source/drivers/mmc/core/core.c#L386
Best regards,
Jörg Krause
Jörg Krause
2016-11-20 09:14:35 UTC
Permalink
Hi Stefan,
Post by Stefan Wahren
Hi Jörg,
Post by Jörg Krause
Hi all,
[snip]
I did some time measurements on the wifi, mmc and dma driver to compare
the performance between the vendor and the mainline kernel. For this I
toggled some GPIOs and measured the time difference with an osci. I
started measuring the time before calling sdio_readsb() in the wifi
driver [1] and stopped the time when the call returns. Note that the
time was only measured for a packet length of 1536 bytes.
The vendor kernel took about 250 us to return whereas the mainline
kernel took about 325 us. To investigate where this additional time
comes from I divided the whole procedure into seperate parts and
compared their time consumed.
I noticed that the mainline kernel does took much longer to return
after the DMA request is done, signalled in this case by calling
mxs_mmc_dma_irq_callback() [2] in the mxs-mmc driver. From here it
takes about 150 us to get back to sdio_readsb().
An example for consuming much more time is the mainline mmc driver
where it hangs in mmc_wait_done() [2] about 50 us just calling
complete(), whereas the vendor mmc driver almost immediately
returns
here.
I wonder why this call to complete consumes so much time? Any ideas?
i don't know why, but how about putting the SDIO clk signal parallel to the
GPIOs at your osci? So could get a better view of the runtime
behavior.
Unfortunately, the board layout does not allow me to access the SDIO
pins.

The main question for me is, why the mmc core driver needs around 120
us beginning from calling complete() in mmc_wait_done() [1] until
receiving the completion signal in mmc_wait_for_req_done() [2]. Why
does signaling the completion consumes so much time?

For comparision, the time to do the mmc request (preparing request,
preparing DMA, doing DMA, waiting, reading response, starting signal
completion) takes about 215 us, whereas just sending the signal that
completion is done takes 120 us. For me this issue is the bottleneck.

Does anyone has an idea what may be responsible that signaling the
completion is so slow?

[1] http://lxr.free-electrons.com/source/drivers/mmc/core/core.c#L386
[2] http://lxr.free-electrons.com/source/drivers/mmc/core/core.c#L492
Post by Stefan Wahren
Btw you should also verify the necessary time between to 2 packets.
Stefan
Post by Jörg Krause
[1] http://lxr.free-electrons.com/source/drivers/net/wireless/broad
com/
brcm80211/brcmfmac/bcmsdh.c#L488
[2] http://lxr.free-electrons.com/source/drivers/mmc/host/mxs-mmc.c
#L17
9
[3] http://lxr.free-electrons.com/source/drivers/mmc/core/core.c#L3
86
Best regards,
Jörg Krause
Jörg Krause
2016-10-15 11:18:36 UTC
Permalink
Post by Stefan Wahren
Hi Jörg,
10:46
Post by Jörg Krause
Thanks!
Note, this is the result for the wireless interface.
I got one of my custom boards running the legacy Linux Kernel 2.6.35
officially supported from Freescale (NXP) and the bcmdhd driver from
the Wiced project, where I get >20Mbps TCP throughput. The firmware
# wl ver
5.90 RC115.2
wl0: Apr 24 2014 14:08:41 version 5.90.195.89.24 FWID 01-bc2d0891
I got it also running with the Linux Kernel 4.1.15 from Freescale
[2],
Post by Jörg Krause
which is not officially supported for the i.MX28 target, with the
latest bcmdhd version where I get <7Mbps TCP throughput (which is
much
Post by Jörg Krause
the same I get with the brcmfmac driver). The firmware version
reported
Post by Jörg Krause
# wl ver
1.107 RC5.0
wl0: Aug  8 2016 02:17:48 version 5.90.232 FWID 01-0
So, probably something is missing in the newer Kernel version, which
is
Post by Jörg Krause
present in the legacy Kernel 2.6.35.
[1]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
Post by Jörg Krause
imx_2.6.35_1.1.0
[2]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/log/?h=
Post by Jörg Krause
imx_4.1.15_1.0.0_ga
during implementation of DDR mode for the mmc driver [1] i noticed a performance
gap between the vendor kernel and mainline by a factor of 2. I expect that your
wireless interface is connected via SDIO.
I wonder if this [2] might be related. As far as I can see it is not present in mainline.
Post by Stefan Wahren
[1] -
http://linux-arm-kernel.infradead.narkive.com/GNkqjvo8/patch-rfc-0-3-mmc-mxs-mmc-implement-ddr-support
[2]
http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/commit/?h=imx_2.6.35_1.1.0&id=c105f3ef1d461aaeedbc6361941096b6684cc812
Loading...