Discussion:
i.MX 6 and PCIe DMA issues
Moese, Michael
2017-07-11 13:40:57 UTC
Permalink
Hello ARM folks,
I turn to you in hope you have any hints or directions on the problem I describe below.

I am currently investigating strange behavior of our i.MX 6 (Quad) board with an FPGA connected to to PCI Express. This FPGA contains, among others, an Ethernet (10/100 Mbps) IP core.
The Ethernet relies completely on DMA transfers. There is one buffer descriptor table containing pointers to 64 RX and TX buffers.
Buffers are allocated using dma_alloc_coherent() and mapped using dma_map_single().
Prior to access, dma_sync_single_for_cpu() is called on the memory regions, afterwards dma_sync_single_for_device().

If a new frame is received, the driver reads the RX buffer and passes the frame using skb_put().
When the issue was reported for an old (say 3.18.19) kernel, the buffer descriptor was read correctly (including a correct length), but the buffer contained all zeroes. When I map the physical address in userspace and dump the contents, I can see the correct buffer descriptor contents and inside the buffers valid Ethernet frames. So the DMA transfer itself is working obviously.
To avoid chasing after already-fixed bugs, I switched to a 4.12.0 kernel and observed almost the same behavior, but this time there was no length read as well.
On 3.18.19 I was able to read the buffers when I allocate them using kmalloc() instead of dma_alloc_coherent(), on 4.12 this did not have any impact.

I was suspecting the caches to be the root of my issue, but I was not able to resolve the issue with calls to flush_cache_all(), which I suppose should have invalidated the entire cache.

Unfortunately, our driver is legacy out-of-tree code and I started working on this driver to get it ready for submission. If it is of any help, I could send the code of the driver as well as our board's device tree.

I would highly appreciate any hint or direction that may help my troubleshooting.

Best Regards,
Michael
Andrew Lunn
2017-07-11 14:07:22 UTC
Permalink
Post by Moese, Michael
I was suspecting the caches to be the root of my issue, but I was
not able to resolve the issue with calls to flush_cache_all(), which
I suppose should have invalidated the entire cache.
Flush and invalidate are different operations. flush_cache_all() will
not invalidate.

How are your RX buffers aligned. It is good to ensure that your
buffers don't share cache lines. What could be happening is that you
are reading the tail end of one buffer which is bringing into cache
the head of the next buffer, if they share cache lines.

Also, ARMv7 processors can do speculative reads, so you must get your
dma_sync_single_for_cpu() and dma_sync_single_for_device() correct.

Andrew
Robin Murphy
2017-07-11 14:50:19 UTC
Permalink
Hello ARM folks, I turn to you in hope you have any hints or
directions on the problem I describe below.
I am currently investigating strange behavior of our i.MX 6 (Quad)
board with an FPGA connected to to PCI Express. This FPGA contains,
among others, an Ethernet (10/100 Mbps) IP core. The Ethernet relies
completely on DMA transfers. There is one buffer descriptor table
containing pointers to 64 RX and TX buffers. Buffers are allocated
using dma_alloc_coherent() and mapped using dma_map_single(). Prior
I don't much like the sound of that "and" there - coherent DMA
allocations are, as the name implies, already coherent for CPU and DMA
accesses, and require no maintenance; the streaming DMA API
(dma_{map,unmap,sync}_*) on the other hand is *only* for use on
kmalloced memory.

I'm far more of a DMA guy than a networking guy, so I don't know much
about the details of ethernet drivers, but typically, long-lived things
like descriptor rings would usually use coherent allocations, whilst
streaming DMA is used for the transient mapping/unmapping of individual
skbs.
to access, dma_sync_single_for_cpu() is called on the memory regions,
afterwards dma_sync_single_for_device().
If a new frame is received, the driver reads the RX buffer and passes
the frame using skb_put(). When the issue was reported for an old
(say 3.18.19) kernel, the buffer descriptor was read correctly
(including a correct length), but the buffer contained all zeroes.
When I map the physical address in userspace and dump the contents, I
can see the correct buffer descriptor contents and inside the buffers
valid Ethernet frames. So the DMA transfer itself is working
obviously. To avoid chasing after already-fixed bugs, I switched to a
4.12.0 kernel and observed almost the same behavior, but this time
there was no length read as well. On 3.18.19 I was able to read the
buffers when I allocate them using kmalloc() instead of
dma_alloc_coherent(), on 4.12 this did not have any impact.
I was suspecting the caches to be the root of my issue, but I was not
able to resolve the issue with calls to flush_cache_all(), which I
suppose should have invalidated the entire cache.
i.MX6Q has a PL310 L2 outer cache, which brings with it a whole load of
extra fun, but the primary effect is that it'll be extremely hard to
bodge things if the DMA API usage is incorrect in the first place.
AFAICS flush_cache_all() on a v7 CPU performs set/way operations on the
CPU caches, which means a) on that system it will only affect L1, and b)
it shouldn't really be used from an SMP context with the MMU on anyway.

The PL310 does have more than its fair share of wackiness, but unless
you also see DMA going wrong for the on-chip peripherals, the problem is
almost certainly down to the driver itself rather than the cache
configuration.

Robin.
Unfortunately, our driver is legacy out-of-tree code and I started
working on this driver to get it ready for submission. If it is of
any help, I could send the code of the driver as well as our board's
device tree.
I would highly appreciate any hint or direction that may help my troubleshooting.
Best Regards, Michael
_______________________________________________ linux-arm-kernel
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
m***@men.de
2017-07-13 07:07:19 UTC
Permalink
Post by Robin Murphy
I don't much like the sound of that "and" there - coherent DMA
allocations are, as the name implies, already coherent for CPU and DMA
accesses, and require no maintenance; the streaming DMA API
(dma_{map,unmap,sync}_*) on the other hand is *only* for use on
kmalloced memory.
Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
once, the dma_handle is passed to the device, with no other dma_map*()
or dma_sync_*() calls needed?

Yesterday I observed some strange behavior when I did some debug prints in
the driver, printing me the result from phys_to_virt() of my
virtual address and the dma_handle. I know these may be different, but,
when I dump the memory (from userspace using mmap), I can see the data
at the adress of my dma_handle, but the memory the driver has the
pointer for has different contents. I don't understand this.
Post by Robin Murphy
The PL310 does have more than its fair share of wackiness, but unless
you also see DMA going wrong for the on-chip peripherals, the problem is
almost certainly down to the driver itself rather than the cache
configuration.
Well, I think I need do dive into this as well, my former co-worker
disabled DMA for SPI, for example, in the device tree. That may be
another hint. I think I will need to find out what is setup here.


Thank you for your advise, I will try to find out more.

Michael
Russell King - ARM Linux
2017-07-13 09:04:23 UTC
Permalink
Post by m***@men.de
Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
once, the dma_handle is passed to the device, with no other dma_map*()
or dma_sync_*() calls needed?
Correct.
Post by m***@men.de
Yesterday I observed some strange behavior when I did some debug prints in
the driver, printing me the result from phys_to_virt() of my
virtual address and the dma_handle. I know these may be different, but,
when I dump the memory (from userspace using mmap), I can see the data
at the adress of my dma_handle, but the memory the driver has the
pointer for has different contents. I don't understand this.
phys_to_virt() and virt_to_phys() are only defined for the linear map
region.

DMA coherent allocations may or may not return addresses in the linear
region, which means virt_to_phys() on a virtual address returned from
dma_alloc_coherent() is not defined.

DMA addresses are not physical addresses, they are device addresses -
this is an important distinction when IOMMUs or address translators are
in the path between the device and memory. phys_to_virt() must not be
used with dma_addr_t data types.

You must always use the two addresses returned from dma_alloc_coherent()
and never try to translate them - the DMA address is for programming
into the hardware, and the virtual address for CPU accesses. If you do
try to pass these to the various translators (like virt_to_phys()) the
result is undefined.

Moreover, passing the dma_addr_t returned from dma_alloc_coherent() to
the dma_sync_*() functions also results in undefined behaviour, and
potential data corruption.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
m***@men.de
2017-07-13 10:00:31 UTC
Permalink
Post by Russell King - ARM Linux
Post by m***@men.de
Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
once, the dma_handle is passed to the device, with no other dma_map*()
or dma_sync_*() calls needed?
Correct.
Great. Thanks for the clarification. I will make sure the usage is done
the right way.
Post by Russell King - ARM Linux
You must always use the two addresses returned from dma_alloc_coherent()
and never try to translate them - the DMA address is for programming
into the hardware, and the virtual address for CPU accesses. If you do
try to pass these to the various translators (like virt_to_phys()) the
result is undefined.
I was trying to just print some addresses for debugging. I removed this
again. I'm just confused, because the DMA device address seems to be the
physical one. At least the memory contains my data. The virtual memory
contains different data.

Could it be possible this mapping on my board is going wrong?

Michael
Russell King - ARM Linux
2017-07-13 10:18:45 UTC
Permalink
Post by m***@men.de
Post by Russell King - ARM Linux
You must always use the two addresses returned from dma_alloc_coherent()
and never try to translate them - the DMA address is for programming
into the hardware, and the virtual address for CPU accesses. If you do
try to pass these to the various translators (like virt_to_phys()) the
result is undefined.
I was trying to just print some addresses for debugging. I removed this
again. I'm just confused, because the DMA device address seems to be the
physical one. At least the memory contains my data. The virtual memory
contains different data.
The DMA address might be a physical address if that's what the device
requires, but equally, it might not be.

If it is a physical address, phys_to_virt() will return the virtual
address in the lowmem mapping provided that the physical address
corresponds with a lowmem mapped page. If it's a highmem page, then
phys_to_virt() will still return something that looks like a virtual
address, but it could be anything. It's undefined.

Accessing memory through that pointer could end up hitting a cacheable
mapping, which would destroy the semantics of the DMA coherent mapping,
thereby leading to unpredictable behaviour.

As I've already said, use _both_ pointers for a DMA coherent allocation.
If you want to dump the data in the DMA coherent memory, use the virtual
address returned from dma_alloc_coherent() - this will have the least
side effects.

If you access it through phys_to_virt(dma_addr) then you could be
dragging the data into the caches, and the first time you print it, it
would look correct, but when the DMA device updates it, it becomes stale,
and even if you subsequently access it through the correct pointer, you
could continue receiving the stale data. Just don't do this - using
phys_to_virt() may seem like a short-cut, but it really isn't.
Post by m***@men.de
Could it be possible this mapping on my board is going wrong?
Unlikely.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
Robin Murphy
2017-07-13 14:57:19 UTC
Permalink
Post by m***@men.de
Post by Robin Murphy
I don't much like the sound of that "and" there - coherent DMA
allocations are, as the name implies, already coherent for CPU and DMA
accesses, and require no maintenance; the streaming DMA API
(dma_{map,unmap,sync}_*) on the other hand is *only* for use on
kmalloced memory.
Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent()
once, the dma_handle is passed to the device, with no other dma_map*()
or dma_sync_*() calls needed?
Yesterday I observed some strange behavior when I did some debug prints in
the driver, printing me the result from phys_to_virt() of my
virtual address and the dma_handle. I know these may be different, but,
when I dump the memory (from userspace using mmap), I can see the data
at the adress of my dma_handle, but the memory the driver has the
pointer for has different contents. I don't understand this.
Post by Robin Murphy
The PL310 does have more than its fair share of wackiness, but unless
you also see DMA going wrong for the on-chip peripherals, the problem is
almost certainly down to the driver itself rather than the cache
configuration.
Well, I think I need do dive into this as well, my former co-worker
disabled DMA for SPI, for example, in the device tree. That may be
another hint. I think I will need to find out what is setup here.
OK, the first thing I'd check is that you have the "arm,shared-override"
property on the DT node for the PL310 and that it's being applied
correctly (e.g. you're not entering the kernel with L2 already enabled).
Otherwise I think it might be possible to end up in a situation where
you get stale data into L2 that even reads from the non-cacheable remap
of the buffer can hit, which would probably look rather like what you're
describing.

Robin.
m***@men.de
2017-07-14 11:07:37 UTC
Permalink
Post by Robin Murphy
I don't much like the sound of that "and" there - coherent DMA
allocations are, as the name implies, already coherent for CPU and DMA
accesses, and require no maintenance; the streaming DMA API
(dma_{map,unmap,sync}_*) on the other hand is *only* for use on
kmalloced memory.
I removed all the calls to dma_{map,unmap,sync}_* and took special care
on the handling of the dma_attr_t's. I now dma_alloc_coherent my memory,
and do nothing else with the dma_attr_t's. It seems to work, at least
partly, now. I had to grab a newer board, because we killed it while
soldering wires to trace PCI Express traffic.

Maybe I had some weird old CPU or some strange U-Boot or I simply missed
removing all dma_sync_* calls or similar before.

Now I got other issues, but at least the DMA seems to be working now.
Thanks to you, I think I understood this topic now, thank you.

Now I can try to figure out why a read to the PCI Express may stop all
traffic - that's a different story.


Thanks,
Michael

Andrew Lunn
2017-07-11 17:56:59 UTC
Permalink
Post by Moese, Michael
Hello ARM folks,
I turn to you in hope you have any hints or directions on the problem I describe below.
I am currently investigating strange behavior of our i.MX 6 (Quad) board with an FPGA connected to to PCI Express. This FPGA contains, among others, an Ethernet (10/100 Mbps) IP core.
FYI: I'm using am IMX6 quad with an intel i210 PCIe Ethernet
controller. It works fine. So you might want to look at how the igb
driver does its buffer handling.

Andrew
Loading...