[Question] devm_kmalloc() for DMA ?

Post by Masahiro Yamada
Hi experts,
I have a question about
how to allocate DMA-safe buffer.
In my understanding, kmalloc() returns
memory with DMA safe alignment
in order to avoid cache-sharing problem when used for DMA.
The alignment is decided by ARCH_DMA_MINALIGN.
For example, on modern ARM 32bit boards, this value is typically 64.
So, memory returned by kmalloc() has
at least 64 byte alignment.
On the other hand, devm_kmalloc() does not return
enough-aligned memory.

How so? If anything returned by kmalloc() is guaranteed to occupy some
multiple of ARCH_DMA_MINALIGN bytes in order to avoid two allocations
falling into the same cache line, I don't see how stealing the first 16
bytes *of a single allocation* could make it start sharing cache lines
with another? :/

If a particular device has a problem with:

p = kmalloc(...);
d = dma_map_single(p + 0x10, ...);
do_something_with(d);

that's a separate issue altogether.

Robin.

Post by Masahiro Yamada
On my board (ARM 32bit), devm_kmalloc() returns
(ARCH_DMA_MINALIGN aligned address) + 0x10.
The reason of the offset 0x10 is obvious.
struct devres {
struct devres_node node;
/* -- 3 pointers */
unsigned long long data[]; /* guarantee ull alignment */
};
Management data is located at the top of struct devres.
Then, devm_kmalloc() returns dr->data.
The "unsigned long long" guarantees
the returned memory has 0x10 alignment,
but I think this may not be enough for DMA.
I noticed this when I was seeing drivers/mtd/nand/denali.c
denali->buf.buf = devm_kzalloc(denali->dev,
mtd->writesize + mtd->oobsize,
GFP_KERNEL);
if (!denali->buf.buf) {
ret = -ENOMEM;
goto failed_req_irq;
}
/* Is 32-bit DMA supported? */
ret = dma_set_mask(denali->dev, DMA_BIT_MASK(32));
if (ret) {
dev_err(denali->dev, "No usable DMA configuration\n");
goto failed_req_irq;
}
denali->buf.dma_buf = dma_map_single(denali->dev, denali->buf.buf,
mtd->writesize + mtd->oobsize,
DMA_BIDIRECTIONAL);
Memory buffer is allocated by devm_kzalloc(), then
passed to dma_map_single().
Could this be a potential problem in general?
Is devm_kmalloc() not recommended
for buffer that can be DMA-mapped?
Any advice is appreciated.

Masahiro Yamada

2017-03-08 18:06:34 UTC

Hi Robin,

Post by Robin Murphy

I just thought of traverse of the linked list of devres_node
on a different thread, but it should not happen.

Please forget my stupid question.

Post by Robin Murphy
p = kmalloc(...);
d = dma_map_single(p + 0x10, ...);
do_something_with(d);
that's a separate issue altogether.

Right.
Each device has own requirement for DMA alignment.

Thanks!

--
Best Regards
Masahiro Yamada

Lars-Peter Clausen

2017-03-08 19:48:31 UTC

Post by Masahiro Yamada
Hi Robin,

Post by Robin Murphy

I just thought of traverse of the linked list of devres_node
on a different thread, but it should not happen.

When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that the
changes made to the devres_node have made it to main memory yet, or is
there? So those updates could be lost and you'd get a data structure corruption.

Russell King - ARM Linux

2017-03-08 19:59:01 UTC

Post by Lars-Peter Clausen
When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that
the changes made to the devres_node have made it to main memory yet, or
is there?

That is incorrect.

Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)

Updates that are made by the CPU on overlapping cache lines while the
memory is mapped for DMA may end up doing one of two things: either
overwriting the newly DMA'd data, or being lost altogether.

In the case of devm_* list manipulations, these should only ever happen
during device probe and tear down, and if DMA is active at those times,
the driver is seriously buggy.

--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

Lars-Peter Clausen

2017-03-08 20:44:17 UTC

That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)

On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.

Russell King - ARM Linux

2017-03-08 21:19:00 UTC

That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)

On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.

There is no use of the term "undefined" in the document you refer to.

There is the recommendation that regions are cache line aligned, but
there is quite a bit of history in the kernel where DMA has been to
regions that are not cache line aligned, and where the DMA region
overlaps with data that has recent accesses made to it.

The situation is improving (in that DMA buffers are being allocated
separately, rather than being part of some other structure) but that
doesn't mean that it's safe to assume that overlapping cache lines can
be invalidated.

In any case, DMA with devm allocated buffers really is not a good idea.
The lifetime of devm allocated buffers is between their allocation and
the point that the driver is unbound (either via probe failure or
removal.) If that turns out to be shorter than DMA, then you end up
scribbing over free'd memory.

Moreover, any memory that is dma_map'd must be dma_unmap'd before
being freed.

So, unless we're going to get devm_* stuff for DMA API and for ensuring
that DMA is disabled, I don't think using devm_k*alloc() with DMA is
really a good idea.

Take the fact that it returns memory that is not cache line aligned to
be a big clue that it shouldn't be used for this purpose.

--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

Lars-Peter Clausen

2017-03-08 21:33:52 UTC

That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)

On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.

I says: "Warnings: Memory coherency operates at a granularity called the
cache line width. In order for memory mapped by this API to operate
correctly, the mapped region must begin exactly on a cache line
boundary and end exactly on one." That doesn't sound like a recommendation
to me. "should" usually implies a recommendation while "must" indicates a
hard requirement.

I believe e.g. MIPS will align the address by masking the lower bits off,
without any flushes. Wouldn't be surprised if other architectures do the same.

Post by Russell King - ARM Linux
The situation is improving (in that DMA buffers are being allocated
separately, rather than being part of some other structure) but that
doesn't mean that it's safe to assume that overlapping cache lines can
be invalidated.
In any case, DMA with devm allocated buffers really is not a good idea.

I very much agree with that part.

Masahiro Yamada

2017-03-09 03:25:07 UTC

Hi Russell, Lars-Peter,

Thanks for your expert comments.

That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)

On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.

I says: "Warnings: Memory coherency operates at a granularity called the
cache line width. In order for memory mapped by this API to operate
correctly, the mapped region must begin exactly on a cache line
boundary and end exactly on one." That doesn't sound like a recommendation
to me. "should" usually implies a recommendation while "must" indicates a
hard requirement.
I believe e.g. MIPS will align the address by masking the lower bits off,
without any flushes. Wouldn't be surprised if other architectures do the same.

I very much agree with that part.

I understood devm_k*alloc() is not good for DMA.

Now I am tackling on improvement of drivers/mtd/nand/denali.c
and I'd like to make the situation better.

I thought 3 choices.

Is there anything recommended, or should be avoided?

(a) Change devm_kmalloc() to return address aligned with ARCH_DMA_MINALIGN.

(b) Give alignment in your driver if you still want to use devm_.

denali->buf = devm_kmalloc(dev, bufsize + ARCH_DMA_MINALIGN,
GFP_KERNEL | GFP_DMA);
if (!denali->buf) {
(error_handling)
}

denali->buf = PTR_ALIGN(denali->buf, ARCH_DMA_MINALIGN);

(c) Use kmalloc() and kfree(). (be careful for memory leak)

Thanks!

--
Best Regards
Masahiro Yamada

Russell King - ARM Linux

2017-03-09 09:20:42 UTC

Post by Masahiro Yamada
(c) Use kmalloc() and kfree(). (be careful for memory leak)

This is quite simple. For the first one, it doesn't seem that it's
DMA'd into, so there's no need to use GFP_DMA.

- /* allocate a temporary buffer for nand_scan_ident() */
- denali->buf.buf = devm_kzalloc(denali->dev, PAGE_SIZE,
- GFP_DMA | GFP_KERNEL);
- if (!denali->buf.buf)
- return -ENOMEM;

...

+ denali->buf.buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!denali->buf.buf)
+ return -ENOMEM;
+
/*
* scan for NAND devices attached to the controller
* this is the first stage in a two step process to register
* with the nand subsystem
*/
ret = nand_scan_ident(mtd, denali->max_banks, NULL);
+ kfree(denali->buf.buf);
+
if (ret)
goto failed_req_irq;

- /* allocate the right size buffer now */
- devm_kfree(denali->dev, denali->buf.buf);

For the second one, I think the first thing to do is to move the
dma_set_mask() to the very beginning of the probe function - if that
fails, then we can't use DMA, and it's not something that requires
any cleanup.

With that gone, convert the other devm_kzalloc() there for buf.buf to
kzalloc(), and ensure that it's appropriately freed. Note that this
driver is _already_ buggy in that if:

} else if (mtd->oobsize < (denali->bbtskipbytes +
ECC_8BITS * (mtd->writesize /
ECC_SECTOR_SIZE))) {
pr_err("Your NAND chip OOB is not large enough to contain 8bit ECC correction codes");
goto failed_req_irq;

fails, or these:

ret = nand_scan_tail(mtd);
if (ret)
goto failed_req_irq;

ret = mtd_device_register(mtd, NULL, 0);
if (ret) {
dev_err(denali->dev, "Failed to register MTD: %d\n", ret);
goto failed_req_irq;
}

it doesn't unmap the buffer. So, the driver is already broken.

--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

Robin Murphy

2017-03-09 11:19:56 UTC