Discussion:
[Question] devm_kmalloc() for DMA ?
(too old to reply)
Masahiro Yamada
2017-03-08 10:59:48 UTC
Permalink
Hi experts,

I have a question about
how to allocate DMA-safe buffer.


In my understanding, kmalloc() returns
memory with DMA safe alignment
in order to avoid cache-sharing problem when used for DMA.

The alignment is decided by ARCH_DMA_MINALIGN.
For example, on modern ARM 32bit boards, this value is typically 64.
So, memory returned by kmalloc() has
at least 64 byte alignment.


On the other hand, devm_kmalloc() does not return
enough-aligned memory.


On my board (ARM 32bit), devm_kmalloc() returns
(ARCH_DMA_MINALIGN aligned address) + 0x10.



The reason of the offset 0x10 is obvious.

struct devres {
struct devres_node node;
/* -- 3 pointers */
unsigned long long data[]; /* guarantee ull alignment */
};


Management data is located at the top of struct devres.
Then, devm_kmalloc() returns dr->data.

The "unsigned long long" guarantees
the returned memory has 0x10 alignment,
but I think this may not be enough for DMA.

I noticed this when I was seeing drivers/mtd/nand/denali.c

The code looks as follows:


denali->buf.buf = devm_kzalloc(denali->dev,
mtd->writesize + mtd->oobsize,
GFP_KERNEL);
if (!denali->buf.buf) {
ret = -ENOMEM;
goto failed_req_irq;
}

/* Is 32-bit DMA supported? */
ret = dma_set_mask(denali->dev, DMA_BIT_MASK(32));
if (ret) {
dev_err(denali->dev, "No usable DMA configuration\n");
goto failed_req_irq;
}

denali->buf.dma_buf = dma_map_single(denali->dev, denali->buf.buf,
mtd->writesize + mtd->oobsize,
DMA_BIDIRECTIONAL);




Memory buffer is allocated by devm_kzalloc(), then
passed to dma_map_single().



Could this be a potential problem in general?

Is devm_kmalloc() not recommended
for buffer that can be DMA-mapped?


Any advice is appreciated.
--
Best Regards
Masahiro Yamada
Robin Murphy
2017-03-08 11:15:39 UTC
Permalink
Post by Masahiro Yamada
Hi experts,
I have a question about
how to allocate DMA-safe buffer.
In my understanding, kmalloc() returns
memory with DMA safe alignment
in order to avoid cache-sharing problem when used for DMA.
The alignment is decided by ARCH_DMA_MINALIGN.
For example, on modern ARM 32bit boards, this value is typically 64.
So, memory returned by kmalloc() has
at least 64 byte alignment.
On the other hand, devm_kmalloc() does not return
enough-aligned memory.
How so? If anything returned by kmalloc() is guaranteed to occupy some
multiple of ARCH_DMA_MINALIGN bytes in order to avoid two allocations
falling into the same cache line, I don't see how stealing the first 16
bytes *of a single allocation* could make it start sharing cache lines
with another? :/

If a particular device has a problem with:

p = kmalloc(...);
d = dma_map_single(p + 0x10, ...);
do_something_with(d);

that's a separate issue altogether.

Robin.
Post by Masahiro Yamada
On my board (ARM 32bit), devm_kmalloc() returns
(ARCH_DMA_MINALIGN aligned address) + 0x10.
The reason of the offset 0x10 is obvious.
struct devres {
struct devres_node node;
/* -- 3 pointers */
unsigned long long data[]; /* guarantee ull alignment */
};
Management data is located at the top of struct devres.
Then, devm_kmalloc() returns dr->data.
The "unsigned long long" guarantees
the returned memory has 0x10 alignment,
but I think this may not be enough for DMA.
I noticed this when I was seeing drivers/mtd/nand/denali.c
denali->buf.buf = devm_kzalloc(denali->dev,
mtd->writesize + mtd->oobsize,
GFP_KERNEL);
if (!denali->buf.buf) {
ret = -ENOMEM;
goto failed_req_irq;
}
/* Is 32-bit DMA supported? */
ret = dma_set_mask(denali->dev, DMA_BIT_MASK(32));
if (ret) {
dev_err(denali->dev, "No usable DMA configuration\n");
goto failed_req_irq;
}
denali->buf.dma_buf = dma_map_single(denali->dev, denali->buf.buf,
mtd->writesize + mtd->oobsize,
DMA_BIDIRECTIONAL);
Memory buffer is allocated by devm_kzalloc(), then
passed to dma_map_single().
Could this be a potential problem in general?
Is devm_kmalloc() not recommended
for buffer that can be DMA-mapped?
Any advice is appreciated.
Masahiro Yamada
2017-03-08 18:06:34 UTC
Permalink
Hi Robin,
Post by Robin Murphy
Post by Masahiro Yamada
Hi experts,
I have a question about
how to allocate DMA-safe buffer.
In my understanding, kmalloc() returns
memory with DMA safe alignment
in order to avoid cache-sharing problem when used for DMA.
The alignment is decided by ARCH_DMA_MINALIGN.
For example, on modern ARM 32bit boards, this value is typically 64.
So, memory returned by kmalloc() has
at least 64 byte alignment.
On the other hand, devm_kmalloc() does not return
enough-aligned memory.
How so? If anything returned by kmalloc() is guaranteed to occupy some
multiple of ARCH_DMA_MINALIGN bytes in order to avoid two allocations
falling into the same cache line, I don't see how stealing the first 16
bytes *of a single allocation* could make it start sharing cache lines
with another? :/
I just thought of traverse of the linked list of devres_node
on a different thread, but it should not happen.

Please forget my stupid question.
Post by Robin Murphy
p = kmalloc(...);
d = dma_map_single(p + 0x10, ...);
do_something_with(d);
that's a separate issue altogether.
Right.
Each device has own requirement for DMA alignment.

Thanks!
--
Best Regards
Masahiro Yamada
Lars-Peter Clausen
2017-03-08 19:48:31 UTC
Permalink
Post by Masahiro Yamada
Hi Robin,
Post by Robin Murphy
Post by Masahiro Yamada
Hi experts,
I have a question about
how to allocate DMA-safe buffer.
In my understanding, kmalloc() returns
memory with DMA safe alignment
in order to avoid cache-sharing problem when used for DMA.
The alignment is decided by ARCH_DMA_MINALIGN.
For example, on modern ARM 32bit boards, this value is typically 64.
So, memory returned by kmalloc() has
at least 64 byte alignment.
On the other hand, devm_kmalloc() does not return
enough-aligned memory.
How so? If anything returned by kmalloc() is guaranteed to occupy some
multiple of ARCH_DMA_MINALIGN bytes in order to avoid two allocations
falling into the same cache line, I don't see how stealing the first 16
bytes *of a single allocation* could make it start sharing cache lines
with another? :/
I just thought of traverse of the linked list of devres_node
on a different thread, but it should not happen.
When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that the
changes made to the devres_node have made it to main memory yet, or is
there? So those updates could be lost and you'd get a data structure corruption.
Russell King - ARM Linux
2017-03-08 19:59:01 UTC
Permalink
Post by Lars-Peter Clausen
When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that
the changes made to the devres_node have made it to main memory yet, or
is there?
That is incorrect.

Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)

Updates that are made by the CPU on overlapping cache lines while the
memory is mapped for DMA may end up doing one of two things: either
overwriting the newly DMA'd data, or being lost altogether.

In the case of devm_* list manipulations, these should only ever happen
during device probe and tear down, and if DMA is active at those times,
the driver is seriously buggy.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
Lars-Peter Clausen
2017-03-08 20:44:17 UTC
Permalink
Post by Russell King - ARM Linux
Post by Lars-Peter Clausen
When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that
the changes made to the devres_node have made it to main memory yet, or
is there?
That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)
On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.
Russell King - ARM Linux
2017-03-08 21:19:00 UTC
Permalink
Post by Lars-Peter Clausen
Post by Russell King - ARM Linux
Post by Lars-Peter Clausen
When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that
the changes made to the devres_node have made it to main memory yet, or
is there?
That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)
On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.
There is no use of the term "undefined" in the document you refer to.

There is the recommendation that regions are cache line aligned, but
there is quite a bit of history in the kernel where DMA has been to
regions that are not cache line aligned, and where the DMA region
overlaps with data that has recent accesses made to it.

The situation is improving (in that DMA buffers are being allocated
separately, rather than being part of some other structure) but that
doesn't mean that it's safe to assume that overlapping cache lines can
be invalidated.

In any case, DMA with devm allocated buffers really is not a good idea.
The lifetime of devm allocated buffers is between their allocation and
the point that the driver is unbound (either via probe failure or
removal.) If that turns out to be shorter than DMA, then you end up
scribbing over free'd memory.

Moreover, any memory that is dma_map'd must be dma_unmap'd before
being freed.

So, unless we're going to get devm_* stuff for DMA API and for ensuring
that DMA is disabled, I don't think using devm_k*alloc() with DMA is
really a good idea.

Take the fact that it returns memory that is not cache line aligned to
be a big clue that it shouldn't be used for this purpose.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
Lars-Peter Clausen
2017-03-08 21:33:52 UTC
Permalink
Post by Russell King - ARM Linux
Post by Lars-Peter Clausen
Post by Russell King - ARM Linux
Post by Lars-Peter Clausen
When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that
the changes made to the devres_node have made it to main memory yet, or
is there?
That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)
On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.
There is no use of the term "undefined" in the document you refer to.
There is the recommendation that regions are cache line aligned, but
there is quite a bit of history in the kernel where DMA has been to
regions that are not cache line aligned, and where the DMA region
overlaps with data that has recent accesses made to it.
I says: "Warnings: Memory coherency operates at a granularity called the
cache line width. In order for memory mapped by this API to operate
correctly, the mapped region must begin exactly on a cache line
boundary and end exactly on one." That doesn't sound like a recommendation
to me. "should" usually implies a recommendation while "must" indicates a
hard requirement.

I believe e.g. MIPS will align the address by masking the lower bits off,
without any flushes. Wouldn't be surprised if other architectures do the same.
Post by Russell King - ARM Linux
The situation is improving (in that DMA buffers are being allocated
separately, rather than being part of some other structure) but that
doesn't mean that it's safe to assume that overlapping cache lines can
be invalidated.
In any case, DMA with devm allocated buffers really is not a good idea.
I very much agree with that part.
Masahiro Yamada
2017-03-09 03:25:07 UTC
Permalink
Hi Russell, Lars-Peter,

Thanks for your expert comments.
Post by Lars-Peter Clausen
Post by Russell King - ARM Linux
Post by Lars-Peter Clausen
Post by Russell King - ARM Linux
Post by Lars-Peter Clausen
When the DMA memory is mapped for reading from the device the associated
cachelines are invalidated without writeback. There is no guarantee that
the changes made to the devres_node have made it to main memory yet, or
is there?
That is incorrect.
Overlapping cache lines are always written back on transitions from CPU
to device ownership of the buffer (eg, dma_map_*().)
On ARM. But my understanding is that this is not a universal requirement
according to DMA-API.txt. It says that mappings must be cache line aligned
and otherwise behavior is undefined.
There is no use of the term "undefined" in the document you refer to.
There is the recommendation that regions are cache line aligned, but
there is quite a bit of history in the kernel where DMA has been to
regions that are not cache line aligned, and where the DMA region
overlaps with data that has recent accesses made to it.
I says: "Warnings: Memory coherency operates at a granularity called the
cache line width. In order for memory mapped by this API to operate
correctly, the mapped region must begin exactly on a cache line
boundary and end exactly on one." That doesn't sound like a recommendation
to me. "should" usually implies a recommendation while "must" indicates a
hard requirement.
I believe e.g. MIPS will align the address by masking the lower bits off,
without any flushes. Wouldn't be surprised if other architectures do the same.
Post by Russell King - ARM Linux
The situation is improving (in that DMA buffers are being allocated
separately, rather than being part of some other structure) but that
doesn't mean that it's safe to assume that overlapping cache lines can
be invalidated.
In any case, DMA with devm allocated buffers really is not a good idea.
I very much agree with that part.
I understood devm_k*alloc() is not good for DMA.

Now I am tackling on improvement of drivers/mtd/nand/denali.c
and I'd like to make the situation better.


I thought 3 choices.

Is there anything recommended, or should be avoided?


(a) Change devm_kmalloc() to return address aligned with ARCH_DMA_MINALIGN.



(b) Give alignment in your driver if you still want to use devm_.

denali->buf = devm_kmalloc(dev, bufsize + ARCH_DMA_MINALIGN,
GFP_KERNEL | GFP_DMA);
if (!denali->buf) {
(error_handling)
}

denali->buf = PTR_ALIGN(denali->buf, ARCH_DMA_MINALIGN);



(c) Use kmalloc() and kfree(). (be careful for memory leak)




Thanks!
--
Best Regards
Masahiro Yamada
Russell King - ARM Linux
2017-03-09 09:20:42 UTC
Permalink
Post by Masahiro Yamada
(c) Use kmalloc() and kfree(). (be careful for memory leak)
This is quite simple. For the first one, it doesn't seem that it's
DMA'd into, so there's no need to use GFP_DMA.

- /* allocate a temporary buffer for nand_scan_ident() */
- denali->buf.buf = devm_kzalloc(denali->dev, PAGE_SIZE,
- GFP_DMA | GFP_KERNEL);
- if (!denali->buf.buf)
- return -ENOMEM;

...

+ denali->buf.buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!denali->buf.buf)
+ return -ENOMEM;
+
/*
* scan for NAND devices attached to the controller
* this is the first stage in a two step process to register
* with the nand subsystem
*/
ret = nand_scan_ident(mtd, denali->max_banks, NULL);
+ kfree(denali->buf.buf);
+
if (ret)
goto failed_req_irq;

- /* allocate the right size buffer now */
- devm_kfree(denali->dev, denali->buf.buf);

For the second one, I think the first thing to do is to move the
dma_set_mask() to the very beginning of the probe function - if that
fails, then we can't use DMA, and it's not something that requires
any cleanup.

With that gone, convert the other devm_kzalloc() there for buf.buf to
kzalloc(), and ensure that it's appropriately freed. Note that this
driver is _already_ buggy in that if:

} else if (mtd->oobsize < (denali->bbtskipbytes +
ECC_8BITS * (mtd->writesize /
ECC_SECTOR_SIZE))) {
pr_err("Your NAND chip OOB is not large enough to contain 8bit ECC correction codes");
goto failed_req_irq;

fails, or these:

ret = nand_scan_tail(mtd);
if (ret)
goto failed_req_irq;

ret = mtd_device_register(mtd, NULL, 0);
if (ret) {
dev_err(denali->dev, "Failed to register MTD: %d\n", ret);
goto failed_req_irq;
}

it doesn't unmap the buffer. So, the driver is already broken.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
Robin Murphy
2017-03-09 11:19:56 UTC
Permalink
Post by Masahiro Yamada
Hi Robin,
Post by Robin Murphy
Post by Masahiro Yamada
Hi experts,
I have a question about
how to allocate DMA-safe buffer.
In my understanding, kmalloc() returns
memory with DMA safe alignment
in order to avoid cache-sharing problem when used for DMA.
The alignment is decided by ARCH_DMA_MINALIGN.
For example, on modern ARM 32bit boards, this value is typically 64.
So, memory returned by kmalloc() has
at least 64 byte alignment.
On the other hand, devm_kmalloc() does not return
enough-aligned memory.
How so? If anything returned by kmalloc() is guaranteed to occupy some
multiple of ARCH_DMA_MINALIGN bytes in order to avoid two allocations
falling into the same cache line, I don't see how stealing the first 16
bytes *of a single allocation* could make it start sharing cache lines
with another? :/
I just thought of traverse of the linked list of devres_node
on a different thread, but it should not happen.
Please forget my stupid question.
On the contrary, my apologies for overlooking the subtlety here and
speaking too soon. It's not strictly the alignment of the allocation
that's at issue, it's more that the streaming DMA notion of "ownership"
of that particular allocation can be unwittingly violated by other agents.

Another thread traversing the list isn't actually a problem, as it's
effectively no different from the cache itself doing a speculative
prefetch, which we have to accommodate anyway. However, there could in
theory be a potential data loss if the devres node is *modified* (e.g.
an adjacent entry is added or removed) whilst the buffer is mapped for
DMA_FROM_DEVICE.

Now, as Russell says, doing streaming DMA on a managed allocation tied
to the device's lifetime doesn't make a great deal of sense, and if
you're allocating or freeing device-lifetime resources *while DMA is
active* you're almost certainly doing something wrong, so thankfully I
don't think we should need to worry about this in practice.

Looking at the Denali driver specifically, it does indeed look a bit
bogus. At least it's not doing DMA with a uint8-aligned array in the
middle of a live structure as it apparently once did, but in terms of
how it's actually using that buffer I think something along the lines of
the below would be appropriate (if anyone wants to try spinning a proper
patch out of it, feel free - I have no way of testing and am not
familiar with MTD in general).

Robin.

----->8-----
diff --git a/drivers/mtd/nand/denali.c b/drivers/mtd/nand/denali.c
index 73b9d4e2dca0..b2b4650a3923 100644
--- a/drivers/mtd/nand/denali.c
+++ b/drivers/mtd/nand/denali.c
@@ -1046,8 +1046,6 @@ static int write_page(struct mtd_info *mtd, struct
nand_chip *chip,
mtd->oobsize);
}

- dma_sync_single_for_device(denali->dev, addr, size, DMA_TO_DEVICE);
-
clear_interrupts(denali);
denali_enable_dma(denali, true);

@@ -1063,7 +1061,6 @@ static int write_page(struct mtd_info *mtd, struct
nand_chip *chip,
}

denali_enable_dma(denali, false);
- dma_sync_single_for_cpu(denali->dev, addr, size, DMA_TO_DEVICE);

return 0;
}
@@ -1139,7 +1136,6 @@ static int denali_read_page(struct mtd_info *mtd,
struct nand_chip *chip,
setup_ecc_for_xfer(denali, true, false);

denali_enable_dma(denali, true);
- dma_sync_single_for_device(denali->dev, addr, size, DMA_FROM_DEVICE);

clear_interrupts(denali);
denali_setup_dma(denali, DENALI_READ);
@@ -1147,8 +1143,6 @@ static int denali_read_page(struct mtd_info *mtd,
struct nand_chip *chip,
/* wait for operation to complete */
irq_status = wait_for_irq(denali, irq_mask);

- dma_sync_single_for_cpu(denali->dev, addr, size, DMA_FROM_DEVICE);
-
memcpy(buf, denali->buf.buf, mtd->writesize);

check_erased_page = handle_ecc(denali, buf, irq_status, &max_bitflips);
@@ -1186,16 +1180,12 @@ static int denali_read_page_raw(struct mtd_info
*mtd, struct nand_chip *chip,
setup_ecc_for_xfer(denali, false, true);
denali_enable_dma(denali, true);

- dma_sync_single_for_device(denali->dev, addr, size, DMA_FROM_DEVICE);
-
clear_interrupts(denali);
denali_setup_dma(denali, DENALI_READ);

/* wait for operation to complete */
wait_for_irq(denali, irq_mask);

- dma_sync_single_for_cpu(denali->dev, addr, size, DMA_FROM_DEVICE);
-
denali_enable_dma(denali, false);

memcpy(buf, denali->buf.buf, mtd->writesize);
@@ -1466,16 +1456,6 @@ int denali_init(struct denali_nand_info *denali)
if (ret)
goto failed_req_irq;

- /* allocate the right size buffer now */
- devm_kfree(denali->dev, denali->buf.buf);
- denali->buf.buf = devm_kzalloc(denali->dev,
- mtd->writesize + mtd->oobsize,
- GFP_KERNEL);
- if (!denali->buf.buf) {
- ret = -ENOMEM;
- goto failed_req_irq;
- }
-
/* Is 32-bit DMA supported? */
ret = dma_set_mask(denali->dev, DMA_BIT_MASK(32));
if (ret) {
@@ -1483,12 +1463,14 @@ int denali_init(struct denali_nand_info *denali)
goto failed_req_irq;
}

- denali->buf.dma_buf = dma_map_single(denali->dev, denali->buf.buf,
- mtd->writesize + mtd->oobsize,
- DMA_BIDIRECTIONAL);
- if (dma_mapping_error(denali->dev, denali->buf.dma_buf)) {
- dev_err(denali->dev, "Failed to map DMA buffer\n");
- ret = -EIO;
+ /* allocate the right size buffer now */
+ devm_kfree(denali->dev, denali->buf.buf);
+ denali->buf.buf = dmam_alloc_coherent(denali->dev,
+ mtd->writesize + mtd->oobsize,
+ &denali->buf.dma_buf,
+ GFP_KERNEL);
+ if (!denali->buf.buf) {
+ ret = -ENOMEM;
goto failed_req_irq;
}

@@ -1598,7 +1580,5 @@ void denali_remove(struct denali_nand_info *denali)

nand_release(mtd);
denali_irq_cleanup(denali->irq, denali);
- dma_unmap_single(denali->dev, denali->buf.dma_buf, bufsize,
- DMA_BIDIRECTIONAL);
}
EXPORT_SYMBOL(denali_remove);

Continue reading on narkive:
Loading...