Discussion:
kmalloc memory slower than malloc
Thommy Jakobsson
2013-09-06 07:48:02 UTC
Permalink
Hi,

doing a project where I use DMA and a DMA-capable buffer in a driver. This
buffer is then mmap:ed to userspace, the driver notice userspace
that the device has filled the buffer. Pretty standard setup I think.

The initial problem was that I noticed that the buffer I got through
dma_alloc_coherent was very slow to step through in my userspace program.
I figured it was due to the memory allocated should be coherent (my hw
doesn't have cache coherence for DMA), so I probably got memory with cache
turned off. So I switched to a kmalloc and dma_map_single, plan was to
get more speed if I did cache invalidations.

After switching to kmalloc in the driver I still got loosy performance
though. I run below testdriver and program on a
marvell kirkwood 88F6281 (ARM9Em ARMv5TE) and a imx6 (Cortex A9 MP, Armv7)
with similar result. The test program is looping through a 4k buffer
10000 times, just adding all bytes and measuring how long time it takes.
On the kirkwood I get the following printout:

pa_dmabuf = 0x195d8000
va_dmabuf = 0x401e4000
pa_kmbuf = 0x19418000
va_kmbuf = 0x4031c000
dma_alloc_coherent 3037365us
kmalloc 3039321us
malloc 823403us

As you can see the kmalloc is ~3-4times slower to step through than a
normal malloc. The addresses in the beginning are just printouts of where
the buffers end up, both physical and virtual (in userspace) addresses.

I would have expected the kmalloc buffer to have roughly the same speed as
a malloc one. Any ideas what am I doing wrong? or are the assumptions
wrong?


BR,
Thommy

relevant driver part:
------------------------------------------------------------------
static long device_ioctl(struct file *file,
unsigned int cmd, unsigned long arg){

dma_addr_t pa = 0;

printk("entering ioctl cmd %d\r\n",cmd);
switch(cmd)
{
case DMAMEM:
va_dmabuf = dma_alloc_coherent(0,BUFSIZE,&pa,GFP_KERNEL|GFP_DMA);
pa_dmabuf = pa;
break;
case KMEM:
va_kmbuf = kmalloc(BUFSIZE,GFP_KERNEL);
//pa = dma_map_single(0,va_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
pa = __pa(va_kmbuf);
pa_kmbuf = pa;
break;
case DMAMEM_REL:
dma_free_coherent(0,BUFSIZE,va_dmabuf,pa_dmabuf);
break;
case KMEM_REL:
kfree(va_kmbuf);
break;
default:
break;
}

printk("allocated pa = 0x%08X\r\n",pa);

if(copy_to_user((void*)arg, &pa, sizeof(pa)))
return -EFAULT;
return 0;
}

static int device_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size;
int res = 0;
size = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

if (remap_pfn_range(vma, vma->vm_start,
vma->vm_pgoff, size, vma->vm_page_prot)) {
res = -ENOBUFS;
goto device_mmap_exit;
}

vma->vm_flags &= ~VM_IO; /* using shared anonymous pages */

device_mmap_exit:
return res;

}


relevant parts of userspace program
-----------------------------------------------------------------

/*
*alloc memory with dma_alloc_coherent
*/
ioctl(fd,DMAMEM,&pa_dmabuf);
if(pa_dmabuf == 0){
printf("no dma pa returned\r\n");
goto exito;
}else{
printf("pa_dmabuf = %p\r\n",(void*)pa_dmabuf);
}

va_dmabuf =
mmap(NULL,BUFSIZE,PROT_READ|PROT_WRITE,MAP_SHARED,fd,pa_dmabuf);

if(!va_dmabuf || va_dmabuf == (char*)0xFFFFFFFF){
perror("no valid va for dmabuf");
goto exito;
}else{
printf("va_dmabuf = %p\r\n",va_dmabuf);
}

/*
* alloc memory with kmalloc
*/
ioctl(fd,KMEM,&pa_kmbuf);
if(pa_kmbuf == 0){
printf("no kmalloc pa returned\r\n");
goto exito;
}else{
printf("pa_kmbuf = %p\r\n",(void*)pa_kmbuf);
}

va_kmbuf =
mmap(NULL,BUFSIZE,PROT_READ|PROT_WRITE,MAP_SHARED,fd,pa_kmbuf);

if(!va_kmbuf || va_kmbuf == (char*)0xFFFFFFFF){
perror("no valid va for kmbuf");
goto exito;
}else{
printf("va_kmbuf = %p\r\n",va_kmbuf);
}


/*
* test speed of dma_alloc_coherent buffer
*/
gettimeofday(&t1,NULL);
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++)
va_dmabuf[i]++;
}
gettimeofday(&t2,NULL);
printf("dma_alloc_coherent %ldus\n",
(t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec));

/*
* test speed of kmalloc buffer
*/
gettimeofday(&t1,NULL);
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++)
va_kmbuf[i]++;
}
gettimeofday(&t2,NULL);
printf("kmalloc %ldus\n",
(t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec));

/*
* test speed of malloc
*/
va_mbuf = malloc(BUFSIZE);

gettimeofday(&t1,NULL);
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++)
va_mbuf[i]++;
}
gettimeofday(&t2,NULL);
printf("malloc %ldus\n",
(t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec));
Russell King - ARM Linux
2013-09-06 08:07:05 UTC
Permalink
Post by Thommy Jakobsson
Hi,
doing a project where I use DMA and a DMA-capable buffer in a driver. This
buffer is then mmap:ed to userspace, the driver notice userspace
that the device has filled the buffer. Pretty standard setup I think.
Your driver appears to be exposing physical addresses to userspace.
This is a no-go. This is a massive security hole - it allows userspace
to map any physical address and write into that memory. That includes
system flash and all system RAM.

This gives userspace a way to overwrite the kernel with exploits,
retrieve sensitive and/or personal data, etc.

Therefore, I will not provide any assistance with this. Please change
your approach so you do not need physical addresses in userspace.

I know that some closed source libraries, particularly GPU and video
decode libraries like to take this approach. Everyone should be aware
that such approaches bypass all system security, especially if the GPU
or video device is accessible to any userspace process.

In your case, your device driver special device node just has to be
accessible to any userspace process.
Thommy Jakobsson
2013-09-06 09:04:40 UTC
Permalink
Post by Russell King - ARM Linux
Your driver appears to be exposing physical addresses to userspace.
This is a no-go. This is a massive security hole - it allows userspace
to map any physical address and write into that memory. That includes
system flash and all system RAM.
Sorry Russell, maybe I was unclear, the attached test was just a quick
hack to be able to compare kmalloc and dma_coherent_alloc with malloc. This is not
code that is part of the actual driver.
Post by Russell King - ARM Linux
This gives userspace a way to overwrite the kernel with exploits,
retrieve sensitive and/or personal data, etc.
Therefore, I will not provide any assistance with this. Please change
your approach so you do not need physical addresses in userspace.
I see your point, but as I said I do not do that in my driver. I do map
in the dmabuffer in userspace though, so I expose a virtual mapping to
userspace. I have understand that to be a "normal" approach for speeding
things up, or do you consider that to be wrong as well?

thanks,
Thommy
Lucas Stach
2013-09-06 09:12:58 UTC
Permalink
Hi Thommy,
Post by Thommy Jakobsson
Hi,
doing a project where I use DMA and a DMA-capable buffer in a driver. This
buffer is then mmap:ed to userspace, the driver notice userspace
that the device has filled the buffer. Pretty standard setup I think.
The initial problem was that I noticed that the buffer I got through
dma_alloc_coherent was very slow to step through in my userspace program.
I figured it was due to the memory allocated should be coherent (my hw
doesn't have cache coherence for DMA), so I probably got memory with cache
turned off. So I switched to a kmalloc and dma_map_single, plan was to
get more speed if I did cache invalidations.
After switching to kmalloc in the driver I still got loosy performance
though. I run below testdriver and program on a
marvell kirkwood 88F6281 (ARM9Em ARMv5TE) and a imx6 (Cortex A9 MP, Armv7)
with similar result. The test program is looping through a 4k buffer
10000 times, just adding all bytes and measuring how long time it takes.
pa_dmabuf = 0x195d8000
va_dmabuf = 0x401e4000
pa_kmbuf = 0x19418000
va_kmbuf = 0x4031c000
dma_alloc_coherent 3037365us
kmalloc 3039321us
malloc 823403us
As you can see the kmalloc is ~3-4times slower to step through than a
normal malloc. The addresses in the beginning are just printouts of where
the buffers end up, both physical and virtual (in userspace) addresses.
I would have expected the kmalloc buffer to have roughly the same speed as
a malloc one. Any ideas what am I doing wrong? or are the assumptions
wrong?
BR,
Thommy
------------------------------------------------------------------
static long device_ioctl(struct file *file,
unsigned int cmd, unsigned long arg){
dma_addr_t pa = 0;
printk("entering ioctl cmd %d\r\n",cmd);
switch(cmd)
{
va_dmabuf = dma_alloc_coherent(0,BUFSIZE,&pa,GFP_KERNEL|GFP_DMA);
pa_dmabuf = pa;
break;
va_kmbuf = kmalloc(BUFSIZE,GFP_KERNEL);
//pa = dma_map_single(0,va_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
pa = __pa(va_kmbuf);
pa_kmbuf = pa;
break;
dma_free_coherent(0,BUFSIZE,va_dmabuf,pa_dmabuf);
break;
kfree(va_kmbuf);
break;
break;
}
printk("allocated pa = 0x%08X\r\n",pa);
if(copy_to_user((void*)arg, &pa, sizeof(pa)))
return -EFAULT;
return 0;
}
static int device_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size;
int res = 0;
size = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
This is the relevant part where you are mapping things uncached into
userspace, so no wonder it is slower than cached malloc memory. If you
want to use cached userspace mappings you need bracketed MMAP access,
where you tell the kernel by using an ioctl or something that userspace
is accessing the mapping so it can flush/invalidate caches at the right
points in time.

Before doing so read up on how conflicting page mappings can lead to
undefined behavior on ARMv7 systems and consider the consequences
carefully. If you aren't sure you understood the problem fully and know
how to mitigate the problems, back out and live with an uncached or
writecombined mapping.
Post by Thommy Jakobsson
if (remap_pfn_range(vma, vma->vm_start,
vma->vm_pgoff, size, vma->vm_page_prot)) {
res = -ENOBUFS;
goto device_mmap_exit;
}
vma->vm_flags &= ~VM_IO; /* using shared anonymous pages */
return res;
}
[...]

Regards,
Lucas
--
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Thommy Jakobsson
2013-09-06 09:36:07 UTC
Permalink
Post by Lucas Stach
Post by Thommy Jakobsson
static int device_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size;
int res = 0;
size = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
This is the relevant part where you are mapping things uncached into
userspace, so no wonder it is slower than cached malloc memory. If you
want to use cached userspace mappings you need bracketed MMAP access,
where you tell the kernel by using an ioctl or something that userspace
is accessing the mapping so it can flush/invalidate caches at the right
points in time.
Well that explains it, you think that calling a function named noncached
would have be a tell, but apparently no =). Thanks Lucas for spotting
that. I should not copy and paste so much I guess.
Post by Lucas Stach
Before doing so read up on how conflicting page mappings can lead to
undefined behavior on ARMv7 systems and consider the consequences
carefully. If you aren't sure you understood the problem fully and know
how to mitigate the problems, back out and live with an uncached or
writecombined mapping.
I have read up a bit on it, but isn't so that I have conflicting page
mappings now in my test? Since the kernel is accessing the buffer as
cached and userspace as noncached? If I would use cached in both places I
assume that I wouldn't have conflicting page mappings?

Thanks,
Thommy
Thommy Jakobsson
2013-09-10 09:54:03 UTC
Permalink
Post by Lucas Stach
This is the relevant part where you are mapping things uncached into
userspace, so no wonder it is slower than cached malloc memory. If you
want to use cached userspace mappings you need bracketed MMAP access,
where you tell the kernel by using an ioctl or something that userspace
is accessing the mapping so it can flush/invalidate caches at the right
points in time.
Removing the pgprot_noncached() seems to make things more what I expected.
Both buffers takes about the same time to traverse in userspace. Thanks.

I changed the code in my testprogram and driver to do the same thing in
kernelspace as well. And now I don't understand the result. The result
stepping through and adding all bytes in a page sized buffer is about 4-5
times faster to do in the kernel. This is the times for looping through
the buffer 10000 times on a imx6:
dma_alloc_coherent in kernel 4.256s (s=0)
kmalloc in kernel 0.126s (s=86700000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.566s (s=86700000)
malloc in userspace 0.566s (s=0)

The 's' inside the paranthes is the resulting sum. See below for
actual code. I've read the L2 cache controller (PL310) in the imx6 has
speculative read. So I assume that it will be a performance increase to
have the memory physically continous (like kmalloc). But that should be
the same after I have map:ed it to userspace aswell, right? There is no
other load on the target during the test run.

I don't really understand the different pgprot-flags (some are obvious
like L_PTE_MT_UNCACHED of ocurse), so maybe I still have some errors in my
mmap. Can someone point me in the right direction or have any ideas why
it is so much faster in the kernel?

Thanks,
Thommy

code from testdriver:
--------------------
static long device_ioctl(struct file *file,
unsigned int cmd, unsigned long arg){

dma_addr_t pa = 0;
int i,j;
unsigned long s=0;

printk("entering ioctl cmd %d\r\n",cmd);
switch(cmd)
{
case DMAMEM:
va_dmabuf = dma_alloc_coherent(0,BUFSIZE,&pa,GFP_KERNEL|GFP_DMA);
//memset(va_dmabuf,0,BUFSIZE);
//va_dmabuf[15] = 23;
pa_dmabuf = pa;
printk("kernel va_dmabuf: 0x%p, pa_dmabuf 0x%08X\r\n",va_dmabuf,pa_dmabuf);
break;
case DMAMEM_TEST:
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++){
s += va_dmabuf[i];
}
}
break;
case KMEM:
va_kmbuf = kmalloc(BUFSIZE,GFP_KERNEL);
//pa = virt_to_phys(va_kmbuf);
//pa = __pa(va_kmbuf);
pa = dma_map_single(0,va_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
pa_kmbuf = pa;
dma_sync_single_for_cpu(0,pa_kmbuf,BUFSIZE,DMA_FROM_DEVICE);
//memset(va_kmbuf,0,BUFSIZE);
//va_kmbuf[10] = 11;
printk("kernel va_kmbuf: 0x%p, pa_kmbuf 0x%08X\r\n",va_kmbuf,pa_kmbuf);
break;
case KMEM_TEST:
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++){
s += va_kmbuf[i];
}
}
break;
case DMAMEM_REL:
dma_free_coherent(0,BUFSIZE,va_dmabuf,pa_dmabuf);
va_dmabuf = 0;
break;
case KMEM_REL:
kfree(va_kmbuf);
va_kmbuf = 0;
break;
default:
break;
}

if(cmd == DMAMEM_TEST || cmd == KMEM_TEST){
if(copy_to_user((void*)arg, &s, sizeof(s)))
return -EFAULT;
}else{
pa_currentbuf = pa;
if(copy_to_user((void*)arg, &pa, sizeof(pa)))
return -EFAULT;
}
return 0;
}

static int device_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size;
int res = 0;
size = vma->vm_end - vma->vm_start;
//vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
// vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
// vma->vm_page_prot = __pgprot_modify(vma->vm_page_prot,L_PTE_MT_MASK,L_PTE_MT_WRITEBACK);
// vma->vm_page_prot = __pgprot_modify(vma->vm_page_prot,L_PTE_MT_MASK,L_PTE_MT_DEV_CACHED);
// vma->vm_page_prot = __pgprot_modify(vma->vm_page_prot,L_PTE_MT_MASK,L_PTE_MT_WRITETHROUGH);

if (remap_pfn_range(vma, vma->vm_start,
pa_currentbuf>>PAGE_SHIFT , size, vma->vm_page_prot)) {
res = -ENOBUFS;
goto device_mmap_exit;
}

vma->vm_flags &= ~VM_IO; /* using shared anonymous pages */

device_mmap_exit:
return res;

}


code from testapplication:
-------------------------
/*
* test speed of dma_alloc_coherent buffer in kernel
*/
gettimeofday(&t1,NULL);
ioctl(fd,DMAMEM_TEST,&s);
gettimeofday(&t2,NULL);
printf("dma_alloc_coherent in kernel %.3fs (s=%lu)\n",
((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);

/*
* test speed of kmalloc buffer in kernel
*/
gettimeofday(&t1,NULL);
ioctl(fd,KMEM_TEST,&s);
gettimeofday(&t2,NULL);
printf("kmalloc in kernel %.3fs (s=%lu)\n",
((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);


/*
* test speed of dma_alloc_coherent buffer
*/
s=0;
gettimeofday(&t1,NULL);
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++){
s += va_dmabuf[i];
}
}
gettimeofday(&t2,NULL);
printf("dma_alloc_coherent userspace %.3fs (s=%lu)\n",
((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);


/*
* test speed of kmalloc buffer
*/
s=0;
gettimeofday(&t1,NULL);
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++){
s += va_kmbuf[i];
}
}
gettimeofday(&t2,NULL);
printf("kmalloc in userspace %.3fs (s=%lu)\n",
((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);


/*
* test speed of malloc
*/
s=0;
va_mbuf = malloc(BUFSIZE);

gettimeofday(&t1,NULL);
for(j=0;j<LOOPCNT;j++){
for(i=0;i<BUFSIZE;i++){
s += va_mbuf[i];
}
}
gettimeofday(&t2,NULL);
printf("malloc in userspace %.3fs (s=%lu)\n",
((t2.tv_sec-t1.tv_sec)*1000000+(t2.tv_usec-t1.tv_usec))/1000000.0,s);
Lucas Stach
2013-09-10 10:10:16 UTC
Permalink
Post by Thommy Jakobsson
Post by Lucas Stach
This is the relevant part where you are mapping things uncached into
userspace, so no wonder it is slower than cached malloc memory. If you
want to use cached userspace mappings you need bracketed MMAP access,
where you tell the kernel by using an ioctl or something that userspace
is accessing the mapping so it can flush/invalidate caches at the right
points in time.
Removing the pgprot_noncached() seems to make things more what I expected.
Both buffers takes about the same time to traverse in userspace. Thanks.
I changed the code in my testprogram and driver to do the same thing in
kernelspace as well. And now I don't understand the result. The result
stepping through and adding all bytes in a page sized buffer is about 4-5
times faster to do in the kernel. This is the times for looping through
dma_alloc_coherent in kernel 4.256s (s=0)
kmalloc in kernel 0.126s (s=86700000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.566s (s=86700000)
malloc in userspace 0.566s (s=0)
How do you init the kmalloc memory? If you do a memset right before the
test loop your "kmalloc in kernel" will most likely always hit in the L1
cache, that's why it's really fast to do.

The userspace mapping of the kmalloc memory will get a different virtual
address than the kernel mapping. So if you do a memset in kernelspace,
but the test loop in userspace you'll always miss the cache as the ARM
v7 caches are virtually indexed. So the processor always fetches data
from memory. The performance advantage against an uncached mapping is
entirely due to the fact that you are fetching whole cache lines
(32bytes) from memory at once, instead of doing a memory/bus transaction
per byte.

Regards,
Lucas
--
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Duan Fugang-B38611
2013-09-10 10:42:44 UTC
Permalink
From: linux-arm-kernel [mailto:linux-arm-kernel-***@lists.infradead.org] On Behalf Of Lucas Stach
Data: Tuesday, September 10, 2013 6:10 PM
To: Thommy Jakobsson
Subject: Re: kmalloc memory slower than malloc
Post by Thommy Jakobsson
Post by Lucas Stach
This is the relevant part where you are mapping things uncached into
userspace, so no wonder it is slower than cached malloc memory. If
you want to use cached userspace mappings you need bracketed MMAP
access, where you tell the kernel by using an ioctl or something
that userspace is accessing the mapping so it can flush/invalidate
caches at the right points in time.
Removing the pgprot_noncached() seems to make things more what I
expected.
Post by Thommy Jakobsson
Both buffers takes about the same time to traverse in userspace. Thanks.
I changed the code in my testprogram and driver to do the same thing
in kernelspace as well. And now I don't understand the result. The
result stepping through and adding all bytes in a page sized buffer is
about 4-5 times faster to do in the kernel. This is the times for
dma_alloc_coherent in kernel 4.256s (s=0)
kmalloc in kernel 0.126s (s=86700000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.566s (s=86700000)
malloc in userspace 0.566s (s=0)
How do you init the kmalloc memory? If you do a memset right before the
test loop your "kmalloc in kernel" will most likely always hit in the L1
cache, that's why it's really fast to do.
The userspace mapping of the kmalloc memory will get a different virtual
address than the kernel mapping. So if you do a memset in kernelspace, but
the test loop in userspace you'll always miss the cache as the ARM
v7 caches are virtually indexed. So the processor always fetches data from
memory. The performance advantage against an uncached mapping is entirely
due to the fact that you are fetching whole cache lines
(32bytes) from memory at once, instead of doing a memory/bus transaction
per byte.
Regards,
Lucas
About the diff:
dma_alloc_coherent in kernel 4.256s (s=0)
dma_alloc_coherent userspace 0.566s (s=0)

I think it call remap_pfn_range() with page attribute (vma->vm_page_prot) transferred from mmap() maybe cacheable.
So the performance is the same as malloc/kmalloc in userspace.

Regards,
Andy
Thommy Jakobsson
2013-09-10 11:28:50 UTC
Permalink
Post by Thommy Jakobsson
dma_alloc_coherent in kernel 4.256s (s=0)
dma_alloc_coherent userspace 0.566s (s=0)
I think it call remap_pfn_range() with page attribute (vma->vm_page_prot) transferred from mmap() maybe cacheable.
So the performance is the same as malloc/kmalloc in userspace.
Thats probably true, or at least that is how I explained it to myself in
my head =)

Thanks,
Thommy
Duan Fugang-B38611
2013-09-10 11:36:34 UTC
Permalink
From: Thommy Jakobsson [mailto:***@gmail.com]
Data: Tuesday, September 10, 2013 7:29 PM
To: Duan Fugang-B38611
Subject: RE: kmalloc memory slower than malloc
Post by Thommy Jakobsson
dma_alloc_coherent in kernel 4.256s (s=0)
dma_alloc_coherent userspace 0.566s (s=0)
I think it call remap_pfn_range() with page attribute (vma->vm_page_prot)
transferred from mmap() maybe cacheable.
Post by Thommy Jakobsson
So the performance is the same as malloc/kmalloc in userspace.
Thats probably true, or at least that is how I explained it to myself in
my head =)
Thanks,
Thommy
Can you add below code to your device_mmap() to test the performance for above two cases:
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

I think the performance must be the same.

Regards,
Andy
Russell King - ARM Linux
2013-09-10 11:44:20 UTC
Permalink
Post by Duan Fugang-B38611
Data: Tuesday, September 10, 2013 7:29 PM
To: Duan Fugang-B38611
Subject: RE: kmalloc memory slower than malloc
Post by Thommy Jakobsson
dma_alloc_coherent in kernel 4.256s (s=0)
dma_alloc_coherent userspace 0.566s (s=0)
I think it call remap_pfn_range() with page attribute (vma->vm_page_prot)
transferred from mmap() maybe cacheable.
Post by Thommy Jakobsson
So the performance is the same as malloc/kmalloc in userspace.
Thats probably true, or at least that is how I explained it to myself in
my head =)
Thanks,
Thommy
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
No, that is not match the page table settings that dma_mmap_coherent
would use. That gets you strongly ordered memory which will be
(a) a violation of the ARM architecture requirements, being a different
"memory type", and (b) will be a different mapping type compared to
that used by the virtual address returned from dma_alloc_coherent().

The appropriate modification here would be pgprot_dmacoherent().
Thommy Jakobsson
2013-09-10 12:42:17 UTC
Permalink
Post by Russell King - ARM Linux
Post by Duan Fugang-B38611
Data: Tuesday, September 10, 2013 7:29 PM
To: Duan Fugang-B38611
Subject: RE: kmalloc memory slower than malloc
Post by Thommy Jakobsson
dma_alloc_coherent in kernel 4.256s (s=0)
dma_alloc_coherent userspace 0.566s (s=0)
I think it call remap_pfn_range() with page attribute (vma->vm_page_prot)
transferred from mmap() maybe cacheable.
Post by Thommy Jakobsson
So the performance is the same as malloc/kmalloc in userspace.
Thats probably true, or at least that is how I explained it to myself in
my head =)
Thanks,
Thommy
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
No, that is not match the page table settings that dma_mmap_coherent
would use. That gets you strongly ordered memory which will be
(a) a violation of the ARM architecture requirements, being a different
"memory type", and (b) will be a different mapping type compared to
that used by the virtual address returned from dma_alloc_coherent().
The appropriate modification here would be pgprot_dmacoherent().
Using pgprot_dmacoherent() in mmap they look more similar. Still
~10-15% difference, but maybe that is normal for kernel/userspace.

dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=81370000)
dma_alloc_coherent userspace 4.907s (s=0)
kmalloc in userspace 1.815s (s=81370000)
malloc in userspace 0.566s (s=0)

Note that I was lazy and used the same pgprot for all mappings now, which
I guess is a violation.
//thommy
Russell King - ARM Linux
2013-09-10 12:50:48 UTC
Permalink
Post by Thommy Jakobsson
Using pgprot_dmacoherent() in mmap they look more similar. Still
~10-15% difference, but maybe that is normal for kernel/userspace.
dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=81370000)
dma_alloc_coherent userspace 4.907s (s=0)
kmalloc in userspace 1.815s (s=81370000)
malloc in userspace 0.566s (s=0)
Note that I was lazy and used the same pgprot for all mappings now, which
I guess is a violation.
What it means is that the results you end up with are documented to be
"unpredictable" which gives scope to manufacturers to come up with any
behaviour they desire in that situation - and it doesn't have to be
consistent.

What that means is that if you have an area of physical memory mapped as
"normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
it is entirely legal for an access via the strongly ordered mapping to
hit the cache if a cache line exists, whereas another implementation
may miss the cache line if it exists.

Furthermore, with such mappings (and this has been true since ARMv3 days)
if you have two such mappings - one cacheable and one non-cacheable, and
the cacheable mapping has dirty cache lines, the dirty cache lines can be
evicted at any moment, overwriting whatever you're doing via the non-
cacheable mapping.

I've recently had a hard-to-track bug doing exactly that in a non-mainline
kernel on ARMv7 because someone decided it was a good idea to bypass my
test in arch/arm/mm/ioremap.c preventing system RAM being ioremap()d. It
lead to one boot in 20ish locking up because a GPU command stream was
being overwritten by the dirty cache lines being evicted after the GPU
had started to read from that memory - or, if you typed "reboot" at the
right moment during a previous boot, you could get it to occur 100% of
the time.

I notice you turn off VM_IO - you don't want to do that...
Thommy Jakobsson
2013-09-12 15:58:22 UTC
Permalink
Post by Russell King - ARM Linux
Post by Thommy Jakobsson
Using pgprot_dmacoherent() in mmap they look more similar. Still
~10-15% difference, but maybe that is normal for kernel/userspace.
dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=81370000)
dma_alloc_coherent userspace 4.907s (s=0)
kmalloc in userspace 1.815s (s=81370000)
malloc in userspace 0.566s (s=0)
Note that I was lazy and used the same pgprot for all mappings now, which
I guess is a violation.
What it means is that the results you end up with are documented to be
"unpredictable" which gives scope to manufacturers to come up with any
behaviour they desire in that situation - and it doesn't have to be
consistent.
What that means is that if you have an area of physical memory mapped as
"normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
it is entirely legal for an access via the strongly ordered mapping to
hit the cache if a cache line exists, whereas another implementation
may miss the cache line if it exists.
Furthermore, with such mappings (and this has been true since ARMv3 days)
if you have two such mappings - one cacheable and one non-cacheable, and
the cacheable mapping has dirty cache lines, the dirty cache lines can be
evicted at any moment, overwriting whatever you're doing via the non-
cacheable mapping.
But isn't the memory received with dma_alloc_coherent() given a noncached
mapping? or even strongly ordered? Will that not conflict with the normal
kernel mapping which is cached?

Is all the mappings documented somewhere, what linux mapping corresponds
to which mapping in MMU? Seems like the armv7 documentation isn't free
either, which isn't making things easier for me.

Comning back to the original issue; dissassembling the code I noticed that
the userspace code looked really stupid with a lot of unnecessary memory
accesses. Kernel looked much better. Even after commenting the actual
memory access out in userspace, leaving just the loop itself, I got
terrible times.

Previous times:
dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=68620000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.566s (s=68620000)
malloc in userspace 0.566s (s=0)

Commenting out actual memory access (loop not optimized away when checking
assembler):
dma_alloc_coherent in kernel 4.256s (s=0)
kmalloc in kernel 0.126s (s=84750000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.412s (s=0) << just looping
malloc in userspace 0.566s (s=0)

Kernel is with -O2 so compiling the testprogram with -O2 aswell yield more
reasonable results:
dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=84560000)
dma_alloc_coherent userspace 0.124s (s=0)
kmalloc in userspace 0.124s (s=84560000)
malloc in userspace 0.113s (s=0)

As can be seen all tests executed in userspace was cut in 1/4-1/5. Malloc
is now a bit faster than kmalloc. Could be faster if physical memory is
spread out on different banks, but on other hand cache prefetching should
be easier if continous.
Post by Russell King - ARM Linux
I notice you turn off VM_IO - you don't want to do that...
Fixed

Thanks for all help,
Thommy
Russell King - ARM Linux
2013-09-12 16:19:55 UTC
Permalink
Post by Thommy Jakobsson
Post by Russell King - ARM Linux
What it means is that the results you end up with are documented to be
"unpredictable" which gives scope to manufacturers to come up with any
behaviour they desire in that situation - and it doesn't have to be
consistent.
What that means is that if you have an area of physical memory mapped as
"normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
it is entirely legal for an access via the strongly ordered mapping to
hit the cache if a cache line exists, whereas another implementation
may miss the cache line if it exists.
Furthermore, with such mappings (and this has been true since ARMv3 days)
if you have two such mappings - one cacheable and one non-cacheable, and
the cacheable mapping has dirty cache lines, the dirty cache lines can be
evicted at any moment, overwriting whatever you're doing via the non-
cacheable mapping.
But isn't the memory received with dma_alloc_coherent() given a noncached
mapping? or even strongly ordered? Will that not conflict with the normal
kernel mapping which is cached?
dma_alloc_coherent() and dma_map_single()/dma_map_page() both know about
the issues and deal with any dirty cache lines - they also try and map
the memory as compatibly as possible with any existing mapping.

On pre-ARMv6, dma_alloc_coherent() will provide memory which is "non-cached
non-bufferable" - C = B = 0. This is also called "strongly ordered" on
ARMv6 and later. You get this with pgprot_uncached(), or
pgprot_dmacoherent() on pre-ARMv6 architectures.

On ARMv6+, it provides memory which is "memory like, uncached". This
is what you get when you use pgprot_dmacoherent() on ARMv6 or later.

On ARMv6+, there are three classes of mapping: strongly ordered, device,
and memory-like. Strongly ordered and device are both non-cacheable.
However, memory-like can be cacheable, and the cache properties can be
specified. All mappings of a physical address _should_ be of the same
"class".

dma_map_single()/dma_map_page() deal with the problem completely
differently - they don't setup a new mapping, instead they perform
manual cache maintanence to ensure that the data is appropriately
visible to either the CPU or the DMA engine after the appropriate
call(s).
Post by Thommy Jakobsson
Comning back to the original issue; dissassembling the code I noticed that
the userspace code looked really stupid with a lot of unnecessary memory
accesses. Kernel looked much better. Even after commenting the actual
memory access out in userspace, leaving just the loop itself, I got
terrible times.
Oh, you're not specifying any optimisation what so ever? That'll be
the reason then - the compiler won't do _any_ optimisation unless you
ask it to. That means it'll do stuff like saving an interator out on
the stack and then immediately read it back in, increment it, and
write it back out again.
Post by Thommy Jakobsson
Kernel is with -O2 so compiling the testprogram with -O2 aswell yield more
dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=84560000)
dma_alloc_coherent userspace 0.124s (s=0)
kmalloc in userspace 0.124s (s=84560000)
malloc in userspace 0.113s (s=0)
Great, glad you solved it.

Note however that the kmalloc version is not realistic of what's required
for the CPU to provide or read DMA data: between the CPU accessing the
data and the DMA engine accessing it, there needs to be a cache flush,
which will consume additional time. That's where using the dma_map_*,
dma_unmap_* or dma_sync_* functions come in.

Thommy Jakobsson
2013-09-10 11:27:07 UTC
Permalink
Post by Lucas Stach
How do you init the kmalloc memory? If you do a memset right before the
test loop your "kmalloc in kernel" will most likely always hit in the L1
cache, that's why it's really fast to do.
I did do a memset previosly but I removed it to see if I still had the
difference. So now I don't initilize the memory at all. The run from which
I attached the times had no initilization at all. Besides, I loop through
all bytes 10000 times, so I would assume that all would be in cache after
the first loop.
Post by Lucas Stach
The userspace mapping of the kmalloc memory will get a different virtual
address than the kernel mapping. So if you do a memset in kernelspace,
but the test loop in userspace you'll always miss the cache as the ARM
v7 caches are virtually indexed. So the processor always fetches data
from memory. The performance advantage against an uncached mapping is
entirely due to the fact that you are fetching whole cache lines
(32bytes) from memory at once, instead of doing a memory/bus transaction
per byte.
I thought that the L1 data cache was physically indexed and tagged,
whereas the instruction cache used virtual indexing. But maybe I'm
wrong. The L2 cache is physically indexed and tagged though, right?

Thanks,
Thommy
Russell King - ARM Linux
2013-09-10 11:41:44 UTC
Permalink
Post by Thommy Jakobsson
I changed the code in my testprogram and driver to do the same thing in
kernelspace as well. And now I don't understand the result. The result
stepping through and adding all bytes in a page sized buffer is about 4-5
times faster to do in the kernel. This is the times for looping through
dma_alloc_coherent in kernel 4.256s (s=0)
kmalloc in kernel 0.126s (s=86700000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.566s (s=86700000)
malloc in userspace 0.566s (s=0)
How many times have you verified this result?

So, the obvious question is: does this kernel have kernel preemption
enabled?

The reason for asking that is that if you have kernel preemption
disabled, while your running your buffer sum, no other thread will get
use of the CPU, so you'll have all the CPU cycles (with the exception
of interrupt handling) to yourself.

That won't be true in userspace.

You may also like to consider giving people the full source to your
tests so that it can be run on other platforms as well.
Thommy Jakobsson
2013-09-10 12:54:49 UTC
Permalink
Post by Russell King - ARM Linux
Post by Thommy Jakobsson
I changed the code in my testprogram and driver to do the same thing in
kernelspace as well. And now I don't understand the result. The result
stepping through and adding all bytes in a page sized buffer is about 4-5
times faster to do in the kernel. This is the times for looping through
dma_alloc_coherent in kernel 4.256s (s=0)
kmalloc in kernel 0.126s (s=86700000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.566s (s=86700000)
malloc in userspace 0.566s (s=0)
How many times have you verified this result?
I havent done any scientific study, but at least 20 times with restarts in
between. I got a similar result on another hw as well, but haven't checked
what kernel or config I was running there. So might not be the same thing.
Also each buffer is looped 10000 times, which I assume would remove the
most severe randomness a least.
Post by Russell King - ARM Linux
So, the obvious question is: does this kernel have kernel preemption
enabled?
The reason for asking that is that if you have kernel preemption
disabled, while your running your buffer sum, no other thread will get
use of the CPU, so you'll have all the CPU cycles (with the exception
of interrupt handling) to yourself.
That won't be true in userspace.
It should be enabled
zcat /proc/config.gz | grep PREEM
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y

I wouldn't be surprised if doing things in the kernel are quicker, its
just the amount that surprise me.
Post by Russell King - ARM Linux
You may also like to consider giving people the full source to your
tests so that it can be run on other platforms as well.
Sure thing, I didn't include it in the mail to avoid cluttering it down to
much. One can find it here:
https://github.com/thommyj/buf-speedtest


Thanks,
Thommy
Loading...