Discussion:
kernel 2.6 and fastfpe
Peter Teichmann
2003-11-07 21:44:17 UTC
Permalink
Hi,

I am currently updating FastFPE to contain all the improvements I have done
this year. However, I am experiencing two problems (I am using kernel
2.6.0-test8-rmk1):

Problem 1:

If I want to use FastFPE, I need to compile in both NWFPE and FastFPE and
select FastFPE via "fpe=fastfpe". If I just compile in FastFPE, the I get a
crash already very short after FastFPE initialisation. I get exactly the same
crash without any FPE compiled into the kernel (look at the very end of this
mail).

How can I understand this? I have been looking for something in the kernel
source that is just enabled for NWFPE but also needed by FastFPE, but did not
find anything.

Problem 2:

FastFPE is much slower than exactly the same emulator in kernel 2.4.19-rmk7.
On a 200MHz Strongarm I get 0.4 MFlops compared to 1.1 MFlops, approximately.
I looked further into this and found the following (not very exact) execution
times:

2.4.19-rmk7 2.6.0-test8-rmk1
ADFD 560ns 1990ns
MUFD 590ns 2060ns
DVFD 1550ns 3000ns
Trap overhead 570ns 2040ns
(trap overhead is measured as additional time per block of FP instructions)

For each time there is a difference of approximately 1500ns for all times,
that are around 300 cycles. I can not imagine what causes these delays. Does
someone know that or have an idea?

Peter Teichmann


Output of booting with just FastFPE enabled:
Linux version 2.6.0-test8-rmk1 (***@Acorn) (gcc version 2.95.4 20011002
(Debian prerelease)) #31 Wed Nov 5 23:47:14 CET 2003
CPU: StrongARM-110 [4401a102] revision 2 (ARMv4)
Machine: Acorn-RiscPC
ATAG_INITRD is deprecated; please update your bootloader.
Memory policy: ECC disabled, Data cache write back
On node 0 totalpages: 16384
DMA zone: 16384 pages, LIFO batch:4
Normal zone: 0 pages, LIFO batch:1
HighMem zone: 0 pages, LIFO batch:1
Building zonelist for node : 0
Kernel command line: BOOT=none root=/dev/hda4 console=ttyS0,19200
PID hash table entries: 1024 (order 10: 8192 bytes)
Console: colour dummy device 80x30
Memory: 16MB 16MB 16MB 16MB = 64MB total
Memory: 63040KB available (1297K code, 313K data, 72K init)
Calibrating delay loop... 134.75 BogoMIPS
Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU: Testing write buffer coherency: ok
POSIX conformance testing by UNIFIX
NET: Registered protocol family 16
Probing expansion cards
Acornfb: 2048kB VRAM, VIDC20, using 640x480, 31.468kHz, 59Hz
Acornfb: Monitor: 0.000-99.999kHz, 0-199Hz
Fast Floating Point Emulator V0.9 (c) Peter Teichmann.
i2c /dev entries driver
Console: switching to colour frame buffer device 80x60
pty: 256 Unix98 ptys configured
registering 0-0050
Unable to handle kernel paging request at virtual address c1de1078
pgd = c0318000
[c1de1078] *pgd=00000000
Internal error: Oops: 10319005 [#1]
CPU: 0
PC is at kfree+0x3c/0x84
LR is at kobject_set_name+0xbc/0xc8
pc : [<c004f058>] lr : [<c00a064c>] Not tainted
sp : c024de34 ip : c024de50 fp : c024de4c
r10: c024de84 r9 : c03d5cf4 r8 : c03d5cc0
r7 : c03d5cc4 r6 : 20000013 r5 : 7473626f r4 : 00000006
r3 : c1de1070 r2 : 003d640e r1 : c03d5cc4 r0 : 7473626f
Flags: nzCv IRQs off FIQs on Mode SVC_32 Segment kernel
Control: 1031917F Table: 1031917F DAC: 0000001D
Process swapper (pid: 1, stack limit = 0xc024c0ec)
Stack: (0xc024de34 to 0xc024e000)
de20: 00000006 c03d5cc4 00000014
de40: c024de7c c024de50 c00a064c c004f028 00000000 c03d5c98 c0178630 c03d5cc0
de60: c03d5cf4 00000001 c00c1c84 00000000 c024dea8 c024de8c c00c85c0 c00a05a0
de80: c03d5cf4 20000013 00000003 c03d5c98 c03d5c80 c01785e0 c01785e0 c024debc
dea0: c024deac c00c8688 c00c857c c03d5c98 c024dedc c024dec0 c00a6bd8 c00c8678
dec0: 00000050 c03d5c80 00000050 c024defc c024df38 c024dee0 c00c1d80 c00a6a74
dee0: 00000000 00000050 c0240001 c024dee3 00010050 c00a0001 c024dee2 00000050
df00: c0240001 c024dee3 00010050 c00a0001 c024dee2 00000002 00000001 00000050
df20: c0178794 00000000 c01785e0 c024df70 c024df3c c00a7a54 c00c1c90 00000000
df40: 00000000 00000000 c0178730 c01787b0 c0173bcc 00000000 c0173bdc 00000000
df60: 00000000 c024df80 c024df74 c00c1da4 c00a7470 c024dfa4 c024df84 c00a67a0
df80: c00c1d98 c00197b4 c024c000 00000000 c00197f8 00000000 c024dfb4 c024dfa8
dfa0: c0013138 c00a66e4 c024dfd4 c024dfb8 c0008804 c0013130 00000000 00000000
dfc0: 00000000 00000000 c024dfe4 c024dfd8 c0008898 c00087c4 c024dff4 c024dfe8
dfe0: c001a0a4 c0008884 00000000 c024dff8 c0031d6c c001a088 00000000 00000000
Backtrace:
[<c004f01c>] (kfree+0x0/0x84) from [<c00a064c>] (kobject_set_name+0xbc/0xc8)
r6 = 00000014 r5 = C03D5CC4 r4 = 00000006
[<c00a0590>] (kobject_set_name+0x0/0xc8) from [<c00c85c0>]
(device_add0x50/0xfc)
r3 = 00000003 r2 = 20000013 r1 = C03D5CF4
[<c00c8570>] (device_add+0x0/0xfc) from [<c00c8688>]
(device_register0x1c/0x20)
r7 = C01785E0 r6 = C01785E0 r5 = C03D5C80 r4 = C03D5C98
[<c00c866c>] (device_register+0x0/0x20) from [<c00a6bd8>] (i2c_attach_client
0x170/0x19c)
r4 = C03D5C98
[<c00a6a68>] (i2c_attach_client+0x0/0x19c) from [<c00c1d80>] (pcf8583_attach
+0xfc/0x108)
r6 = C024DEFC r5 = 00000050 r4 = C03D5C80
[<c00c1c84>] (pcf8583_attach+0x0/0x108) from [<c00a7a54>] (i2c_probe
+0x5f0/0x620)
[<c00a7464>] (i2c_probe+0x0/0x620) from [<c00c1da4>] (pcf8583_probe+0x18/0x24)
[<c00c1d8c>] (pcf8583_probe+0x0/0x24) from [<c00a67a0>] (i2c_add_driver
+0xc8/0x120)
[<c00a66d8>] (i2c_add_driver+0x0/0x120) from [<c0013138>] (pcf8583_init
+0x14/0x1c)
r8 = 00000000 r7 = C00197F8 r6 = 00000000 r5 = C024C000
r4 = C00197B4
[<c0013124>] (pcf8583_init+0x0/0x1c) from [<c0008804>] (do_initcalls
+0x4c/0xc0)
[<c00087b8>] (do_initcalls+0x0/0xc0) from [<c0008898>] (do_basic_setup
+0x20/0x24)
r7 = 00000000 r6 = 00000000 r5 = 00000000 r4 = 00000000
[<c0008878>] (do_basic_setup+0x0/0x24) from [<c001a0a4>] (init+0x28/0xd4)
[<c001a07c>] (init+0x0/0xd4) from [<c0031d6c>] (do_exit+0x0/0x3c0)
Code: e5933000 e0822102 e0833182 e243370a (e5930008)
<0>Kernel panic: Attempted to kill init!


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ/Etiquette: http://www.arm.linux.org.uk/armlinux/mailinglists.php
Russell King - ARM Linux
2003-11-08 10:10:01 UTC
Permalink
Post by Peter Teichmann
If I want to use FastFPE, I need to compile in both NWFPE and FastFPE and
select FastFPE via "fpe=fastfpe". If I just compile in FastFPE, the I get a
crash already very short after FastFPE initialisation. I get exactly the same
crash without any FPE compiled into the kernel (look at the very end of this
mail).
There isn't a direct connection between the two, but it's a case of
pcf8583.c not zeroing the i2c client structure before calling
i2c_attach_client()
Post by Peter Teichmann
FastFPE is much slower than exactly the same emulator in kernel 2.4.19-rmk7.
On a 200MHz Strongarm I get 0.4 MFlops compared to 1.1 MFlops, approximately.
I looked further into this and found the following (not very exact) execution
2.4.19-rmk7 2.6.0-test8-rmk1
ADFD 560ns 1990ns
MUFD 590ns 2060ns
DVFD 1550ns 3000ns
Trap overhead 570ns 2040ns
(trap overhead is measured as additional time per block of FP instructions)
For each time there is a difference of approximately 1500ns for all times,
that are around 300 cycles. I can not imagine what causes these delays. Does
someone know that or have an idea?
The undefined instruction handler in the kernel will now read the first
instruction for you, and partially decode it. It needs to do this because
we're now seeing other co-pro instructions being used for other purposes.

I did update fastfpe in 2.6 to take account of this though, so I'm not
sure why you're seeing quadruple overhead on the trap.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ/Etiquette: http://www.arm.linux.org.uk/armlinux/mailinglists.php
Peter Teichmann
2003-11-09 21:06:27 UTC
Permalink
Post by Russell King - ARM Linux
Post by Peter Teichmann
If I want to use FastFPE, I need to compile in both NWFPE and FastFPE and
select FastFPE via "fpe=fastfpe". If I just compile in FastFPE, the I get
a crash already very short after FastFPE initialisation. I get exactly
the same crash without any FPE compiled into the kernel (look at the very
end of this mail).
There isn't a direct connection between the two, but it's a case of
pcf8583.c not zeroing the i2c client structure before calling
i2c_attach_client()
I see. But why is that just a problem when there is no NWFPE compiled into the
kernel?
Post by Russell King - ARM Linux
Post by Peter Teichmann
2.4.19-rmk7 2.6.0-test8-rmk1
ADFD 560ns 1990ns
MUFD 590ns 2060ns
DVFD 1550ns 3000ns
Trap overhead 570ns 2040ns
(trap overhead is measured as additional time per block of FP instructions)
The undefined instruction handler in the kernel will now read the first
instruction for you, and partially decode it. It needs to do this because
we're now seeing other co-pro instructions being used for other purposes.
I did update fastfpe in 2.6 to take account of this though, so I'm not
sure why you're seeing quadruple overhead on the trap.
Yes, I noticed that. But I do not believe that this is the cause for the
additional delay, because NWFPE seems not to suffer from that.

Also, I explained the numbers above a bit unclear. Times for ADFD/MUFD/DVFD do
_not_ include the Trap overhead. For any block of FP instructions, the
execution time increased by 1500ns for the Trap and 1500ns for each of the
instructions, and this can not be caused by the change to the undefined
instruction handler. But what is it then? Could because of strange
circumstances some data it accesses is always not in Cache or even
uncacheable?

Peter Teichmann



-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ/Etiquette: http://www.arm.linux.org.uk/armlinux/mailinglists.php
Russell King - ARM Linux
2003-11-09 21:28:02 UTC
Permalink
Post by Peter Teichmann
Post by Russell King - ARM Linux
There isn't a direct connection between the two, but it's a case of
pcf8583.c not zeroing the i2c client structure before calling
i2c_attach_client()
I see. But why is that just a problem when there is no NWFPE compiled
into the kernel?
It just happens to hit some area which was already initialised to zero
when NWFPE is compiled into the kernel?
Post by Peter Teichmann
Also, I explained the numbers above a bit unclear. Times for ADFD/MUFD/DVFD do
_not_ include the Trap overhead. For any block of FP instructions, the
execution time increased by 1500ns for the Trap and 1500ns for each of the
instructions, and this can not be caused by the change to the undefined
instruction handler. But what is it then? Could because of strange
circumstances some data it accesses is always not in Cache or even
uncacheable?
Hmm, no idea off hand. The only thing which does occur to me that might
cause a change like this is if the fpe wasn't in cacheable memory.
However, I'd only consider that if you have fastfpe as a module (which
you don't.)

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ/Etiquette: http://www.arm.linux.org.uk/armlinux/mailinglists.php
Peter Teichmann
2003-11-09 21:43:06 UTC
Permalink
Post by Russell King - ARM Linux
Post by Peter Teichmann
Also, I explained the numbers above a bit unclear. Times for
ADFD/MUFD/DVFD do _not_ include the Trap overhead. For any block of FP
instructions, the execution time increased by 1500ns for the Trap and
1500ns for each of the instructions, and this can not be caused by the
change to the undefined instruction handler. But what is it then? Could
because of strange circumstances some data it accesses is always not in
Cache or even uncacheable?
Hmm, no idea off hand. The only thing which does occur to me that might
cause a change like this is if the fpe wasn't in cacheable memory.
However, I'd only consider that if you have fastfpe as a module (which
you don't.)
But on the other hand, that would not add a constant penalty of 1500ns, but
scale the execution time. It must be some operation it does, that suddenly
lasts much longer. And it occurs to me that this can only be a memory access.

On the other hand, could it be that the buggy pcf8585.c modifies the code?

Peter Teichmann


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ/Etiquette: http://www.arm.linux.org.uk/armlinux/mailinglists.php
Peter Teichmann
2003-11-10 19:17:21 UTC
Permalink
Post by Russell King - ARM Linux
Post by Peter Teichmann
Also, I explained the numbers above a bit unclear. Times for
ADFD/MUFD/DVFD do _not_ include the Trap overhead. For any block of FP
instructions, the execution time increased by 1500ns for the Trap and
1500ns for each of the instructions, and this can not be caused by the
change to the undefined instruction handler. But what is it then? Could
because of strange circumstances some data it accesses is always not in
Cache or even uncacheable?
Hmm, no idea off hand. The only thing which does occur to me that might
cause a change like this is if the fpe wasn't in cacheable memory.
However, I'd only consider that if you have fastfpe as a module (which
you don't.)
It seems that the buggy pcf8385.c modified some code of FastFPE. Amazing that
it was still functional... If I memset the i2c_client structure to 0 before
calling i2c_attach_client, I also gain the original performance for FastFPE.
So I believe that I will soon deliver a FastFPE patch to the patch system.

How about the pcf8385 bug, will you patch it because it is simple or shall I
supply a patch for this as well?

Peter Teichmann


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ/Etiquette: http://www.arm.linux.org.uk/armlinux/mailinglists.php
Russell King - ARM Linux
2003-11-10 19:20:51 UTC
Permalink
Post by Peter Teichmann
How about the pcf8385 bug, will you patch it because it is simple or shall I
supply a patch for this as well?
Sorry, I forgot about it for -test9-rmk1. However, I've just put the
fix in, so no patch is necessary.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ/Etiquette: http://www.arm.linux.org.uk/armlinux/mailinglists.php
Loading...