Discussion:
Understanding an 'unhandled page fault'
Marc Singer
2004-04-11 18:22:30 UTC
Permalink
The user-mode faults have returned--perhaps the alignment of the
planets is helping.

I want to understand what this output means.

cp: unhandled page fault at 0x000005e4, code 0x807
pgd = c1210000
[000005e4] *pgd=c11c8011, *pte=00000000, *ppte=00000000
PC is at 0xab0c
LR is at 0x2405c
pc : [<0000ab0c>] lr : [<0002405c>] Not tainted
sp : beffdc44 ip : 4013f128 fp : 00000000
r10: 0003f008 r9 : 0003f178 r8 : beffdc44
r7 : 00002000 r6 : 00002000 r5 : ffffffff r4 : ffffffff
r3 : 0003f008 r2 : 00002000 r1 : 00000001 r0 : beffdc44
Flags: NzCv IRQs on FIQs on Mode USER_32 Segment user
Control: C000717F Table: C1210000 DAC: 00000015

The instruction at this locaton, as provided by GDB of the core file,
is:

0x0000ab0c: ldr r12, [pc, #4] ; 0xab18

The memory location the instruction is referencing, 0xab18, contains
0x2dddac.

Where does the 0x5e4 come from?

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-11 18:54:23 UTC
Permalink
Post by Marc Singer
The user-mode faults have returned--perhaps the alignment of the
planets is helping.
Intersting. Is this reproducable? If so, are the register dumps exactly
the same?

Which machine and/or cpu is this?

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-11 19:07:00 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
The user-mode faults have returned--perhaps the alignment of the
planets is helping.
Intersting. Is this reproducable? If so, are the register dumps exactly
the same?
Not exactly. As I wrote, these faults just returned. There was one
day when I could run my NFS transfer test where nothing failed and I
got very good throughput. Today, nearly every transfer fails with
either an unhandled page fault or an illegal instruction. The dumps
are different every time.
Post by Russell King - ARM Linux
Which machine and/or cpu is this?
This is the Sharp LH7A400 andn LH7A404. I have two boards. Both
exhibit the problem. Both CPUs have a 922T core.

I thought that this might be some sort of memory timing problem. I
trimmed the SDRAM timings back to CAS3 and RAStoCAS3, I set the CPU
clock down to 150MHz, and I cleared the iA bit in the control
register. The faults keep coming.

There is an errata for the CPU which states that LDM-2 instructions
(LDM's that load two registers) will produce incorrect results for
non-cached RAM. AFAICT, we don't have any of that. Right?

There's one other thing. It looks like the trouble occurs when RAM
fills with cached pages.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-11 23:12:15 UTC
Permalink
Post by Russell King - ARM Linux
Intersting. Is this reproducable? If so, are the register dumps exactly
the same?
I think I have a reproducible case.

To make things easier, I've written a brief implementation of cp. It
opens the files and then calls read() and write() on 32K pieces of the
file. Also, I'm using a file that contains only 0xe5 bytes. The file
with 0xff's also crashes, but not in the same way every time.

First, the dump.

t: unhandled page fault at 0x4001e1f8, code 0x80f
pgd = c13f4000
[4001e1f8] *pgd=c13f3011, *pte=c521805f, *ppte=c5218aae
PC is at 0x85d4
LR is at 0x85c8
pc : [<000085d4>] lr : [<000085c8>] Not tainted
sp : befffe10 ip : 00010814 fp : befffe34
r10: 4013f0d8 r9 : 00008504 r8 : 4013c5bc
r7 : 00000003 r6 : 00008600 r5 : 4001dc13 r4 : befffe64
r3 : 00008000 r2 : 00008000 r1 : 00011008 r0 : 00008000
Flags: nzcv IRQs on FIQs on Mode USER_32 Segment user
Control: C000717F Table: C13F4000 DAC: 00000015

And the dissasembly.

0x000085c0 <main+188>: mov r2, #32768 ; 0x8000
0x000085c4 <main+192>: bl 0x83c0
0x000085c8 <main+196>: str r0, [r11, -#36]
0x000085cc <main+200>: ldr r3, [r11, -#36]
0x000085d0 <main+204>: cmp r3, #0 ; 0x0
0x000085d4 <main+208>: bne 0x85dc <main+216>
0x000085d8 <main+212>: b 0x85f0 <main+236>

The bl to 0x83c0 is the call to read(). The code called is a simple
thunk. What seems to have happened is that main loop just read 32K
and it's about to call write. It must have taken an interrupt as it
was about to jump. Something in returning from the interrupt went
awry and we got the page fault. Unlike previous cases, this one
references a valid address *and* it references the same address every
time. The exact place within the code shifts a bit, but not much
since the program is so short.

When the program is running, 0x4001e1f8 is mapped as a code segment in
libc.

Do you have any suggestions for debugging this?

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
yomienews@hotmail
2004-04-12 06:12:08 UTC
Permalink
I have a function with just simple file handling and also have core dump.
it's replicable on at91rm9200 with linux-2.4.21-rmk1
I add a "printk" at a irrelevant position inside the user application
program and then no more core dump.

I want to know why as well
@.@

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 14:37:35 UTC
Permalink
Post by ***@hotmail
I have a function with just simple file handling and also have core dump.
it's replicable on at91rm9200 with linux-2.4.21-rmk1
I add a "printk" at a irrelevant position inside the user application
program and then no more core dump.
Interesting suggestion. How does one add a printk to user-mode code?
Post by ***@hotmail
I want to know why as well
@.@
-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
yomienews@hotmail
2004-04-12 14:52:44 UTC
Permalink
oups..... sorry,i mean "printf"

actually I added a big dummy for loop also works as well

----- Original Message -----
From: "Marc Singer" <***@buici.com>
To: "***@hotmail" <***@hotmail.com>
Cc: "Marc Singer" <***@buici.com>; <linux-arm-***@lists.arm.linux.org.uk>
Sent: Monday, April 12, 2004 10:37 PM
Subject: Re: Understanding an 'unhandled page fault'
Post by Marc Singer
Post by ***@hotmail
I have a function with just simple file handling and also have core dump.
it's replicable on at91rm9200 with linux-2.4.21-rmk1
I add a "printk" at a irrelevant position inside the user application
program and then no more core dump.
Interesting suggestion. How does one add a printk to user-mode code?
Post by ***@hotmail
I want to know why as well
@.@
-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 09:37:07 UTC
Permalink
Post by Marc Singer
I think I have a reproducible case.
Ok. Could you try disabing the cache and see if that makes the problem
go away? If it does, also try writethrough and see what effect that has.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 16:15:26 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
I think I have a reproducible case.
Ok. Could you try disabing the cache and see if that makes the problem
go away? If it does, also try writethrough and see what effect that has.
Oh, my. Jeeze. Well, they have another errata about how Linux
crashes when the data cache is disabled. Don't worry, though. I'll
try these things. Maybe there's more to learn.

Do you have a hypothesis about the mysteriously appearing address?


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 16:23:32 UTC
Permalink
Post by Marc Singer
Post by Russell King - ARM Linux
Post by Marc Singer
I think I have a reproducible case.
Ok. Could you try disabing the cache and see if that makes the problem
go away? If it does, also try writethrough and see what effect that has.
Oh, my. Jeeze. Well, they have another errata about how Linux
crashes when the data cache is disabled. Don't worry, though. I'll
try these things. Maybe there's more to learn.
Do you have a hypothesis about the mysteriously appearing address?
Not really. The best I can come up with is the idea that maybe the
CPU is executing stale instructions which somehow got left in the
instruction cache.

That's about the only thing I can come up with for a "bne" instruction
to apparantly cause a data abort. Reality is that this instruction can
not cause a data abort with the FSR indicating "page permission fault."

There is one step I haven't confirmed - whether your exact bne
instruction bit pattern happens to have bit 20 clear. The reason
is that bit 20 of a load/store instruction indicates a store, and
the "this is a write" bit in Linux's version of the FSR was set.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 18:20:09 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
Post by Russell King - ARM Linux
Post by Marc Singer
I think I have a reproducible case.
Ok. Could you try disabing the cache and see if that makes the problem
go away? If it does, also try writethrough and see what effect that has.
Oh, my. Jeeze. Well, they have another errata about how Linux
crashes when the data cache is disabled. Don't worry, though. I'll
try these things. Maybe there's more to learn.
Do you have a hypothesis about the mysteriously appearing address?
Not really. The best I can come up with is the idea that maybe the
CPU is executing stale instructions which somehow got left in the
instruction cache.
That's about the only thing I can come up with for a "bne" instruction
to apparantly cause a data abort. Reality is that this instruction can
not cause a data abort with the FSR indicating "page permission fault."
There is one step I haven't confirmed - whether your exact bne
instruction bit pattern happens to have bit 20 clear. The reason
is that bit 20 of a load/store instruction indicates a store, and
the "this is a write" bit in Linux's version of the FSR was set.
Interestingly enough, disabling the cache has meant that location of
the failure has changed and is not consistent.

t: unhandled page fault at 0x4001e1f8, code 0x807
pgd = c109c000
[4001e1f8] *pgd=c11c7011, *pte=00000002, *ppte=00000000
PC is at 0x400e37b0
LR is at 0x85c8
pc : [<400e37b0>] lr : [<000085c8>] Not tainted
sp : befffe10 ip : 00010814 fp : befffe34
r10: 4013f0d8 r9 : 00008504 r8 : 4013c5bc
r7 : 00000003 r6 : 00008600 r5 : 4001dc13 r4 : befffe64
r3 : 00008000 r2 : 00008000 r1 : 00011008 r0 : 00008000
Flags: nzcv IRQs on FIQs on Mode USER_32 Segment user
Control: C000617B Table: C109C000 DAC: 00000015

The dissassembly,

0x400e37a8 <read+8>: movcc pc, lr
0x400e37ac <read+12>: b 0x40036110 <__libc_start_main+312>
0x400e37b0 <write+0>: swi 0x00900004
0x400e37b4 <write+4>: cmn r0, #4096 ; 0x1000

Now it is the swi instruction that is the common source of the fault.
Every time.


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 18:24:38 UTC
Permalink
Post by Marc Singer
Interestingly enough, disabling the cache has meant that location of
the failure has changed and is not consistent.
Gosh, this is completely wrong.

Disabling the cache means that the failure location is now very
consistent. It is always on the swi instruction.


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 18:31:11 UTC
Permalink
Post by Marc Singer
Interestingly enough, disabling the cache has meant that location of
the failure has changed and is not consistent.
t: unhandled page fault at 0x4001e1f8, code 0x807
pgd = c109c000
[4001e1f8] *pgd=c11c7011, *pte=00000002, *ppte=00000000
PC is at 0x400e37b0
LR is at 0x85c8
pc : [<400e37b0>] lr : [<000085c8>] Not tainted
sp : befffe10 ip : 00010814 fp : befffe34
r10: 4013f0d8 r9 : 00008504 r8 : 4013c5bc
r7 : 00000003 r6 : 00008600 r5 : 4001dc13 r4 : befffe64
r3 : 00008000 r2 : 00008000 r1 : 00011008 r0 : 00008000
Flags: nzcv IRQs on FIQs on Mode USER_32 Segment user
Control: C000617B Table: C109C000 DAC: 00000015
The dissassembly,
0x400e37a8 <read+8>: movcc pc, lr
0x400e37ac <read+12>: b 0x40036110 <__libc_start_main+312>
0x400e37b0 <write+0>: swi 0x00900004
0x400e37b4 <write+4>: cmn r0, #4096 ; 0x1000
Now it is the swi instruction that is the common source of the fault.
Every time.
Hmmmmm. Now, let's see.

code 0x807:
0x800 == write (iow, bit 20 of instruction _clear_)
0x007 == page translation fault (iow page not present)

Now lets look at the disassembly:

0x400e37b0 <write+0>: swi 0x00900004

which, according to my toolchain, that instruction has the value
0xef900004, which has bit 20 _set_. Note the rather major discrepency
between the actual instruction and the fault code - one says the bit
is cleared, the other says it's set.

Since it's the kernel itself which decides whether bit 11 of the fault
code is set, this tends to imply that the kernel fault handler read an
instruction from 0x400e37b0 which had bit 20 clear.

Please apply this patch so we can see the instructions we supposedly
executed:

--- orig/arch/arm/kernel/traps.c Fri Mar 19 11:55:17 2004
+++ linux/arch/arm/kernel/traps.c Mon Apr 12 19:29:16 2004
@@ -124,7 +124,7 @@ static void dump_mem(const char *str, un
set_fs(fs);
}

-static void dump_instr(struct pt_regs *regs)
+void dump_instr(struct pt_regs *regs)
{
unsigned long addr = instruction_pointer(regs);
const int thumb = thumb_mode(regs);
--- orig/arch/arm/mm/fault-common.c Fri Mar 19 11:55:17 2004
+++ linux/arch/arm/mm/fault-common.c Mon Apr 12 19:29:55 2004
@@ -137,6 +137,7 @@ __do_user_fault(struct task_struct *tsk,
tsk->comm, addr, fsr);
show_pte(tsk->mm, addr);
show_regs(regs);
+ dump_instr(regs);
}
#endif


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 18:04:37 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
I think I have a reproducible case.
Ok. Could you try disabing the cache and see if that makes the problem
go away? If it does, also try writethrough and see what effect that has.
I've disabled the I and D caches as well as setting the WRITETHROUGH
option. Same failure. Same address.

I checked the CP15 register that controls caching. Indeed, the caches
are off. I notice that the write-buffer is still enabled, though.
Disabling it has no effect.



-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 11:31:19 UTC
Permalink
Post by Marc Singer
There is an errata for the CPU which states that LDM-2 instructions
(LDM's that load two registers) will produce incorrect results for
non-cached RAM. AFAICT, we don't have any of that. Right?
Hmm, this may eventually end up causing problems actually, because:

- there is no guarantee that the compiler will avoid such code sequences.
- we may map data pages uncached into userspace in some situations to
avoid cache aliasing issues (occurs when the same set of pages are
mapped into user space more than once.)

Realistically, if the manufacturer of this CPU wants their chip to
be reliable, and know that software run on it is going to behave as
designed, they need to fix this errata.

That said, I think the second point may be sufficiently rare situation
that it hardly ever occurs in real life.

Whatever, I don't think this errata is causing your present problem.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 15:16:17 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
There is an errata for the CPU which states that LDM-2 instructions
(LDM's that load two registers) will produce incorrect results for
non-cached RAM. AFAICT, we don't have any of that. Right?
- there is no guarantee that the compiler will avoid such code sequences.
- we may map data pages uncached into userspace in some situations to
avoid cache aliasing issues (occurs when the same set of pages are
mapped into user space more than once.)
Interesting cases.
Post by Russell King - ARM Linux
Realistically, if the manufacturer of this CPU wants their chip to
be reliable, and know that software run on it is going to behave as
designed, they need to fix this errata.
I'm certain that they agree. The errata is in the 'investigation'
phase. The CPU was released only last year. I suspect that by next
year, these issues will be ironed out. In the mean time, other OSs
have modified their runtime libraries to cope with the problem. It is
more difficult for us since there could be many BLT implementations
hidden throughout the kernel and user-mode libraries. Hence, I'm
banking on there not being any case where this occurs.
Post by Russell King - ARM Linux
That said, I think the second point may be sufficiently rare situation
that it hardly ever occurs in real life.
Whatever, I don't think this errata is causing your present problem.
Good and bad.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 16:17:40 UTC
Permalink
Post by Marc Singer
I'm certain that they agree. The errata is in the 'investigation'
phase. The CPU was released only last year. I suspect that by next
year, these issues will be ironed out. In the mean time, other OSs
have modified their runtime libraries to cope with the problem. It is
more difficult for us since there could be many BLT implementations
hidden throughout the kernel and user-mode libraries. Hence, I'm
banking on there not being any case where this occurs.
It wouldn't be impossible to stop gcc emitting ldm/stm's for pairs, and
to get binutils to shout at you if it tried to assemble any.
OK. That does mean I'd have to produce a whole toolchain, kernel &
BSP.

Do you believe I could use the binutils tools to *look* for examples
of the LDM-2 instructions? That would, at least, give me an idea if
there is any reason to pursue the path?
Dave
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ _________________________|_____ http://www.treblig.org |_______/
-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
b***@fluff.org
2004-04-12 16:20:19 UTC
Permalink
Post by Marc Singer
Post by Russell King - ARM Linux
Post by Marc Singer
There is an errata for the CPU which states that LDM-2 instructions
(LDM's that load two registers) will produce incorrect results for
non-cached RAM. AFAICT, we don't have any of that. Right?
- there is no guarantee that the compiler will avoid such code sequences.
- we may map data pages uncached into userspace in some situations to
avoid cache aliasing issues (occurs when the same set of pages are
mapped into user space more than once.)
Interesting cases.
Post by Russell King - ARM Linux
Realistically, if the manufacturer of this CPU wants their chip to
be reliable, and know that software run on it is going to behave as
designed, they need to fix this errata.
I'm certain that they agree. The errata is in the 'investigation'
phase. The CPU was released only last year. I suspect that by next
year, these issues will be ironed out. In the mean time, other OSs
have modified their runtime libraries to cope with the problem. It is
more difficult for us since there could be many BLT implementations
hidden throughout the kernel and user-mode libraries. Hence, I'm
banking on there not being any case where this occurs.
At least the manufacturer is investiagting the problem... it gets bad
when you go on about a `feature`, provide oscilloscope dumps of what
is going on, and the manufacturer turns around and says `well, if it
dosen't work, then we'll remove that claim from the next revision of
the data-sheet'.

One has to wonder how you can break an ARM92x core that badly, since
the tech comes from ARM itself...
--
Ben

Q: What's a light-year?
A: One-third less calories than a regular year.


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 16:35:22 UTC
Permalink
Post by b***@fluff.org
One has to wonder how you can break an ARM92x core that badly, since
the tech comes from ARM itself...
They're suprisingly frank with details. It has to do with the fact
that the CPU cannot always perform queued writes when there is AHB
master, other than the CPU, on the bus. The cache appears to smooth
over this problem. Thus, they recommend that the cache always be
enabled. As Russell points out, there are situations where we want to
disable the cache because of aliasing. I don't know if the aliasing
is really a problem here. It depends on whether or not the cache
operates against physical addresses or virtual addresses.
Post by b***@fluff.org
--
Ben
Q: What's a light-year?
A: One-third less calories than a regular year.
-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 16:42:42 UTC
Permalink
Post by Marc Singer
Post by b***@fluff.org
One has to wonder how you can break an ARM92x core that badly, since
the tech comes from ARM itself...
They're suprisingly frank with details. It has to do with the fact
that the CPU cannot always perform queued writes when there is AHB
master, other than the CPU, on the bus. The cache appears to smooth
over this problem. Thus, they recommend that the cache always be
enabled. As Russell points out, there are situations where we want to
disable the cache because of aliasing. I don't know if the aliasing
is really a problem here. It depends on whether or not the cache
operates against physical addresses or virtual addresses.
ARM922T have virtual index virtual tagged caches - same as most of
the other ARM CPUs you're likely to come across today.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Louie Boczek
2004-04-12 17:44:57 UTC
Permalink
I had a similar problem about a year ago on a project that I had using a
922T core with the 2.4.19-rmk4 kernel. I traced it down to
"cpu_arm922_data_abort" routine. It was

----------------------------------------------
/*
* cpu_arm922_data_abort()
*
* obtain information about current aborted instruction
*
* r0 = address of aborted instruction
*
* Returns:
* r0 = address of abort
* r1 != 0 if writing
* r3 = FSR
*/
.align 5
ENTRY(cpu_arm922_data_abort)
ldr r1, [r0] @ read aborted instruction
mrc p15, 0, r0, c6, c0, 0 @ get FAR
tst r1, r1, lsr #21 @ C = bit 20
mrc p15, 0, r3, c5, c0, 0 @ get FSR
sbc r1, r1, r1 @ r1 = C - 1
and r3, r3, #255
mov pc, lr

-----------------------------------------------

Looking through the archives I saw some discussion from Russell and
others about the problem of doing the "ldr" prior to getting the FAR and
FSR since the ldr could data abort and corrupt the FAR and FSR. (this is
vague since I did this about a year ago). So I basically copied the
cpu_arm920_data_abort code and my problems went away. So my code now
looks like.

------------------------------------------------------


/*
* cpu_arm922_data_abort()
*
* obtain information about current aborted instruction
* Note: we read user space. This means we might cause a data
* abort here if the I-TLB and D-TLB aren't seeing the same
* picture. Unfortunately, this does happen. We live with it.
*
* r2 = address of aborted instruction
*
* Returns:
* r0 = address of abort
* r1 != 0 if writing
* r3 = FSR
*/
.align 5
ENTRY(cpu_arm922_data_abort)
mrc p15, 0, r3, c5, c0, 0 @ get FSR
mrc p15, 0, r0, c6, c0, 0 @ get FAR
ldr r1, [r2] @ read aborted instruction
and r3, r3, #255
tst r1, r1, lsr #21 @ C = bit 20
sbc r1, r1, r1 @ r1 = C - 1
mov pc, lr

----------------------------------------------------

I hope this helps. And I am sorry for not submitting a patch a long
time ago, but I thought it was in the works since the problem was
discussed by Russell and others. I will follow up better in the future.

louie



-----Original Message-----
From: linux-arm-kernel-***@lists.arm.linux.org.uk
[mailto:linux-arm-kernel-***@lists.arm.linux.org.uk] On Behalf Of
Marc Singer
Sent: Sunday, April 11, 2004 11:07 AM
To: Marc Singer; linux-arm-kernel
Subject: Re: Understanding an 'unhandled page fault'

On Sun, Apr 11, 2004 at 07:54:23PM +0100, Russell King - ARM Linux
Post by Russell King - ARM Linux
Post by Marc Singer
The user-mode faults have returned--perhaps the alignment of the
planets is helping.
Intersting. Is this reproducable? If so, are the register dumps exactly
the same?
Not exactly. As I wrote, these faults just returned. There was one
day when I could run my NFS transfer test where nothing failed and I
got very good throughput. Today, nearly every transfer fails with
either an unhandled page fault or an illegal instruction. The dumps
are different every time.
Post by Russell King - ARM Linux
Which machine and/or cpu is this?
This is the Sharp LH7A400 andn LH7A404. I have two boards. Both
exhibit the problem. Both CPUs have a 922T core.

I thought that this might be some sort of memory timing problem. I
trimmed the SDRAM timings back to CAS3 and RAStoCAS3, I set the CPU
clock down to 150MHz, and I cleared the iA bit in the control
register. The faults keep coming.

There is an errata for the CPU which states that LDM-2 instructions
(LDM's that load two registers) will produce incorrect results for
non-cached RAM. AFAICT, we don't have any of that. Right?

There's one other thing. It looks like the trouble occurs when RAM
fills with cached pages.

-------------------------------------------------------------------
Subscription options:
http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php



-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 17:47:08 UTC
Permalink
Post by Louie Boczek
I had a similar problem about a year ago on a project that I had using a
922T core with the 2.4.19-rmk4 kernel. I traced it down to
"cpu_arm922_data_abort" routine.
Given that it's the Sharp LH7A40x device, and that Marc has recently
submitted 2.6 patches for it, I'm assuming that Marc is seeing these
on a 2.6 kernel - in which case this fix is already present.

I hope Marc isn't using a 2.4 kernel with the problem in; it has been
fixed in any modern 2.4 kernel, but still...

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 18:23:43 UTC
Permalink
Post by Russell King - ARM Linux
Post by Louie Boczek
I had a similar problem about a year ago on a project that I had using a
922T core with the 2.4.19-rmk4 kernel. I traced it down to
"cpu_arm922_data_abort" routine.
Given that it's the Sharp LH7A40x device, and that Marc has recently
submitted 2.6 patches for it, I'm assuming that Marc is seeing these
on a 2.6 kernel - in which case this fix is already present.
I hope Marc isn't using a 2.4 kernel with the problem in; it has been
fixed in any modern 2.4 kernel, but still...
Marc is using a 2.6.5 kernel. Marc has seen these problems since
2.6.3 when he first got the network driver working.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 18:32:37 UTC
Permalink
Post by Marc Singer
Marc is using a 2.6.5 kernel. Marc has seen these problems since
2.6.3 when he first got the network driver working.
Hmm, could this be an extra clue? Could the network driver be
scribbling over user pages?

[Before we go there, please apply the patch I just sent and post
a new register dump with the code line - thanks.]

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 18:37:59 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
Marc is using a 2.6.5 kernel. Marc has seen these problems since
2.6.3 when he first got the network driver working.
Hmm, could this be an extra clue? Could the network driver be
scribbling over user pages?
I have some evidence that this might be the case. The error *is*
sensitive to the contents of the file being copied. In one case, when
copying the file containing 0xff's, I saw that the PC contained
0xffffffff. Since it wasn't consistent, I couldn't be sure what was
going on.
Post by Russell King - ARM Linux
[Before we go there, please apply the patch I just sent and post
a new register dump with the code line - thanks.]
Applying...

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 18:44:45 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
Marc is using a 2.6.5 kernel. Marc has seen these problems since
2.6.3 when he first got the network driver working.
Hmm, could this be an extra clue? Could the network driver be
scribbling over user pages?
[Before we go there, please apply the patch I just sent and post
a new register dump with the code line - thanks.]
Yipee!!!

t: unhandled page fault at 0x4001e1f8, code 0x807
pgd = c10e4000
[4001e1f8] *pgd=c10c3011, *pte=00000002, *ppte=00000000
PC is at 0x400e37b0
LR is at 0x85c8
pc : [<400e37b0>] lr : [<000085c8>] Not tainted
sp : befffdd0 ip : 00010814 fp : befffdf4
r10: 4013f0d8 r9 : 00008504 r8 : 4013c5bc
r7 : 00000003 r6 : 00008600 r5 : 4001dc13 r4 : befffe24
r3 : 00008000 r2 : 00008000 r1 : 00011008 r0 : 00008000
Flags: nzcv IRQs on FIQs on Mode USER_32 Segment user
Control: C000617B Table: C10E4000 DAC: 00000015
Code: e5e5e5e5 e5e5e5e5 e5e5e5e5 e5e5e5e5 (e5e5e5e5)

Which brings me to another question. One of your patches included the
smc91x driver. I don't see the driver in your latest submissions, nor
do I find it in the kernel. Where has it gone?


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 18:52:14 UTC
Permalink
Post by Marc Singer
Which brings me to another question. One of your patches included the
smc91x driver. I don't see the driver in your latest submissions, nor
do I find it in the kernel. Where has it gone?
Haven't managed to get around to submitting it yet. I'm debating about
sending it to Jeff Garzik for a review before sending it on.

However, with an outstanding patch queue of presently around 500K, I'm
trying to avoid putting large things in it until Linus has taken this
set. As far as I can see, he's still on holiday.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 19:26:42 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
Which brings me to another question. One of your patches included the
smc91x driver. I don't see the driver in your latest submissions, nor
do I find it in the kernel. Where has it gone?
Haven't managed to get around to submitting it yet. I'm debating about
sending it to Jeff Garzik for a review before sending it on.
However, with an outstanding patch queue of presently around 500K, I'm
trying to avoid putting large things in it until Linus has taken this
set. As far as I can see, he's still on holiday.
OK. I'm trying to get my JTAG box to break on a tracepoint. If I'm
successful, I believe I'll be able to find the culprit. And I hope it
isn't too embarassing.

I wouldn't mind seeing your most recent version of the SMC driver so
that I can diff against it. I saw a small patch from Nico that looks
specific to his hardware. I can diff against the 2.6.0 version if
that's really all there is.


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-13 00:31:33 UTC
Permalink
Post by Marc Singer
t: unhandled page fault at 0x4001e1f8, code 0x807
pgd = c10e4000
[4001e1f8] *pgd=c10c3011, *pte=00000002, *ppte=00000000
PC is at 0x400e37b0
LR is at 0x85c8
pc : [<400e37b0>] lr : [<000085c8>] Not tainted
sp : befffdd0 ip : 00010814 fp : befffdf4
r10: 4013f0d8 r9 : 00008504 r8 : 4013c5bc
r7 : 00000003 r6 : 00008600 r5 : 4001dc13 r4 : befffe24
r3 : 00008000 r2 : 00008000 r1 : 00011008 r0 : 00008000
Flags: nzcv IRQs on FIQs on Mode USER_32 Segment user
Control: C000617B Table: C10E4000 DAC: 00000015
Code: e5e5e5e5 e5e5e5e5 e5e5e5e5 e5e5e5e5 (e5e5e5e5)
The PTE shown in the crash report isn't always the same. Here is
another.

[4001e1f8] *pgd=c0008011, *pte=c504e057, *ppte=c504eaa6

Is *pte the first level descriptor and *ppte the second level
descriptor? They kinda look like them. Would this then mean that the
physical address is 0xc504e1f8?


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-13 07:27:25 UTC
Permalink
Post by Marc Singer
The PTE shown in the crash report isn't always the same. Here is
another.
[4001e1f8] *pgd=c0008011, *pte=c504e057, *ppte=c504eaa6
Is *pte the first level descriptor and *ppte the second level
descriptor?
*pte is the "Linux" version of the PTE, and "*ppte" is the hardware
version of the PTE.
Post by Marc Singer
They kinda look like them. Would this then mean that the
physical address is 0xc504e1f8?
Yup.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-14 19:32:43 UTC
Permalink
Post by Marc Singer
Post by Russell King - ARM Linux
Post by Marc Singer
Marc is using a 2.6.5 kernel. Marc has seen these problems since
2.6.3 when he first got the network driver working.
Hmm, could this be an extra clue? Could the network driver be
scribbling over user pages?
[Before we go there, please apply the patch I just sent and post
a new register dump with the code line - thanks.]
Yipee!!!
t: unhandled page fault at 0x4001e1f8, code 0x807
pgd = c10e4000
[4001e1f8] *pgd=c10c3011, *pte=00000002, *ppte=00000000
PC is at 0x400e37b0
LR is at 0x85c8
pc : [<400e37b0>] lr : [<000085c8>] Not tainted
sp : befffdd0 ip : 00010814 fp : befffdf4
r10: 4013f0d8 r9 : 00008504 r8 : 4013c5bc
r7 : 00000003 r6 : 00008600 r5 : 4001dc13 r4 : befffe24
r3 : 00008000 r2 : 00008000 r1 : 00011008 r0 : 00008000
Flags: nzcv IRQs on FIQs on Mode USER_32 Segment user
Control: C000617B Table: C10E4000 DAC: 00000015
Code: e5e5e5e5 e5e5e5e5 e5e5e5e5 e5e5e5e5 (e5e5e5e5)
The problem here is the 0xe5e5e5e5 btw. That's:

0: e5e5e5e5 strb lr, [r5, #1509]!

which does appear to be the instruction which was executed (r5 + 1509 ==
fault address.)

The cause of this isn't that the page was unmapped; it's caused by
the program code being overwritten by 0xe5e5e5e5.

You could dump out the PTE for the code being executed:

show_pte(tsk->mm, instruction_pointer(regs));

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php

Marc Singer
2004-04-12 20:23:50 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
Marc is using a 2.6.5 kernel. Marc has seen these problems since
2.6.3 when he first got the network driver working.
Hmm, could this be an extra clue? Could the network driver be
scribbling over user pages?
[Before we go there, please apply the patch I just sent and post
a new register dump with the code line - thanks.]
Well, I believe I've done everything right. It looks like there might
be another explanation, though. I am breaking on references to the
specified address. However, it is probable that there is an alias for
the code pages being overwritten. Since diasabling the cache provokes
the reproducible case, I suspect that that's what is happening. So,
the EmbeddedICE isn't going to break since the code address isn't
being written.

How can I find out if there are physical addresses being mapped twice?


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 22:43:10 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
Marc is using a 2.6.5 kernel. Marc has seen these problems since
2.6.3 when he first got the network driver working.
Hmm, could this be an extra clue? Could the network driver be
scribbling over user pages?
I can see a few different ways that I might experience the behavior
I'm seeing.

1) The physical page is being mapped to more than one virtual
address.
2) The physical page is present in the memory map in two places.
This might happen if the setup was done incorrectly.
3) There is a memory alias present in the memory map. Some
addressing bits are unused, so 0xc0000000 and 0xc200000 point to
the same physical location.

I've vetted my memory macros again. I've changed them to be more
correct even though the previous versions returned correct values.
This didn't solve the problem.

I've changed the memory map to put each bank in a separate node. I'm
reasonably confident that each physical page is represented once and
only once in the memory map.

I've been experimenting with reducing the amount of memory given to
the kernel. When I give it 8MiB, the problem doesn't seem to occur.
I've tried several combinations of memory regions. Seems, though,
that a negative result isn't very useful here.

It would be helpful to know if the page being overwritten is mapped
twice. Is there a facility for checking this?


-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Russell King - ARM Linux
2004-04-12 22:49:22 UTC
Permalink
Post by Marc Singer
I can see a few different ways that I might experience the behavior
I'm seeing.
1) The physical page is being mapped to more than one virtual
address.
If it's mapped into userspace, it will be mapped in two places,
though we know how to handle that. It will be mapped in kernel
space, and again in user space.
Post by Marc Singer
It would be helpful to know if the page being overwritten is mapped
twice. Is there a facility for checking this?
About the best idea I can give you at present is to write a function
to walk the page tables dumping the values found there. Howver,
there may be a lot to it, so I'd recommend not dumping level 1
page table entries containing zero.

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Marc Singer
2004-04-12 23:04:11 UTC
Permalink
Post by Russell King - ARM Linux
Post by Marc Singer
I can see a few different ways that I might experience the behavior
I'm seeing.
1) The physical page is being mapped to more than one virtual
address.
If it's mapped into userspace, it will be mapped in two places,
though we know how to handle that. It will be mapped in kernel
space, and again in user space.
Post by Marc Singer
It would be helpful to know if the page being overwritten is mapped
twice. Is there a facility for checking this?
About the best idea I can give you at present is to write a function
to walk the page tables dumping the values found there. Howver,
there may be a lot to it, so I'd recommend not dumping level 1
page table entries containing zero.
static inline void check_overwrite (const char* sz)
{
unsigned long addr = 0x4001e1f8;
pgd_t* pgd = pgd_offset (&init_mm, addr);

if (pgd_val(*pgd) && *(unsigned long*) addr == 0xe5e5e5e5)
printk ("overwrite @%s\n", sz);
}

Here's an interim plan to see if I can catch it in the act. If this
is happening in the network driver, I ought to see the user's code
space modified before the ISR returns. Will the code above printk when the memory accress is overwritten?



-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Yomie Chan
2004-04-13 01:57:19 UTC
Permalink
Dear Mr King:

As I also receive core dump at file handling user application on my
at91rm9200 running with linux-2.4.21-rmk1 and compiled with arm-linux-gcc
2.95.3, will the core dump error be caused by the arm CPU problem as well?

Thankx
-Yomie


-----Original Message-----
From: ben-***@fluff.org [mailto:ben-***@fluff.org]
Sent: Tuesday, April 13, 2004 12:20 AM
To: Marc Singer
Cc: linux-arm-***@lists.arm.linux.org.uk
Subject: Re: Understanding an 'unhandled page fault'
Post by Marc Singer
Post by Russell King - ARM Linux
Post by Marc Singer
There is an errata for the CPU which states that LDM-2 instructions
(LDM's that load two registers) will produce incorrect results for
non-cached RAM. AFAICT, we don't have any of that. Right?
- there is no guarantee that the compiler will avoid such code
sequences.
Post by Marc Singer
Post by Russell King - ARM Linux
- we may map data pages uncached into userspace in some situations to
avoid cache aliasing issues (occurs when the same set of pages are
mapped into user space more than once.)
Interesting cases.
Post by Russell King - ARM Linux
Realistically, if the manufacturer of this CPU wants their chip to
be reliable, and know that software run on it is going to behave as
designed, they need to fix this errata.
I'm certain that they agree. The errata is in the 'investigation'
phase. The CPU was released only last year. I suspect that by next
year, these issues will be ironed out. In the mean time, other OSs
have modified their runtime libraries to cope with the problem. It is
more difficult for us since there could be many BLT implementations
hidden throughout the kernel and user-mode libraries. Hence, I'm
banking on there not being any case where this occurs.
At least the manufacturer is investiagting the problem... it gets bad
when you go on about a `feature`, provide oscilloscope dumps of what
is going on, and the manufacturer turns around and says `well, if it
dosen't work, then we'll remove that claim from the next revision of
the data-sheet'.

One has to wonder how you can break an ARM92x core that badly, since
the tech comes from ARM itself...
--
Ben

Q: What's a light-year?
A: One-third less calories than a regular year.


-------------------------------------------------------------------
Subscription options:
http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php

-------------------------------------------------------------------
Subscription options: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/armlinux/mlfaq.php
Etiquette: http://www.arm.linux.org.uk/armlinux/mletiquette.php
Loading...