Discussion:
runtime check for omap-aes bus access permission (was: Re: 3.13-rc3 (commit 7ce93f3) breaks Nokia N900 DT boot)
(too old to reply)
Nishanth Menon
2015-02-11 20:40:33 UTC
Permalink
Anyhow, since checking the firewalls/APs to see if you have
permission will probably only get you yet another fault if
things are walled off, the robust way of dealing with this
sort of situation is by probing the device with a read
while trapping bus faults. This also handles modules that
are unreachable for other reasons, e.g. being disabled by
eFuse.
It is possible to patch kernel code to mask or ignore that
fault? Can you help me with something like that?
As I mentioned, I'm still learning my way around the kernel,
so I don't feel very comfortable suggesting a concrete patch
just yet. I've been browsing arch/arm/mm/ however and my
impression is that all that would be required is editing
fault.c by making a copy of do_bad but containing
return user_mode(regs) || !fixup_exception(regs);
and hook it onto the appropriate fault codes. However, this
really needs the opinion of someone more familiar with this
code.
I do have an observation to make on the issue of fault
decoding: the list in fsr-2level.c may be "standard ARMv3 and
[ 0] -
[ 1] alignment fault
[ 2] debug event
[ 3] section access flag fault
[ 4] instruction cache maintainance fault (reported via data
abort) [ 5] section translation fault
[ 6] page access flag fault
[ 7] page translation fault
[ 8] bus error on access
[ 9] section domain fault
[10] -
[11] page domain fault
[12] bus error on section table walk
[13] section permission fault
[14] bus error on page table walk
[15] page permission fault
[16] (TLB conflict abort)
[17] -
[18] -
[19] -
[20] (lockdown abort)
[21] -
[22] async bus error (reported via data abort)
[23] -
[24] async parity/ECC error (reported via data abort)
[25] parity/ECC error on access
[26] (coprocessor abort)
[27] -
[28] parity/ECC error on section table walk
[29] -
[30] parity/ECC error on page table walk
[31] -
Some entries are patched up near the bottom of fault.c but
many bogus messages remain, for example the "on linefetch" vs
"on non-linefetch" is misleading since no such thing can be
inferred from the fault status on v7. Also, the i-cache
maintenance fault handling looks wrong to me: it should fetch
the actual fault status from IFSR (even though the address
still comes from DFSR) and dispatch based on that.
Async external aborts (async bus error and async parity/ECC
error) give you basically no info. DFAR will contain garbage
hence displaying it will confuse rather than enlighten, a
traceback is pointless since the instruction that caused the
access is long retired, likewise user_mode() doesn't matter
since a transition to kernel space may have happened after
the access that cause the abort. Basically they should be
treated more as an IRQ than as a fault (note they can also be
masked just like irqs). In case of a bus error, it may be
appropriate to just warn about it, or perhaps send a signal
to the current process, although in the latter case it should
have some means to distinguish it from a synchronous bus
error.
At least on the cortex-a8, a parity/ECC error (whether async
or not) is to be regarded as absolutely fatal. Quoth the
TRM: "No recovery is possible. The abort handler must disable
the caches, communicate the fail directly with the external
system, request a reboot."
Bit 10 no longer indicates an asynchronous (let alone
imprecise) fault. Apart from the debug events and async
aborts (and possibly some implementation-defined aborts), all
aborts listed are synchronous, and DFAR/IFAR is valid.
There's no technical obstruction to make these trappable via
the kernel exception handling mechanism. (Though at least in
case of parity/ECC errors one shouldn't.)
Tony, Nishanth, or somebody else... can you help with memory
management? Or do you know some expert for arch/arm/mm/ code?
Folks in linux-arm-kernel are probably the right people, I suppose.
Looping them in.
--
---
Regards,
Nishanth Menon
Pali Rohár
2015-02-18 21:14:59 UTC
Permalink
On Wed, Feb 11, 2015 at 2:28 PM, Pali Rohár
On Wednesday 11 February 2015 16:22:51 Matthijs van Duin
On 11 February 2015 at 13:39, Pali Rohár
Anyhow, since checking the firewalls/APs to see if you
have permission will probably only get you yet another
fault if things are walled off, the robust way of
dealing with this sort of situation is by probing the
device with a read while trapping bus faults. This also
handles modules that are unreachable for other reasons,
e.g. being disabled by eFuse.
It is possible to patch kernel code to mask or ignore
that fault? Can you help me with something like that?
As I mentioned, I'm still learning my way around the
kernel, so I don't feel very comfortable suggesting a
concrete patch just yet. I've been browsing arch/arm/mm/
however and my impression is that all that would be
required is editing fault.c by making a copy of do_bad but
containing
return user_mode(regs) || !fixup_exception(regs);
and hook it onto the appropriate fault codes. However,
this really needs the opinion of someone more familiar
with this code.
I do have an observation to make on the issue of fault
decoding: the list in fsr-2level.c may be "standard ARMv3
and ARMv4 aborts" but they are quite wrong for ARMv7 which
[ 0] -
[ 1] alignment fault
[ 2] debug event
[ 3] section access flag fault
[ 4] instruction cache maintainance fault (reported via
data abort) [ 5] section translation fault
[ 6] page access flag fault
[ 7] page translation fault
[ 8] bus error on access
[ 9] section domain fault
[10] -
[11] page domain fault
[12] bus error on section table walk
[13] section permission fault
[14] bus error on page table walk
[15] page permission fault
[16] (TLB conflict abort)
[17] -
[18] -
[19] -
[20] (lockdown abort)
[21] -
[22] async bus error (reported via data abort)
[23] -
[24] async parity/ECC error (reported via data abort)
[25] parity/ECC error on access
[26] (coprocessor abort)
[27] -
[28] parity/ECC error on section table walk
[29] -
[30] parity/ECC error on page table walk
[31] -
Some entries are patched up near the bottom of fault.c but
many bogus messages remain, for example the "on linefetch"
vs "on non-linefetch" is misleading since no such thing
can be inferred from the fault status on v7. Also, the
i-cache maintenance fault handling looks wrong to me: it
should fetch the actual fault status from IFSR (even
though the address still comes from DFSR) and dispatch
based on that.
Async external aborts (async bus error and async parity/ECC
error) give you basically no info. DFAR will contain
garbage hence displaying it will confuse rather than
enlighten, a traceback is pointless since the instruction
that caused the access is long retired, likewise
user_mode() doesn't matter since a transition to kernel
space may have happened after the access that cause the
abort. Basically they should be treated more as an IRQ
than as a fault (note they can also be masked just like
irqs). In case of a bus error, it may be appropriate to
just warn about it, or perhaps send a signal to the
current process, although in the latter case it should
have some means to distinguish it from a synchronous bus
error.
At least on the cortex-a8, a parity/ECC error (whether
async or not) is to be regarded as absolutely fatal.
Quoth the TRM: "No recovery is possible. The abort handler
must disable the caches, communicate the fail directly
with the external system, request a reboot."
Bit 10 no longer indicates an asynchronous (let alone
imprecise) fault. Apart from the debug events and async
aborts (and possibly some implementation-defined aborts),
all aborts listed are synchronous, and DFAR/IFAR is valid.
There's no technical obstruction to make these trappable
via the kernel exception handling mechanism. (Though at
least in case of parity/ECC errors one shouldn't.)
Tony, Nishanth, or somebody else... can you help with memory
management? Or do you know some expert for arch/arm/mm/
code?
Folks in linux-arm-kernel are probably the right people, I
suppose. Looping them in.
Hi folks in linux-arm-kernel!

Can you help us with above problem? How to catch external abort
on non-linefetch in kernel driver and prevent kernel panic?

Here is that kernel panic log:
http://thread.gmane.org/gmane.linux.ports.arm.omap/108397/

We want to check for "Unhandled fault: external abort on non-
linefetch" and if it happens disable some functionality in kernel
driver omap-aes.ko
--
Pali Rohár
***@gmail.com
Pali Rohár
2015-02-19 18:20:41 UTC
Permalink
Anyhow, since checking the firewalls/APs to see if you have
permission will probably only get you yet another fault if
things are walled off, the robust way of dealing with this
sort of situation is by probing the device with a read
while trapping bus faults. This also handles modules that
are unreachable for other reasons, e.g. being disabled by
eFuse.
It is possible to patch kernel code to mask or ignore that
fault? Can you help me with something like that?
As I mentioned, I'm still learning my way around the kernel,
so I don't feel very comfortable suggesting a concrete patch
just yet. I've been browsing arch/arm/mm/ however and my
impression is that all that would be required is editing
fault.c by making a copy of do_bad but containing
return user_mode(regs) || !fixup_exception(regs);
and hook it onto the appropriate fault codes. However, this
really needs the opinion of someone more familiar with this
code.
I do have an observation to make on the issue of fault
decoding: the list in fsr-2level.c may be "standard ARMv3 and
[ 0] -
[ 1] alignment fault
[ 2] debug event
[ 3] section access flag fault
[ 4] instruction cache maintainance fault (reported via data
abort) [ 5] section translation fault
[ 6] page access flag fault
[ 7] page translation fault
[ 8] bus error on access
[ 9] section domain fault
[10] -
[11] page domain fault
[12] bus error on section table walk
[13] section permission fault
[14] bus error on page table walk
[15] page permission fault
[16] (TLB conflict abort)
[17] -
[18] -
[19] -
[20] (lockdown abort)
[21] -
[22] async bus error (reported via data abort)
[23] -
[24] async parity/ECC error (reported via data abort)
[25] parity/ECC error on access
[26] (coprocessor abort)
[27] -
[28] parity/ECC error on section table walk
[29] -
[30] parity/ECC error on page table walk
[31] -
Some entries are patched up near the bottom of fault.c but
many bogus messages remain, for example the "on linefetch" vs
"on non-linefetch" is misleading since no such thing can be
inferred from the fault status on v7. Also, the i-cache
maintenance fault handling looks wrong to me: it should fetch
the actual fault status from IFSR (even though the address
still comes from DFSR) and dispatch based on that.
Async external aborts (async bus error and async parity/ECC
error) give you basically no info. DFAR will contain garbage
hence displaying it will confuse rather than enlighten, a
traceback is pointless since the instruction that caused the
access is long retired, likewise user_mode() doesn't matter
since a transition to kernel space may have happened after
the access that cause the abort. Basically they should be
treated more as an IRQ than as a fault (note they can also be
masked just like irqs). In case of a bus error, it may be
appropriate to just warn about it, or perhaps send a signal
to the current process, although in the latter case it should
have some means to distinguish it from a synchronous bus
error.
At least on the cortex-a8, a parity/ECC error (whether async
or not) is to be regarded as absolutely fatal. Quoth the
TRM: "No recovery is possible. The abort handler must disable
the caches, communicate the fail directly with the external
system, request a reboot."
Bit 10 no longer indicates an asynchronous (let alone
imprecise) fault. Apart from the debug events and async
aborts (and possibly some implementation-defined aborts), all
aborts listed are synchronous, and DFAR/IFAR is valid.
There's no technical obstruction to make these trappable via
the kernel exception handling mechanism. (Though at least in
case of parity/ECC errors one shouldn't.)
Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch:

diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index 18ca74c..d530d55 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -7,7 +7,12 @@ static struct fsr_info fsr_info[] = {
{ do_bad, SIGBUS, BUS_ADRALN, "alignment exception" },
{ do_bad, SIGKILL, 0, "terminal exception" },
{ do_bad, SIGBUS, BUS_ADRALN, "alignment exception" },
+/* Do we need runtime check ? */
+#if __LINUX_ARM_ARCH__ < 6
{ do_bad, SIGBUS, 0, "external abort on linefetch" },
+#else
+ { do_translation_fault, SIGSEGV, SEGV_MAPERR, "I-cache maintenance fault" },
+#endif
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "section translation fault" },
{ do_bad, SIGBUS, 0, "external abort on linefetch" },
{ do_page_fault, SIGSEGV, SEGV_MAPERR, "page translation fault" },

Maybe it is related?
--
Pali Rohár
***@gmail.com
Matthijs van Duin
2015-02-19 20:25:49 UTC
Permalink
Post by Pali Rohár
Can you help us with above problem? How to catch external abort
on non-linefetch in kernel driver and prevent kernel panic?
Actually it's a synchronous bus error that you want to catch, which
however is misreported by linux as "external abort on non-linefetch"
(... but a bus error on a linefetch would produce exactly the same
error). Also, ARM apparently uses the term "external abort" as
umbrella term for aborts triggered outside the MMU, which includes not
just bus errors but also (uncorrectable) parity/ECC errors.

Anyhow, the core question you mean to ask is: can the "exception"
mechanism current already in place to trap MMU faults in e.g.
put_user() easily be extended to allow drivers to trap synchronous bus
errors? My impression is that this would in fact be quite easy and I
even outlined a suggested patch, but I'm still a kernel newbie so I
may be way off course.

Although its main use would be for auto-probing, it's maybe worth
mentioning I've met at least one peripheral which also reports bus
errors when writing inappropriate/unsupported *values* to a register.
(Of course when using posted writes you won't get an abort anyhow in
that case, it's only reported via interconnect error logs.)
Post by Pali Rohár
Anyway, in Nokia Harmattan N9/N950 2.6.32 kernel is this patch
In mainline linux the same fix-up is done at runtime rather than
compile time (in exceptions_init() at the bottom of fault.c). Either
way, in my post of the 11th I also mentioned that it looks wrong to
me. I-cache maintenance fault is really a special case in the fault
decoding logic since it means "although you got here via DAbort and
the relevant address is in DFAR, the exception happened on the
instruction side so you need to fetch the fault status from IFSR
instead."
Aaro Koskinen
2015-02-19 21:10:55 UTC
Permalink
Hi,
Post by Pali Rohár
+/* Do we need runtime check ? */
+#if __LINUX_ARM_ARCH__ < 6
{ do_bad, SIGBUS, 0, "external abort on linefetch" },
+#else
+ { do_translation_fault, SIGSEGV, SEGV_MAPERR, "I-cache maintenance fault" },
+#endif
Maybe it is related?
That was unrelated. Also, the patch is also in mainline,
see 8c0b742ca7a7d21de0ddc87eda6ef0b282e4de18 (ARM: 6134/1: Handle
instruction cache maintenance fault properly).

A.
Pali Rohár
2015-05-28 07:37:40 UTC
Permalink
Post by Nishanth Menon
Anyhow, since checking the firewalls/APs to see if you have
permission will probably only get you yet another fault if
things are walled off, the robust way of dealing with this
sort of situation is by probing the device with a read
while trapping bus faults. This also handles modules that
are unreachable for other reasons, e.g. being disabled by
eFuse.
It is possible to patch kernel code to mask or ignore that
fault? Can you help me with something like that?
As I mentioned, I'm still learning my way around the kernel,
so I don't feel very comfortable suggesting a concrete patch
just yet. I've been browsing arch/arm/mm/ however and my
impression is that all that would be required is editing
fault.c by making a copy of do_bad but containing
return user_mode(regs) || !fixup_exception(regs);
and hook it onto the appropriate fault codes. However, this
really needs the opinion of someone more familiar with this
code.
I do have an observation to make on the issue of fault
decoding: the list in fsr-2level.c may be "standard ARMv3 and
[ 0] -
[ 1] alignment fault
[ 2] debug event
[ 3] section access flag fault
[ 4] instruction cache maintainance fault (reported via data
abort) [ 5] section translation fault
[ 6] page access flag fault
[ 7] page translation fault
[ 8] bus error on access
[ 9] section domain fault
[10] -
[11] page domain fault
[12] bus error on section table walk
[13] section permission fault
[14] bus error on page table walk
[15] page permission fault
[16] (TLB conflict abort)
[17] -
[18] -
[19] -
[20] (lockdown abort)
[21] -
[22] async bus error (reported via data abort)
[23] -
[24] async parity/ECC error (reported via data abort)
[25] parity/ECC error on access
[26] (coprocessor abort)
[27] -
[28] parity/ECC error on section table walk
[29] -
[30] parity/ECC error on page table walk
[31] -
Some entries are patched up near the bottom of fault.c but
many bogus messages remain, for example the "on linefetch" vs
"on non-linefetch" is misleading since no such thing can be
inferred from the fault status on v7. Also, the i-cache
maintenance fault handling looks wrong to me: it should fetch
the actual fault status from IFSR (even though the address
still comes from DFSR) and dispatch based on that.
Async external aborts (async bus error and async parity/ECC
error) give you basically no info. DFAR will contain garbage
hence displaying it will confuse rather than enlighten, a
traceback is pointless since the instruction that caused the
access is long retired, likewise user_mode() doesn't matter
since a transition to kernel space may have happened after
the access that cause the abort. Basically they should be
treated more as an IRQ than as a fault (note they can also be
masked just like irqs). In case of a bus error, it may be
appropriate to just warn about it, or perhaps send a signal
to the current process, although in the latter case it should
have some means to distinguish it from a synchronous bus
error.
At least on the cortex-a8, a parity/ECC error (whether async
or not) is to be regarded as absolutely fatal. Quoth the
TRM: "No recovery is possible. The abort handler must disable
the caches, communicate the fail directly with the external
system, request a reboot."
Bit 10 no longer indicates an asynchronous (let alone
imprecise) fault. Apart from the debug events and async
aborts (and possibly some implementation-defined aborts), all
aborts listed are synchronous, and DFAR/IFAR is valid.
There's no technical obstruction to make these trappable via
the kernel exception handling mechanism. (Though at least in
case of parity/ECC errors one shouldn't.)
Tony, Nishanth, or somebody else... can you help with memory
management? Or do you know some expert for arch/arm/mm/ code?
Folks in linux-arm-kernel are probably the right people, I suppose.
Looping them in.
So pinging linux-arm-kernel again. Any idea how to handle that fault?
--
Pali Rohár
***@gmail.com
Tony Lindgren
2015-05-28 16:01:13 UTC
Permalink
Post by Pali Rohár
Post by Nishanth Menon
Anyhow, since checking the firewalls/APs to see if you have
permission will probably only get you yet another fault if
things are walled off, the robust way of dealing with this
sort of situation is by probing the device with a read
while trapping bus faults. This also handles modules that
are unreachable for other reasons, e.g. being disabled by
eFuse.
It is possible to patch kernel code to mask or ignore that
fault? Can you help me with something like that?
As I mentioned, I'm still learning my way around the kernel,
so I don't feel very comfortable suggesting a concrete patch
just yet. I've been browsing arch/arm/mm/ however and my
impression is that all that would be required is editing
fault.c by making a copy of do_bad but containing
return user_mode(regs) || !fixup_exception(regs);
and hook it onto the appropriate fault codes. However, this
really needs the opinion of someone more familiar with this
code.
I do have an observation to make on the issue of fault
decoding: the list in fsr-2level.c may be "standard ARMv3 and
[ 0] -
[ 1] alignment fault
[ 2] debug event
[ 3] section access flag fault
[ 4] instruction cache maintainance fault (reported via data
abort) [ 5] section translation fault
[ 6] page access flag fault
[ 7] page translation fault
[ 8] bus error on access
[ 9] section domain fault
[10] -
[11] page domain fault
[12] bus error on section table walk
[13] section permission fault
[14] bus error on page table walk
[15] page permission fault
[16] (TLB conflict abort)
[17] -
[18] -
[19] -
[20] (lockdown abort)
[21] -
[22] async bus error (reported via data abort)
[23] -
[24] async parity/ECC error (reported via data abort)
[25] parity/ECC error on access
[26] (coprocessor abort)
[27] -
[28] parity/ECC error on section table walk
[29] -
[30] parity/ECC error on page table walk
[31] -
Some entries are patched up near the bottom of fault.c but
many bogus messages remain, for example the "on linefetch" vs
"on non-linefetch" is misleading since no such thing can be
inferred from the fault status on v7. Also, the i-cache
maintenance fault handling looks wrong to me: it should fetch
the actual fault status from IFSR (even though the address
still comes from DFSR) and dispatch based on that.
Async external aborts (async bus error and async parity/ECC
error) give you basically no info. DFAR will contain garbage
hence displaying it will confuse rather than enlighten, a
traceback is pointless since the instruction that caused the
access is long retired, likewise user_mode() doesn't matter
since a transition to kernel space may have happened after
the access that cause the abort. Basically they should be
treated more as an IRQ than as a fault (note they can also be
masked just like irqs). In case of a bus error, it may be
appropriate to just warn about it, or perhaps send a signal
to the current process, although in the latter case it should
have some means to distinguish it from a synchronous bus
error.
At least on the cortex-a8, a parity/ECC error (whether async
or not) is to be regarded as absolutely fatal. Quoth the
TRM: "No recovery is possible. The abort handler must disable
the caches, communicate the fail directly with the external
system, request a reboot."
Bit 10 no longer indicates an asynchronous (let alone
imprecise) fault. Apart from the debug events and async
aborts (and possibly some implementation-defined aborts), all
aborts listed are synchronous, and DFAR/IFAR is valid.
There's no technical obstruction to make these trappable via
the kernel exception handling mechanism. (Though at least in
case of parity/ECC errors one shouldn't.)
Tony, Nishanth, or somebody else... can you help with memory
management? Or do you know some expert for arch/arm/mm/ code?
Folks in linux-arm-kernel are probably the right people, I suppose.
Looping them in.
So pinging linux-arm-kernel again. Any idea how to handle that fault?
Here's what might work.. You could patch drivers/bus/omap_l3*.c
code to probe the devices after the omap_l3 driver interrupts
are enabled.

For failed device access you get an interrupt so you know to not
create the struct device entry for that device. For the working
devices you can do the struct device entry and let it probe.

So basically we could make the omap_l3* drivers managers for
the omap bus code instead of probing them with "simple-bus"
and omap_device_build_from_dt().

No need to have these device probe early, and they are all
internal devices so as long as we know the type and address
for each soc the omap_l3 drive code could probe them.

It seems that trying to do this early just makes things more
complicated and should be done in the bootloader instead of
kernel if needed early.

Regards,

Tony
Matthijs van Duin
2015-05-28 20:26:52 UTC
Permalink
Post by Tony Lindgren
For failed device access you get an interrupt
Well for failed reads you get a bus error, and "catching" those (e.g.
using the existing exception mechanism used to catch MMU faults) is
the whole issue.

Though now that you mention it, it is true that for writes you won't
get any fault (at least on the DM814x and AM335x the posting point
appears to be the async bridge from MPUSS to the L3 interconnect) but
an interconnect error irq instead. It may be easier to make some kind
of harmless write (e.g. to the version register), wait a bit, and
check if the write triggered an interconnect error.

Feels hackish though: you'd need to be sure you waited long enough
(though using a read from another device on the same L4 interconnect
should be a reliable barrier in this case), and drivers for
receiving/interpreting interconnect errors are not implemented yet on
all SoCs (for some, like the AM335x, TI didn't even bother publishing
the relevant data in its TRM). Interconnect errors can also be lost in
some cases (multiple errors involving the same target in a short time
window) though that problem shouldn't arise in this particular case.

Also, presumably interconnect error reporting is unavailable on HS
devices given the fact that all interconnect registers seemed to be
inaccessible?

Matthijs
Tony Lindgren
2015-05-28 22:24:13 UTC
Permalink
Post by Matthijs van Duin
Post by Tony Lindgren
For failed device access you get an interrupt
Well for failed reads you get a bus error, and "catching" those (e.g.
using the existing exception mechanism used to catch MMU faults) is
the whole issue.
Though now that you mention it, it is true that for writes you won't
get any fault (at least on the DM814x and AM335x the posting point
appears to be the async bridge from MPUSS to the L3 interconnect) but
an interconnect error irq instead. It may be easier to make some kind
of harmless write (e.g. to the version register), wait a bit, and
check if the write triggered an interconnect error.
Feels hackish though: you'd need to be sure you waited long enough
(though using a read from another device on the same L4 interconnect
should be a reliable barrier in this case), and drivers for
receiving/interpreting interconnect errors are not implemented yet on
all SoCs (for some, like the AM335x, TI didn't even bother publishing
the relevant data in its TRM). Interconnect errors can also be lost in
some cases (multiple errors involving the same target in a short time
window) though that problem shouldn't arise in this particular case.
Hmm I believe the interrupt happens immediately trying to access an
invalid device. But maybe I'm thinking about just errors if a device
is not powered or clocked. So obviously some experiments need to be
done :)

The advantage here would be that the l3 driver actually already knows
quite a bit about the devices on the bus.
Post by Matthijs van Duin
Also, presumably interconnect error reporting is unavailable on HS
devices given the fact that all interconnect registers seemed to be
inaccessible?
Oh OK yeah then that would not work for Pali's case. I guess it just
needs to be tested.

Regards,

Tony
Pali Rohár
2015-05-28 22:27:21 UTC
Permalink
Post by Tony Lindgren
Post by Matthijs van Duin
Post by Tony Lindgren
For failed device access you get an interrupt
Well for failed reads you get a bus error, and "catching" those
(e.g. using the existing exception mechanism used to catch MMU
faults) is the whole issue.
Though now that you mention it, it is true that for writes you
won't get any fault (at least on the DM814x and AM335x the posting
point appears to be the async bridge from MPUSS to the L3
interconnect) but an interconnect error irq instead. It may be
easier to make some kind of harmless write (e.g. to the version
register), wait a bit, and check if the write triggered an
interconnect error.
Feels hackish though: you'd need to be sure you waited long enough
(though using a read from another device on the same L4
interconnect should be a reliable barrier in this case), and
drivers for receiving/interpreting interconnect errors are not
implemented yet on all SoCs (for some, like the AM335x, TI didn't
even bother publishing the relevant data in its TRM). Interconnect
errors can also be lost in some cases (multiple errors involving
the same target in a short time window) though that problem
shouldn't arise in this particular case.
Hmm I believe the interrupt happens immediately trying to access an
invalid device. But maybe I'm thinking about just errors if a device
is not powered or clocked. So obviously some experiments need to be
done :)
The advantage here would be that the l3 driver actually already knows
quite a bit about the devices on the bus.
Post by Matthijs van Duin
Also, presumably interconnect error reporting is unavailable on HS
devices given the fact that all interconnect registers seemed to be
inaccessible?
Oh OK yeah then that would not work for Pali's case. I guess it just
needs to be tested.
Regards,
Tony
Ok, thanks for info. Do you have some quick small patches for testing?
Or some pointers what is needed to modify and how?
--
Pali Rohár
***@gmail.com
Tony Lindgren
2015-05-29 00:15:13 UTC
Permalink
Post by Pali Rohár
Post by Tony Lindgren
Post by Matthijs van Duin
Post by Tony Lindgren
For failed device access you get an interrupt
Well for failed reads you get a bus error, and "catching" those
(e.g. using the existing exception mechanism used to catch MMU
faults) is the whole issue.
Though now that you mention it, it is true that for writes you
won't get any fault (at least on the DM814x and AM335x the posting
point appears to be the async bridge from MPUSS to the L3
interconnect) but an interconnect error irq instead. It may be
easier to make some kind of harmless write (e.g. to the version
register), wait a bit, and check if the write triggered an
interconnect error.
Feels hackish though: you'd need to be sure you waited long enough
(though using a read from another device on the same L4
interconnect should be a reliable barrier in this case), and
drivers for receiving/interpreting interconnect errors are not
implemented yet on all SoCs (for some, like the AM335x, TI didn't
even bother publishing the relevant data in its TRM). Interconnect
errors can also be lost in some cases (multiple errors involving
the same target in a short time window) though that problem
shouldn't arise in this particular case.
Hmm I believe the interrupt happens immediately trying to access an
invalid device. But maybe I'm thinking about just errors if a device
is not powered or clocked. So obviously some experiments need to be
done :)
The advantage here would be that the l3 driver actually already knows
quite a bit about the devices on the bus.
Post by Matthijs van Duin
Also, presumably interconnect error reporting is unavailable on HS
devices given the fact that all interconnect registers seemed to be
inaccessible?
Oh OK yeah then that would not work for Pali's case. I guess it just
needs to be tested.
Regards,
Tony
Ok, thanks for info. Do you have some quick small patches for testing?
Or some pointers what is needed to modify and how?
Well I guess the initial test would be to make sure you have
CONFIG_OMAP_INTERCONNECT=y, comment out status = "disabled" in
omap3-n900.dts for aes, patch in the aes hwmod data, check that
you have CONFIG_CRYPTO_DEV_OMAP_AES=y, boot the kernel.

Do you get just the l3_smx interrupt instead of the "Unhandled fault"?

If so then we can use the interrupt handle to make the probe fail.
Not sure yet what would be the best way to do that though :)

Regards,

Tony
Matthijs van Duin
2015-05-29 00:58:29 UTC
Permalink
Post by Tony Lindgren
Hmm I believe the interrupt happens immediately trying to access an
invalid device. But maybe I'm thinking about just errors if a device
is not powered or clocked.
It is only guaranteed to happen immediately (before the next
instruction is executed) if the error occurs before the posting-point
of the write. However, in that case the error is reported in-band to
the cpu, resulting in a (synchronous) bus error which takes precedence
over the out-of-band error irq (if any is signalled). Once the write
is posted however, the cpu will receive an ack on the write and
continue execution, and there's no reason to assume that an error irq
will happen *immediately* after the write.

Of course it typically will happen soon afterwards, possibly even
before the next instruction is executed, depending a bit on how soon
after the posting-point the error occurs versus how long it takes for
the write-ack to reach the cpu. On the other hand, it's also possible
the write, after becoming posted, gets stuck for a while due to a
burst of higher-priority traffic. (I also recall reading about some
situation where a request needs to wait for something to be
dynamically powered up before an error response could be generated,
but I think that was on the OMAP 4.)

So that's the icky part: it will very likely happen almost
immediately. There's however no *guarantee* that it will, and in fact
it's quite tricky to absolutely make sure a write is no longer in
transit. The usual solution is an "OCP barrier": a read that is known
to follow the same path as the write. Normally that means a read from
the same peripheral, but that would defeat the purpose in this case.
Fortunately, the L4 interconnects (unlike the L3) detect firewall
violations in the initiator agent rather than the target agents, hence
a read from any peripheral on the same L4 interconnect suffices.
Matthijs van Duin
2015-05-29 01:35:22 UTC
Permalink
Post by Matthijs van Duin
It is only guaranteed to happen immediately (before the next
instruction is executed) if the error occurs before the posting-point
of the write. However, in that case the error is reported in-band to
the cpu, resulting in a (synchronous) bus error which takes precedence
over the out-of-band error irq (if any is signalled).
OK, all this was actually assuming linux uses device-type mappings for
device mappings, which was also the impression I got from
build_mem_type_table() in arch/arm/mm/mmu.c (although it's a bit of a
maze). A quick test however seems to imply otherwise:

~# ./bogus-dev-write
Bus error

So... linux actually uses strongly-ordered mappings? I really didn't
expect that, given the performance implications (especially on a
strictly in-order cpu like the Cortex-A8 which will really just sit
there picking its nose until the write completes) and I think I recall
having seen an OCP barrier being used somewhere in driver code...

Well, in that case everything I said is technically still true, except
the posting point is the peripheral itself. That also means the
interconnect error reporting mechanism is not really useful for
probing since you'll get a bus error before any error irq is
delivered.

So I'd say you're back at having to trap that bus error using the
exception handling mechanism, which I still suspect shouldn't be hard
to do.

Or perhaps you could probe the device using a DMA access and combine
that with the interconnect error reporting irq... ;-)
Tony Lindgren
2015-05-29 15:50:31 UTC
Permalink
Post by Matthijs van Duin
Post by Matthijs van Duin
It is only guaranteed to happen immediately (before the next
instruction is executed) if the error occurs before the posting-point
of the write. However, in that case the error is reported in-band to
the cpu, resulting in a (synchronous) bus error which takes precedence
over the out-of-band error irq (if any is signalled).
OK, all this was actually assuming linux uses device-type mappings for
device mappings, which was also the impression I got from
build_mem_type_table() in arch/arm/mm/mmu.c (although it's a bit of a
~# ./bogus-dev-write
Bus error
So... linux actually uses strongly-ordered mappings? I really didn't
expect that, given the performance implications (especially on a
strictly in-order cpu like the Cortex-A8 which will really just sit
there picking its nose until the write completes) and I think I recall
having seen an OCP barrier being used somewhere in driver code...
I believe some TI kernels use strongly-ordered mappings, mainline
kernel does not. Which kernel version are you using?
Post by Matthijs van Duin
Well, in that case everything I said is technically still true, except
the posting point is the peripheral itself. That also means the
interconnect error reporting mechanism is not really useful for
probing since you'll get a bus error before any error irq is
delivered.
Hmm if that's the case then yes we can't use the error irq. However,
what I've seen so far is that we only get the bus error if the
l3_* drivers are configured. I guess some more testing is needed.
Post by Matthijs van Duin
So I'd say you're back at having to trap that bus error using the
exception handling mechanism, which I still suspect shouldn't be hard
to do.
And in that case it makes sense to do that in the bootloader to
avoid adding any custom early boot code to Linux kernel.
Post by Matthijs van Duin
Or perhaps you could probe the device using a DMA access and combine
that with the interconnect error reporting irq... ;-)
Heh too many dependencies :)

Tony
Tony Lindgren
2015-05-29 18:16:15 UTC
Permalink
Post by Tony Lindgren
Post by Matthijs van Duin
Post by Matthijs van Duin
It is only guaranteed to happen immediately (before the next
instruction is executed) if the error occurs before the posting-point
of the write. However, in that case the error is reported in-band to
the cpu, resulting in a (synchronous) bus error which takes precedence
over the out-of-band error irq (if any is signalled).
OK, all this was actually assuming linux uses device-type mappings for
device mappings, which was also the impression I got from
build_mem_type_table() in arch/arm/mm/mmu.c (although it's a bit of a
~# ./bogus-dev-write
Bus error
So... linux actually uses strongly-ordered mappings? I really didn't
expect that, given the performance implications (especially on a
strictly in-order cpu like the Cortex-A8 which will really just sit
there picking its nose until the write completes) and I think I recall
having seen an OCP barrier being used somewhere in driver code...
I believe some TI kernels use strongly-ordered mappings, mainline
kernel does not. Which kernel version are you using?
Post by Matthijs van Duin
Well, in that case everything I said is technically still true, except
the posting point is the peripheral itself. That also means the
interconnect error reporting mechanism is not really useful for
probing since you'll get a bus error before any error irq is
delivered.
Hmm if that's the case then yes we can't use the error irq. However,
what I've seen so far is that we only get the bus error if the
l3_* drivers are configured. I guess some more testing is needed.
Post by Matthijs van Duin
So I'd say you're back at having to trap that bus error using the
exception handling mechanism, which I still suspect shouldn't be hard
to do.
And in that case it makes sense to do that in the bootloader to
avoid adding any custom early boot code to Linux kernel.
Post by Matthijs van Duin
Or perhaps you could probe the device using a DMA access and combine
that with the interconnect error reporting irq... ;-)
Heh too many dependencies :)
If we can't use the l3 interrrupts, then something similar to commit
fdf4850cb5b2 ("ARM: BCM5301X: workaround suppress fault") might be
doable too.

Regards,

Tony
Matthijs van Duin
2015-05-30 15:22:11 UTC
Permalink
Post by Tony Lindgren
I believe some TI kernels use strongly-ordered mappings, mainline
kernel does not. Which kernel version are you using?
Normally I periodically rebuild based on Robert C Nelson's -bone
kernel (but with heavily customized config). I also tried a plain
4.1.0-rc5-bone3, the generic 4.1.0-rc5-armv7-x0 (the most
vanilla-looking kernel I could find in my debian package list), and
for the heck of it also the classic 3.14.43-ti-r66.

In all cases I observed a synchronous bus error (dubiously reported as
"external abort on non-linefetch (0x1818)") on an AM335x with this
trivial test:

int main() {
int fd = open( "/dev/mem", O_RDWR | O_DSYNC );
if( fd < 0 ) return 1;
void *ptr = mmap( NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0x42000000 );
if( ptr == MAP_FAILED ) return 1;
*(volatile int *)ptr = 0;
return 0;
}

I even considered for a moment that maybe the AM335x has some "all
writes non-posted" thing enabled (which I think is available as a
switch on OMAP 4/5?). It seemed unlikely, but since most of my
exploration of interconnect behaviour was done on a DM814x, I
double-checked by performing the same write in a baremetal test
program (with that region configured Device-type in the MMU). As
expected, no data abort occurred, so writes most certainly are posted.

So I have trouble coming up with any explanation for this other than
the use of strongly-ordered mappings.

(Curiously BTW, omitting O_DSYNC made no difference.)
Tony Lindgren
2015-06-01 17:58:07 UTC
Permalink
Post by Matthijs van Duin
Post by Tony Lindgren
I believe some TI kernels use strongly-ordered mappings, mainline
kernel does not. Which kernel version are you using?
Normally I periodically rebuild based on Robert C Nelson's -bone
kernel (but with heavily customized config). I also tried a plain
4.1.0-rc5-bone3, the generic 4.1.0-rc5-armv7-x0 (the most
vanilla-looking kernel I could find in my debian package list), and
for the heck of it also the classic 3.14.43-ti-r66.
In all cases I observed a synchronous bus error (dubiously reported as
"external abort on non-linefetch (0x1818)") on an AM335x with this
int main() {
int fd = open( "/dev/mem", O_RDWR | O_DSYNC );
if( fd < 0 ) return 1;
void *ptr = mmap( NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0x42000000 );
if( ptr == MAP_FAILED ) return 1;
*(volatile int *)ptr = 0;
return 0;
}
I even considered for a moment that maybe the AM335x has some "all
writes non-posted" thing enabled (which I think is available as a
switch on OMAP 4/5?). It seemed unlikely, but since most of my
exploration of interconnect behaviour was done on a DM814x, I
double-checked by performing the same write in a baremetal test
program (with that region configured Device-type in the MMU). As
expected, no data abort occurred, so writes most certainly are posted.
So I have trouble coming up with any explanation for this other than
the use of strongly-ordered mappings.
(Curiously BTW, omitting O_DSYNC made no difference.)
I think these kernels are missing the configuration for l3-noc
driver?

I tried it on omap4 that has l3-noc configured, and it first produces
"Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",
and the L3 interrupt only after that. So yeah, you're right, we can't
use the interrupts here. I somehow remembered we'd get only the L3
interrupt if configured.

Regards,

Tony
Matthijs van Duin
2015-06-01 20:32:47 UTC
Permalink
Post by Tony Lindgren
I think these kernels are missing the configuration for l3-noc
driver?
Yup. Since I'm pretty sure I have all the necessary info I was hoping
look into that... somewhere in my copious spare time...
Post by Tony Lindgren
I tried it on omap4 that has l3-noc configured, and it first produces
"Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",
(Though making a patch to fix that annoyingly wrong and useless
message is higher on my list of priorities)
Post by Tony Lindgren
and the L3 interrupt only after that. So yeah, you're right, we can't
use the interrupts here. I somehow remembered we'd get only the L3
interrupt if configured.
The bus error is not influenced by L3 error reporting config afaik,
and it will always win from the irq: even though the irq is almost
certainly asserted first, it can't be taken until the load/store
instruction completes, and then the fault will take precedence.

While implementing L3 error reporting in my forth system I ran into a
tricky scenario though: it turns out that if an irq occurs while the
cpu is waiting for instruction fetch, it does allow the irq to be
taken. The interrupted fetch is abandoned and any bus error it may
have produced is ignored since exception entry/exit is an implicit
instruction sync barrier. On return it is simply refetched...

Hence, the result from attempting to execute code from an invalid address:
fetching from [invalid]
irq entry (LR=[invalid])
L3 error displayed
irq exit
fetching from [invalid]
irq entry (LR=[invalid])
L3 error displayed
irq exit
fetching from [invalid]
...
(repeat until watchdog expires)


Anyhow, so we still have the puzzling fact that apparently neither of
us was expecting device memory to use a strongly-ordered mapping,
getting a bus error on a write (outside MPUSS itself) shows that it
does.

I've tried to read arch/arm/mm/mmu.c to find out why, but so far I'm
feeling hopelessly lost there... (the multitude of ARM architecture
versions/flavors supported aren't helping.)
Tony Lindgren
2015-06-01 20:52:18 UTC
Permalink
Post by Matthijs van Duin
Post by Tony Lindgren
I think these kernels are missing the configuration for l3-noc
driver?
Yup. Since I'm pretty sure I have all the necessary info I was hoping
look into that... somewhere in my copious spare time...
Post by Tony Lindgren
I tried it on omap4 that has l3-noc configured, and it first produces
"Unhandled fault: external abort on non-linefetch (0x1818) at 0xb6fd7000",
(Though making a patch to fix that annoyingly wrong and useless
message is higher on my list of priorities)
Post by Tony Lindgren
and the L3 interrupt only after that. So yeah, you're right, we can't
use the interrupts here. I somehow remembered we'd get only the L3
interrupt if configured.
The bus error is not influenced by L3 error reporting config afaik,
and it will always win from the irq: even though the irq is almost
certainly asserted first, it can't be taken until the load/store
instruction completes, and then the fault will take precedence.
While implementing L3 error reporting in my forth system I ran into a
tricky scenario though: it turns out that if an irq occurs while the
cpu is waiting for instruction fetch, it does allow the irq to be
taken. The interrupted fetch is abandoned and any bus error it may
have produced is ignored since exception entry/exit is an implicit
instruction sync barrier. On return it is simply refetched...
fetching from [invalid]
irq entry (LR=[invalid])
L3 error displayed
irq exit
fetching from [invalid]
irq entry (LR=[invalid])
L3 error displayed
irq exit
fetching from [invalid]
...
(repeat until watchdog expires)
OK that must be the case I've seen then. Probably that happens
when a device is not clocked.
Post by Matthijs van Duin
Anyhow, so we still have the puzzling fact that apparently neither of
us was expecting device memory to use a strongly-ordered mapping,
getting a bus error on a write (outside MPUSS itself) shows that it
does.
Hmm well it should be just MT_DEVICE for anything Linux ioremaps..
Care to verify that from a device driver that does ioremap on it
first?
Post by Matthijs van Duin
I've tried to read arch/arm/mm/mmu.c to find out why, but so far I'm
feeling hopelessly lost there... (the multitude of ARM architecture
versions/flavors supported aren't helping.)
Heh yeah too much hardware churn going on :)

Regards,

Tony
Matthijs van Duin
2015-06-02 04:21:20 UTC
Permalink
Post by Tony Lindgren
OK that must be the case I've seen then. Probably that happens
when a device is not clocked.
It happens for any interconnect error reported as a result of
instruction fetch, but that is itself not a very common occurrence and
obviously doesn't apply to device memory.

Another case where the L3 error irq may be taken first is if the bus
error is asynchronous, but I don't think this combo can be produced on
a dm81xx or am335x, but that's mainly due to the strictly in-order
Cortex-A8 making almost every abort synchronous. I'd expect async
aborts are more common on an A9.
Post by Tony Lindgren
Hmm well it should be just MT_DEVICE for anything Linux ioremaps..
Yikes, so both /dev/mem and uio are behaving unlike any device driver:
both use remap_pfn_range() after running the vm_page_prot though
pgprot_noncached() to set the memory type to L_PTE_MT_UNCACHED, which
counterintuitively turns out to mean strongly-ordered. o.O Especially
uio is acting inappropriate here imho.

But this is problematic... these ranges are already mapped by the
kernel, and ARM explicitly forbids mapping the same physical range
twice with different memory attributes (and it's not the only
architecture to do so). Hmmz...

Anyhow, drifting a bit off-topic here. I'm going to some more reading
and thinking about this.

Continue reading on narkive:
Loading...