Discussion:
4.4 BCM5301X ARM regression "External imprecise Data abort"
Rafał Miłecki
2016-04-04 06:13:00 UTC
Permalink
Hi guys,

I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.

It happens that Linux 4.4 doesn't boot due to the following commits:
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")

In kernel 4.3 we got that abort workaround which was resulting in:
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.

With 4.4 similar (or the same?) abort happens earlier (during PCI host
driver init) and doesn't get ignored:
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem 0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem 0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)

At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
followed by another:
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000

So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.

Reverting all three commits from the top of 4.4.6 gives me back a
working & booting kernel.

Do you have any idea how to fix this regression (and hopefully
original problem as well)?
--
Rafał
Scott Branden
2016-04-04 21:08:45 UTC
Permalink
Hi Rafal,

I do not work on BCM5301x SoCs but perhaps Jon Mason can comment.
A few comments inline as well.
Post by Rafał Miłecki
Hi guys,
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem 0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem 0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
We do not observe such issues in Cygnus and other SoCs that use this
PCIe driver (we do not use bcma either - I do not know if that is related).
Post by Rafał Miłecki
Reverting all three commits from the top of 4.4.6 gives me back a
working & booting kernel.
Do you have any idea how to fix this regression (and hopefully
original problem as well)?
I think the proper fix is to correct the issues in the bootloader. It
was my understanding from Jon Mason that this is the root of the
original problem.
Regards,
Scott
Hauke Mehrtens
2016-04-04 21:23:25 UTC
Permalink
Hi Rafal,
Post by Scott Branden
Hi Rafal,
I do not work on BCM5301x SoCs but perhaps Jon Mason can comment.
A few comments inline as well.
Post by Rafał Miłecki
Hi guys,
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Scott Branden
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
Post by Scott Branden
We do not observe such issues in Cygnus and other SoCs that use this
PCIe driver (we do not use bcma either - I do not know if that is related).
Post by Rafał Miłecki
Reverting all three commits from the top of 4.4.6 gives me back a
working & booting kernel.
Do you have any idea how to fix this regression (and hopefully
original problem as well)?
I think the proper fix is to correct the issues in the bootloader. It
was my understanding from Jon Mason that this is the root of the
original problem.
I think this is a new problem.

In the Broadcom SDK was a comment saying that probably the bootloader is
broken and that causes this fault which was worked around in the
mainline kernel with the fault handler in the brcm code.

When I added the code Arnd asked me if this SoC has a PCIe controller
because he saw such a problem on an other SoC with a PCIe controller.
https://www.spinics.net/lists/arm-kernel/msg298112.html

As this is now happening in the PCIe code I assume that this has
something to do with PCIe. ;-)

Have you tried to deactivate PCIe support in Device tree and see what
happens? Have you tried to load the PCIe controller as a module later on
so if that makes a difference?

Hauke
Rafał Miłecki
2016-04-08 06:45:15 UTC
Permalink
Post by Hauke Mehrtens
Post by Scott Branden
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Scott Branden
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
So AFAIU with kernel 4.3:
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it

With kernel 4.4:
1) All aborts are reported immediately
2) Abort caused by a bootloader gets ignored by ARM code:
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported immediately now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one

Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
Post by Hauke Mehrtens
Post by Scott Branden
We do not observe such issues in Cygnus and other SoCs that use this
PCIe driver (we do not use bcma either - I do not know if that is related).
Post by Rafał Miłecki
Reverting all three commits from the top of 4.4.6 gives me back a
working & booting kernel.
Do you have any idea how to fix this regression (and hopefully
original problem as well)?
I think the proper fix is to correct the issues in the bootloader. It
was my understanding from Jon Mason that this is the root of the
original problem.
I think this is a new problem.
In the Broadcom SDK was a comment saying that probably the bootloader is
broken and that causes this fault which was worked around in the
mainline kernel with the fault handler in the brcm code.
There is some issue with bootloader indeed, but with kernel 4.4 we
seem to have it handled by ARM arch code. There is now this nice
workaround I see when booting 4.4:
[ 0.000000] Hit pending asynchronous external abort
(FSR=0x00001c06) during first unmask, this is most likely caused by a
firmware/bootloader bug.

It seems we were lucky so far (in 4.3 and older) thanks for all aborts
being squashed into a single one. We meant to ignore bootloader caused
abort but we were also ignoring many more aborts triggered by iproc.
Post by Hauke Mehrtens
When I added the code Arnd asked me if this SoC has a PCIe controller
because he saw such a problem on an other SoC with a PCIe controller.
https://www.spinics.net/lists/arm-kernel/msg298112.html
As this is now happening in the PCIe code I assume that this has
something to do with PCIe. ;-)
Have you tried to deactivate PCIe support in Device tree and see what
happens? Have you tried to load the PCIe controller as a module later on
so if that makes a difference?
So this definitely looks like something PCIe related. I modified
OpenWrt config to build iproc as module and load it late.

As said earlier I got this early on-unmask abort handled nicely by ARM
arch code:
[ 0.000000] Hit pending asynchronous external abort
(FSR=0x00001c06) during first unmask, this is most likely caused by a
firmware/bootloader bug.

Then many modules load nicely, I'm seeing:
[ 2.959868] Freeing unused kernel memory: 212K (c0443000 - c0478000)
without any abort at this point.

And it goes cleanly farther until loading pcie-iproc-bcma.ko. At some
point PCIe initialization starts triggering aborts:
[ 10.547032] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 10.555199] External imprecise Data abort at addr=0x1c6e00f,
fsr=0x1406 ignored.
[ 10.562635] Unhandled fault: imprecise external abort (0x1406) at 0x01c6e00f
(backtrace here, see 4.4-iproc-module.txt)

With kernel 4.3 the same place of PCIe init looked like this:
[ 2.926866] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.935051] pci 0001:02:00.0: unknown header type 12, ignoring device
[ 2.942154] pci 0001:02:03.0: [Firmware Bug]: reg 0x10: invalid BAR
(can't size)
[ 2.949595] pci 0001:02:03.0: [Firmware Bug]: reg 0x14: invalid BAR
(can't size)
[ 2.957014] pci 0001:02:03.0: [Firmware Bug]: reg 0x18: invalid BAR
(can't size)
(...)


Arnd: did you find any solution for that aborts triggered during PCIe init?
--
Rafał
Lucas Stach
2016-04-08 08:43:13 UTC
Permalink
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported immediately now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.

Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core. If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.

Regards,
Lucas
Ray Jui
2016-04-08 22:02:20 UTC
Permalink
Hi Lucas,
Post by Lucas Stach
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported immediately now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.
Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core.
Yes, I suspect this is the case for these imprecise external abort
triggered by the iProc PCIe.
Post by Lucas Stach
If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.
Is this as simple as registering an abort handler to the hook in the
iProc PCIe driver, and based on the fsr (0x1406 in our case), simply
ignore the abort by returning zero from the abort handler?

Thanks,

Ray
Rafał Miłecki
2016-04-08 22:05:22 UTC
Permalink
Post by Lucas Stach
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported immediately now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.
Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core.
Yes, I suspect this is the case for these imprecise external abort triggered
by the iProc PCIe.
Post by Lucas Stach
If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.
Is this as simple as registering an abort handler to the hook in the iProc
PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
abort by returning zero from the abort handler?
This is what I did in OpenWrt an hour ago and it seems to be working:
http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
--
Rafał
Ray Jui
2016-04-08 22:08:12 UTC
Permalink
Post by Rafał Miłecki
Post by Lucas Stach
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported immediately now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.
Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core.
Yes, I suspect this is the case for these imprecise external abort triggered
by the iProc PCIe.
Post by Lucas Stach
If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.
Is this as simple as registering an abort handler to the hook in the iProc
PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
abort by returning zero from the abort handler?
http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
It looks good to me except that I think you should register the hook in
"iproc_pcie_setup" so both the BCMA and platform based iProc PCIe
drivers can use it.

Thanks,

Ray
Rafał Miłecki
2016-04-08 22:11:35 UTC
Permalink
Post by Ray Jui
Post by Rafał Miłecki
Post by Lucas Stach
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported
immediately
now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.
Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core.
Yes, I suspect this is the case for these imprecise external abort triggered
by the iProc PCIe.
Post by Lucas Stach
If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.
Is this as simple as registering an abort handler to the hook in the iProc
PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
abort by returning zero from the abort handler?
http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
It looks good to me except that I think you should register the hook in
"iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
can use it.
Should I add some new field to struct iproc_pcie, like "bool
hook_abort_handler"?
--
Rafał
Ray Jui
2016-04-08 22:41:11 UTC
Permalink
Post by Rafał Miłecki
Post by Ray Jui
Post by Rafał Miłecki
Post by Lucas Stach
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406
ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406
ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported
immediately
now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.
Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core.
Yes, I suspect this is the case for these imprecise external abort triggered
by the iProc PCIe.
Post by Lucas Stach
If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.
Is this as simple as registering an abort handler to the hook in the iProc
PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
abort by returning zero from the abort handler?
http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
It looks good to me except that I think you should register the hook in
"iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
can use it.
Should I add some new field to struct iproc_pcie, like "bool
hook_abort_handler"?
You want to enable/disable them based on platforms? I don't see a need
at this point...

Thanks,

Ray
Rafał Miłecki
2016-04-08 22:53:15 UTC
Permalink
Post by Rafał Miłecki
Post by Ray Jui
Post by Rafał Miłecki
Post by Lucas Stach
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault
handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406
ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406
ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during
first
unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported
immediately
now)
4) To work as before (in 4.3) we should ignore all aborts, not only
the
1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.
Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core.
Yes, I suspect this is the case for these imprecise external abort triggered
by the iProc PCIe.
Post by Lucas Stach
If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.
Is this as simple as registering an abort handler to the hook in the iProc
PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
abort by returning zero from the abort handler?
http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
It looks good to me except that I think you should register the hook in
"iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
can use it.
Should I add some new field to struct iproc_pcie, like "bool
hook_abort_handler"?
You want to enable/disable them based on platforms? I don't see a need at
this point...
I was assuming we don't want this handler hooked on Northstart+, where
the issue doesn't occur. If you think it's not worth it, we can hook
it on all platforms.
--
Rafał
Ray Jui
2016-04-09 00:00:40 UTC
Permalink
Post by Rafał Miłecki
Post by Rafał Miłecki
Post by Ray Jui
Post by Rafał Miłecki
Post by Lucas Stach
Post by Rafał Miłecki
Post by Hauke Mehrtens
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault
handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406
ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.
Post by Rafał Miłecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406
ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at
0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were
ignoring it
1) All aborts are reported immediately
"Hit pending asynchronous external abort (FSR=0x00001c06) during
first
unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported
immediately
now)
4) To work as before (in 4.3) we should ignore all aborts, not only
the
1st one
Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.
A master abort on the PCI bus during probe of the PCI config space
(device enumeration) is expected. Most host bridges ignore those errors
and just return 0 for the read transaction.
Some bridges forward the error onto the AXI/AMBA bus and thus cause
imprecise external aborts on the ARM core.
Yes, I suspect this is the case for these imprecise external abort triggered
by the iProc PCIe.
Post by Lucas Stach
If your host bridge doesn't
have a way to disable error forwarding during PCI bus probe you need to
install an abort handler. Most implementations based on the designware
PCIe core do this already.
Is this as simple as registering an abort handler to the hook in the iProc
PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
abort by returning zero from the abort handler?
http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
It looks good to me except that I think you should register the hook in
"iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
can use it.
Should I add some new field to struct iproc_pcie, like "bool
hook_abort_handler"?
You want to enable/disable them based on platforms? I don't see a need at
this point...
I was assuming we don't want this handler hooked on Northstart+, where
the issue doesn't occur. If you think it's not worth it, we can hook
it on all platforms.
I see. In this case, we might need a device tree based configuration
that allows us to enable/disable the abort handler for different
platforms (for all iProc platform bus based clients, including Cygnus,
NSP, NS2, and etc.). It sounds like even for all these iProc SoCs that
do not use BCMA, some need this abort hook and some don't.

Do you have any bandwidth to work on that? If not, you can leave the
hook always installed in the "iproc_pcie_setup" routine for now, and
later on when I have time I'll work out something for it.

Thanks,

Ray

Rafał Miłecki
2016-04-07 18:48:26 UTC
Permalink
Post by Rafał Miłecki
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.
Do you know if the device causing it is a PCI multifunction device?
I don't know. What gets discovered on the first controller are two
devices: 14e4:d612 (kind of bridge I believe) and 14e4:4365 (wireless
with BCM4366).
Can you try regressing the PCI host driver and isolate that?
What do you mean by regressing PCI host driver?
--
Rafał
Loading...