4.4 BCM5301X ARM regression "External imprecise Data abort"

Discussion:

Rafał Miłecki

2016-04-04 06:13:00 UTC

Hi guys,

I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.

It happens that Linux 4.4 doesn't boot due to the following commits:
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")

In kernel 4.3 we got that abort workaround which was resulting in:
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.

With 4.4 similar (or the same?) abort happens earlier (during PCI host
driver init) and doesn't get ignored:
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem 0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem 0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)

At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
followed by another:
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000

So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.

Reverting all three commits from the top of 4.4.6 gives me back a
working & booting kernel.

Do you have any idea how to fix this regression (and hopefully
original problem as well)?

--
RafaÅ

Scott Branden

2016-04-04 21:08:45 UTC

Permalink

Hi Rafal,

I do not work on BCM5301x SoCs but perhaps Jon Mason can comment.
A few comments inline as well.

Post by RafaÅ MiÅecki
Hi guys,
I got regression reports from Netgear R8000 (BCM4709A0) users and did
some testing & regression tracking with Aditya.
bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel startup")
9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault handler")
[ 5.007128] Freeing unused kernel memory: 212K (c0435000 - c046a000)
[ 5.694632] init: Console is alive
[ 5.698169] init: - watchdog -
[ 5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
As you can see, this abort was happening soon after freeing unused
memory and ignoring it *once* did the trick. It was never appearing
again.
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem 0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem 0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.

We do not observe such issues in Cygnus and other SoCs that use this
PCIe driver (we do not use bcma either - I do not know if that is related).

Post by RafaÅ MiÅecki
Reverting all three commits from the top of 4.4.6 gives me back a
working & booting kernel.
Do you have any idea how to fix this regression (and hopefully
original problem as well)?

I think the proper fix is to correct the issues in the bootloader. It
was my understanding from Jon Mason that this is the root of the
original problem.
Regards,
Scott

Hauke Mehrtens

2016-04-04 21:23:25 UTC

Permalink

Hi Rafal,

Post by Scott Branden
Hi Rafal,
I do not work on BCM5301x SoCs but perhaps Jon Mason can comment.
A few comments inline as well.

I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.

Post by Scott Branden

Post by RafaÅ MiÅecki
With 4.4 similar (or the same?) abort happens earlier (during PCI host
[ 2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 2.483451] pci 0000:00:00.0: bridge window [mem
0x08000000-0x085fffff]
[ 2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
[ 2.605744] pci_bus 0001:00: root bus resource [mem
0x40000000-0x47ffffff]
[ 2.612657] pcie_iproc_bcma bcma0:8: link: UP
[ 2.617241] PCI: bus0: Fast back to back transfers disabled
[ 2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.631297] PCI: bus1: Fast back to back transfers disabled
[ 2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.645035] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
(see 4.4.txt for the backtrace)
At first I was hoping that we simply need to re-add the removed
workaround. I tried it but it appeared that one abort is immediately
[ 2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406 ignored.
[ 2.951966] Unhandled fault: imprecise external abort (0x1406) at 0x00000000
So it seems that commits bbeb920 and 9254970 broke something in PCI
host initialization (or maybe just exposed another bug?). Instead of
getting an abort once and late we are getting now many of them and a
bit earlier.

These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.

Post by Scott Branden
We do not observe such issues in Cygnus and other SoCs that use this
PCIe driver (we do not use bcma either - I do not know if that is related).

I think the proper fix is to correct the issues in the bootloader. It
was my understanding from Jon Mason that this is the root of the
original problem.

I think this is a new problem.

In the Broadcom SDK was a comment saying that probably the bootloader is
broken and that causes this fault which was worked around in the
mainline kernel with the fault handler in the brcm code.

When I added the code Arnd asked me if this SoC has a PCIe controller
because he saw such a problem on an other SoC with a PCIe controller.
https://www.spinics.net/lists/arm-kernel/msg298112.html

As this is now happening in the PCIe code I assume that this has
something to do with PCIe. ;-)

Have you tried to deactivate PCIe support in Device tree and see what
happens? Have you tried to load the PCIe controller as a module later on
so if that makes a difference?

Hauke

Rafał Miłecki

2016-04-08 06:45:15 UTC

Permalink

Post by Hauke Mehrtens

Post by Scott Branden

I assume it only can throw one of these and if it is deactivated it will
ignore the next one or overwrite it. So it could be that more than one
is thrown here.

Post by Scott Branden

These commits mad the kernel earlier "listen" to such errors, so that
they will be shown at the time they occur and not sometime later.

So AFAIU with kernel 4.3:
1) Aborts were masked (silent) until "Freeing unused kernel memory"
2) There was one (silent) abort caused by a bootloader
3) There were likely multiple aborts (silent) during early PCI init
4) After unmasking we got only a single abort reported and we were ignoring it

With kernel 4.4:
1) All aborts are reported immediately
2) Abort caused by a bootloader gets ignored by ARM code:
"Hit pending asynchronous external abort (FSR=0x00001c06) during first unmask"
thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
3) There are still multiple aborts during PCI init (reported immediately now)
4) To work as before (in 4.3) we should ignore all aborts, not only the 1st one

Of course proposed solution is an ugly workaround, we should have no
aborts reported in the first place.

Post by Hauke Mehrtens

Post by Scott Branden
We do not observe such issues in Cygnus and other SoCs that use this
PCIe driver (we do not use bcma either - I do not know if that is related).

I think the proper fix is to correct the issues in the bootloader. It
was my understanding from Jon Mason that this is the root of the
original problem.

There is some issue with bootloader indeed, but with kernel 4.4 we
seem to have it handled by ARM arch code. There is now this nice
workaround I see when booting 4.4:
[ 0.000000] Hit pending asynchronous external abort
(FSR=0x00001c06) during first unmask, this is most likely caused by a
firmware/bootloader bug.

It seems we were lucky so far (in 4.3 and older) thanks for all aborts
being squashed into a single one. We meant to ignore bootloader caused
abort but we were also ignoring many more aborts triggered by iproc.

Post by Hauke Mehrtens
When I added the code Arnd asked me if this SoC has a PCIe controller
because he saw such a problem on an other SoC with a PCIe controller.
https://www.spinics.net/lists/arm-kernel/msg298112.html
As this is now happening in the PCIe code I assume that this has
something to do with PCIe. ;-)
Have you tried to deactivate PCIe support in Device tree and see what
happens? Have you tried to load the PCIe controller as a module later on
so if that makes a difference?

So this definitely looks like something PCIe related. I modified
OpenWrt config to build iproc as module and load it late.

As said earlier I got this early on-unmask abort handled nicely by ARM
arch code:
[ 0.000000] Hit pending asynchronous external abort
(FSR=0x00001c06) during first unmask, this is most likely caused by a
firmware/bootloader bug.

Then many modules load nicely, I'm seeing:
[ 2.959868] Freeing unused kernel memory: 212K (c0443000 - c0478000)
without any abort at this point.

And it goes cleanly farther until loading pcie-iproc-bcma.ko. At some
point PCIe initialization starts triggering aborts:
[ 10.547032] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 10.555199] External imprecise Data abort at addr=0x1c6e00f,
fsr=0x1406 ignored.
[ 10.562635] Unhandled fault: imprecise external abort (0x1406) at 0x01c6e00f
(backtrace here, see 4.4-iproc-module.txt)

With kernel 4.3 the same place of PCIe init looked like this:
[ 2.926866] pci 0001:01:00.0: bridge configuration invalid ([bus
00-00]), reconfiguring
[ 2.935051] pci 0001:02:00.0: unknown header type 12, ignoring device
[ 2.942154] pci 0001:02:03.0: [Firmware Bug]: reg 0x10: invalid BAR
(can't size)
[ 2.949595] pci 0001:02:03.0: [Firmware Bug]: reg 0x14: invalid BAR
(can't size)
[ 2.957014] pci 0001:02:03.0: [Firmware Bug]: reg 0x18: invalid BAR
(can't size)
(...)

Arnd: did you find any solution for that aborts triggered during PCIe init?

--
RafaÅ

Lucas Stach

2016-04-08 08:43:13 UTC