Discussion:
get_irq_regs() from soft IRQ
Jean Pihet
2009-06-29 14:31:18 UTC
Permalink
Hi,

I am trying to get the latest IRQ registers from a timer or a work queue but I
am running into problems:
- get_irq_regs() returns NULL in some cases, so it is unsuable and even causes
crash when trying to get the registers values from the returned ptr
- I never get user space registers, only kernel

The use case is that the performance unit (PMNC) of the Cortex A8 has some
serious bug, in short the performance counters overflow IRQ is to be avoided.
The solution I am implementing is to read and reset the counters from a work
queue that is triggered by a timer.

Some questions:
- is there a way to get the last 'real' IRQ registers from a timer or work
queue handler?
- is there some other way to do it?

Any thoughts?

Thanks & regards,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux
2009-06-29 15:19:31 UTC
Permalink
Post by Jean Pihet
I am trying to get the latest IRQ registers from a timer or a work queue
- get_irq_regs() returns NULL in some cases,
It will always return NULL outside of IRQ context - and only returns valid
pointers when used inside IRQ context.

It's one of these things that nests itself - when you have several IRQs
being processed on one CPU, there are several register contexts saved,
and get_irq_regs() returns the most recent one.
Post by Jean Pihet
The use case is that the performance unit (PMNC) of the Cortex A8 has some
serious bug, in short the performance counters overflow IRQ is to be avoided.
I don't follow. None of the PMNC support code in the mainline kernel
uses get_irq_regs() outside of IRQ context.
Post by Jean Pihet
- is there a way to get the last 'real' IRQ registers from a timer or work
queue handler?
No. Outside of IRQ events, the saved IRQ context does not exist.

-------------------------------------------------------------------
List admin: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
FAQ: http://www.arm.linux.org.uk/mailinglists/faq.php
Etiquette: http://www.arm.linux.org.uk/mailinglists/etiquette.php
Jean Pihet
2009-06-29 15:35:37 UTC
Permalink
Post by Russell King - ARM Linux
Post by Jean Pihet
I am trying to get the latest IRQ registers from a timer or a work queue
- get_irq_regs() returns NULL in some cases,
It will always return NULL outside of IRQ context - and only returns valid
pointers when used inside IRQ context.
Ok got it.
Post by Russell King - ARM Linux
It's one of these things that nests itself - when you have several IRQs
being processed on one CPU, there are several register contexts saved,
and get_irq_regs() returns the most recent one.
Post by Jean Pihet
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is to be
avoided.
I don't follow. None of the PMNC support code in the mainline kernel
uses get_irq_regs() outside of IRQ context.
That is correct. The Cortex A8 needs some special treatment.
The errata says that if the counters are overflowing at the same time as a
coprocessor access is performed, the perf unit gets reset and/or locks up. In
short the counters overflow is to be avoided and so the PMNC IRQ.
Post by Russell King - ARM Linux
Post by Jean Pihet
- is there a way to get the last 'real' IRQ registers from a timer or
work queue handler?
No. Outside of IRQ events, the saved IRQ context does not exist.
Ok. I wonder how to implement it correctly from here.
The ultimate goal is to feed the registers to oprofile for statistics
gathering (mostly the PC). I do not see much benefit from oprofile without
the PC statistics.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux
2009-06-29 16:07:44 UTC
Permalink
Post by Jean Pihet
Post by Russell King - ARM Linux
It's one of these things that nests itself - when you have several IRQs
being processed on one CPU, there are several register contexts saved,
and get_irq_regs() returns the most recent one.
Post by Jean Pihet
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is to be
avoided.
I don't follow. None of the PMNC support code in the mainline kernel
uses get_irq_regs() outside of IRQ context.
That is correct. The Cortex A8 needs some special treatment.
The errata says that if the counters are overflowing at the same time as a
coprocessor access is performed, the perf unit gets reset and/or locks up. In
short the counters overflow is to be avoided and so the PMNC IRQ.
Are you talking about 628216?
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jean Pihet
2009-06-29 16:12:51 UTC
Permalink
Post by Russell King - ARM Linux
Post by Jean Pihet
Post by Russell King - ARM Linux
It's one of these things that nests itself - when you have several IRQs
being processed on one CPU, there are several register contexts saved,
and get_irq_regs() returns the most recent one.
Post by Jean Pihet
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is
to be avoided.
I don't follow. None of the PMNC support code in the mainline kernel
uses get_irq_regs() outside of IRQ context.
That is correct. The Cortex A8 needs some special treatment.
The errata says that if the counters are overflowing at the same time as
a coprocessor access is performed, the perf unit gets reset and/or locks
up. In short the counters overflow is to be avoided and so the PMNC IRQ.
Are you talking about 628216?
Yes that is the one. Sorry not to mention it sooner.

Jean

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Siarhei Siamashka
2009-06-29 16:36:57 UTC
Permalink
Post by Jean Pihet
Hi,
I am trying to get the latest IRQ registers from a timer or a work queue
- get_irq_regs() returns NULL in some cases, so it is unsuable and even
causes crash when trying to get the registers values from the returned ptr
- I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has some
serious bug, in short the performance counters overflow IRQ is to be
avoided. The solution I am implementing is to read and reset the counters
from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?

Are you trying to implement (in a clean way) something similar to
http://marc.info/?l=oprofile-list&m=123688347009580&w=2

Or is it going to be a different workaround?
--
Best regards,
Siarhei Siamashka

------------------------------------------------------------------------------
Jean Pihet
2009-06-29 16:58:41 UTC
Permalink
Hi Siarhei Siamashka,
Post by Siarhei Siamashka
Post by Jean Pihet
Hi,
I am trying to get the latest IRQ registers from a timer or a work queue
- get_irq_regs() returns NULL in some cases, so it is unsuable and even
causes crash when trying to get the registers values from the returned
ptr - I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is to be
avoided. The solution I am implementing is to read and reset the counters
from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?
Are you trying to implement (in a clean way) something similar to
http://marc.info/?l=oprofile-list&m=123688347009580&w=2
Or is it going to be a different workaround?
I am trying to get a different approach, starting from the errata description.
The idea is to avoid the counters from overflowing, which could cause a PMNC
unit reset or lock-up (or both).

Here are the implementation details:
- use a timer to read and reset the counters, then fire a work queue
- in the work queue the counters values are converted to oprofile samples
- the proper locking is used to avoid some races between the various tasks

I am nearly done with it but I am now running into problems with PM
(suspend/resume) and get_irq_regs().

What do you think?
How far are you on your side? Did you stress test the solution? Is the PMNC
recovery always successful?

Regards,
Jean

------------------------------------------------------------------------------
Russell King - ARM Linux
2009-06-29 17:46:33 UTC
Permalink
Post by Jean Pihet
I am trying to get a different approach, starting from the errata
description. The idea is to avoid the counters from overflowing,
which could cause a PMNC unit reset or lock-up (or both).
But this can't work.

Oprofile essentially works as follows:

You set the number (N) of events you wish to occur between each sample.
When N events have occured, you record the stacktrace and reset the
counter so it fires after another N events.

Now, you could start the counters at zero every time, and then poll them
via a timer. When the counter value is larger than N, you could log a
stacktrace and zero the counter.

However, this suffers one very serious problem - if you're wanting to
measure something at an interval which occurs faster than your timer,
you're going to get misleading results.

You could set the timer to fire at a high rate, but then that's going
to upset things like cache miss, cache hit, etc measurements.
Post by Jean Pihet
- use a timer to read and reset the counters, then fire a work queue
- in the work queue the counters values are converted to oprofile samples
- the proper locking is used to avoid some races between the various tasks
This sounds over complicated. I see no reason for a workqueue to be
involved anywhere near the oprofile sample code.
Post by Jean Pihet
I am nearly done with it but I am now running into problems with PM
(suspend/resume) and get_irq_regs().
You really really really can't use get_irq_regs() outside of IRQ context.
The stored registers just do not exist anymore - they've been overwritten
by whatever exception or system call you're currently in.

You can't create a copy of them - copies will be overwritten on the very
next (nested) interrupt. You don't know which interrupt is the first
interrupt to occur.

I really think that the only option here is to just accept that oprofile
is crucified by this errata.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jean Pihet
2009-06-29 17:57:34 UTC
Permalink
Post by Russell King - ARM Linux
Post by Jean Pihet
I am trying to get a different approach, starting from the errata
description. The idea is to avoid the counters from overflowing,
which could cause a PMNC unit reset or lock-up (or both).
But this can't work.
You set the number (N) of events you wish to occur between each sample.
When N events have occured, you record the stacktrace and reset the
counter so it fires after another N events.
Now, you could start the counters at zero every time, and then poll them
via a timer. When the counter value is larger than N, you could log a
stacktrace and zero the counter.
However, this suffers one very serious problem - if you're wanting to
measure something at an interval which occurs faster than your timer,
you're going to get misleading results.
The counters are 32-bit wide and the maximum counting frequency is 2 events
per cycle (cf. errata). That means you get plenty of time before the counters
overflow.
Post by Russell King - ARM Linux
You could set the timer to fire at a high rate, but then that's going
to upset things like cache miss, cache hit, etc measurements.
Correct.
You need a tradeoff for the timer period.
Post by Russell King - ARM Linux
Post by Jean Pihet
- use a timer to read and reset the counters, then fire a work queue
- in the work queue the counters values are converted to oprofile samples
- the proper locking is used to avoid some races between the various tasks
This sounds over complicated.
It is ;p
Post by Russell King - ARM Linux
I see no reason for a workqueue to be
involved anywhere near the oprofile sample code.
Got it.
Post by Russell King - ARM Linux
Post by Jean Pihet
I am nearly done with it but I am now running into problems with PM
(suspend/resume) and get_irq_regs().
You really really really can't use get_irq_regs() outside of IRQ context.
The stored registers just do not exist anymore - they've been overwritten
by whatever exception or system call you're currently in.
You can't create a copy of them - copies will be overwritten on the very
next (nested) interrupt. You don't know which interrupt is the first
interrupt to occur.
Doh!
Post by Russell King - ARM Linux
I really think that the only option here is to just accept that oprofile
is crucified by this errata.
Amen!

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Siarhei Siamashka
2009-06-29 17:54:23 UTC
Permalink
Post by Jean Pihet
Hi Siarhei Siamashka,
Post by Siarhei Siamashka
Post by Jean Pihet
Hi,
I am trying to get the latest IRQ registers from a timer or a work
- get_irq_regs() returns NULL in some cases, so it is unsuable and even
causes crash when trying to get the registers values from the returned
ptr - I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is to
be avoided. The solution I am implementing is to read and reset the
counters from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?
Are you trying to implement (in a clean way) something similar to
http://marc.info/?l=oprofile-list&m=123688347009580&w=2
Or is it going to be a different workaround?
I am trying to get a different approach, starting from the errata
description. The idea is to avoid the counters from overflowing, which
could cause a PMNC unit reset or lock-up (or both).
- use a timer to read and reset the counters, then fire a work queue
- in the work queue the counters values are converted to oprofile samples
- the proper locking is used to avoid some races between the various tasks
I am nearly done with it but I am now running into problems with PM
(suspend/resume) and get_irq_regs().
What do you think?
Russel was the first to reply :)

But we also discussed this "hybrid model" some time ago, and there is a clear
counterexample where it fails:
http://www.nabble.com/Re%3A--PATCH-0-1--OMAP-gptimer-based-event-monitor-driver-for-oprofile-p21374285.html
Post by Jean Pihet
How far are you on your side? Did you stress test the solution? Is the PMNC
recovery always successful?
I ended up just using a timer with high frequency of samples generation. it
works without hassle and is sufficient for the majority of cases.
--
Best regards,
Siarhei Siamashka
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jean Pihet
2009-06-29 18:08:53 UTC
Permalink
Post by Siarhei Siamashka
Post by Jean Pihet
Hi Siarhei Siamashka,
Post by Siarhei Siamashka
Post by Jean Pihet
Hi,
I am trying to get the latest IRQ registers from a timer or a work
- get_irq_regs() returns NULL in some cases, so it is unsuable and
even causes crash when trying to get the registers values from the
returned ptr - I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is
to be avoided. The solution I am implementing is to read and reset
the counters from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?
Are you trying to implement (in a clean way) something similar to
http://marc.info/?l=oprofile-list&m=123688347009580&w=2
Or is it going to be a different workaround?
I am trying to get a different approach, starting from the errata
description. The idea is to avoid the counters from overflowing, which
could cause a PMNC unit reset or lock-up (or both).
- use a timer to read and reset the counters, then fire a work queue
- in the work queue the counters values are converted to oprofile samples
- the proper locking is used to avoid some races between the various tasks
I am nearly done with it but I am now running into problems with PM
(suspend/resume) and get_irq_regs().
What do you think?
Russel was the first to reply :)
But we also discussed this "hybrid model" some time ago, and there is a
http://www.nabble.com/Re%3A--PATCH-0-1--OMAP-gptimer-based-event-monitor-dr
iver-for-oprofile-p21374285.html
All right, sorry I was not aware of that discussion. So the PMNC unit is
broken beyond repair. BTW good description and test results!
Post by Siarhei Siamashka
Post by Jean Pihet
How far are you on your side? Did you stress test the solution? Is the
PMNC recovery always successful?
I ended up just using a timer with high frequency of samples generation. it
works without hassle and is sufficient for the majority of cases.
Ok. It looks like it is the best we can do.

Thanks,
Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux
2009-06-29 17:37:57 UTC
Permalink
Post by Siarhei Siamashka
Post by Jean Pihet
I am trying to get the latest IRQ registers from a timer or a work queue
- get_irq_regs() returns NULL in some cases, so it is unsuable and even
causes crash when trying to get the registers values from the returned ptr
- I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has some
serious bug, in short the performance counters overflow IRQ is to be
avoided. The solution I am implementing is to read and reset the counters
from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?
I don't think you can - triggering capture on overflow is precisely how
oprofile works.

The erratum talks about polling for overflow. By doing this, you are in
a well defined part of the kernel, which is obviously going to be shown
as a hot path for every counter, thus making oprofile useless for kernel
work.

Deferring the interrupt to a workqueue doesn't resolve the problem either.
The problem has nothing to do with what happens after the interrupt
occurs - it's about interrupts themselves being lost.

I think just accepting that this erratum breaks oprofile is the only
realistic solution. ;(
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jean Pihet
2009-06-29 17:52:20 UTC
Permalink
Post by Russell King - ARM Linux
Post by Siarhei Siamashka
Post by Jean Pihet
I am trying to get the latest IRQ registers from a timer or a work
- get_irq_regs() returns NULL in some cases, so it is unsuable and even
causes crash when trying to get the registers values from the returned
ptr - I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is to
be avoided. The solution I am implementing is to read and reset the
counters from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?
I don't think you can - triggering capture on overflow is precisely how
oprofile works.
The erratum talks about polling for overflow. By doing this, you are in
a well defined part of the kernel, which is obviously going to be shown
as a hot path for every counter, thus making oprofile useless for kernel
work.
I think it is possible, well if you except the get_irq_regs() problem.
The idea is to read and reset the counters before the overflow, instead of
loading them with a small negative value and waiting for the overflow to
happen.
Post by Russell King - ARM Linux
Deferring the interrupt to a workqueue doesn't resolve the problem either.
The problem has nothing to do with what happens after the interrupt
occurs - it's about interrupts themselves being lost.
The errata is about a lost event and/or a lock-up of the PMNC unit at the time
of overflow.
Post by Russell King - ARM Linux
I think just accepting that this erratum breaks oprofile is the only
realistic solution. ;(
Completely agree. However it would be nice to have a workaround, as un-elegant
as it can be ;(
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Siarhei Siamashka
2009-06-29 18:38:59 UTC
Permalink
Post by Russell King - ARM Linux
Post by Siarhei Siamashka
Post by Jean Pihet
I am trying to get the latest IRQ registers from a timer or a work
- get_irq_regs() returns NULL in some cases, so it is unsuable and even
causes crash when trying to get the registers values from the returned
ptr - I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is to
be avoided. The solution I am implementing is to read and reset the
counters from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?
I don't think you can - triggering capture on overflow is precisely how
oprofile works.
The erratum talks about polling for overflow. By doing this, you are in
a well defined part of the kernel, which is obviously going to be shown
as a hot path for every counter, thus making oprofile useless for kernel
work.
Deferring the interrupt to a workqueue doesn't resolve the problem either.
The problem has nothing to do with what happens after the interrupt
occurs - it's about interrupts themselves being lost.
I think just accepting that this erratum breaks oprofile is the only
realistic solution. ;(
I also thought about the same initially. But the problem still looks like it
can be workarounded, admittedly in quite a dirty way.

We just need to use not a periodic timer, but kind of a watchdog (this can be
implemented with OMAP GPTIMER).

As long as PMU interrupts are coming fast, watchdog is frequently reset and
never shows up anywhere. Everything is working nice.

Now if PMU gets broken, watchdog gets triggered eventually and recovers PMU
state. As PMU could get broken something like 10 times per second in the worst
case in my experiments, having ~10 ms for a watchdog trigger period seemed to
be a reasonable empirical value. So in this conditions, PMU will be in a
nonworking state approximately less than 10% of the time in the worst
practical case. Not very nice, but not completely ugly either.

Another problematic condition is when PMU is fine, but is not generating
events naturally (for example we have configured it for cache misses, but are
burning cpu in a loop which is not accessing memory at all). In this case a
watchdog will be triggered periodically for no reason, generating the "noise"
in profiling statistics. This noise needs to be filtered out, and seems like
it is possible to do it. The trick is to reset watchdog counter to a lower
value than it is typically reset in PMU IRQ handler. This way, whenever PMU
interrupt is generated, we check if watchdog counter is below the normal
threshold. If it is lower, then we know that watchdog interrupt was triggered
recently and this sample can be ignored. The difference between normal
watchdog counter reset value and the value which gets set on watchdog
interrupts should provide sufficient time to get out of the watchdog interrupt
handler and its related code, so that it does not show up in statistics that
much.

A working proof of concept patch was submitted there:
http://groups.google.com/group/beagleboard/msg/dd361f3b43fdeff0
Sorry for not posting it to one of the kernel mailing lists, but I thought
that beagleboard mailing list was a good place to find users who may
want to try it and evaluate if it has any practical value. Maybe it was not a
very wise decision.

Unfortunately I'm not a kernel hacker and cleaning up the patch may take
too much time and efforts, taking into account my current knowledge. I would
be happy if somebody else with more hands-on kernel experience could make a
clean and usable Cortex-A8 PMU workaround. I don't care about getting some
part of credit for it or not, the end result is more important :)

One of the obvious problems with the patch (other than race conditions) is
that it is using OMAP-specific GPTIMER. Is there something more portable in
the kernel to provide similar functionality? Or are there any Cortex-A8 r1
cores other than OMAP3 in the wild?
--
Best regards,
Siarhei Siamashka

------------------------------------------------------------------------------
Jean Pihet
2009-06-29 18:49:59 UTC
Permalink
Post by Siarhei Siamashka
Post by Russell King - ARM Linux
Post by Siarhei Siamashka
Post by Jean Pihet
I am trying to get the latest IRQ registers from a timer or a work
- get_irq_regs() returns NULL in some cases, so it is unsuable and
even causes crash when trying to get the registers values from the
returned ptr - I never get user space registers, only kernel
The use case is that the performance unit (PMNC) of the Cortex A8 has
some serious bug, in short the performance counters overflow IRQ is
to be avoided. The solution I am implementing is to read and reset
the counters from a work queue that is triggered by a timer.
Regarding this oprofile related part. I wonder how you can get oprofile
working properly (providing non-bogus results) without performance
counters overflow IRQ generation?
I don't think you can - triggering capture on overflow is precisely how
oprofile works.
The erratum talks about polling for overflow. By doing this, you are in
a well defined part of the kernel, which is obviously going to be shown
as a hot path for every counter, thus making oprofile useless for kernel
work.
Deferring the interrupt to a workqueue doesn't resolve the problem
either. The problem has nothing to do with what happens after the
interrupt occurs - it's about interrupts themselves being lost.
I think just accepting that this erratum breaks oprofile is the only
realistic solution. ;(
I also thought about the same initially. But the problem still looks like
it can be workarounded, admittedly in quite a dirty way.
We just need to use not a periodic timer, but kind of a watchdog (this can
be implemented with OMAP GPTIMER).
As long as PMU interrupts are coming fast, watchdog is frequently reset and
never shows up anywhere. Everything is working nice.
Now if PMU gets broken, watchdog gets triggered eventually and recovers PMU
state. As PMU could get broken something like 10 times per second in the
worst case in my experiments, having ~10 ms for a watchdog trigger period
seemed to be a reasonable empirical value. So in this conditions, PMU
will be in a nonworking state approximately less than 10% of the time in
the worst practical case. Not very nice, but not completely ugly either.
The accuracy is not very good.
Post by Siarhei Siamashka
Another problematic condition is when PMU is fine, but is not generating
events naturally (for example we have configured it for cache misses, but
are burning cpu in a loop which is not accessing memory at all). In this
case a watchdog will be triggered periodically for no reason, generating
the "noise" in profiling statistics. This noise needs to be filtered out,
and seems like it is possible to do it. The trick is to reset watchdog
counter to a lower value than it is typically reset in PMU IRQ handler.
This way, whenever PMU interrupt is generated, we check if watchdog counter
is below the normal threshold. If it is lower, then we know that watchdog
interrupt was triggered recently and this sample can be ignored. The
difference between normal watchdog counter reset value and the value which
gets set on watchdog interrupts should provide sufficient time to get out
of the watchdog interrupt handler and its related code, so that it does not
show up in statistics that much.
http://groups.google.com/group/beagleboard/msg/dd361f3b43fdeff0
Sorry for not posting it to one of the kernel mailing lists, but I thought
that beagleboard mailing list was a good place to find users who may
want to try it and evaluate if it has any practical value. Maybe it was not
a very wise decision.
Unfortunately I'm not a kernel hacker and cleaning up the patch may take
too much time and efforts, taking into account my current knowledge. I
would be happy if somebody else with more hands-on kernel experience could
make a clean and usable Cortex-A8 PMU workaround. I don't care about
getting some part of credit for it or not, the end result is more important
:)
I am ok to help
Post by Siarhei Siamashka
One of the obvious problems with the patch (other than race conditions) is
that it is using OMAP-specific GPTIMER. Is there something more portable in
the kernel to provide similar functionality? Or are there any Cortex-A8 r1
cores other than OMAP3 in the wild?
You can use a 'struct timer_list' and the setup_timer, mod_timer,
del_timer_sync. Another API is the hight resolution timers (HRT) but I do not
think we need such a high precision timer here.

Jean
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Siarhei Siamashka
2009-06-29 19:45:11 UTC
Permalink
On Monday 29 June 2009 21:49:59 ext Jean Pihet wrote:
[...]
Post by Jean Pihet
Post by Siarhei Siamashka
We just need to use not a periodic timer, but kind of a watchdog (this
can be implemented with OMAP GPTIMER).
As long as PMU interrupts are coming fast, watchdog is frequently reset
and never shows up anywhere. Everything is working nice.
Now if PMU gets broken, watchdog gets triggered eventually and recovers
PMU state. As PMU could get broken something like 10 times per second in
the worst case in my experiments, having ~10 ms for a watchdog trigger
period seemed to be a reasonable empirical value. So in this
conditions, PMU will be in a nonworking state approximately less than 10%
of the time in the worst practical case. Not very nice, but not
completely ugly either.
The accuracy is not very good.
Yes, but it is the worst case. In "normal" case when PMU not broken or very
rarely broken, the statistics would be quite good. One of the reasons of
dropping working on this patch was also the fact that in some cases Cortex-A8
PMU even works reliable enough :) Adding some suspicious weird extra logic may
be not very desired by the people, who are quite satisfied even with the
current oprofile state on Cortex-A8 chips (numbercrunching applications with
relatively low number of syscalls and hence rarely touching any coprocessor
registers, are mostly unaffected).

Some adaptive watchdog trigger period may be better (try to predict when the
next PMU interrupt is going to normally happen and tune watchdog timeout at
runtime), but also may be more complex and may theoretically still misbehave
in some cases.
Post by Jean Pihet
Post by Siarhei Siamashka
Another problematic condition is when PMU is fine, but is not generating
events naturally (for example we have configured it for cache misses, but
are burning cpu in a loop which is not accessing memory at all). In this
case a watchdog will be triggered periodically for no reason, generating
the "noise" in profiling statistics. This noise needs to be filtered out,
and seems like it is possible to do it. The trick is to reset watchdog
counter to a lower value than it is typically reset in PMU IRQ handler.
This way, whenever PMU interrupt is generated, we check if watchdog
counter is below the normal threshold. If it is lower, then we know that
watchdog interrupt was triggered recently and this sample can be ignored.
The difference between normal watchdog counter reset value and the value
which gets set on watchdog interrupts should provide sufficient time to
get out of the watchdog interrupt handler and its related code, so that
it does not show up in statistics that much.
And forgot to mention here, very low frequency events (with frequency lower
than the frequency of watchdog) may be quite problematic and still distort the
statistics because they will be filtered out. Tuning all the magic values may
turn out to be a hell.

But at the very least, all the watchdog interrupts (both false alarms and real
cases of PMU breakage) can be counted and taken into account. This statistics
could be somehow reported to the user, so that (s)he would make a decision
if the final profiling statistics can be trusted and for how much time the PMU
was actually broken.
Post by Jean Pihet
Post by Siarhei Siamashka
http://groups.google.com/group/beagleboard/msg/dd361f3b43fdeff0
Sorry for not posting it to one of the kernel mailing lists, but I
thought that beagleboard mailing list was a good place to find users who
may want to try it and evaluate if it has any practical value. Maybe it
was not a very wise decision.
Unfortunately I'm not a kernel hacker and cleaning up the patch may take
too much time and efforts, taking into account my current knowledge. I
would be happy if somebody else with more hands-on kernel experience
could make a clean and usable Cortex-A8 PMU workaround. I don't care
about getting some part of credit for it or not, the end result is more
important
:)
I am ok to help
Post by Siarhei Siamashka
One of the obvious problems with the patch (other than race conditions)
is that it is using OMAP-specific GPTIMER. Is there something more
portable in the kernel to provide similar functionality? Or are there any
Cortex-A8 r1 cores other than OMAP3 in the wild?
You can use a 'struct timer_list' and the setup_timer, mod_timer,
del_timer_sync. Another API is the hight resolution timers (HRT) but I do
not think we need such a high precision timer here.
Thanks
--
Best regards,
Siarhei Siamashka

------------------------------------------------------------------------------
Loading...