Discussion:
[PATCH] nvme: Acknowledge completion queue on each iteration
Sinan Kaya
2017-07-17 22:36:23 UTC
Permalink
Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.

This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.

Signed-off-by: Sinan Kaya <***@codeaurora.org>
---
drivers/nvme/host/pci.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d10d2f2..33d9b5b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)

while (nvme_read_cqe(nvmeq, &cqe)) {
nvme_handle_cqe(nvmeq, &cqe);
+ nvme_ring_cq_doorbell(nvmeq);
consumed++;
}

- if (consumed) {
- nvme_ring_cq_doorbell(nvmeq);
+ if (consumed)
nvmeq->cqe_seen = 1;
- }
}

static irqreturn_t nvme_irq(int irq, void *data)
--
1.9.1
Keith Busch
2017-07-17 22:45:51 UTC
Permalink
Post by Sinan Kaya
Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.
This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.
That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.
Sinan Kaya
2017-07-17 22:46:11 UTC
Permalink
Hi Keith,
Post by Keith Busch
Post by Sinan Kaya
Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.
This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.
That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.
Problem is that code is throttling the HW as HW cannot queue more completions until
SW get a chance to clear it.

As an example:

for each in N
(
blk_layer()
)
ring door bell

HW cannot queue new job until N x blk_layer operations are processed and queue
element ownership is passed to the HW after the loop. HW is just sitting idle
there if no queue entries are available.

Sinan
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Keith Busch
2017-07-17 22:56:15 UTC
Permalink
Post by Sinan Kaya
Hi Keith,
Post by Keith Busch
Post by Sinan Kaya
Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.
This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.
That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.
Problem is that code is throttling the HW as HW cannot queue more completions until
SW get a chance to clear it.
for each in N
(
blk_layer()
)
ring door bell
HW cannot queue new job until N x blk_layer operations are processed and queue
element ownership is passed to the HW after the loop. HW is just sitting idle
there if no queue entries are available.
If no completion queue entries are available, then there can't possibly
be any submission queue entries for the HW to work on either.
o***@codeaurora.org
2017-07-17 23:07:00 UTC
Permalink
Post by Keith Busch
Post by Sinan Kaya
Hi Keith,
Post by Keith Busch
Post by Sinan Kaya
Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.
This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.
That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.
Problem is that code is throttling the HW as HW cannot queue more completions until
SW get a chance to clear it.
for each in N
(
blk_layer()
)
ring door bell
HW cannot queue new job until N x blk_layer operations are processed and queue
element ownership is passed to the HW after the loop. HW is just sitting idle
there if no queue entries are available.
If no completion queue entries are available, then there can't possibly
be any submission queue entries for the HW to work on either.
Maybe, I need to understand the design better. I was curious why
completion and submission queues were protected by a single lock causing
lock contention.

I was treating each queue independently. I have seen slightly better
performance by an early doorbell. That was my explanation.
Keith Busch
2017-07-18 14:36:17 UTC
Permalink
Maybe, I need to understand the design better. I was curious why completion
and submission queues were protected by a single lock causing lock
contention.
Ideally the queues are tied to CPUs, so you couldn't have one thread
submitting to a particular queue-pair while another thread is reaping
completions from it. Such a setup wouldn't get lock contention.

Some machines have so many CPUs, though, that sharing hardware queues
is required. We've experimented with separate submission and completion
locks for such cases, but I've never seen an improved performance as a
result.
Sinan Kaya
2017-07-18 18:52:26 UTC
Permalink
Post by Keith Busch
Maybe, I need to understand the design better. I was curious why completion
and submission queues were protected by a single lock causing lock
contention.
Ideally the queues are tied to CPUs, so you couldn't have one thread
submitting to a particular queue-pair while another thread is reaping
completions from it. Such a setup wouldn't get lock contention.
I do see that the NVMe driver is creating a completion interrupt on
each CPU core for the completions. No problems with that.

However, I don't think you can guarantee that there will always be a single
CPU core targeting one submission queue especially with asynchronous IO.

Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
in my FIO tests.

Did I miss something?
Post by Keith Busch
Some machines have so many CPUs, though, that sharing hardware queues
is required. We've experimented with separate submission and completion
locks for such cases, but I've never seen an improved performance as a
result.
I have also experimented with multiple locks with no significant gains.
However, I was curious if somebody else had a better implementation than mine.
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Keith Busch
2017-07-18 21:26:32 UTC
Permalink
Post by Sinan Kaya
I do see that the NVMe driver is creating a completion interrupt on
each CPU core for the completions. No problems with that.
However, I don't think you can guarantee that there will always be a single
CPU core targeting one submission queue especially with asynchronous IO.
Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
in my FIO tests.
Did I miss something?
I think that must mean your machine has many more CPUs than your nvme
controller has IO queues.
Sagi Grimberg
2017-07-19 09:20:57 UTC
Permalink
Post by Sinan Kaya
Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.
This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.
---
drivers/nvme/host/pci.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d10d2f2..33d9b5b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
while (nvme_read_cqe(nvmeq, &cqe)) {
nvme_handle_cqe(nvmeq, &cqe);
+ nvme_ring_cq_doorbell(nvmeq);
consumed++;
}
- if (consumed) {
- nvme_ring_cq_doorbell(nvmeq);
+ if (consumed)
nvmeq->cqe_seen = 1;
- }
}
Agree with Keith that this is definitely not the way to go, it
adds mmio operations in the hot path with very little gain (if
at all).
Sinan Kaya
2017-07-19 10:37:22 UTC
Permalink
Post by Sagi Grimberg
Post by Sinan Kaya
Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.
This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.
---
drivers/nvme/host/pci.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d10d2f2..33d9b5b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
while (nvme_read_cqe(nvmeq, &cqe)) {
nvme_handle_cqe(nvmeq, &cqe);
+ nvme_ring_cq_doorbell(nvmeq);
consumed++;
}
- if (consumed) {
- nvme_ring_cq_doorbell(nvmeq);
+ if (consumed)
nvmeq->cqe_seen = 1;
- }
}
Agree with Keith that this is definitely not the way to go, it
adds mmio operations in the hot path with very little gain (if
at all).
Understood, different architectures might have different latency accessing the HW
registers. It might be expansive on some platform like you indicated and this change
would make it worse.

I'm doing a self NACK as well.
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Loading...