Discussion:
[PATCH v7 0/3] PCI/IOMMU: Reserve IOVAs for PCI inbound memory
Oza Pawandeep
2017-05-22 16:39:39 UTC
Permalink
iproc based PCI RC and Stingray SOC has limitaiton of addressing only 512GB
memory at once.

IOVA allocation honors device's coherent_dma_mask/dma_mask.
In PCI case, current code honors DMA mask set by EP, there is no
concept of PCI host bridge dma-mask, should be there and hence
could truly reflect the limitation of PCI host bridge.

However assuming Linux takes care of largest possible dma_mask, still the
limitation could exist, because of the way memory banks are implemented.

for e.g. memory banks:
<0x00000000 0x80000000 0x0 0x80000000>, /* 2G @ 2G */
<0x00000008 0x80000000 0x3 0x80000000>, /* 14G @ 34G */
<0x00000090 0x00000000 0x4 0x00000000>, /* 16G @ 576G */
<0x000000a0 0x00000000 0x4 0x00000000>; /* 16G @ 640G */

When run User space (SPDK) which internally uses vfio in order to access
PCI EndPoint directly.

Vfio uses huge-pages which could come from 640G/0x000000a0.
And the way vfio maps the hugepage is to have phys addr as iova,
and ends up calling VFIO_IOMMU_MAP_DMA ends up calling iommu_map,
inturn arm_lpae_map mapping iovas out of range.

So the way kernel allocates IOVA (where it honours device dma_mask) and
the way userspace gets IOVA is different.

dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>; will not work.

Instead we have to go for scattered dma-ranges leaving holes.
Hence, we have to reserve IOVA allocations for inbound memory.
The patch-set caters to only addressing IOVA allocation problem.

Changes since v7:
- Robin's comment addressed
where he wanted to remove depedency between IOMMU and OF layer.
- Bjorn Helgaas's comments addressed.

Changes since v6:
- Robin's comments addressed.

Changes since v5:
Changes since v4:
Changes since v3:
Changes since v2:
- minor changes, redudant checkes removed
- removed internal review

Changes since v1:
- address Rob's comments.
- Add a get_dma_ranges() function to of_bus struct..
- Convert existing contents of of_dma_get_range function to
of_bus_default_dma_get_ranges and adding that to the
default of_bus struct.
- Make of_dma_get_range call of_bus_match() and then bus->get_dma_ranges.


Oza Pawandeep (3):
OF/PCI: expose inbound memory interface to PCI RC drivers.
IOMMU/PCI: reserve IOVA for inbound memory for PCI masters
PCI: add support for inbound windows resources

drivers/iommu/dma-iommu.c | 44 ++++++++++++++++++++--
drivers/of/of_pci.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 30 +++++++++++++--
include/linux/of_pci.h | 7 ++++
include/linux/pci.h | 1 +
5 files changed, 170 insertions(+), 8 deletions(-)
--
1.9.1
Oza Pawandeep
2017-05-22 16:39:40 UTC
Permalink
The patch exports interface to PCIe RC drivers so that,
Drivers can get their inbound memory configuration.

It provides basis for IOVA reservations for inbound memory
holes, if RC is not capable of addressing all the host memory,
Specifically when IOMMU is enabled and on ARMv8 where 64bit IOVA
could be allocated.

It handles multiple inbound windows, and returns resources,
and is left to the caller, how it wants to use them.

Signed-off-by: Oza Pawandeep <***@broadcom.com>

diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
index c9d4d3a..20cf527 100644
--- a/drivers/of/of_pci.c
+++ b/drivers/of/of_pci.c
@@ -283,6 +283,102 @@ int of_pci_get_host_bridge_resources(struct device_node *dev,
return err;
}
EXPORT_SYMBOL_GPL(of_pci_get_host_bridge_resources);
+
+/**
+ * of_pci_get_dma_ranges - Parse PCI host bridge inbound resources from DT
+ * @np: device node of the host bridge having the dma-ranges property
+ * @resources: list where the range of resources will be added after DT parsing
+ *
+ * It is the caller's job to free the @resources list.
+ *
+ * This function will parse the "dma-ranges" property of a
+ * PCI host bridge device node and setup the resource mapping based
+ * on its content.
+ *
+ * It returns zero if the range parsing has been successful or a standard error
+ * value if it failed.
+ */
+
+int of_pci_get_dma_ranges(struct device_node *dn, struct list_head *resources)
+{
+ struct device_node *node = of_node_get(dn);
+ int rlen;
+ int pna = of_n_addr_cells(node);
+ const int na = 3, ns = 2;
+ int np = pna + na + ns;
+ int ret = 0;
+ struct resource *res;
+ const u32 *dma_ranges;
+ struct of_pci_range range;
+
+ if (!node)
+ return -EINVAL;
+
+ while (1) {
+ dma_ranges = of_get_property(node, "dma-ranges", &rlen);
+
+ /* Ignore empty ranges, they imply no translation required. */
+ if (dma_ranges && rlen > 0)
+ break;
+
+ /* no dma-ranges, they imply no translation required. */
+ if (!dma_ranges)
+ break;
+
+ node = of_get_next_parent(node);
+
+ if (!node)
+ break;
+ }
+
+ if (!dma_ranges) {
+ pr_debug("pcie device has no dma-ranges defined for node(%s)\n",
+ dn->full_name);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ while ((rlen -= np * 4) >= 0) {
+ range.pci_space = be32_to_cpup((const __be32 *) &dma_ranges[0]);
+ range.pci_addr = of_read_number(dma_ranges + 1, ns);
+ range.cpu_addr = of_translate_dma_address(node,
+ dma_ranges + na);
+ range.size = of_read_number(dma_ranges + pna + na, ns);
+ range.flags = IORESOURCE_MEM;
+
+ dma_ranges += np;
+
+ /*
+ * If we failed translation or got a zero-sized region
+ * then skip this range.
+ */
+ if (range.cpu_addr == OF_BAD_ADDR || range.size == 0)
+ continue;
+
+ res = kzalloc(sizeof(struct resource), GFP_KERNEL);
+ if (!res) {
+ ret = -ENOMEM;
+ goto parse_failed;
+ }
+
+ ret = of_pci_range_to_resource(&range, dn, res);
+ if (ret) {
+ kfree(res);
+ continue;
+ }
+
+ pci_add_resource_offset(resources, res,
+ res->start - range.pci_addr);
+ }
+ return ret;
+
+parse_failed:
+ pci_free_resource_list(resources);
+out:
+ of_node_put(node);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(of_pci_get_dma_ranges);
#endif /* CONFIG_OF_ADDRESS */

/**
diff --git a/include/linux/of_pci.h b/include/linux/of_pci.h
index 518c8d2..0eafe86 100644
--- a/include/linux/of_pci.h
+++ b/include/linux/of_pci.h
@@ -76,6 +76,7 @@ static inline void of_pci_check_probe_only(void) { }
int of_pci_get_host_bridge_resources(struct device_node *dev,
unsigned char busno, unsigned char bus_max,
struct list_head *resources, resource_size_t *io_base);
+int of_pci_get_dma_ranges(struct device_node *np, struct list_head *resources);
#else
static inline int of_pci_get_host_bridge_resources(struct device_node *dev,
unsigned char busno, unsigned char bus_max,
@@ -83,6 +84,12 @@ static inline int of_pci_get_host_bridge_resources(struct device_node *dev,
{
return -EINVAL;
}
+
+static inline int of_pci_get_dma_ranges(struct device_node *np,
+ struct list_head *resources)
+{
+ return -EINVAL;
+}
#endif

#endif
--
1.9.1
Oza Pawandeep
2017-05-22 16:39:41 UTC
Permalink
This patch adds support for inbound memory window
for PCI RC drivers.

It defines new function pci_create_root_bus2 which
takes inbound resources as an argument and fills in the
memory resource to PCI internal host bridge structure
as inbound_windows.

Legacy RC driver could continue to use pci_create_root_bus,
but any RC driver who wants to reseve IOVAS for their
inbound memory holes, should use new API pci_create_root_bus2.

Signed-off-by: Oza Pawandeep <***@broadcom.com>

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 19c8950..a95b9bb 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -531,6 +531,7 @@ struct pci_host_bridge *pci_alloc_host_bridge(size_t priv)
return NULL;

INIT_LIST_HEAD(&bridge->windows);
+ INIT_LIST_HEAD(&bridge->inbound_windows);

return bridge;
}
@@ -726,6 +727,7 @@ int pci_register_host_bridge(struct pci_host_bridge *bridge)
struct pci_bus *bus, *b;
resource_size_t offset;
LIST_HEAD(resources);
+ LIST_HEAD(inbound_resources);
struct resource *res;
char addr[64], *fmt;
const char *name;
@@ -739,6 +741,8 @@ int pci_register_host_bridge(struct pci_host_bridge *bridge)

/* temporarily move resources off the list */
list_splice_init(&bridge->windows, &resources);
+ list_splice_init(&bridge->inbound_windows, &inbound_resources);
+
bus->sysdata = bridge->sysdata;
bus->msi = bridge->msi;
bus->ops = bridge->ops;
@@ -794,6 +798,10 @@ int pci_register_host_bridge(struct pci_host_bridge *bridge)
else
pr_info("PCI host bridge to bus %s\n", name);

+ /* Add inbound mem resource. */
+ resource_list_for_each_entry_safe(window, n, &inbound_resources)
+ list_move_tail(&window->node, &bridge->inbound_windows);
+
/* Add initial resources to the bus */
resource_list_for_each_entry_safe(window, n, &resources) {
list_move_tail(&window->node, &bridge->windows);
@@ -2300,7 +2308,8 @@ void __weak pcibios_remove_bus(struct pci_bus *bus)

static struct pci_bus *pci_create_root_bus_msi(struct device *parent,
int bus, struct pci_ops *ops, void *sysdata,
- struct list_head *resources, struct msi_controller *msi)
+ struct list_head *resources, struct list_head *in_res,
+ struct msi_controller *msi)
{
int error;
struct pci_host_bridge *bridge;
@@ -2313,6 +2322,9 @@ static struct pci_bus *pci_create_root_bus_msi(struct device *parent,
bridge->dev.release = pci_release_host_bridge_dev;

list_splice_init(resources, &bridge->windows);
+ if (in_res)
+ list_splice_init(in_res, &bridge->inbound_windows);
+
bridge->sysdata = sysdata;
bridge->busnr = bus;
bridge->ops = ops;
@@ -2329,11 +2341,20 @@ static struct pci_bus *pci_create_root_bus_msi(struct device *parent,
return NULL;
}

+struct pci_bus *pci_create_root_bus2(struct device *parent, int bus,
+ struct pci_ops *ops, void *sysdata, struct list_head *resources,
+ struct list_head *in_res)
+{
+ return pci_create_root_bus_msi(parent, bus, ops, sysdata,
+ resources, in_res, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_create_root_bus2);
+
struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
struct pci_ops *ops, void *sysdata, struct list_head *resources)
{
- return pci_create_root_bus_msi(parent, bus, ops, sysdata, resources,
- NULL);
+ return pci_create_root_bus_msi(parent, bus, ops, sysdata,
+ resources, NULL, NULL);
}
EXPORT_SYMBOL_GPL(pci_create_root_bus);

@@ -2415,7 +2436,8 @@ struct pci_bus *pci_scan_root_bus_msi(struct device *parent, int bus,
break;
}

- b = pci_create_root_bus_msi(parent, bus, ops, sysdata, resources, msi);
+ b = pci_create_root_bus_msi(parent, bus, ops, sysdata,
+ resources, NULL, msi);
if (!b)
return NULL;

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 33c2b0b..d2df107 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -432,6 +432,7 @@ struct pci_host_bridge {
void *sysdata;
int busnr;
struct list_head windows; /* resource_entry */
+ struct list_head inbound_windows; /* inbound memory */
void (*release_fn)(struct pci_host_bridge *);
void *release_data;
struct msi_controller *msi;
--
1.9.1
Bjorn Helgaas
2017-05-30 22:42:46 UTC
Permalink
Post by Oza Pawandeep
This patch adds support for inbound memory window
for PCI RC drivers.
It defines new function pci_create_root_bus2 which
takes inbound resources as an argument and fills in the
memory resource to PCI internal host bridge structure
as inbound_windows.
Legacy RC driver could continue to use pci_create_root_bus,
but any RC driver who wants to reseve IOVAS for their
inbound memory holes, should use new API pci_create_root_bus2.
...
+struct pci_bus *pci_create_root_bus2(struct device *parent, int bus,
+ struct pci_ops *ops, void *sysdata, struct list_head *resources,
+ struct list_head *in_res)
+{
+ return pci_create_root_bus_msi(parent, bus, ops, sysdata,
+ resources, in_res, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_create_root_bus2);
Based on your response to Lorenzo's "[RFC/RFT PATCH 03/18] PCI:
Introduce pci_scan_root_bus_bridge()", I'm hoping you can avoid adding
yet another variant of pci_create_root_bus().

So I think I can wait for that to settle out and look for a v8?

Bjorn
Oza Oza
2017-05-31 16:17:45 UTC
Permalink
Post by Bjorn Helgaas
Post by Oza Pawandeep
This patch adds support for inbound memory window
for PCI RC drivers.
It defines new function pci_create_root_bus2 which
takes inbound resources as an argument and fills in the
memory resource to PCI internal host bridge structure
as inbound_windows.
Legacy RC driver could continue to use pci_create_root_bus,
but any RC driver who wants to reseve IOVAS for their
inbound memory holes, should use new API pci_create_root_bus2.
...
+struct pci_bus *pci_create_root_bus2(struct device *parent, int bus,
+ struct pci_ops *ops, void *sysdata, struct list_head *resources,
+ struct list_head *in_res)
+{
+ return pci_create_root_bus_msi(parent, bus, ops, sysdata,
+ resources, in_res, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_create_root_bus2);
Introduce pci_scan_root_bus_bridge()", I'm hoping you can avoid adding
yet another variant of pci_create_root_bus().
So I think I can wait for that to settle out and look for a v8?
Bjorn
Sure Bjorn, please wait for v8.

But there is one more associated patch
[PATCH v7 1/3] OF/PCI: Export inbound memory interface to PCI RC
which basically aims to provide an interface to RC drivers for their
inbound resources.
RC driver already get their outbound resources from
of_pci_get_host_bridge_resources,
similar attempt for inbound dma-ranges.

Thanking you for looking into this.

Regards,
Oza.
Bjorn Helgaas
2017-06-01 17:08:59 UTC
Permalink
Post by Oza Oza
Post by Bjorn Helgaas
Post by Oza Pawandeep
This patch adds support for inbound memory window
for PCI RC drivers.
It defines new function pci_create_root_bus2 which
takes inbound resources as an argument and fills in the
memory resource to PCI internal host bridge structure
as inbound_windows.
Legacy RC driver could continue to use pci_create_root_bus,
but any RC driver who wants to reseve IOVAS for their
inbound memory holes, should use new API pci_create_root_bus2.
...
+struct pci_bus *pci_create_root_bus2(struct device *parent, int bus,
+ struct pci_ops *ops, void *sysdata, struct list_head *resources,
+ struct list_head *in_res)
+{
+ return pci_create_root_bus_msi(parent, bus, ops, sysdata,
+ resources, in_res, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_create_root_bus2);
Introduce pci_scan_root_bus_bridge()", I'm hoping you can avoid adding
yet another variant of pci_create_root_bus().
So I think I can wait for that to settle out and look for a v8?
Bjorn
Sure Bjorn, please wait for v8.
But there is one more associated patch
[PATCH v7 1/3] OF/PCI: Export inbound memory interface to PCI RC
which basically aims to provide an interface to RC drivers for their
inbound resources.
RC driver already get their outbound resources from
of_pci_get_host_bridge_resources,
similar attempt for inbound dma-ranges.
Not sure I understand. Patch 1/3 adds of_pci_get_dma_ranges(), but
none of the patches adds a caller, so I don't see the point of it yet.

In general, if I'm expecting another revision of one patch in a
series, I expect the next revision to include *all* the patches in the
series. I normally don't pick out and apply individual patches from
the series.

Bjorn
Oza Oza
2017-06-01 18:06:17 UTC
Permalink
Post by Bjorn Helgaas
Post by Oza Oza
Post by Bjorn Helgaas
Post by Oza Pawandeep
This patch adds support for inbound memory window
for PCI RC drivers.
It defines new function pci_create_root_bus2 which
takes inbound resources as an argument and fills in the
memory resource to PCI internal host bridge structure
as inbound_windows.
Legacy RC driver could continue to use pci_create_root_bus,
but any RC driver who wants to reseve IOVAS for their
inbound memory holes, should use new API pci_create_root_bus2.
...
+struct pci_bus *pci_create_root_bus2(struct device *parent, int bus,
+ struct pci_ops *ops, void *sysdata, struct list_head *resources,
+ struct list_head *in_res)
+{
+ return pci_create_root_bus_msi(parent, bus, ops, sysdata,
+ resources, in_res, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_create_root_bus2);
Introduce pci_scan_root_bus_bridge()", I'm hoping you can avoid adding
yet another variant of pci_create_root_bus().
So I think I can wait for that to settle out and look for a v8?
Bjorn
Sure Bjorn, please wait for v8.
But there is one more associated patch
[PATCH v7 1/3] OF/PCI: Export inbound memory interface to PCI RC
which basically aims to provide an interface to RC drivers for their
inbound resources.
RC driver already get their outbound resources from
of_pci_get_host_bridge_resources,
similar attempt for inbound dma-ranges.
Not sure I understand. Patch 1/3 adds of_pci_get_dma_ranges(), but
none of the patches adds a caller, so I don't see the point of it yet.
In general, if I'm expecting another revision of one patch in a
series, I expect the next revision to include *all* the patches in the
series. I normally don't pick out and apply individual patches from
the series.
Bjorn
Yes, it does not get called by anybody, because it is supposed to be called
by RC drivers who want to reserve IOVAs, not all the PCI host bridge driver
might call it, but certainly iproc based PCI driver has to call it.

Then I will have PATCH v8, and with that, I will include PCI RC driver
patch calling it as well.
Thanks for the Review.

Regards,
Oza.
Oza Pawandeep
2017-05-22 16:39:42 UTC
Permalink
This patch reserves the inbound memory holes for PCI masters.
ARM64 based SOCs may have scattered memory banks.
For e.g as iproc based SOC has

<0x00000000 0x80000000 0x0 0x80000000>, /* 2G @ 2G */
<0x00000008 0x80000000 0x3 0x80000000>, /* 14G @ 34G */
<0x00000090 0x00000000 0x4 0x00000000>, /* 16G @ 576G */
<0x000000a0 0x00000000 0x4 0x00000000>; /* 16G @ 640G */

But incoming PCI transaction addressing capability is limited
by host bridge, for example if max incoming window capability
is 512 GB, then 0x00000090 and 0x000000a0 will fall beyond it.

To address this problem, iommu has to avoid allocating IOVA which
are reserved.

Which in turn does not allocate IOVA if it falls into hole.
and the holes should be reserved before any of the IOVA allocations
can happen.

Signed-off-by: Oza Pawandeep <***@broadcom.com>

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..efe3d07 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -171,16 +171,15 @@ void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list)
{
struct pci_host_bridge *bridge;
struct resource_entry *window;
+ struct iommu_resv_region *region;
+ phys_addr_t start, end;
+ size_t length;

if (!dev_is_pci(dev))
return;

bridge = pci_find_host_bridge(to_pci_dev(dev)->bus);
resource_list_for_each_entry(window, &bridge->windows) {
- struct iommu_resv_region *region;
- phys_addr_t start;
- size_t length;
-
if (resource_type(window->res) != IORESOURCE_MEM)
continue;

@@ -193,6 +192,43 @@ void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list)

list_add_tail(&region->list, list);
}
+
+ /* PCI inbound memory reservation. */
+ start = length = 0;
+ resource_list_for_each_entry(window, &bridge->inbound_windows) {
+ end = window->res->start - window->offset;
+
+ if (start > end) {
+ /* multiple ranges assumed sorted. */
+ pr_warn("PCI: failed to reserve iovas\n");
+ return;
+ }
+
+ if (start != end) {
+ length = end - start - 1;
+ region = iommu_alloc_resv_region(start, length, 0,
+ IOMMU_RESV_RESERVED);
+ if (!region)
+ return;
+
+ list_add_tail(&region->list, list);
+ }
+
+ start += end + length + 1;
+ }
+ /*
+ * the last dma-range should honour based on the
+ * 32/64-bit dma addresses.
+ */
+ if ((start) && (start < DMA_BIT_MASK(sizeof(dma_addr_t) * 8))) {
+ length = DMA_BIT_MASK((sizeof(dma_addr_t) * 8)) - 1;
+ region = iommu_alloc_resv_region(start, length, 0,
+ IOMMU_RESV_RESERVED);
+ if (!region)
+ return;
+
+ list_add_tail(&region->list, list);
+ }
}
EXPORT_SYMBOL(iommu_dma_get_resv_regions);
--
1.9.1
Oza Oza
2017-07-19 12:07:39 UTC
Permalink
Hi Robin,

My apology for noise.

I have taken care of your comments.
but these whole patch-set, (specially PCI patch-set) inbound memory
addition depends on Lorenzo's patch-set
.
So I will be posting version 8 patches for IOVA reservation soon after
Lorenzo's patches are made in.

Regards,
Oza.
Post by Oza Pawandeep
This patch reserves the inbound memory holes for PCI masters.
ARM64 based SOCs may have scattered memory banks.
For e.g as iproc based SOC has
But incoming PCI transaction addressing capability is limited
by host bridge, for example if max incoming window capability
is 512 GB, then 0x00000090 and 0x000000a0 will fall beyond it.
To address this problem, iommu has to avoid allocating IOVA which
are reserved.
Which in turn does not allocate IOVA if it falls into hole.
and the holes should be reserved before any of the IOVA allocations
can happen.
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..efe3d07 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -171,16 +171,15 @@ void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list)
{
struct pci_host_bridge *bridge;
struct resource_entry *window;
+ struct iommu_resv_region *region;
+ phys_addr_t start, end;
+ size_t length;
if (!dev_is_pci(dev))
return;
bridge = pci_find_host_bridge(to_pci_dev(dev)->bus);
resource_list_for_each_entry(window, &bridge->windows) {
- struct iommu_resv_region *region;
- phys_addr_t start;
- size_t length;
-
if (resource_type(window->res) != IORESOURCE_MEM)
continue;
@@ -193,6 +192,43 @@ void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list)
list_add_tail(&region->list, list);
}
+
+ /* PCI inbound memory reservation. */
+ start = length = 0;
+ resource_list_for_each_entry(window, &bridge->inbound_windows) {
+ end = window->res->start - window->offset;
+
+ if (start > end) {
+ /* multiple ranges assumed sorted. */
+ pr_warn("PCI: failed to reserve iovas\n");
+ return;
+ }
+
+ if (start != end) {
+ length = end - start - 1;
+ region = iommu_alloc_resv_region(start, length, 0,
+ IOMMU_RESV_RESERVED);
+ if (!region)
+ return;
+
+ list_add_tail(&region->list, list);
+ }
+
+ start += end + length + 1;
+ }
+ /*
+ * the last dma-range should honour based on the
+ * 32/64-bit dma addresses.
+ */
+ if ((start) && (start < DMA_BIT_MASK(sizeof(dma_addr_t) * 8))) {
+ length = DMA_BIT_MASK((sizeof(dma_addr_t) * 8)) - 1;
+ region = iommu_alloc_resv_region(start, length, 0,
+ IOMMU_RESV_RESERVED);
+ if (!region)
+ return;
+
+ list_add_tail(&region->list, list);
+ }
}
EXPORT_SYMBOL(iommu_dma_get_resv_regions);
--
1.9.1
Alex Williamson
2017-05-22 19:18:38 UTC
Permalink
On Mon, 22 May 2017 22:09:39 +0530
Post by Oza Pawandeep
iproc based PCI RC and Stingray SOC has limitaiton of addressing only 512GB
memory at once.
IOVA allocation honors device's coherent_dma_mask/dma_mask.
In PCI case, current code honors DMA mask set by EP, there is no
concept of PCI host bridge dma-mask, should be there and hence
could truly reflect the limitation of PCI host bridge.
However assuming Linux takes care of largest possible dma_mask, still the
limitation could exist, because of the way memory banks are implemented.
When run User space (SPDK) which internally uses vfio in order to access
PCI EndPoint directly.
Vfio uses huge-pages which could come from 640G/0x000000a0.
And the way vfio maps the hugepage is to have phys addr as iova,
and ends up calling VFIO_IOMMU_MAP_DMA ends up calling iommu_map,
inturn arm_lpae_map mapping iovas out of range.
So the way kernel allocates IOVA (where it honours device dma_mask) and
the way userspace gets IOVA is different.
dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>; will not work.
Instead we have to go for scattered dma-ranges leaving holes.
Hence, we have to reserve IOVA allocations for inbound memory.
The patch-set caters to only addressing IOVA allocation problem.
The description here confuses me, with vfio the user owns the iova
allocation problem. Mappings are only identity mapped if the user
chooses to do so. The dma_mask of the device is set by the driver and
only relevant to the DMA-API. vfio is a meta-driver and doesn't know
the dma_mask of any particular device, that's the user's job. Is the
net result of what's happening here for the vfio case simply to expose
extra reserved regions in sysfs, which the user can then consume to
craft a compatible iova? Thanks,

Alex
Post by Oza Pawandeep
- Robin's comment addressed
where he wanted to remove depedency between IOMMU and OF layer.
- Bjorn Helgaas's comments addressed.
- Robin's comments addressed.
- minor changes, redudant checkes removed
- removed internal review
- address Rob's comments.
- Add a get_dma_ranges() function to of_bus struct..
- Convert existing contents of of_dma_get_range function to
of_bus_default_dma_get_ranges and adding that to the
default of_bus struct.
- Make of_dma_get_range call of_bus_match() and then bus->get_dma_ranges.
OF/PCI: expose inbound memory interface to PCI RC drivers.
IOMMU/PCI: reserve IOVA for inbound memory for PCI masters
PCI: add support for inbound windows resources
drivers/iommu/dma-iommu.c | 44 ++++++++++++++++++++--
drivers/of/of_pci.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 30 +++++++++++++--
include/linux/of_pci.h | 7 ++++
include/linux/pci.h | 1 +
5 files changed, 170 insertions(+), 8 deletions(-)
Oza Oza
2017-05-23 05:00:51 UTC
Permalink
On Tue, May 23, 2017 at 12:48 AM, Alex Williamson
Post by Alex Williamson
On Mon, 22 May 2017 22:09:39 +0530
Post by Oza Pawandeep
iproc based PCI RC and Stingray SOC has limitaiton of addressing only 512GB
memory at once.
IOVA allocation honors device's coherent_dma_mask/dma_mask.
In PCI case, current code honors DMA mask set by EP, there is no
concept of PCI host bridge dma-mask, should be there and hence
could truly reflect the limitation of PCI host bridge.
However assuming Linux takes care of largest possible dma_mask, still the
limitation could exist, because of the way memory banks are implemented.
When run User space (SPDK) which internally uses vfio in order to access
PCI EndPoint directly.
Vfio uses huge-pages which could come from 640G/0x000000a0.
And the way vfio maps the hugepage is to have phys addr as iova,
and ends up calling VFIO_IOMMU_MAP_DMA ends up calling iommu_map,
inturn arm_lpae_map mapping iovas out of range.
So the way kernel allocates IOVA (where it honours device dma_mask) and
the way userspace gets IOVA is different.
dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>; will not work.
Instead we have to go for scattered dma-ranges leaving holes.
Hence, we have to reserve IOVA allocations for inbound memory.
The patch-set caters to only addressing IOVA allocation problem.
The description here confuses me, with vfio the user owns the iova
allocation problem. Mappings are only identity mapped if the user
chooses to do so. The dma_mask of the device is set by the driver and
only relevant to the DMA-API. vfio is a meta-driver and doesn't know
the dma_mask of any particular device, that's the user's job. Is the
net result of what's happening here for the vfio case simply to expose
extra reserved regions in sysfs, which the user can then consume to
craft a compatible iova? Thanks,
Alex
Hi Alex,

this is not a VFIO problem, the reason I have mentioned VFIO because,
wanted to bring problem
statement as a whole (which includes both kernel space and user space).
The way SPDK pipeline is set, yes mapping are identity mapped, and
whatever user space passes down IOVA,
VFIO use is as is. which is fine and expected.

But the problem is, user space physical memory (hugepages) reside
high enough in
memory, which could be beyond PCI RC's capability.

Again, this is not VFIO's problem, neither is of user-space.
In-fact both have nothing to do with dma-mask as well.
My reference of dma-mask was for Linux IOMMU framework (not for VFIO)

Regards,
Oza.
Post by Alex Williamson
Post by Oza Pawandeep
- Robin's comment addressed
where he wanted to remove depedency between IOMMU and OF layer.
- Bjorn Helgaas's comments addressed.
- Robin's comments addressed.
- minor changes, redudant checkes removed
- removed internal review
- address Rob's comments.
- Add a get_dma_ranges() function to of_bus struct..
- Convert existing contents of of_dma_get_range function to
of_bus_default_dma_get_ranges and adding that to the
default of_bus struct.
- Make of_dma_get_range call of_bus_match() and then bus->get_dma_ranges.
OF/PCI: expose inbound memory interface to PCI RC drivers.
IOMMU/PCI: reserve IOVA for inbound memory for PCI masters
PCI: add support for inbound windows resources
drivers/iommu/dma-iommu.c | 44 ++++++++++++++++++++--
drivers/of/of_pci.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 30 +++++++++++++--
include/linux/of_pci.h | 7 ++++
include/linux/pci.h | 1 +
5 files changed, 170 insertions(+), 8 deletions(-)
Loading...