[PATCH V4 0/6] iommu/arm-smmu: Add runtime pm/sleep support

Discussion:

[PATCH V4 0/6] iommu/arm-smmu: Add runtime pm/sleep support

Vivek Gautam

2017-07-06 09:36:59 UTC

This series provides the support for turning on the arm-smmu's
clocks/power domains using runtime pm. This is done using the
recently introduced device links patches, which lets the symmu's
runtime to follow the master's runtime pm, so the smmu remains
powered only when the masters use it.

Took some reference from the exynos runtime patches [2].
Tested this with MDP, GPU, and VENUS devices on apq8096-db820c board.

Previous version of the patchset [1].

[V4]
* Reworked the clock handling part. We now take clock names as data
in the driver for supported compatible versions, and loop over them
to get, enable, and disable the clocks.
* Using qcom,msm8996 based compatibles for bindings instead of a generic
qcom compatible.
* Refactor MMU500 patch to just add the necessary clock names data and
corresponding bindings.
* Added the pm_runtime_get/put() calls in .unmap iommu op (fix added by
Stanimir on top of previous patch version.
* Added a patch to fix error path in arm_smmu_add_device()
* Removed patch 3/5 of V3 patch series that added qcom,smmu-v2 bindings.

[V3]
* Reworked the patches to keep the clocks init/enabling function
separately for each compatible.

* Added clocks bindings for MMU40x/500.

* Added a new compatible for qcom,smmu-v2 implementation and
the clock bindings for the same.

* Rebased on top of 4.11-rc1

[V2]
* Split the patches little differently.

* Addressed comments.

* Removed the patch #4 [3] from previous post
for arm-smmu context save restore. Planning to
post this separately after reworking/addressing Robin's
feedback.

* Reversed the sequence to disable clocks than enabling.
This was required for those cases where the
clocks are populated in a dependent order from DT.

[1] https://www.spinics.net/lists/arm-kernel/msg567488.html
[2] https://lkml.org/lkml/2016/10/20/70
[3] https://patchwork.kernel.org/patch/9389717/

Sricharan R (4):
iommu/arm-smmu: Add pm_runtime/sleep ops
iommu/arm-smmu: Invoke pm_runtime during probe, add/remove device
iommu/arm-smmu: Add the device_link between masters and smmu
iommu/arm-smmu: Add support for MMU40x/500 clocks

Vivek Gautam (2):
iommu/arm-smmu: Fix the error path in arm_smmu_add_device
iommu/arm-smmu: Add support for qcom,msm8996-smmu-v2 clocks

.../devicetree/bindings/iommu/arm,smmu.txt | 42 +++++
drivers/iommu/arm-smmu.c | 191 +++++++++++++++++++--
2 files changed, 222 insertions(+), 11 deletions(-)

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-06 09:37:00 UTC

fwspec->iommu_priv is available only after arm_smmu_master_cfg
instance has been allocated. We shouldn't free it before that.
Also it's logical to free the master cfg itself without
checking for fwspec.

Signed-off-by: Vivek Gautam <***@codeaurora.org>
---
drivers/iommu/arm-smmu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 9a45117d90de..61b1f8729a7c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1349,15 +1349,15 @@ static int arm_smmu_add_device(struct device *dev)

ret = arm_smmu_master_alloc_smes(dev);
if (ret)
- goto out_free;
+ goto out_cfg_free;

iommu_device_link(&smmu->iommu, dev);

return 0;

+out_cfg_free:
+ kfree(cfg);
out_free:
- if (fwspec)
- kfree(fwspec->iommu_priv);
iommu_fwspec_free(dev);
return ret;
}

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-06 09:37:01 UTC

From: Sricharan R <***@codeaurora.org>

The smmu needs to be functional only when the respective
master's using it are active. The device_link feature
helps to track such functional dependencies, so that the
iommu gets powered when the master device enables itself
using pm_runtime. So by adapting the smmu driver for
runtime pm, above said dependency can be addressed.

This patch adds the pm runtime/sleep callbacks to the
driver and also the functions to parse the smmu clocks
from DT and enable them in resume/suspend.

Signed-off-by: Sricharan R <***@codeaurora.org>
Signed-off-by: Archit Taneja <***@codeaurora.org>
[vivek: Clock rework to loop over clock names data]
Signed-off-by: Vivek Gautam <***@codeaurora.org>
---
drivers/iommu/arm-smmu.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 94 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 61b1f8729a7c..bfe613f8939c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -48,6 +48,7 @@
#include <linux/of_iommu.h>
#include <linux/pci.h>
#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
#include <linux/slab.h>
#include <linux/spinlock.h>

@@ -196,6 +197,9 @@ struct arm_smmu_device {
u32 num_global_irqs;
u32 num_context_irqs;
unsigned int *irqs;
+ int num_clks;
+ struct clk **clocks;
+ const char * const *clk_names;

u32 cavium_id_base; /* Specific to Cavium */

@@ -272,6 +276,32 @@ static void parse_driver_options(struct arm_smmu_device *smmu)
} while (arm_smmu_options[++i].opt);
}

+static int arm_smmu_enable_clocks(struct arm_smmu_device *smmu)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < smmu->num_clks; ++i) {
+ ret = clk_prepare_enable(smmu->clocks[i]);
+ if (ret) {
+ dev_err(smmu->dev, "Couldn't enable %s clock\n",
+ smmu->clk_names[i]);
+ while (i--)
+ clk_disable_unprepare(smmu->clocks[i]);
+ break;
+ }
+ }
+
+ return ret;
+}
+
+static void arm_smmu_disable_clocks(struct arm_smmu_device *smmu)
+{
+ int i = smmu->num_clks;
+
+ while (i--)
+ clk_disable_unprepare(smmu->clocks[i]);
+}
+
static struct device_node *dev_get_dev_node(struct device *dev)
{
if (dev_is_pci(dev)) {
@@ -1626,6 +1656,36 @@ static int arm_smmu_id_size_to_bits(int size)
}
}

+static int arm_smmu_init_clocks(struct arm_smmu_device *smmu)
+{
+ int i, err;
+ struct device *dev = smmu->dev;
+
+ if (smmu->num_clks < 1)
+ return 0;
+
+ smmu->clocks = devm_kcalloc(dev, smmu->num_clks,
+ sizeof(*smmu->clocks), GFP_KERNEL);
+ if (!smmu->clocks)
+ return -ENOMEM;
+
+ for (i = 0; i < smmu->num_clks; i++) {
+ const char *cname = smmu->clk_names[i];
+ struct clk *c = devm_clk_get(dev, cname);
+
+ if (IS_ERR(c)) {
+ err = PTR_ERR(c);
+ if (err != -EPROBE_DEFER)
+ dev_err(dev, "Couldn't get clock: %s", cname);
+
+ return err;
+ }
+ smmu->clocks[i] = c;
+ }
+
+ return 0;
+}
+
static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
{
unsigned long size;
@@ -1833,10 +1893,12 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu)
struct arm_smmu_match_data {
enum arm_smmu_arch_version version;
enum arm_smmu_implementation model;
+ const char * const *clks;
+ int num_clks;
};

#define ARM_SMMU_MATCH_DATA(name, ver, imp) \
-static struct arm_smmu_match_data name = { .version = ver, .model = imp }
+static const struct arm_smmu_match_data name = { .version = ver, .model = imp }

ARM_SMMU_MATCH_DATA(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU);
ARM_SMMU_MATCH_DATA(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU);
@@ -1937,6 +1999,8 @@ static int arm_smmu_device_dt_probe(struct platform_device *pdev,
data = of_device_get_match_data(dev);
smmu->version = data->version;
smmu->model = data->model;
+ smmu->clk_names = data->clks;
+ smmu->num_clks = data->num_clks;

parse_driver_options(smmu);

@@ -2035,6 +2099,10 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
smmu->irqs[i] = irq;
}

+ err = arm_smmu_init_clocks(smmu);
+ if (err)
+ return err;
+
err = arm_smmu_device_cfg_probe(smmu);
if (err)
return err;
@@ -2120,10 +2188,35 @@ static int arm_smmu_device_remove(struct platform_device *pdev)
return 0;
}

+#ifdef CONFIG_PM
+static int arm_smmu_resume(struct device *dev)
+{
+ struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+
+ return arm_smmu_enable_clocks(smmu);
+}
+
+static int arm_smmu_suspend(struct device *dev)
+{
+ struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+
+ arm_smmu_disable_clocks(smmu);
+
+ return 0;
+}
+#endif
+
+static const struct dev_pm_ops arm_smmu_pm_ops = {
+ SET_RUNTIME_PM_OPS(arm_smmu_suspend, arm_smmu_resume, NULL)
+ SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
+ pm_runtime_force_resume)
+};
+
static struct platform_driver arm_smmu_driver = {
.driver = {
.name = "arm-smmu",
.of_match_table = of_match_ptr(arm_smmu_of_match),
+ .pm = &arm_smmu_pm_ops,
},
.probe = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Stephen Boyd

2017-07-12 22:58:21 UTC

Post by Vivek Gautam
The smmu needs to be functional only when the respective
master's using it are active. The device_link feature
helps to track such functional dependencies, so that the
iommu gets powered when the master device enables itself
using pm_runtime. So by adapting the smmu driver for
runtime pm, above said dependency can be addressed.
This patch adds the pm runtime/sleep callbacks to the
driver and also the functions to parse the smmu clocks
from DT and enable them in resume/suspend.
[vivek: Clock rework to loop over clock names data]
---

General comment, we have a bulk clk API now, but I guess we
failed to add the clk_bulk_prepare_enable() API that could be
used here. Perhaps you can add that API and then use it here to
reduce lines of code.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Stephen Boyd

2017-07-12 23:01:52 UTC

Post by Stephen Boyd

Post by Vivek Gautam
The smmu needs to be functional only when the respective
master's using it are active. The device_link feature
helps to track such functional dependencies, so that the
iommu gets powered when the master device enables itself
using pm_runtime. So by adapting the smmu driver for
runtime pm, above said dependency can be addressed.
This patch adds the pm runtime/sleep callbacks to the
driver and also the functions to parse the smmu clocks
from DT and enable them in resume/suspend.
[vivek: Clock rework to loop over clock names data]
---

General comment, we have a bulk clk API now, but I guess we
failed to add the clk_bulk_prepare_enable() API that could be
used here. Perhaps you can add that API and then use it here to
reduce lines of code.

Bjorn just sent a patch for that API an hour ago.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-13 03:57:44 UTC

Post by Stephen Boyd

Post by Stephen Boyd

Post by Vivek Gautam
The smmu needs to be functional only when the respective
master's using it are active. The device_link feature
helps to track such functional dependencies, so that the
iommu gets powered when the master device enables itself
using pm_runtime. So by adapting the smmu driver for
runtime pm, above said dependency can be addressed.
This patch adds the pm runtime/sleep callbacks to the
driver and also the functions to parse the smmu clocks
from DT and enable them in resume/suspend.
[vivek: Clock rework to loop over clock names data]
---

General comment, we have a bulk clk API now, but I guess we
failed to add the clk_bulk_prepare_enable() API that could be
used here. Perhaps you can add that API and then use it here to
reduce lines of code.

Sure, will use the bulk clock APIs to handle the clocks.

Best regards
Vivek

Post by Stephen Boyd
Bjorn just sent a patch for that API an hour ago.

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-06 09:37:02 UTC

From: Sricharan R <***@codeaurora.org>

The smmu device probe/remove and add/remove master device callbacks
gets called when the smmu is not linked to its master, that is without
the context of the master device. So calling runtime apis in those places
separately.

Signed-off-by: Sricharan R <***@codeaurora.org>
[stanimir: added runtime pm in .unmap iommu op]
Signed-off-by: Stanimir Varbanov <***@linaro.org>
[vivek: Cleanup pm runtime calls]
Signed-off-by: Vivek Gautam <***@codeaurora.org>
---
drivers/iommu/arm-smmu.c | 54 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index bfe613f8939c..ddbfa8ab69e6 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -897,11 +897,15 @@ static void arm_smmu_destroy_domain_context(struct iommu_domain *domain)
struct arm_smmu_device *smmu = smmu_domain->smmu;
struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
void __iomem *cb_base;
- int irq;
+ int ret, irq;

if (!smmu || domain->type == IOMMU_DOMAIN_IDENTITY)
return;

+ ret = pm_runtime_get_sync(smmu->dev);
+ if (ret)
+ return;
+
/*
* Disable the context bank and free the page tables before freeing
* it.
@@ -916,6 +920,8 @@ static void arm_smmu_destroy_domain_context(struct iommu_domain *domain)

free_io_pgtable_ops(smmu_domain->pgtbl_ops);
__arm_smmu_free_bitmap(smmu->context_map, cfg->cbndx);
+
+ pm_runtime_put_sync(smmu->dev);
}

static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;

if (!ops)
return 0;

- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);
+ ret = ops->unmap(ops, iova, size);
+ pm_runtime_put_sync(smmu_domain->smmu->dev);
+
+ return ret;
}

static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,
@@ -1377,12 +1389,20 @@ static int arm_smmu_add_device(struct device *dev)
while (i--)
cfg->smendx[i] = INVALID_SMENDX;

- ret = arm_smmu_master_alloc_smes(dev);
+ ret = pm_runtime_get_sync(smmu->dev);
if (ret)
goto out_cfg_free;

+ ret = arm_smmu_master_alloc_smes(dev);
+ if (ret) {
+ pm_runtime_put_sync(smmu->dev);
+ goto out_cfg_free;
+ }
+
iommu_device_link(&smmu->iommu, dev);

+ pm_runtime_put_sync(smmu->dev);
+
return 0;

out_cfg_free:
@@ -1397,7 +1417,7 @@ static void arm_smmu_remove_device(struct device *dev)
struct iommu_fwspec *fwspec = dev->iommu_fwspec;
struct arm_smmu_master_cfg *cfg;
struct arm_smmu_device *smmu;
-
+ int ret;

if (!fwspec || fwspec->ops != &arm_smmu_ops)
return;
@@ -1405,8 +1425,21 @@ static void arm_smmu_remove_device(struct device *dev)
cfg = fwspec->iommu_priv;
smmu = cfg->smmu;

+ /*
+ * The device link between the master device and
+ * smmu is already purged at this point.
+ * So enable the power to smmu explicitly.
+ */
+
+ ret = pm_runtime_get_sync(smmu->dev);
+ if (ret)
+ return;
+
iommu_device_unlink(&smmu->iommu, dev);
arm_smmu_master_free_smes(fwspec);
+
+ pm_runtime_put_sync(smmu->dev);
+
iommu_group_remove_device(dev);
kfree(fwspec->iommu_priv);
iommu_fwspec_free(dev);
@@ -2103,6 +2136,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
if (err)
return err;

+ platform_set_drvdata(pdev, smmu);
+ pm_runtime_enable(dev);
+
+ err = pm_runtime_get_sync(dev);
+ if (err)
+ return err;
+
err = arm_smmu_device_cfg_probe(smmu);
if (err)
return err;
@@ -2144,9 +2184,9 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
return err;
}

- platform_set_drvdata(pdev, smmu);
arm_smmu_device_reset(smmu);
arm_smmu_test_smr_masks(smmu);
+ pm_runtime_put_sync(dev);

/*
* For ACPI and generic DT bindings, an SMMU will be probed before
@@ -2185,6 +2225,8 @@ static int arm_smmu_device_remove(struct platform_device *pdev)

/* Turn the thing off */
writel(sCR0_CLIENTPD, ARM_SMMU_GR0_NS(smmu) + ARM_SMMU_GR0_sCR0);
+ pm_runtime_force_suspend(smmu->dev);
+
return 0;
}

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Stephen Boyd

2017-07-12 22:54:59 UTC

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

Post by Vivek Gautam
+ ret = ops->unmap(ops, iova, size);
+ pm_runtime_put_sync(smmu_domain->smmu->dev);
+
+ return ret;
}
static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-13 05:13:59 UTC

Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

That's something which was dropped in the following patch merged in master:
523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock

Looks like we don't need locks here anymore?

Best Regards
Vivek

Post by Stephen Boyd

Post by Vivek Gautam
+ ret = ops->unmap(ops, iova, size);
+ pm_runtime_put_sync(smmu_domain->smmu->dev);
+
+ return ret;
}
static phys_addr_t arm_smmu_iova_to_phys_hard(struct iommu_domain *domain,

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Sricharan R

2017-07-13 05:35:59 UTC

Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Regards,
Sricharan

--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Marek Szyprowski

2017-07-13 12:02:40 UTC

Hi All,

Post by Sricharan R

Post by Vivek Gautam

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Afair unmap might be called from atomic context as well, for example as
a result of dma_unmap_page(). In exynos IOMMU I simply check the runtime
PM state of IOMMU device. TLB flush is performed only when IOMMU is in
active
state. If it is suspended, I assume that the IOMMU controller's context
is already lost and its respective power domain might be already turned off,
so there is no point in touching IOMMU registers.

Best regards

--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Rob Clark

2017-07-13 12:10:22 UTC

On Thu, Jul 13, 2017 at 8:02 AM, Marek Szyprowski

Post by Marek Szyprowski
Hi All,

Post by Sricharan R

Post by Vivek Gautam

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain
*domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Afair unmap might be called from atomic context as well, for example as
a result of dma_unmap_page(). In exynos IOMMU I simply check the runtime
PM state of IOMMU device. TLB flush is performed only when IOMMU is in
active
state. If it is suspended, I assume that the IOMMU controller's context
is already lost and its respective power domain might be already turned off,
so there is no point in touching IOMMU registers.

that seems like an interesting approach.. although I wonder if there
can be some race w/ new device memory access once clks are enabled
before tlb flush completes? That would be rather bad, since this
approach is letting the backing pages of memory be freed before tlb
flush.

BR,
-R

Marek Szyprowski

2017-07-13 12:23:54 UTC

Hi Rob,

Post by Rob Clark
On Thu, Jul 13, 2017 at 8:02 AM, Marek Szyprowski

Post by Marek Szyprowski

Post by Sricharan R

Post by Vivek Gautam

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain
*domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned
long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Afair unmap might be called from atomic context as well, for example as
a result of dma_unmap_page(). In exynos IOMMU I simply check the runtime
PM state of IOMMU device. TLB flush is performed only when IOMMU is in
active
state. If it is suspended, I assume that the IOMMU controller's context
is already lost and its respective power domain might be already turned off,
so there is no point in touching IOMMU registers.

that seems like an interesting approach.. although I wonder if there
can be some race w/ new device memory access once clks are enabled
before tlb flush completes? That would be rather bad, since this
approach is letting the backing pages of memory be freed before tlb
flush.

Exynos IOMMU has spinlock for ensuring that there is no race between PM
runtime
suspend and unmap/tlb flush.

Best regards

--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Sricharan R

2017-07-13 13:53:02 UTC

Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,

1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.

2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

Regards,
Sricharan

--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Rob Clark

2017-07-13 14:55:10 UTC

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

BR,
-R

Will Deacon

2017-07-14 17:07:43 UTC

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

That said, I don't think we can tolerate suspend/resume racing with
map/unmap, and it's not clear to me how we avoid that without penalising
the fastpath.

Will

Rob Clark

2017-07-14 17:42:13 UTC

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

BR,
-R

Post by Will Deacon
That said, I don't think we can tolerate suspend/resume racing with
map/unmap, and it's not clear to me how we avoid that without penalising
the fastpath.
Will

Will Deacon

2017-07-14 18:06:16 UTC

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks. Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

It would be helpful if somebody could figure out exactly what can race with
the suspend/resume calls here.

Will

Rob Clark

2017-07-14 18:25:45 UTC

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

BR,
-R

Post by Will Deacon
It would be helpful if somebody could figure out exactly what can race with
the suspend/resume calls here.
Will

Will Deacon

2017-07-14 19:01:14 UTC

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Could do, but it sounds like a lot of complication that I think we can fix
by making the suspend operation put the SMMU into a "clean" state.

Post by Rob Clark

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

Right, and we shouldn't have to resume it if we suspend it in a clean state,
with the TLBs invalidated.

Will

Rob Clark

2017-07-14 19:34:42 UTC

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Could do, but it sounds like a lot of complication that I think we can fix
by making the suspend operation put the SMMU into a "clean" state.

Post by Rob Clark

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

Right, and we shouldn't have to resume it if we suspend it in a clean state,
with the TLBs invalidated.

I guess if the device_link() stuff ensured the attached device
(gpu/etc) was suspended before suspending the iommu, then I guess I
can't see how temporarily putting the iommu in bypass would be a
problem. I haven't looked at the device_link() stuff too closely, but
iommu being resumed first and suspended last seems like the only thing
that would make sense. I'm mostly just nervous about iommu in bypass
vs gpu since userspace has so much control over what address gpu
writes to / reads from, so getting it wrong w/ the iommu would be a
rather bad thing ;-)

BR,
-R

Will Deacon

2017-07-14 19:36:38 UTC

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Could do, but it sounds like a lot of complication that I think we can fix
by making the suspend operation put the SMMU into a "clean" state.

Post by Rob Clark

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

Right, and we shouldn't have to resume it if we suspend it in a clean state,
with the TLBs invalidated.

I guess if the device_link() stuff ensured the attached device
(gpu/etc) was suspended before suspending the iommu, then I guess I
can't see how temporarily putting the iommu in bypass would be a
problem. I haven't looked at the device_link() stuff too closely, but
iommu being resumed first and suspended last seems like the only thing
that would make sense. I'm mostly just nervous about iommu in bypass
vs gpu since userspace has so much control over what address gpu
writes to / reads from, so getting it wrong w/ the iommu would be a
rather bad thing ;-)

Right, but we can also configure it to terminate if you don't want bypass.

Will

Rob Clark

2017-07-14 19:39:20 UTC

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Could do, but it sounds like a lot of complication that I think we can fix
by making the suspend operation put the SMMU into a "clean" state.

Post by Rob Clark

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

Right, and we shouldn't have to resume it if we suspend it in a clean state,
with the TLBs invalidated.

I guess if the device_link() stuff ensured the attached device
(gpu/etc) was suspended before suspending the iommu, then I guess I
can't see how temporarily putting the iommu in bypass would be a
problem. I haven't looked at the device_link() stuff too closely, but
iommu being resumed first and suspended last seems like the only thing
that would make sense. I'm mostly just nervous about iommu in bypass
vs gpu since userspace has so much control over what address gpu
writes to / reads from, so getting it wrong w/ the iommu would be a
rather bad thing ;-)

Right, but we can also configure it to terminate if you don't want bypass.

ok, terminate wfm

BR,
-R

Sricharan R

2017-07-17 11:46:56 UTC

Hi,

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Could do, but it sounds like a lot of complication that I think we can fix
by making the suspend operation put the SMMU into a "clean" state.

Post by Rob Clark

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

Right, and we shouldn't have to resume it if we suspend it in a clean state,
with the TLBs invalidated.

I guess if the device_link() stuff ensured the attached device
(gpu/etc) was suspended before suspending the iommu, then I guess I
can't see how temporarily putting the iommu in bypass would be a
problem. I haven't looked at the device_link() stuff too closely, but
iommu being resumed first and suspended last seems like the only thing
that would make sense. I'm mostly just nervous about iommu in bypass
vs gpu since userspace has so much control over what address gpu
writes to / reads from, so getting it wrong w/ the iommu would be a
rather bad thing ;-)

Right, but we can also configure it to terminate if you don't want bypass.

But one thing here is, with devicelinks in picture, iommu suspend/resume
is called along with the master. That means, we can end up cleaning even
active entries on the suspend path ?, if suspend is going to
put the smmu in to a clean state every time. So if the master's are following
the pm_runtime sequence before a dma_map/unmap operation, that seems better.

Regards,
Sricharan

--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Sricharan R

2017-07-17 12:28:31 UTC

Hi,

Post by Sricharan R
Hi,

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Could do, but it sounds like a lot of complication that I think we can fix
by making the suspend operation put the SMMU into a "clean" state.

Post by Rob Clark

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

Right, and we shouldn't have to resume it if we suspend it in a clean state,
with the TLBs invalidated.

I guess if the device_link() stuff ensured the attached device
(gpu/etc) was suspended before suspending the iommu, then I guess I
can't see how temporarily putting the iommu in bypass would be a
problem. I haven't looked at the device_link() stuff too closely, but
iommu being resumed first and suspended last seems like the only thing
that would make sense. I'm mostly just nervous about iommu in bypass
vs gpu since userspace has so much control over what address gpu
writes to / reads from, so getting it wrong w/ the iommu would be a
rather bad thing ;-)

Right, but we can also configure it to terminate if you don't want bypass.

But one thing here is, with devicelinks in picture, iommu suspend/resume
is called along with the master. That means, we can end up cleaning even
active entries on the suspend path ?, if suspend is going to
put the smmu in to a clean state every time. So if the master's are following
the pm_runtime sequence before a dma_map/unmap operation, that seems better.

Also, for the usecase of unmap being done from master's like GPU while it is already
suspended, then following the Marek's approach of checking for the smmu state while
in unmap and defer the TLB flush till resume seems correct way. All of the above
true if we want to use device_link.

Regards,
Sricharan

--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Vivek Gautam

2017-07-24 15:31:33 UTC

Hi,

Post by Sricharan R
Hi,

Post by Sricharan R
Hi,

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Will Deacon

Post by Rob Clark

Post by Sricharan R
Hi,

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Yes, there are a bunch of scenarios where unmap can happen with
disabled master (but not in atomic context). On the gpu side we
opportunistically keep a buffer mapping until the buffer is freed
(which can happen after gpu is disabled). Likewise, v4l2 won't unmap
an exported dmabuf while some other driver holds a reference to it
(which can be dropped when the v4l2 device is suspended).
Since unmap triggers tbl flush which touches iommu regs, the iommu
driver *definitely* needs a pm_runtime_get_sync().

Ok, with that being the case, there are two things here,
1) If the device links are still intact at these places where unmap is called,
then pm_runtime from the master would setup the all the clocks. That would
avoid reintroducing the locking indirectly here.
2) If not, then doing it here is the only way. But for both cases, since
the unmap can be called from atomic context, resume handler here should
avoid doing clk_prepare_enable , instead move the clk_prepare to the init.

I do kinda like the approach Marek suggested.. of deferring the tlb
flush until resume. I'm wondering if we could combine that with
putting the mmu in a stalled state when we suspend (and not resume the
mmu until after the pending tlb flush)?

I'm not sure that a stalled state is what we're after here, because we need
to take care to prevent any table walks if we've freed the underlying pages.
What we could try to do is disable the SMMU (put into global bypass) and
invalidate the TLB when performing a suspend operation, then we just ignore
invalidation whilst the clocks are stopped and, on resume, enable the SMMU
again.

wouldn't stalled just block any memory transactions by device(s) using
the context bank? Putting it in bypass isn't really a good thing if
there is any chance the device can sneak in a memory access before
we've taking it back out of bypass (ie. makes gpu a giant userspace
controlled root hole).

If it doesn't deadlock, then yes, it will stall transactions. However, that
doesn't mean it necessarily prevents page table walks.

btw, I guess the concern about pagetable walk is that the unmap could
have removed some sub-level of the pt that the tlb walk would hit?
Would deferring freeing those pages help?

Could do, but it sounds like a lot of complication that I think we can fix
by making the suspend operation put the SMMU into a "clean" state.

Post by Rob Clark

Post by Will Deacon
Instead of bypass, we
could configure all the streams to terminate, but this race still worries me
somewhat. I thought that the SMMU would only be suspended if all of its
masters were suspended, so if the GPU wants to come out of suspend then the
SMMU should be resumed first.

I believe this should be true.. on the gpu side, I'm mostly trying to
avoid having to power the gpu back on to free buffers. (On the v4l2
side, somewhere in the core videobuf code would also need to be made
to wrap it's dma_unmap_sg() with pm_runtime_get/put()..)

Right, and we shouldn't have to resume it if we suspend it in a clean state,
with the TLBs invalidated.

I guess if the device_link() stuff ensured the attached device
(gpu/etc) was suspended before suspending the iommu, then I guess I
can't see how temporarily putting the iommu in bypass would be a
problem. I haven't looked at the device_link() stuff too closely, but
iommu being resumed first and suspended last seems like the only thing
that would make sense. I'm mostly just nervous about iommu in bypass
vs gpu since userspace has so much control over what address gpu
writes to / reads from, so getting it wrong w/ the iommu would be a
rather bad thing ;-)

Right, but we can also configure it to terminate if you don't want bypass.

But one thing here is, with devicelinks in picture, iommu suspend/resume
is called along with the master. That means, we can end up cleaning even
active entries on the suspend path ?, if suspend is going to
put the smmu in to a clean state every time. So if the master's are following
the pm_runtime sequence before a dma_map/unmap operation, that seems better.

Also, for the usecase of unmap being done from master's like GPU while it is already
suspended, then following the Marek's approach of checking for the smmu state while
in unmap and defer the TLB flush till resume seems correct way. All of the above
true if we want to use device_link.

This sounds like a plan.
I have a WIP version in which we will just check in 'unmap' if the mmu is
already suspended. If yes, then we save the unmap request (domain, iova)
and defer this request.
On resume, we check the pending tlb flush requests, and call unmap.

I will give a try for the venus use-case.

Regards
Vivek

Post by Sricharan R
Regards,
Sricharan
--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-13 13:57:20 UTC

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Right, the master should have done a runtime_get(), and with
device links the iommu will also resume.

The master will call the unmap when it is attached to the iommu
and therefore the iommu should be in resume state.
We shouldn't have an unmap without the master attached anyways.
Will investigate this further if we need the pm_runtime() calls
around unmap or not.

Best regards
Vivek

Post by Sricharan R
Regards,
Sricharan
--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-13 14:01:41 UTC

On Thu, Jul 13, 2017 at 7:27 PM, Vivek Gautam

Post by Vivek Gautam

Post by Sricharan R
Hi Vivek,

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

Apart from the locking, wonder why a explicit pm_runtime is needed
from unmap. Somehow looks like some path in the master using that
should have enabled the pm ?

Right, the master should have done a runtime_get(), and with
device links the iommu will also resume.
The master will call the unmap when it is attached to the iommu
and therefore the iommu should be in resume state.
We shouldn't have an unmap without the master attached anyways.
Will investigate this further if we need the pm_runtime() calls
around unmap or not.

My apologies. My email client didn't update the thread. So please ignore
this comment.

Post by Vivek Gautam
Best regards
Vivek

Post by Sricharan R
Regards,
Sricharan
--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Stephen Boyd

2017-07-13 06:48:43 UTC

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

While removing the spinlock around the map/unmap path may be one
thing, I'm not sure that's all of them. Is there a path from an
atomic DMA allocation (GFP_ATOMIC sort of thing) mapped into an
IOMMU for a device that can eventually get down to here and
attempt to turn a clk on?

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Robin Murphy

2017-07-13 09:50:45 UTC

Post by Stephen Boyd

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

While removing the spinlock around the map/unmap path may be one
thing, I'm not sure that's all of them. Is there a path from an
atomic DMA allocation (GFP_ATOMIC sort of thing) mapped into an
IOMMU for a device that can eventually get down to here and
attempt to turn a clk on?

Yes, in the DMA path map/unmap will frequently be called from IRQ
handlers (think e.g. network packets). The whole point of removing the
lock was to allow multiple maps/unmaps to execute in parallel (since we
know they will be safely operating on different areas of the pagetable).
AFAICS this change is going to largely reintroduce that bottleneck via
dev->power_lock, which is anything but what we want :(

Robin.

Rob Clark

2017-07-13 11:53:42 UTC

Post by Robin Murphy

Post by Stephen Boyd

Post by Vivek Gautam
Hi Stephen,

Post by Stephen Boyd

Post by Vivek Gautam
@@ -1231,12 +1237,18 @@ static int arm_smmu_map(struct iommu_domain *domain, unsigned long iova,
static size_t arm_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size)
{
- struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+ struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+ struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
+ size_t ret;
if (!ops)
return 0;
- return ops->unmap(ops, iova, size);
+ pm_runtime_get_sync(smmu_domain->smmu->dev);

Can these map/unmap ops be called from an atomic context? I seem
to recall that being a problem before.

523d7423e21b iommu/arm-smmu: Remove io-pgtable spinlock
Looks like we don't need locks here anymore?

While removing the spinlock around the map/unmap path may be one
thing, I'm not sure that's all of them. Is there a path from an
atomic DMA allocation (GFP_ATOMIC sort of thing) mapped into an
IOMMU for a device that can eventually get down to here and
attempt to turn a clk on?

Yes, in the DMA path map/unmap will frequently be called from IRQ
handlers (think e.g. network packets). The whole point of removing the
lock was to allow multiple maps/unmaps to execute in parallel (since we
know they will be safely operating on different areas of the pagetable).
AFAICS this change is going to largely reintroduce that bottleneck via
dev->power_lock, which is anything but what we want :(

Maybe __pm_runtime_resume() needs some sort of fast-path if already
enabled? Or otherwise we need some sort of flag to tell the iommu
that it cannot rely on the unmapping device to be resumed?

BR,
-R

Vivek Gautam

2017-07-06 09:37:03 UTC

From: Sricharan R <***@codeaurora.org>

Finally add the device link between the master device and
smmu, so that the smmu gets runtime enabled/disabled only when the
master needs it. This is done from add_device callback which gets
called once when the master is added to the smmu.

Signed-off-by: Sricharan R <***@codeaurora.org>
---
drivers/iommu/arm-smmu.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ddbfa8ab69e6..75567d9698ab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1348,6 +1348,7 @@ static int arm_smmu_add_device(struct device *dev)
struct arm_smmu_device *smmu;
struct arm_smmu_master_cfg *cfg;
struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+ struct device_link *link = NULL;
int i, ret;

if (using_legacy_binding) {
@@ -1403,6 +1404,16 @@ static int arm_smmu_add_device(struct device *dev)

pm_runtime_put_sync(smmu->dev);

+ /*
+ * Establish the link between smmu and master, so that the
+ * smmu gets runtime enabled/disabled as per the master's
+ * needs.
+ */
+ link = device_link_add(dev, smmu->dev, DL_FLAG_PM_RUNTIME);
+ if (!link)
+ dev_warn(smmu->dev, "Unable to create device link between %s and %s\n",
+ dev_name(smmu->dev), dev_name(dev));
+
return 0;

out_cfg_free:

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Stephen Boyd

2017-07-12 22:55:47 UTC

Post by Vivek Gautam
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ddbfa8ab69e6..75567d9698ab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1348,6 +1348,7 @@ static int arm_smmu_add_device(struct device *dev)
struct arm_smmu_device *smmu;
struct arm_smmu_master_cfg *cfg;
struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+ struct device_link *link = NULL;

Unnecessary initialization?

Post by Vivek Gautam
int i, ret;
if (using_legacy_binding) {
@@ -1403,6 +1404,16 @@ static int arm_smmu_add_device(struct device *dev)
pm_runtime_put_sync(smmu->dev);
+ /*
+ * Establish the link between smmu and master, so that the
+ * smmu gets runtime enabled/disabled as per the master's
+ * needs.
+ */
+ link = device_link_add(dev, smmu->dev, DL_FLAG_PM_RUNTIME);
+ if (!link)
+ dev_warn(smmu->dev, "Unable to create device link between %s and %s\n",
+ dev_name(smmu->dev), dev_name(dev));
+
return 0;

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-13 03:59:30 UTC

Post by Stephen Boyd

Post by Vivek Gautam
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ddbfa8ab69e6..75567d9698ab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1348,6 +1348,7 @@ static int arm_smmu_add_device(struct device *dev)
struct arm_smmu_device *smmu;
struct arm_smmu_master_cfg *cfg;
struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+ struct device_link *link = NULL;

Unnecessary initialization?

Right, will drop this.
Thanks.

Post by Stephen Boyd

Post by Vivek Gautam
int i, ret;
if (using_legacy_binding) {
@@ -1403,6 +1404,16 @@ static int arm_smmu_add_device(struct device *dev)
pm_runtime_put_sync(smmu->dev);
+ /*
+ * Establish the link between smmu and master, so that the
+ * smmu gets runtime enabled/disabled as per the master's
+ * needs.
+ */
+ link = device_link_add(dev, smmu->dev, DL_FLAG_PM_RUNTIME);
+ if (!link)
+ dev_warn(smmu->dev, "Unable to create device link between %s and %s\n",
+ dev_name(smmu->dev), dev_name(dev));
+
return 0;

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-06 09:37:04 UTC

From: Sricharan R <***@codeaurora.org>

The MMU400x/500 is the implementation of the SMMUv2
arch specification. It is split in to two blocks
TBU, TCU. TBU caches the page table, instantiated
for each master locally, clocked by the TBUn_clk.
TCU manages the address translation with PTW and has
the programming interface as well, clocked using the
TCU_CLK. The TBU can also be sharing the same clock
domain as TCU, in which case both are clocked using
the TCU_CLK.

This defines the clock bindings for the same and adds
the clock names to compatible data.

Signed-off-by: Sricharan R <***@codeaurora.org>
[vivek: clock rework and cleanup]
Signed-off-by: Vivek Gautam <***@codeaurora.org>
---
.../devicetree/bindings/iommu/arm,smmu.txt | 24 ++++++++++++++++++++++
drivers/iommu/arm-smmu.c | 12 ++++++++++-
2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 8a6ffce12af5..00331752d355 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -71,6 +71,26 @@ conditions.
or using stream matching with #iommu-cells = <2>, and
may be ignored if present in such cases.

+- clock-names: Should be "tcu" and "iface" for "arm,mmu-400",
+ "arm,mmu-401" and "arm,mmu-500"
+
+ "tcu" clock is required for smmu's register access using the
+ programming interface and ptw for downstream bus access. This
+ clock is also used for access to the TBU connected to the
+ master locally. Sometimes however, TBU is clocked along with
+ the master.
+
+ "iface" clock is required to access the TCU's programming
+ interface, apart from the "tcu" clock.
+
+- clocks: Phandles for respective clocks described by clock-names.
+
+- power-domains: Phandles to SMMU's power domain specifier. This is
+ required even if SMMU belongs to the master's power
+ domain, as the SMMU will have to be enabled and
+ accessed before master gets enabled and linked to its
+ SMMU.
+
** Deprecated properties:

- mmu-masters (deprecated in favour of the generic "iommus" binding) :
@@ -95,6 +115,10 @@ conditions.
<0 36 4>,
<0 37 4>;
#iommu-cells = <1>;
+ clocks = <&gcc GCC_SMMU_CFG_CLK>,
+ <&gcc GCC_APSS_TCU_CLK>;
+
+ clock-names = "iface", "tcu";
};

/* device with two stream IDs, 0 and 7 */
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 75567d9698ab..7bb09280fa11 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1947,9 +1947,19 @@ struct arm_smmu_match_data {
ARM_SMMU_MATCH_DATA(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU);
ARM_SMMU_MATCH_DATA(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU);
ARM_SMMU_MATCH_DATA(arm_mmu401, ARM_SMMU_V1_64K, GENERIC_SMMU);
-ARM_SMMU_MATCH_DATA(arm_mmu500, ARM_SMMU_V2, ARM_MMU500);
ARM_SMMU_MATCH_DATA(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2);

+static const char * const arm_mmu500_clks[] = {
+ "tcu", "iface",
+};
+
+static const struct arm_smmu_match_data arm_mmu500 = {
+ .version = ARM_SMMU_V2,
+ .model = ARM_MMU500,
+ .clks = arm_mmu500_clks,
+ .num_clks = ARRAY_SIZE(arm_mmu500_clks),
+};
+
static const struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v1", .data = &smmu_generic_v1 },
{ .compatible = "arm,smmu-v2", .data = &smmu_generic_v2 },

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Rob Herring

2017-07-10 03:37:55 UTC

Post by Vivek Gautam
The MMU400x/500 is the implementation of the SMMUv2
arch specification. It is split in to two blocks
TBU, TCU. TBU caches the page table, instantiated
for each master locally, clocked by the TBUn_clk.
TCU manages the address translation with PTW and has
the programming interface as well, clocked using the
TCU_CLK. The TBU can also be sharing the same clock
domain as TCU, in which case both are clocked using
the TCU_CLK.

No TBU clock below. When is it shared or not? If that's an integration
option then the binding should always have a TBU clock with the same
parent as the TCU_CLK.

Post by Vivek Gautam
This defines the clock bindings for the same and adds
the clock names to compatible data.
[vivek: clock rework and cleanup]
---
.../devicetree/bindings/iommu/arm,smmu.txt | 24 ++++++++++++++++++++++
drivers/iommu/arm-smmu.c | 12 ++++++++++-
2 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 8a6ffce12af5..00331752d355 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -71,6 +71,26 @@ conditions.
or using stream matching with #iommu-cells = <2>, and
may be ignored if present in such cases.
+- clock-names: Should be "tcu" and "iface" for "arm,mmu-400",
+ "arm,mmu-401" and "arm,mmu-500"
+
+ "tcu" clock is required for smmu's register access using the
+ programming interface and ptw for downstream bus access. This
+ clock is also used for access to the TBU connected to the
+ master locally. Sometimes however, TBU is clocked along with
+ the master.
+
+ "iface" clock is required to access the TCU's programming
+ interface, apart from the "tcu" clock.
+
+- clocks: Phandles for respective clocks described by clock-names.
+
+- power-domains: Phandles to SMMU's power domain specifier. This is
+ required even if SMMU belongs to the master's power
+ domain, as the SMMU will have to be enabled and
+ accessed before master gets enabled and linked to its
+ SMMU.
+
@@ -95,6 +115,10 @@ conditions.
<0 36 4>,
<0 37 4>;
#iommu-cells = <1>;
+ clocks = <&gcc GCC_SMMU_CFG_CLK>,
+ <&gcc GCC_APSS_TCU_CLK>;
+
+ clock-names = "iface", "tcu";
};
/* device with two stream IDs, 0 and 7 */
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 75567d9698ab..7bb09280fa11 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1947,9 +1947,19 @@ struct arm_smmu_match_data {
ARM_SMMU_MATCH_DATA(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU);
ARM_SMMU_MATCH_DATA(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU);
ARM_SMMU_MATCH_DATA(arm_mmu401, ARM_SMMU_V1_64K, GENERIC_SMMU);
-ARM_SMMU_MATCH_DATA(arm_mmu500, ARM_SMMU_V2, ARM_MMU500);
ARM_SMMU_MATCH_DATA(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2);
+static const char * const arm_mmu500_clks[] = {
+ "tcu", "iface",
+};
+
+static const struct arm_smmu_match_data arm_mmu500 = {
+ .version = ARM_SMMU_V2,
+ .model = ARM_MMU500,
+ .clks = arm_mmu500_clks,
+ .num_clks = ARRAY_SIZE(arm_mmu500_clks),
+};
+
static const struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v1", .data = &smmu_generic_v1 },
{ .compatible = "arm,smmu-v2", .data = &smmu_generic_v2 },
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-11 05:18:17 UTC

Hi Rob,

Post by Rob Herring

Post by Vivek Gautam
The MMU400x/500 is the implementation of the SMMUv2
arch specification. It is split in to two blocks
TBU, TCU. TBU caches the page table, instantiated
for each master locally, clocked by the TBUn_clk.
TCU manages the address translation with PTW and has
the programming interface as well, clocked using the
TCU_CLK. The TBU can also be sharing the same clock
domain as TCU, in which case both are clocked using
the TCU_CLK.

No TBU clock below. When is it shared or not? If that's an integration
option then the binding should always have a TBU clock with the same
parent as the TCU_CLK.

Right. This is something that the ARM spec also says.
The TBU clock can either be in the same clock and power domain as
the TCU clock, or in a separate.

As you said, we should have the TBU clock as well, and based on the
integration the TBU clock can either have same parent as TCU or
different.

I will change these bindings to include the TBU clock as well.

Best Regards
Vivek

Post by Rob Herring

Post by Vivek Gautam
This defines the clock bindings for the same and adds
the clock names to compatible data.
[vivek: clock rework and cleanup]
---
.../devicetree/bindings/iommu/arm,smmu.txt | 24 ++++++++++++++++++++++
drivers/iommu/arm-smmu.c | 12 ++++++++++-
2 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 8a6ffce12af5..00331752d355 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -71,6 +71,26 @@ conditions.
or using stream matching with #iommu-cells = <2>, and
may be ignored if present in such cases.
+- clock-names: Should be "tcu" and "iface" for "arm,mmu-400",
+ "arm,mmu-401" and "arm,mmu-500"
+
+ "tcu" clock is required for smmu's register access using the
+ programming interface and ptw for downstream bus access. This
+ clock is also used for access to the TBU connected to the
+ master locally. Sometimes however, TBU is clocked along with
+ the master.
+
+ "iface" clock is required to access the TCU's programming
+ interface, apart from the "tcu" clock.
+
+- clocks: Phandles for respective clocks described by clock-names.
+
+- power-domains: Phandles to SMMU's power domain specifier. This is
+ required even if SMMU belongs to the master's power
+ domain, as the SMMU will have to be enabled and
+ accessed before master gets enabled and linked to its
+ SMMU.
+
@@ -95,6 +115,10 @@ conditions.
<0 36 4>,
<0 37 4>;
#iommu-cells = <1>;
+ clocks = <&gcc GCC_SMMU_CFG_CLK>,
+ <&gcc GCC_APSS_TCU_CLK>;
+
+ clock-names = "iface", "tcu";
};
/* device with two stream IDs, 0 and 7 */
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 75567d9698ab..7bb09280fa11 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1947,9 +1947,19 @@ struct arm_smmu_match_data {
ARM_SMMU_MATCH_DATA(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU);
ARM_SMMU_MATCH_DATA(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU);
ARM_SMMU_MATCH_DATA(arm_mmu401, ARM_SMMU_V1_64K, GENERIC_SMMU);
-ARM_SMMU_MATCH_DATA(arm_mmu500, ARM_SMMU_V2, ARM_MMU500);
ARM_SMMU_MATCH_DATA(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2);
+static const char * const arm_mmu500_clks[] = {
+ "tcu", "iface",
+};
+
+static const struct arm_smmu_match_data arm_mmu500 = {
+ .version = ARM_SMMU_V2,
+ .model = ARM_MMU500,
+ .clks = arm_mmu500_clks,
+ .num_clks = ARRAY_SIZE(arm_mmu500_clks),
+};
+
static const struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v1", .data = &smmu_generic_v1 },
{ .compatible = "arm,smmu-v2", .data = &smmu_generic_v2 },
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
More majordomo info athttp://vger.kernel.org/majordomo-info.html

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Vivek Gautam

2017-07-06 09:37:05 UTC

qcom,msm8996-smmu-v2 is an arm,smmu-v2 implementation with
specific clock and power requirements. This smmu core is used
with multiple masters on msm8996, viz. mdss, video, etc.
Add bindings for the same.

Signed-off-by: Vivek Gautam <***@codeaurora.org>
---
Documentation/devicetree/bindings/iommu/arm,smmu.txt | 18 ++++++++++++++++++
drivers/iommu/arm-smmu.c | 13 +++++++++++++
2 files changed, 31 insertions(+)

diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 00331752d355..5d8e79775fae 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -17,6 +17,7 @@ conditions.
"arm,mmu-401"
"arm,mmu-500"
"cavium,smmu-v2"
+ "qcom,msm8996-smmu-v2"

depending on the particular implementation and/or the
version of the architecture implemented.
@@ -74,11 +75,16 @@ conditions.
- clock-names: Should be "tcu" and "iface" for "arm,mmu-400",
"arm,mmu-401" and "arm,mmu-500"

+ Should be "bus", and "iface" for "qcom,msm8996-smmu-v2"
+ implementation.
+
"tcu" clock is required for smmu's register access using the
programming interface and ptw for downstream bus access. This
clock is also used for access to the TBU connected to the
master locally. Sometimes however, TBU is clocked along with
the master.
+ "bus" clock for "qcom,msm8996-smmu-v2" is requierd for downstream
+ bus access and for the smmu ptw.

"iface" clock is required to access the TCU's programming
interface, apart from the "tcu" clock.
@@ -161,3 +167,15 @@ conditions.
iommu-map = <0 &smmu3 0 0x400>;
...
};
+
+ /* Qcom's arm,smmu-v2 implementation for msm8996 */
+ smmu4: iommu {
+ compatible = "qcom,msm8996-smmu-v2";
+ ...
+ #iommu-cells = <1>;
+ power-domains = <&mmcc MDSS_GDSC>;
+
+ clocks = <&mmcc SMMU_MDP_AXI_CLK>,
+ <&mmcc SMMU_MDP_AHB_CLK>;
+ clock-names = "bus", "iface";
+ };
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7bb09280fa11..fe8e7fd61282 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -110,6 +110,7 @@ enum arm_smmu_implementation {
GENERIC_SMMU,
ARM_MMU500,
CAVIUM_SMMUV2,
+ QCOM_MSM8996_SMMUV2,
};

/* Until ACPICA headers cover IORT rev. C */
@@ -1960,6 +1961,17 @@ struct arm_smmu_match_data {
.num_clks = ARRAY_SIZE(arm_mmu500_clks),
};

+static const char * const qcom_msm8996_smmuv2_clks[] = {
+ "bus", "iface",
+};
+
+static const struct arm_smmu_match_data qcom_msm8996_smmuv2 = {
+ .version = ARM_SMMU_V2,
+ .model = QCOM_MSM8996_SMMUV2,
+ .clks = qcom_msm8996_smmuv2_clks,
+ .num_clks = ARRAY_SIZE(qcom_msm8996_smmuv2_clks),
+};
+
static const struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v1", .data = &smmu_generic_v1 },
{ .compatible = "arm,smmu-v2", .data = &smmu_generic_v2 },
@@ -1967,6 +1979,7 @@ struct arm_smmu_match_data {
{ .compatible = "arm,mmu-401", .data = &arm_mmu401 },
{ .compatible = "arm,mmu-500", .data = &arm_mmu500 },
{ .compatible = "cavium,smmu-v2", .data = &cavium_smmuv2 },
+ { .compatible = "qcom,msm8996-smmu-v2", .data = &qcom_msm8996_smmuv2 },
{ },
};
MODULE_DEVICE_TABLE(of, arm_smmu_of_match);

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Rob Herring

2017-07-10 03:40:57 UTC

Post by Vivek Gautam
qcom,msm8996-smmu-v2 is an arm,smmu-v2 implementation with
specific clock and power requirements. This smmu core is used
with multiple masters on msm8996, viz. mdss, video, etc.
Add bindings for the same.
---
Documentation/devicetree/bindings/iommu/arm,smmu.txt | 18 ++++++++++++++++++
drivers/iommu/arm-smmu.c | 13 +++++++++++++
2 files changed, 31 insertions(+)
diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 00331752d355..5d8e79775fae 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -17,6 +17,7 @@ conditions.
"arm,mmu-401"
"arm,mmu-500"
"cavium,smmu-v2"
+ "qcom,msm8996-smmu-v2"
depending on the particular implementation and/or the
version of the architecture implemented.
@@ -74,11 +75,16 @@ conditions.
- clock-names: Should be "tcu" and "iface" for "arm,mmu-400",
"arm,mmu-401" and "arm,mmu-500"
+ Should be "bus", and "iface" for "qcom,msm8996-smmu-v2"
+ implementation.
+
"tcu" clock is required for smmu's register access using the
programming interface and ptw for downstream bus access. This
clock is also used for access to the TBU connected to the
master locally. Sometimes however, TBU is clocked along with
the master.
+ "bus" clock for "qcom,msm8996-smmu-v2" is requierd for downstream

s/requierd/required/

Post by Vivek Gautam
+ bus access and for the smmu ptw.
"iface" clock is required to access the TCU's programming
interface, apart from the "tcu" clock.
@@ -161,3 +167,15 @@ conditions.
iommu-map = <0 &smmu3 0 0x400>;
...
};
+
+ /* Qcom's arm,smmu-v2 implementation for msm8996 */
+ smmu4: iommu {
+ compatible = "qcom,msm8996-smmu-v2";

No registers?

Post by Vivek Gautam
+ ...
+ #iommu-cells = <1>;
+ power-domains = <&mmcc MDSS_GDSC>;
+
+ clocks = <&mmcc SMMU_MDP_AXI_CLK>,
+ <&mmcc SMMU_MDP_AHB_CLK>;
+ clock-names = "bus", "iface";
+ };

Vivek Gautam

2017-07-10 06:42:29 UTC

Hi Rob,

Post by Rob Herring

Post by Vivek Gautam
qcom,msm8996-smmu-v2 is an arm,smmu-v2 implementation with
specific clock and power requirements. This smmu core is used
with multiple masters on msm8996, viz. mdss, video, etc.
Add bindings for the same.
---
Documentation/devicetree/bindings/iommu/arm,smmu.txt | 18 ++++++++++++++++++
drivers/iommu/arm-smmu.c | 13 +++++++++++++
2 files changed, 31 insertions(+)
diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 00331752d355..5d8e79775fae 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -17,6 +17,7 @@ conditions.
"arm,mmu-401"
"arm,mmu-500"
"cavium,smmu-v2"
+ "qcom,msm8996-smmu-v2"
depending on the particular implementation and/or the
version of the architecture implemented.
@@ -74,11 +75,16 @@ conditions.
- clock-names: Should be "tcu" and "iface" for "arm,mmu-400",
"arm,mmu-401" and "arm,mmu-500"
+ Should be "bus", and "iface" for "qcom,msm8996-smmu-v2"
+ implementation.
+
"tcu" clock is required for smmu's register access using the
programming interface and ptw for downstream bus access. This
clock is also used for access to the TBU connected to the
master locally. Sometimes however, TBU is clocked along with
the master.
+ "bus" clock for "qcom,msm8996-smmu-v2" is requierd for downstream

s/requierd/required/

sure, will correct it.

Post by Rob Herring

Post by Vivek Gautam
+ bus access and for the smmu ptw.
"iface" clock is required to access the TCU's programming
interface, apart from the "tcu" clock.
@@ -161,3 +167,15 @@ conditions.
iommu-map = <0 &smmu3 0 0x400>;
...
};
+
+ /* Qcom's arm,smmu-v2 implementation for msm8996 */
+ smmu4: iommu {
+ compatible = "qcom,msm8996-smmu-v2";

No registers?

It does have registers. Will add the complete binding example.

Thank you for the review.

Best Regards
Vivek

[snip]

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

39 Replies
3 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Vivek Gautam 2017-07-06 09:36:59 UTC

Vivek Gautam 2017-07-06 09:37:00 UTC

Vivek Gautam 2017-07-06 09:37:01 UTC

Stephen Boyd 2017-07-12 22:58:21 UTC

Stephen Boyd 2017-07-12 23:01:52 UTC

Vivek Gautam 2017-07-13 03:57:44 UTC

Vivek Gautam 2017-07-06 09:37:02 UTC

Stephen Boyd 2017-07-12 22:54:59 UTC

Vivek Gautam 2017-07-13 05:13:59 UTC

Sricharan R 2017-07-13 05:35:59 UTC

Marek Szyprowski 2017-07-13 12:02:40 UTC

Rob Clark 2017-07-13 12:10:22 UTC

Marek Szyprowski 2017-07-13 12:23:54 UTC

Sricharan R 2017-07-13 13:53:02 UTC

Rob Clark 2017-07-13 14:55:10 UTC

Will Deacon 2017-07-14 17:07:43 UTC

Rob Clark 2017-07-14 17:42:13 UTC

Will Deacon 2017-07-14 18:06:16 UTC

Rob Clark 2017-07-14 18:25:45 UTC

Will Deacon 2017-07-14 19:01:14 UTC

Rob Clark 2017-07-14 19:34:42 UTC

Will Deacon 2017-07-14 19:36:38 UTC

Rob Clark 2017-07-14 19:39:20 UTC

Sricharan R 2017-07-17 11:46:56 UTC

Sricharan R 2017-07-17 12:28:31 UTC

Vivek Gautam 2017-07-24 15:31:33 UTC

Vivek Gautam 2017-07-13 13:57:20 UTC

Vivek Gautam 2017-07-13 14:01:41 UTC

Stephen Boyd 2017-07-13 06:48:43 UTC

Robin Murphy 2017-07-13 09:50:45 UTC

Rob Clark 2017-07-13 11:53:42 UTC

Vivek Gautam 2017-07-06 09:37:03 UTC

Stephen Boyd 2017-07-12 22:55:47 UTC

Vivek Gautam 2017-07-13 03:59:30 UTC

Vivek Gautam 2017-07-06 09:37:04 UTC

Rob Herring 2017-07-10 03:37:55 UTC

Vivek Gautam 2017-07-11 05:18:17 UTC

Vivek Gautam 2017-07-06 09:37:05 UTC

Rob Herring 2017-07-10 03:40:57 UTC

Vivek Gautam 2017-07-10 06:42:29 UTC

about - legalese

Loading...