'Pinning user space buffer for DMA from Linux kernel

I'm writing driver for devices that produce around 1GB of data per second. Because of that I decided to map user buffer allocated by application directly for DMA instead of copying through intermediate kernel buffer.

The code works, more or less. But during long-run stress testing I see kernel oops with "bad page state" initiated by unrelated applications (for instance updatedb), probably when kernel wants to swap some pages:

[21743.515404] BUG: Bad page state in process PmStabilityTest  pfn:357518
[21743.521992] page:ffffdf844d5d4600 count:19792158 mapcount:0 mapping:          (null) index:0x12b011e012d0132
[21743.531829] flags: 0x119012c01220124(referenced|lru|slab|reclaim|uncached|idle)
[21743.539138] raw: 0119012c01220124 0000000000000000 012b011e012d0132 012e011e011e0111
[21743.546899] raw: 0000000000000000 012101300131011c 0000000000000000 012101240123012b
[21743.554638] page dumped because: page still charged to cgroup
[21743.560383] page->mem_cgroup:012101240123012b
[21743.564745] bad because of flags: 0x120(lru|slab)
[21743.569555] BUG: Bad page state in process PmStabilityTest  pfn:357519
[21743.576098] page:ffffdf844d5d4640 count:18219302 mapcount:18940179 mapping:          (null) index:0x0
[21743.585318] flags: 0x0()
[21743.587859] raw: 0000000000000000 0000000000000000 0000000000000000 0116012601210112
[21743.595599] raw: 0000000000000000 011301310127012f 0000000000000000 012f011d010d011a
[21743.603336] page dumped because: page still charged to cgroup
[21743.609108] page->mem_cgroup:012f011d010d011a
...
Entering kdb (current=0xffff8948189b2d00, pid 6387) on processor 6 Oops: (null)
due to oops @ 0xffffffff9c87f469
CPU: 6 PID: 6387 Comm: updatedb.mlocat Tainted: G    B      OE   4.10.0-42-generic #46~16.04.1-Ubuntu
...

Details:

The user buffer consists of frames and neither the buffer not the frames are page-aligned. The frames in buffer are used in circular manner for "infinite" live data transfers. For each frame I get memory pages via get_user_pages_fast, then convert it to scatter-gatter table with sg_alloc_table_from_pages and finally map for DMA using dma_map_sg.

I rely on sg_alloc_table_from_pages to bind consecutive pages into one DMA descriptor to reduce size of S/G table sent to device. Devices are custom built and utilize FPGA. I took inspiration from many drivers doing similar mapping, especially video drivers i915 and radeon, but no one has all the stuff on one place so I might overlook something.

Related functions (pin_user_buffer and unpin_user_buffer are called upon separate IOCTLs):

static int pin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
        const unsigned long bytes = cam->acq_frame_bytes;
        const unsigned long first =
                ( frame->uaddr              &  PAGE_MASK) >> PAGE_SHIFT;
        const unsigned long last =
                ((frame->uaddr + bytes - 1) &  PAGE_MASK) >> PAGE_SHIFT;
        const unsigned long offset =
                  frame->uaddr              & ~PAGE_MASK;
        int nr_pages = last - first + 1;
        int err;
        int n;
        struct page **pages;
        struct sg_table *sgt;

        if (frame->uaddr + bytes < frame->uaddr) {
                pr_err("%s: attempted user buffer overflow!\n", __func__);
                return -EINVAL;
        }

        if (bytes == 0) {
                pr_err("%s: user buffer has zero bytes\n", __func__);
                return -EINVAL;
        }

        pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
        if (!pages) {
                pr_err("%s: can't allocate udma_frame.pages\n", __func__);
                return -ENOMEM;
        }

        sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
        if (!sgt) {
                pr_err("%s: can't allocate udma_frame.sgt\n", __func__);
                err = -ENOMEM;
                goto err_alloc_sgt;
        }

        /* (rw == READ) means read from device, write into memory area */
        err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);
        if (err < nr_pages) {
                nr_pages = err;
                if (err > 0) {
                        pr_err("%s: can't pin all %d user pages, got %d\n",
                               __func__, nr_pages, err);
                        err = -EFAULT;
                } else {
                        pr_err("%s: can't pin user pages\n", __func__);
                }
                goto err_get_pages;
        }

        for (n = 0; n < nr_pages; ++n)
                flush_dcache_page(pages[n]); //<--- Is this needed?

        err = sg_alloc_table_from_pages(sgt, pages, nr_pages, offset, bytes,
                                        GFP_KERNEL);
        if (err) {
                pr_err("%s: can't build sg_table for %d pages\n",
                       __func__, nr_pages);
                goto err_alloc_sgt2;
        }

        if (!dma_map_sg(&cam->pci_dev->dev, sgt->sgl, sgt->nents, DMA_FROM_DEVICE)) {
                pr_err("%s: can't map %u sg_table entries for DMA\n",
                       __func__, sgt->nents);
                err = -ENOMEM;
                goto err_dma_map;
        }

        frame->pages = pages;
        frame->nr_pages = nr_pages;
        frame->sgt = sgt;

        return 0;

err_dma_map:
        sg_free_table(sgt);

err_alloc_sgt2:
err_get_pages:
        for (n = 0; n < nr_pages; ++n)
                put_page(pages[n]);
        kfree(sgt);

err_alloc_sgt:
        kfree(pages);

        return err;
}

static void unpin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
        int n;

        dma_unmap_sg(&cam->pci_dev->dev, frame->sgt->sgl, frame->sgt->nents,
                     DMA_FROM_DEVICE);

        sg_free_table(frame->sgt);
        kfree(frame->sgt);
        frame->sgt = NULL;

        for (n = 0; n < frame->nr_pages; ++n) {
                struct page *page = frame->pages[n];
                set_page_dirty_lock(page);
                mark_page_accessed(page); //<--- Without this the Oops are more frequent
                put_page(page);
        }
        kfree(frame->pages);
        frame->pages = NULL;

        frame->nr_pages = 0;
}

static void unpin_user_buffer(struct my_dev *cam)
{
        if (cam->udma_frames) {
                int n;
                for (n = 0; n < cam->udma_frame_count; ++n)
                        unpin_user_frame(cam, &cam->udma_frames[n]);
                kfree(cam->udma_frames);
                cam->udma_frames = NULL;
        }
        cam->udma_frame_count = 0;
        cam->udma_buffer_bytes = 0;
        cam->udma_buffer = NULL;
        cam->udma_desc_count = 0;
}

static int pin_user_buffer(struct my_dev *cam)
{
        int err;
        int n;
        const u32 acq_frame_count = cam->acq_buffer_bytes / cam->acq_frame_bytes;
        struct udma_frame *udma_frames;
        u32 udma_desc_count = 0;

        if (!cam->acq_buffer) {
                pr_err("%s: user buffer is NULL!\n", __func__);
                return -EFAULT;
        }

        if (cam->udma_buffer == cam->acq_buffer
            && cam->udma_buffer_bytes == cam->acq_buffer_bytes
            && cam->udma_frame_count == acq_frame_count)
                return 0;

        if (cam->udma_buffer)
                unpin_user_buffer(cam);

        udma_frames = kcalloc(acq_frame_count, sizeof(*udma_frames),
                              GFP_KERNEL | __GFP_ZERO);
        if (!udma_frames) {
                pr_err("%s: can't allocate udma_frame array for %u frames\n",
                       __func__, acq_frame_count);
                return -ENOMEM;
        }

        for (n = 0; n < acq_frame_count; ++n) {
                struct udma_frame *frame = &udma_frames[n];
                frame->uaddr =
                        (unsigned long)(cam->acq_buffer + n * cam->acq_frame_bytes);
                err = pin_user_frame(cam, frame);
                if (err) {
                        pr_err("%s: can't pin frame %d (out of %u)\n",
                               __func__, n + 1, acq_frame_count);
                        for (--n; n >= 0; --n)
                                unpin_user_frame(cam, frame);
                        kfree(udma_frames);
                        return err;
                }
                udma_desc_count += frame->sgt->nents; /* Cannot overflow */
        }
        pr_debug("%s: total udma_desc_count=%u\n", __func__, udma_desc_count);

        cam->udma_buffer = cam->acq_buffer;
        cam->udma_buffer_bytes = cam->acq_buffer_bytes;
        cam->udma_frame_count = acq_frame_count;
        cam->udma_frames = udma_frames;
        cam->udma_desc_count = udma_desc_count;

        return 0;
}

Related structures:

struct udma_frame {
        unsigned long   uaddr;      /* User address of the frame */
        int             nr_pages;   /* Nr. of pages covering the frame */
        struct page     **pages;    /* Actual pages covering the frame */
        struct sg_table *sgt;       /* S/G table describing the frame */
};

struct my_dev {
        ...
        u8 __user   *acq_buffer;   /* User-space buffer received via IOCTL */
        ...
        u8 __user   *udma_buffer;       /* User-space buffer for image */
        u32         udma_buffer_bytes;  /* Total image size in bytes */
        u32         udma_frame_count;   /* Nr. of items in udma_frames */
        struct udma_frame
                    *udma_frames;       /* DMA descriptors per frame */
        u32         udma_desc_count;    /* Total nr. of DMA descriptors */
        ...
};

Questions:

  1. How to properly pin user buffer pages and mark them as not movable?
  2. If one frame ends and next frame starts in the same page, is it correct to handle it as two independent pages, i.e. pin the page twice?
  3. The data comes from device to user buffer and app is supposed to not write to its buffer, but I have no control over it. Can I use DMA_FROM_DEVICE or rather use DMA_BIDIRECTIONAL just in case?
  4. Do I need to use something like SetPageReserved/ClearPageReserved or mark_page_reserved/free_reserved_page?
  5. Is IOMMU/swiotlb somehow involved? E.g. i915 driver doesn't use sg_alloc_table_from_pages if swiotlb is active?
  6. What the difference between set_page_dirty, set_page_dirty_lock and SetPageDirty functions?

Thanks for any hint.

PS: I cannot change the way the application gets the data without breaking our library API maintained for many years. So please do not advise e.g. to mmap kernel buffer...



Solution 1:[1]

Why do you put "READ == READ" as the third paramter? You need put flag there.

err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);

You need put "FOLL_LONGTERM" here, and FOLL_PIN is set by get_user_pages_fast internally. See https://www.kernel.org/doc/html/latest/core-api/pin_user_pages.html#case-2-rdma

In addition, you need take care of cpu and device memory coherence. Just call "dma_sync_sg_for_device(...)" before dma transfer, and "dma_sync_sg_for_cpu(...)" after dma transfer.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1