'Stream (Geo)JSON file and get startByte and endByte of each JSON record in the file

For very large JSON/GeoJSON files, I'd like to create a primitive key/value store that keeps track of the starting positions and lengths of each JSON record in the file. This way, I could look up individual records at a later stage without reading the whole file into memory (Using the fd.read API). Somewhat similar to a super simple database, but read-only and without the extra overhead.

The issue I'm facing is that I don't know how I could determine the starting position and byte length of each JSON record / GeoJSON feature in the original file.

Here's some pseudo-code showcasing what I'm trying to achieve, note that the geojsonStream.parse callback doesn't receive the startByte and length arguments in reality though.

Thanks for your help, also happy about any feedback outlining why this might be a bad idea :)

import geojsonStream from 'geojson-stream'
import { open } from 'fs/promises'
import { Buffer } from 'buffer'

function getFeaturePositionsInFile(fd) {
  return new Promise((resolve,reject) => {
    const featurePositionsInFile = []
    const stream = fd
      .createReadStream()
      .pipe(geojsonStream.parse((building, index, startByte, length) => {
          // The startByte and length callback arguments are not real unfortunately :(
          featurePositionsInFile.push({
            index,
            startPosition,
            length
          })
      }))
     stream.on('end', () => resolve(featurePositionsInFile))
     stream.on('error', () => reject)
  })
}

function readSingleFeatureFromFile(fd, startPosition, length) {
  return new Promise((resolve, reject) => {
    try {
      const buff = Buffer.alloc(length)
      const offset = 0
      const { buffer } = await fd.read(buff, offset, length, startPosition)
      const singleFeature = JSON.parse(buffer.toString())
      resolve(singleFeature)
    } catch (e) {
      reject(e)
    }
  })
}

const fd = await open('buildings.geojson')
const featurePositionsInFile = await getFeaturePositionsInFile(fd)
const featureIndexToRead = 0
const { startPosition, length } = featurePositionsInFile[featureIndexToRead]
const singleFeature = await readSingleFeatureFromFile(fd, startPosition, length)


Solution 1:[1]

Alright, since I couldn't find a suitable package for my needs, I created a simple (naïve) solution using RegExp to extract single GeoJSON features.

It works given:

  • The GeoJSON has a properties object, and the object is the last key in the parent GeoJSON object
  • the GeoJSON (properties) solely consists ASCII characters

For GeoJSON files containing non-ASCII characters, the byte counting is off. I tried but couldn't really find out what exactly I'm doing wrong, so any help is appreciated!

For a more general solution, I guess one would need to implement the byte counting logic in an existing library such as stream-json

import { open } from 'fs/promises'
import { Buffer } from 'buffer'

const HIGHWATERMARK = 64 * 1024 / 8

function getFeaturePositionsInFile(fd) {
  return new Promise((resolve,reject) => {
    const featurePositionsInFile = []
    const stream = fd.createReadStream({highWaterMark: HIGHWATERMARK, autoClose: false});
    // this RegEx will solely work with standard GeoJSON without any foreign members:
    // https://datatracker.ietf.org/doc/html/rfc7946#section-6.1
    // The properties object has to be present, and has to be that last key in the GeoJSON object
    const jsonExtractor = /\{[\n\r\s]*?"type":[\n\r\s]*?"Feature"[\S\s]*?\}(?:[\n\r\s]*\})+/g

    let string = ''
    let endPos = 0
    
    stream.on('data', (d) => {
      const section = d.toString()      
      const sectionLength = (new TextEncoder().encode(section)).length
      string += section
      endPos+= sectionLength
      
      let match
      let latestEndPositionInString = 0
      while ((match = jsonExtractor.exec(string)) != null) {
        const startPositionInString = match.index
        const featureString = match[0]
        const endPositionInString = startPositionInString + featureString.length
        const curStringLength = (new TextEncoder().encode(string)).length
        // calculate starting position in file
        const startPosition = endPos - curStringLength + startPositionInString
        // calculate number of bytes in feature
        const byteLength = (new TextEncoder().encode(featureString)).length
        // store info for later in our lookup array
        featurePositionsInFile.push({
          startPosition,
          byteLength
        })        
        if (endPositionInString > latestEndPositionInString) {
          latestEndPositionInString = endPositionInString
        }
      }
      // remove features from string to free memory
      string = string.substring(latestEndPositionInString)
    })
    stream.on('end', () => resolve(featurePositionsInFile))
    stream.on('error', () => reject)
  })
}

function readSingleFeatureFromFile(fd, startPosition, length) {
  return new Promise(async (resolve, reject) => {
    try {
      const buff = Buffer.alloc(length)
      const offset = 0
      const { buffer } = await fd.read(buff, offset, length, startPosition)
      const featureString = buffer.toString()
      const singleFeature = JSON.parse(featureString)
      resolve(singleFeature)
    } catch (e) {
      reject(e)
    }
  })
}

async function getFeature(featureIndexToRead, featurePositionsInFile) {
  const { startPosition, byteLength } = featurePositionsInFile[featureIndexToRead]  
  const singleFeature = await readSingleFeatureFromFile(fd, startPosition, byteLength)
  return singleFeature
}

// source: https://raw.githubusercontent.com/node-geojson/geojson-stream/master/test/data/featurecollection.geojson
const path = 'featurecollection.geojson'
// -> has 3 features

const fd = await open(path, 'r');
const featurePositionsInFile = await getFeaturePositionsInFile(fd)
// get nth (e.g 3rd) feature in file
const firstFeature = await getFeature(2, featurePositionsInFile)
console.log(firstFeature)

// done! make sure to close the filehandle
fd.close()

https://gist.github.com/chrispahm/c226cca151b25147869288600151a5f8

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Christoph Pahmeyer