'Stream (Geo)JSON file and get startByte and endByte of each JSON record in the file
For very large JSON/GeoJSON files, I'd like to create a primitive key/value store that keeps track of the starting positions and lengths of each JSON record in the file. This way, I could look up individual records at a later stage without reading the whole file into memory (Using the fd.read API). Somewhat similar to a super simple database, but read-only and without the extra overhead.
The issue I'm facing is that I don't know how I could determine the starting position and byte length of each JSON record / GeoJSON feature in the original file.
Here's some pseudo-code showcasing what I'm trying to achieve, note that the geojsonStream.parse callback doesn't receive the startByte and length arguments in reality though.
Thanks for your help, also happy about any feedback outlining why this might be a bad idea :)
import geojsonStream from 'geojson-stream'
import { open } from 'fs/promises'
import { Buffer } from 'buffer'
function getFeaturePositionsInFile(fd) {
return new Promise((resolve,reject) => {
const featurePositionsInFile = []
const stream = fd
.createReadStream()
.pipe(geojsonStream.parse((building, index, startByte, length) => {
// The startByte and length callback arguments are not real unfortunately :(
featurePositionsInFile.push({
index,
startPosition,
length
})
}))
stream.on('end', () => resolve(featurePositionsInFile))
stream.on('error', () => reject)
})
}
function readSingleFeatureFromFile(fd, startPosition, length) {
return new Promise((resolve, reject) => {
try {
const buff = Buffer.alloc(length)
const offset = 0
const { buffer } = await fd.read(buff, offset, length, startPosition)
const singleFeature = JSON.parse(buffer.toString())
resolve(singleFeature)
} catch (e) {
reject(e)
}
})
}
const fd = await open('buildings.geojson')
const featurePositionsInFile = await getFeaturePositionsInFile(fd)
const featureIndexToRead = 0
const { startPosition, length } = featurePositionsInFile[featureIndexToRead]
const singleFeature = await readSingleFeatureFromFile(fd, startPosition, length)
Solution 1:[1]
Alright, since I couldn't find a suitable package for my needs, I created a simple (naïve) solution using RegExp to extract single GeoJSON features.
It works given:
- The GeoJSON has a
propertiesobject, and the object is the last key in the parent GeoJSON object - the GeoJSON (properties) solely consists ASCII characters
For GeoJSON files containing non-ASCII characters, the byte counting is off. I tried but couldn't really find out what exactly I'm doing wrong, so any help is appreciated!
For a more general solution, I guess one would need to implement the byte counting logic in an existing library such as stream-json
import { open } from 'fs/promises'
import { Buffer } from 'buffer'
const HIGHWATERMARK = 64 * 1024 / 8
function getFeaturePositionsInFile(fd) {
return new Promise((resolve,reject) => {
const featurePositionsInFile = []
const stream = fd.createReadStream({highWaterMark: HIGHWATERMARK, autoClose: false});
// this RegEx will solely work with standard GeoJSON without any foreign members:
// https://datatracker.ietf.org/doc/html/rfc7946#section-6.1
// The properties object has to be present, and has to be that last key in the GeoJSON object
const jsonExtractor = /\{[\n\r\s]*?"type":[\n\r\s]*?"Feature"[\S\s]*?\}(?:[\n\r\s]*\})+/g
let string = ''
let endPos = 0
stream.on('data', (d) => {
const section = d.toString()
const sectionLength = (new TextEncoder().encode(section)).length
string += section
endPos+= sectionLength
let match
let latestEndPositionInString = 0
while ((match = jsonExtractor.exec(string)) != null) {
const startPositionInString = match.index
const featureString = match[0]
const endPositionInString = startPositionInString + featureString.length
const curStringLength = (new TextEncoder().encode(string)).length
// calculate starting position in file
const startPosition = endPos - curStringLength + startPositionInString
// calculate number of bytes in feature
const byteLength = (new TextEncoder().encode(featureString)).length
// store info for later in our lookup array
featurePositionsInFile.push({
startPosition,
byteLength
})
if (endPositionInString > latestEndPositionInString) {
latestEndPositionInString = endPositionInString
}
}
// remove features from string to free memory
string = string.substring(latestEndPositionInString)
})
stream.on('end', () => resolve(featurePositionsInFile))
stream.on('error', () => reject)
})
}
function readSingleFeatureFromFile(fd, startPosition, length) {
return new Promise(async (resolve, reject) => {
try {
const buff = Buffer.alloc(length)
const offset = 0
const { buffer } = await fd.read(buff, offset, length, startPosition)
const featureString = buffer.toString()
const singleFeature = JSON.parse(featureString)
resolve(singleFeature)
} catch (e) {
reject(e)
}
})
}
async function getFeature(featureIndexToRead, featurePositionsInFile) {
const { startPosition, byteLength } = featurePositionsInFile[featureIndexToRead]
const singleFeature = await readSingleFeatureFromFile(fd, startPosition, byteLength)
return singleFeature
}
// source: https://raw.githubusercontent.com/node-geojson/geojson-stream/master/test/data/featurecollection.geojson
const path = 'featurecollection.geojson'
// -> has 3 features
const fd = await open(path, 'r');
const featurePositionsInFile = await getFeaturePositionsInFile(fd)
// get nth (e.g 3rd) feature in file
const firstFeature = await getFeature(2, featurePositionsInFile)
console.log(firstFeature)
// done! make sure to close the filehandle
fd.close()
https://gist.github.com/chrispahm/c226cca151b25147869288600151a5f8
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Christoph Pahmeyer |
