'Fastest, most efficient way to check for changes in a large XML file (100k+ entries) updated daily with PHP

I am working with xml files with 50-150k+ entries, and about 50-100MB+ in size that change daily. All entries are unique, and have 10-15 elements per entry (id, title, etc.). I'm currently pulling the xml file into a string, using simplexml to parse it, and then looping through each entry to check for changes.

Here's the basic code...

$data = file_get_contents("test.xml");
$data = preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $data);
    

$xml = simplexml_load_string($data);

$item_count = count($xml->entry);   

foreach (range(0, $item_count - 1, 1) as $num) {
    
    $id = (string)$xml->entry[$num]->id;
    $title = (string)$xml->entry[$num]->title;

    // ... etc. ...
    

I'm on a VPS server (1CPU, 4GB RAM, 40GB). Without any other code, just iterating over an xml file with 70k entries, at about 80MB takes 25-30 minutes. Performance wanes over time with the CPU at 98% after a few minutes (RAM is fine). The first 30K+ entries take 10 minutes, while the second 40K take about 20 minutes.

Is there a faster, more efficient way to do this than simplexml?...Or a better method for checking large XML files that change daily? (i.e. MySQL import/queries, etc.)

I've seen suggestions to use SAX parser, but I do like the easy access to elements that simplexml provides. If I can stream the xml and save RAM, that's preferable as well.

One last note, if it helps, my current logic for checking for changes is as follows:

  1. Create an array of previously imported entries with the entry ID as the key and a string of entry values to check as the value
  2. Loop through xml file and check if the current entry id/value pair exists...
// Check if current entry is in array 
if ($previously_imported_entries[$id] === $value1 . " ---- " . $value2 . " --- " . $value3) {

   // Item is in array, and values are the same
   // No changes
   // etc...   
} else {
   
   // Item isn't in the array or values have changed
   // Import entry either way  

}


Solution 1:[1]

XMLReader + SimpleXML.

Stream the xml through XMLReader, then for each entry, load it into SimpleXML to parse and easily access the elements.

Incredible!! Same concept, same files, new result time: 6-10 seconds.

Credit: Bartosz Pacho?ek, Linkedin Post "Parsing huge XML files with PHP"

*Thanks also to stackoverflow user Maharramoff for pointing me to XMLReader

Here is Bartosz's example code:

<?php
//include "memcheck.php";
$start = time();

$xml = XMLReader::open('random5.xml');
//go to the first 'object' element
while ($xml->name !== 'object') { 
    $xml->read(); 
}

do {
    $object = simplexml_load_string($xml->readOuterXml());
    $id = (string) $object->id;
    $name = (string) $object->name;
    $features = [];
    foreach($object->features->feature as $feature) {
        $features[(string)$feature->id] = (string) $feature->name;
    }

    $services = [];
    foreach($object->services->service as $service) {
        $services[(string)$service->id] = (string) $service->service;
    }

    //here again we have all data of an object
} while ($xml->next('object'));

//var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

This worked fantastic for me.

Just replaced 'object' in line 7 and 4th line from the bottom to 'entry' to loop through all of the entry tags in my xml files, and replaced the specific element names/variables to match my xml files (i.e. $title = (string) $object->title instead of 'name', etc.).

*I commented out memcheck.php and the second to last line as I didn't use it.

NOTE: For the xml files I used, I first had to remove all of the invalid characters to avoid PCDATA invalid Char errors. If it helps anyone, here's a script that I used:

$data = file_get_contents('test.xml');
        
$data = preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $data);
        
file_put_contents('new_file.xml', $data);

Bartosz's article lays out several tests with 4GB and other XML files, comparing the speeds with SimpleXML, DOM, SAX Expat Parser, XMLReader, etc. Very helpful. Check it out if you're interested in his other finds.

For the speed & efficiency I needed, plus the simplicity I enjoyed with SimpleXML, this solution is best.

Test and see if it works for you.

Solution 2:[2]

The XMLReader class might solve the problem

"If you have to deal with massive XML documents, use XMLReader to process them. Don't try and gather an entire XML document into a PHP data structure using XMLReader and a PHP xml2assoc() function, you are reinventing the SimpleXML wheel. When parsing massive XML documents using XMLReader, gather the data you need to perform an operation then perform it before skipping to the next node."

You can read more here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Maharramoff