'Parsing XML file with duplicate tags

I currently use an XML parser to extract the name of a route from a GPX (XML) file.

Each GPX files contains a single "name" tag which is what I've been extracting.

Here's the script:

#! /bin/bash

gpxpath=/mnt/gpxfiles; export gpxpath

for file in $gpxpath/*
do

filename=`ls $file`; export filenanme
gpxname=`$scripts/xmlparse.pl "$file"`

echo $filename "    "$gpxname >> gpxparse.tmp

done

sort -k 2,2 gpxparse.tmp > gpxparse.out

cat gpxparse.out

And here's xmlparse.pl:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'name' => sub { print $_ ->text }
    }
    )->parse( <> );

Here's an example GPX file:

<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="creator" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <metadata>     
        <referrer>Referrer</referrer>
        <time>2019-06-17T06:02:23.000Z</time>
    </metadata>
    <trk>
        <name>Another GPX file</name>
        <trkseg>
            <trkpt lon="-1.91990" lat="53.00131">
                <ele>112.1</ele>
                <time>2019-06-17T06:02:23.000Z</time>
            </trkpt>
            <trkpt lon="-1.91966" lat="53.00126">
                <ele>113.6</ele>
                <time>2019-06-17T06:02:25.000Z</time>
            </trkpt>
            <trkpt lon="-1.91962" lat="53.00125">
                <ele>114.1</ele>
                <time>2019-06-17T06:02:25.000Z</time>
            </trkpt>
            <trkpt lon="-1.91945" lat="53.00120">
                <ele>115.5</ele>
                <time>2019-06-17T06:02:26.000Z</time>
            </trkpt>
        </trkseg>
    </trk>
</gpx>

I can successfully extract the name of the route using the scripts above However, I'd additionally like to extract the first co-ordinate pair in each file.

Atrack can defined by a "trk" element and within a track can be multiple segments or "trkseg". Finally, within a trkseg are multiple "trkpt" (track points).

A track point usually consists of a latitdue and longitude co-ordinate pair along with elevation and timestamp information.

I'm only looking to extract the first lat and lon within the first trkpt of the GPX file. Ideally, once the script has found the first co-ordinate pair it should exit and move onto the next file.

I've tried crafting an additional perl script

I've added an additional perl parse script using XML::Twig but it seems to stumble when there are multiple elements with duplicate names.



Solution 1:[1]

Since you were originally going for a Perl solution,

perl -MXML::LibXML -e'
   my $doc = XML::LibXML->load_xml( location => $ARGV[0] );
   my $xpc = XML::LibXML::XPathContext->new();
   $xpc->registerNs( gpx => "http://www.topografix.com/GPX/1/1" );
   CORE::say
      join ",",
         $xpc->findnodes(q{/gpx:gpx/gpx:trk/gpx:name}, $doc),
         $xpc->findnodes(q{/gpx:gpx/gpx:trk/gpx:trkseg/gpx:trkpt[1]/@lat}, $doc),
         $xpc->findnodes(q{/gpx:gpx/gpx:trk/gpx:trkseg/gpx:trkpt[1]/@long}, $doc);
' "$file"

(I used XML::LibXML instead of XML::Twig because I'm more familiar with that.)

Unlike the solution in the earlier answer,

  • This solution doesn't make fragile assumptions about what the default namespace might be.
  • This solution doesn't make fragile assumptions about where name elements might or might not appear.

Solution 2:[2]

This is very easy for :

xidel -s input.xml -e 'join((//name,//trkpt[1]/@*),",")'
Another GPX file,-1.91990,53.00131

Ideally, once the script has found the first co-ordinate pair it should exit and move onto the next file.

xidel, together with the integrated EXPath File Module, can do this very efficiently:

xidel -se 'file:list("/mnt/gpxfiles")'   # lists all files in '/mnt/gpxfiles' (and subdirs!)
xidel -se 'file:list("/mnt/gpxfiles",false(),"*.xml")'   # lists all xml-files in '/mnt/gpxfiles'

xidel -se '
  for $x in file:list("/mnt/gpxfiles") return
  doc("/mnt/gpxfiles/"||$x)/join((//name,//trkpt[1]/@*),",")
'   # iterate over and parse all xml-files in '/mnt/gpxfiles' AND extract the info you need.

Solution 3:[3]

I see some more elegant methods in other answers, but I'd probably use a brute force method:

grep name {file} | head -1
grep "trkpt lon" {file} | head -1

and then use perl or sed to edit the result to the parts wanted.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Reino
Solution 3 WGroleau