'I would like to use regex to retrieve text between two words. This text has xml tags but isnt xml

For example I have a bunch of text that is upparsed from a command that I am looping through and would like to get the text between. I've tried (.*?) \([</Location>])$ and nothing happened. Not a single thing. SO in this body of text, for example I need the paths inside the <Location>

<?xml version="1.0" encoding="utf-16"?><AppMgmtDigest xmlns="http://schemas.microsoft.com/SystemCenterConfigurationManager/2009/AppMgmtDigest" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Application AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="Application_0487d42d-94f8-4424-bd10-693005c74d9c" Version="11"><DisplayInfo DefaultLanguage="en-US"><Info Language="en-US"><Title>Update BeyondTrust</Title><ReleaseDate>2022-01-14</ReleaseDate></Info></DisplayInfo><DeploymentTypes><DeploymentType AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="DeploymentType_3f86c80a-f4d6-4c63-b066-7c030730456a" Version="11"/></DeploymentTypes><Title ResourceId="Res_163096156">Update BeyondTrust</Title><ReleaseDate ResourceId="Res_2088816488">2022-01-14</ReleaseDate><Owners><User Qualifier="LogonName" Id="Admin.MH"/></Owners><Contacts><User Qualifier="LogonName" Id="Admin.MH"/></Contacts></Application><DeploymentType AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="DeploymentType_3f86c80a-f4d6-4c63-b066-7c030730456a" Version="11"><Title ResourceId="Res_1162077075">Update BeyondTrust</Title><DeploymentTechnology>GLOBAL/ScriptDeploymentTechnology</DeploymentTechnology><Technology>Script</Technology><Hosting>Native</Hosting><Installer Technology="Script"><ExecutionContext>System</ExecutionContext><Contents><Content ContentId="Content_27d453bb-3439-4440-a90b-ddd731e5a4a7" Version="1"><File Name="PrivilegeManagementConsoleAdapter_x64.msi" Size="7425536"/><File Name="PrivilegeManagementForWindows_x64.msi" Size="21287936"/><File Name="remediate.ps1" Size="3020"/><Location>\\pennoni.com\util\Software\BeyondTrust\PMCloud\application_sccm\</Location><PeerCache>true</PeerCache><OnFastNetwork>Download</OnFastNetwork><OnSlowNetwork>DoNothing</OnSlowNetwork></Content></Contents><DetectAction><Provider>Local</Provider><Args><Arg Name="ExecutionContext" Type="String">System</Arg><Arg Name="MethodBody" Type="String">&lt;?xml version="1.0" encoding="utf-16"?&gt;

Basically, in a body of text, I want to retrieve the text between

<Location> pathThatINeed </Location>


Solution 1:[1]

Lee Dailey's helpful answer offers a pragmatic solution that is easy to conceptualize.

To offer a single-operation alternative using the regex-based -replace operator:

# $text is assumed to contain the (incomplete) input XML text.
$text -replace '^.*<location>(.*?)</location>.*$', '$1'

Note: If the regex doesn't match the input, the input is returned as-is.

For an explanation of the regex and the ability to experiment with it, see this regex101.com page.


As an aside: As MikeSh notes, in the case at hand the regex can be simplified:

$text -replace '.*<location>(.*)</location>.*', '$1'
  • The start and end anchors, ^ and $, aren't strictly necessary, because the .* on either end implicitly ensures that a match will capture the entire input string, which is necessary for the logic of the replace operation - however, I've added them for conceptual clarity.

  • If the assumption is that only one location element is present in the input, (.*), as a greedy subexpression (one that matches as much as possible) works fine, because when the regex engine backtracks to the last instance of </location>, it'll by definition be the only one.

    • Generally, however, if the intent is to match non-greedily, i.e. only to the next, not the last instance of the subexpression that follows, (.*?) is required - in this case, it is a more readable alternative to [^<]* (match everything up to the next <)

The following example shows when (.*?) is required:

# !! WRONG: Greedy subexpression matches from the start of the 
# !!        *first* opening tag through the end of the *last* one:
# !!        -> 'a</el> <el>b'
'<el>a</el> <el>b</el>' -replace '<el>(.*)</el>', '$1'

# OK: Non-greedy subexpression matches only up to the *next* 
#     closing tag. However, the regex now matches *twice*.
#     -> 'a b'
'<el>a</el> <el>b</el>' -replace '<el>(.*?)</el>', '$1'

# OK: Start the regex with a greedy match-anything subexpression
#     in order to limit matching to the *last* element.
#     Note: For the reasons explained above, (.*) will 
#           *technically* do here, but using (.*?) for *conceptual*
#           reasons - to signal the intent - is advisable.
#     -> 'b'
'<el>a</el> <el>b</el>' -replace '.*<el>(.*?)</el>', '$1'

# OK: Start the regex with a non-greedy match-anything subexpression
#     and end it with a greedy one in order to limit matching 
#     to the *first* element.
#     -> 'a'
'<el>a</el> <el>b</el>' -replace '.*?<el>(.*?)</el>.*', '$1'

Solution 2:[2]

That should do the trick:

(?<=<Location>).*?(?=<Location/>)

Output:

 THisismyDesiredText 

Explanation:

  • (?<=): Positive Lookbehind
  • .*?: Matches any character between zero and unlimited times, as few times as possible (lazy)
  • (?=): Positive Lookahead

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Cubix48