'I would like to use regex to retrieve text between two words. This text has xml tags but isnt xml
For example I have a bunch of text that is upparsed from a command that I am looping through and would like to get the text between. I've tried (.*?) \([</Location>])$ and nothing happened. Not a single thing. SO in this body of text, for example I need the paths inside the <Location>
<?xml version="1.0" encoding="utf-16"?><AppMgmtDigest xmlns="http://schemas.microsoft.com/SystemCenterConfigurationManager/2009/AppMgmtDigest" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Application AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="Application_0487d42d-94f8-4424-bd10-693005c74d9c" Version="11"><DisplayInfo DefaultLanguage="en-US"><Info Language="en-US"><Title>Update BeyondTrust</Title><ReleaseDate>2022-01-14</ReleaseDate></Info></DisplayInfo><DeploymentTypes><DeploymentType AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="DeploymentType_3f86c80a-f4d6-4c63-b066-7c030730456a" Version="11"/></DeploymentTypes><Title ResourceId="Res_163096156">Update BeyondTrust</Title><ReleaseDate ResourceId="Res_2088816488">2022-01-14</ReleaseDate><Owners><User Qualifier="LogonName" Id="Admin.MH"/></Owners><Contacts><User Qualifier="LogonName" Id="Admin.MH"/></Contacts></Application><DeploymentType AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="DeploymentType_3f86c80a-f4d6-4c63-b066-7c030730456a" Version="11"><Title ResourceId="Res_1162077075">Update BeyondTrust</Title><DeploymentTechnology>GLOBAL/ScriptDeploymentTechnology</DeploymentTechnology><Technology>Script</Technology><Hosting>Native</Hosting><Installer Technology="Script"><ExecutionContext>System</ExecutionContext><Contents><Content ContentId="Content_27d453bb-3439-4440-a90b-ddd731e5a4a7" Version="1"><File Name="PrivilegeManagementConsoleAdapter_x64.msi" Size="7425536"/><File Name="PrivilegeManagementForWindows_x64.msi" Size="21287936"/><File Name="remediate.ps1" Size="3020"/><Location>\\pennoni.com\util\Software\BeyondTrust\PMCloud\application_sccm\</Location><PeerCache>true</PeerCache><OnFastNetwork>Download</OnFastNetwork><OnSlowNetwork>DoNothing</OnSlowNetwork></Content></Contents><DetectAction><Provider>Local</Provider><Args><Arg Name="ExecutionContext" Type="String">System</Arg><Arg Name="MethodBody" Type="String"><?xml version="1.0" encoding="utf-16"?>
Basically, in a body of text, I want to retrieve the text between
<Location> pathThatINeed </Location>
Solution 1:[1]
Lee Dailey's helpful answer offers a pragmatic solution that is easy to conceptualize.
To offer a single-operation alternative using the regex-based -replace operator:
# $text is assumed to contain the (incomplete) input XML text.
$text -replace '^.*<location>(.*?)</location>.*$', '$1'
Note: If the regex doesn't match the input, the input is returned as-is.
For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
As an aside: As MikeSh notes, in the case at hand the regex can be simplified:
$text -replace '.*<location>(.*)</location>.*', '$1'
The start and end anchors,
^and$, aren't strictly necessary, because the.*on either end implicitly ensures that a match will capture the entire input string, which is necessary for the logic of the replace operation - however, I've added them for conceptual clarity.If the assumption is that only one
locationelement is present in the input,(.*), as a greedy subexpression (one that matches as much as possible) works fine, because when the regex engine backtracks to the last instance of</location>, it'll by definition be the only one.- Generally, however, if the intent is to match non-greedily, i.e. only to the next, not the last instance of the subexpression that follows,
(.*?)is required - in this case, it is a more readable alternative to[^<]*(match everything up to the next<)
- Generally, however, if the intent is to match non-greedily, i.e. only to the next, not the last instance of the subexpression that follows,
The following example shows when (.*?) is required:
# !! WRONG: Greedy subexpression matches from the start of the
# !! *first* opening tag through the end of the *last* one:
# !! -> 'a</el> <el>b'
'<el>a</el> <el>b</el>' -replace '<el>(.*)</el>', '$1'
# OK: Non-greedy subexpression matches only up to the *next*
# closing tag. However, the regex now matches *twice*.
# -> 'a b'
'<el>a</el> <el>b</el>' -replace '<el>(.*?)</el>', '$1'
# OK: Start the regex with a greedy match-anything subexpression
# in order to limit matching to the *last* element.
# Note: For the reasons explained above, (.*) will
# *technically* do here, but using (.*?) for *conceptual*
# reasons - to signal the intent - is advisable.
# -> 'b'
'<el>a</el> <el>b</el>' -replace '.*<el>(.*?)</el>', '$1'
# OK: Start the regex with a non-greedy match-anything subexpression
# and end it with a greedy one in order to limit matching
# to the *first* element.
# -> 'a'
'<el>a</el> <el>b</el>' -replace '.*?<el>(.*?)</el>.*', '$1'
Solution 2:[2]
That should do the trick:
(?<=<Location>).*?(?=<Location/>)
Output:
THisismyDesiredText
Explanation:
(?<=): Positive Lookbehind.*?: Matches any character between zero and unlimited times, as few times as possible (lazy)(?=): Positive Lookahead
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Cubix48 |
