'Strip BOM from a generated string (not a file)

I'm working with strings that look like they're MS Office documents. Note in this example, there are two BOM "characters," one at the start of the string and one in the body. Sometimes there are several of the characters, sometimes none. In the Powershell console, they print as ?

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=unicode"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
    <snip - bunch of style defs>
--></style></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1>
<p class=MsoNormal style='text-autospace:none'>
 <span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'></span>
 <span style='font-size:12.0pt;font-family:"Times New Roman","serif"'>Testing <o:p></o:p></span>
</p></div></body></html>

The strings come from an object, so I can't simply force UTF8 encoding with Get-Content. How else might I strip them? I'm not worried about this being lossy, as this is just being piped to the display, thus the desire to strip the extra characters. I'll also be stripping the HTML.



Solution 1:[1]

Another way to do this if there may be other actual UTF8 characters in the string would be to go this route. It assumes the the byte order mark characters are at the beginning of each string though:

$bytes = @()
$strs | Foreach {$bytes += [byte[]][char[]]$_}

$memStream = new-object system.io.memorystream
$memStream.Write($bytes, 0, $bytes.Length)
$memStream.Position = 0

$reader = new-object system.io.streamreader($memStream, [System.Text.Encoding]::UTF8)
$reader.ReadToEnd()
$reader.Dispose()

Solution 2:[2]

You should include the code you use to get your output when you ask for help. Does this work?

$s = #your code that gets the output#
$s -replace ""  #returns output without the characters

Or

( code that creates output ) -replace ""

Solution 3:[3]

Here's a PowerShell script that I use to remove embedded UTF-8 BOM characters from my source files:

$files=get-childitem -Path . -Include @("*.h","*.cpp") -Recurse
foreach ($f in $files)
{
(Get-Content $f.PSPath) | 
Foreach-Object {$_ -replace "\xEF\xBB\xBF", ""} | 
Set-Content $f.PSPath
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Keith Hill
Solution 2 Frode F.
Solution 3 Scott Smith