Pages

Tuesday, March 15, 2016

PowerShell v3 Testing Encoding and Out File

One thing I run into a lot is the need to use Get-Content, and, for that matter, .NET commands to parse files. In order to be sure I am using the proper encoding I began searching earlier today for information on how to be sure I am using the right encoding for a given cmdlet. Three posts I ran across that help are:

  • http://franckrichard.blogspot.com/2010/08/powershell-get-encoding-file-type.html
  • Chad Millers script (referenced above) -  http://poshcode.org/2059
  • and, Lee Holmes variant -  http://poshcode.org/2153
To give myself something to work with I decided to explore the standard encodings available with most cmdlets. Interestingly, there are a few standard encodings that you should be familiar with:
  • ASCII
  • Big Endian Unicode
  • Default
  • OEM
  • Unicode
  • UTF-32
  • UTF-7
  • UTF-8
In testing the output for these, I used this approach:
unicode,utf7,utf8,utf32,ascii,bigendianunicode,default,oem |
sort |
% {
      Out-File -FilePath "C:dataDocumentsPowershellProjectsEncoding est$_.txt" -InputObject Test -Encoding $_;
      $bytearray = Get-Content -Path "C:dataDocumentsPowershellProjectsEncoding est$($_).txt" -Encoding byte
      "$($_): $($bytearray -join )"
}
which yielded this output: 
ascii: 84 101 115 116 13 10
bigendianunicode: 254 255 0 84 0 101 0 115 0 116 0 13 0 10
default: 84 101 115 116 13 10
oem: 84 101 115 116 13 10
unicode: 255 254 84 0 101 0 115 0 116 0 13 0 10 0
utf32: 255 254 0 0 84 0 0 0101 0 0 0 115 0 0 0 116 0 0 0 13 0 0 0 10 0 0 0
utf7: 84 101 115 116 13 10
utf8: 239 187 191 84 101 115 116 13 10
As you can see, there are some similarities between each, but, when working with encoding it is important to know what is "expected" and what is purely data. I highlighted the "common" characters in red so it was obvious what the control was in each case. Alternatively, here is the same thing in Hex.
unicode,utf7,utf8,utf32,ascii,bigendianunicode,default,oem |
sort |
% {
      Out-File -FilePath "C:dataDocumentsPowershellProjectsEncoding est$_.txt" -InputObject Test -Encoding $_;
      $bytearray = Get-Content -Path "C:dataDocumentsPowershellProjectsEncoding est$($_).txt" -Encoding byte
      "$($_): {0}" -f (($bytearray | % { [Convert]::ToString($_,16).PadLeft(2,"0")}) -join )
}

ascii: 54 65 73 74 0d 0a
bigendianunicode: fe ff 00 54 00 65 00 73 00 74 00 0d 00 0a
default: 54 65 73 74 0d 0a
oem: 54 65 73 74 0d 0a
unicode: ff fe 54 00 65 00 73 00 74 00 0d 00 0a 00
utf32: ff fe 00 00 54 00 00 00 65 00 00 00 73 00 00 00 74 00 00 00 0d 00 00 00 0a 00 00 00
utf7: 54 65 73 74 0d 0a
utf8: ef bb bf 54 65 73 74 0d 0a
It is clear you need to be careful when you are dealing with unknown file formats. I will more than likely use Lees function as it covers some non-standard encodings:
function Get-FileEncoding
{
      ##############################################################################
      ##
      ## Get-FileEncoding
      ##
      ## From Windows PowerShell Cookbook (OReilly)
      ## by Lee Holmes (http://www.leeholmes.com/guide)
      ##
      ##############################################################################

      <#

      .SYNOPSIS

      Gets the encoding of a file

      .EXAMPLE

      Get-FileEncoding.ps1 .UnicodeScript.ps1

      BodyName          : unicodeFFFE
      EncodingName      : Unicode (Big-Endian)
      HeaderName        : unicodeFFFE
      WebName           : unicodeFFFE
      WindowsCodePage   : 1200
      IsBrowserDisplay  : False
      IsBrowserSave     : False
      IsMailNewsDisplay : False
      IsMailNewsSave    : False
      IsSingleByte      : False
      EncoderFallback   : System.Text.EncoderReplacementFallback
      DecoderFallback   : System.Text.DecoderReplacementFallback
      IsReadOnly        : True
      CodePage          : 1201

      #>

      param(
          ## The path of the file to get the encoding of.
          $Path
      )

      Set-StrictMode -Version Latest

      ## The hashtable used to store our mapping of encoding bytes to their
      ## name. For example, "255-254 = Unicode"
      $encodings = @{}

      ## Find all of the encodings understood by the .NET Framework. For each,
      ## determine the bytes at the start of the file (the preamble) that the .NET
      ## Framework uses to identify that encoding.
      $encodingMembers = [System.Text.Encoding] |
          Get-Member -Static -MemberType Property

      $encodingMembers | Foreach-Object {
          $encodingBytes = [System.Text.Encoding]::($_.Name).GetPreamble() -join -
          $encodings[$encodingBytes] = $_.Name
      }

      ## Find out the lengths of all of the preambles.
      $encodingLengths = $encodings.Keys | Where-Object { $_ } |

Related Posts by Categories

0 comments:

Post a Comment