Pages

Saturday, January 30, 2016

PowerShell v2 Get Content ReadCount Reading Really Large Files

At work we deal with really large files...as in multi-gibibyte files. I have tried in the past to swallow these whole only to run into OutOfMemory exceptions on x64 machines. After posting on Twitter for some help mjolinor hinted an answer he gave to someone in a similar predicament:
How can I make this PowerShell script parse large files faster?
Near the end of Robs solution was this:
Get-Content $path$infile -ReadCount $batch |
foreach {
  $_ -match $match_regex -notmatch $header_regex -Replace $replace_regex ,$1 | Out-File $path$outfile -Append
}
The  magic bullet here is the -ReadCount parameter. $batch is simple an int set to 1000.  In plain English, this parameter instructs Get-Content to read the $path$infile value in chunks of 1,000 lines.  So, if I have a 5 million line file, no problem, I only do chunks of 1,000 at a time. In a way, this is like streaming, except, the data being worked upon is collections of 1,000 lines, not streams. Using the approach shown above, you can segment large data segments into small, bite-sized segments.

To get more background on the parameter I went to
get-help get-content -parameter readcount
and got this:
-ReadCount <Int64>
Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.

This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in very large items.

Required? false
Position? named
Default value
Accept pipeline input? true (ByPropertyName)
Accept wildcard characters? false
When you pass a collection to the pipelined foreach {} this automatically assumes a process block and iterates over the objects without special manipulation being required.  Note that, as listed above, larger read values (the number after -ReadCount) slows the initial time to the first processing as it has to gulp up the initial chunk of content.

Related Posts by Categories

0 comments:

Post a Comment