I am trying to concatenate a large (~4000) number of ncdf files into a single file. Each input file is a spatial raster, with a x and y dimension.
I am trying to work with ncecat:
ncecat -4 -L 5 -D 2 --open_ram --cnk_csh=1000000000 \
--cnk_dmn record,2000 --cnk_dmn x,10 --cnk_dmn y,10 \
$input_files output.nc
This gives me something like this:
netcdf test { dimensions:
record = UNLIMITED ; // (6 currently)
y = 11250 ;
x = 15000 ; variables:
float Band1(record, y, x) ;
Band1:long_name = "GDAL Band Number 1" ;
Band1:_FillValue = -3.4e+38f ;
Band1:grid_mapping = "transverse_mercator" ;
Band1:_Storage = "chunked" ;
Band1:_ChunkSizes = 1, 10, 10 ;
Band1:_DeflateLevel = 5 Band1:_Filter = "|1,5 ;
Band1:_Shuffle = "true" ;
Band1:_Endianness = "little" ;
, and the record dimension was not actually chunked.
I think I can run this command first, and then use ncks on the output file to fix the record dim and rechunk again, however, as ncks needs to read everything into ram, and is also another time-costly operation, I am searching a way to tell ncecat that it should also consider the record-dim as a chunking dimension. I haven't found a way to do this yet.
Your command looks well-formed, though there are a few comments I would make. First, the behavior you are seeing may be a bug, since the command should produce record dimension chunks of size 2000 as request. Second, please read the chunking documentation here. This leads to the possibility that adding the --cnk_plc=cnk_xpl option may help. Third, I suggest you concatenate and chunk the files with ncrcat not ncecat. The former is less memory-intensive than the latter, as described here.
Related
My goal is to convert a multi page pdf file into a number of .jpg files, in such a way that the images are directly written to the hard-disk/SSD in stead of stored into memory.
In python 3.11 :
from pdf2image import convert_from_path
poppler_path = r".\poppler-22.12.0\Library\bin"
images = convert_from_path('test.pdf', output_folder='.', output_file = 'test',
poppler_path=poppler_path, paths_only = True)
pdf2image generates files with the following names
'test_0001-1.jpg',
'test_0001-2.jpg',
etc
Problem:
I would like to have the files have names without the suffix '_0001-' (eg. 'test1.jpg').
The only way so far seems to be to use convert_from_path WITHOUT output_folder and then
save each images by images.save. But in this way the images are stored first into memory, which easyly can become a lot of Mbytes.
Is it possible to change the way pdf2image generates the file names when saving images directly to files?
I'm not familiar if Poppler already has some parameters to customize the generated file names, but you can always do this:
Run the command in an empty directory (e.g. in tempfile.TemporaryDirectory())
After command finishes, list the contents of the directory and store the result in a list
Iterate over the list with a regex that will match the numbers, and create a dict for the mapping (integer to file name)
At this point you are free to rename the files to whatever you like, or to process them.
The benefit of this solution is that it's neutral, robust and works for many similar scenarios.
hi have a look at your codebase in file generators.py ,
I got mine from def counter_generator(prefix="", suffix="", padding_goal=4):
at line 41 you have :
....
#threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)
....
think you need to play with the yield line zfill() :
The Python String zfill() method is used to fill the string with zeroes on its left until it reaches a certain width; else called Padding. If the prefix of this string is a sign character (+ or -), the zeroes are added after the sign character rather than before.
The Python String zfill() method does not fill the string if the length of the string is greater than the total width of the string after padding.
Note: The zfill() method works similar to the rjust() method if we assign '0' to the fillchar parameter of the rjust() method.
https://www.tutorialspoint.com/python/string_zfill.htm
Just use poppler utilities direct (or xpdf pdftopng) so simply call it via a shell (add other options like -r 200 as desired for resolutions other than 150)
I recommend PNG as better image fidelity, however if you want .jpg replace "-png" below with "-jpg" (direct answer as asked would be pdftoppm -jpg -f 1 -l 9 -sep "" test.pdf "test") but do follow the below enhancement for file sorting. Windows file sorting needs leading zeros otherwise sort in zip or folder is 1,10,11...2,20...., which is often undesirable.
"path to bin\pdftoppm" -png "path to \in.pdf" "name"
Result =
name-1.png
name-2.png etc.
adding digits is limited compared to other apps so if you want "name-01.png" you need to only output pages 1-9 as
\bin>pdftoppm -png -f 1 -l 9 -sep "0" in.pdf "name-"
then for pages 10 to ## use say for up to 99 page file use default (it will only use the page numbers that are available)
\bin>pdftoppm -png -f 10 -l 99 in.pdf "name"
thus for 12 pages this would produce only -10 -11 and -12 as required
likewise, for up to 9999 pages you need 4 calls, if you don't want - simply delete it. For different output directory adjust output accordingly.
set "name=%~dpn1"
set "bin=path to Poppler\Release-22.12.0-0\poppler-22.12.0\Library\bin"
"%bin%\pdftoppm" -png -r 200 -f 1 -l 9 -sep "0" "%name%.pdf" "%name%-00"
"%bin%\pdftoppm" -png -r 200 -f 10 -l 99 -sep "0" "%name%.pdf" "%name%-0"
"%bin%\pdftoppm" -png -r 200 -f 100 -l 999 -sep "0" "%name%.pdf" "%name%-"
"%bin%\pdftoppm" -png -r 200 -f 1000 -l 9999 -sep "" "%name%.pdf" "%name%-"
in say example for 12 page above the worst case would be last calls replies
Wrong page range given: the first page (100) can not be after the last page (12). and same for 1000 Thus, those warnings can be ignored.
Those 4 lines could be in a windows or OS script batch file (for sendto or drag and drop) that accepts arguments then very simply use in system or python by call pdf2png.bat input.pdf for each file and output will in that simple case be same directory.
I downloaded daily TRMM 3B42 data for a few days from https://disc.gsfc.nasa.gov/datasets. The filenames are of the form 3B42_Daily.yyyymmdd.7.nc4.nc4 but the files do not contain any time dimension. Hence when I use ncecat to concatenate various files, the date information is missing in the resulting file. I want to know how to add the time information in the combined dataset.
It seems that the timestamp is a part of the global attributes. Here is the relevant part from ncdump:
$ ncdump -h ~/Downloads/3B42_Daily.19980730.7.nc4.nc4
netcdf \3B42_Daily.19980730.7.nc4 {
dimensions:
lon = 201 ;
lat = 201 ;
variables:
float lon(lon) ;
...trimmed...
// global attributes:
:BeginDate = "1998-07-30" ;
:BeginTime = "01:30:00.000Z" ;
:EndDate = "1998-07-31" ;
:EndTime = "01:29:59.999Z" ;
...trimmed...
When I tried using ncecat 3B42_Daily.199808??.7.nc4.nc4 /tmp/daily.nc4 it gives
$ ncdump -h /tmp/daily.nc
netcdf daily {
dimensions:
record = UNLIMITED ; // (5 currently)
lon = 201 ;
lat = 201 ;
variables:
float lon(lon) ;
...trimmed...
// global attributes:
:BeginDate = "1998-08-01" ;
:BeginTime = "01:30:00.000Z" ;
:EndDate = "1998-08-02" ;
:EndTime = "01:29:59.999Z" ;
...trimmed...
The time information in the global attribute of the first file is retained, which is not very useful.
When trying to use xarray in python, I face the same issue - again, I could not find how the time information that is contained in the global attribute can be used when concatenating the data.
I can think of two possible solutions, but I do not know how to achieve them.
Add timestamp "by hand" using some command to each of the file first, and then use ncecat
Somehow ncecat may read the global attribute and convert it into a dimension and a variable while concatenating.
From the documentation on http://nco.sourceforge.net/nco.html I could not figure out how to achieve either of these two ways. Or is there a third better way to achieve this (concatenation with relevant time information added)?
Since the data files do not follow CF conventions, you will likely have to manually create a time coordinate after using ncecat to concatenate the files. It only takes a few commands if the times are regular, e.g.,
ncecat -u time in*.nc out.nc
ncap2 -O -s 'time[time]={0.0,1.0,2.0,3.0,4.0}' out.nc out.nc
ncatted -a units,time,o,c,"days since 1998-08-01" out.nc
or use ncap2's array facilities for generic arithmetic arrays.
I am trying to extract a single variable (DUEXTTAU) from multiple NC files, and then combine all the individual files into a single NC file. I am using nco, but have an issue with ncks.
The NC filenames follow:
MERRA2_100.tavgM_2d_aer_Nx.YYYYMM.nc4
Each file has 1 (monthly) time step, and the time coordinate has no real value, but changes in units or begin_date. For example, in the file MERRA2_100.tavgM_2d_aer_Nx.198001.nc4, it has:
int time(time=1);
:long_name = "time";
:units = "minutes since 1980-01-01 00:30:00";
:time_increment = 60000; // int
:begin_date = 19800101; // int
:begin_time = 3000; // int
:vmax = 9.9999999E14f; // float
:vmin = -9.9999999E14f; // float
:valid_range = -9.9999999E14f, 9.9999999E14f; // float
:_ChunkSizes = 1U; // uint
I repeat this step for each file
ncks -v DUEXTTAU MERRA2_100.tavgM_2d_aer_Nx.YYYYMM.nc4 YYYYMM.nc4
and then
ncrcat YYYYMM.nc4 final.nc4
In final.nc4, the time coordinate has the same value (of the first YYYYMM.nc4). For example, after combining the 3 files of 198001, 198002 and 198003, the time coordinate equals 198001 for all the time steps. How should I deal with this?
Firstly, this command should work:
ncrcat -v DUEXTTAU MERRA2_100.tavgM_2d_aer_Nx.??????.nc4 final.nc4
However, recent versions of NCO fail to correctly reconstruct or re-base the time coordinate when time is an integer, which it is in your case. The fix is in the latest NCO snapshot on GitHub and will be in 4.9.3 to be released hopefully this week. If installing from source is not an option, then manual intervention would be required (e.g., change time to floating point in each input file with ncap2 -s 'time=float(time)' in.nc out.nc). In any case, the time_increment, begin_date, and begin_time attributes are non-standard and will simply be copied from the first file. But time itself should be correctly reconstructed if you use a non-broken version of ncrcat.
you can do this using cdo as well, but you need two steps:
cdo mergetime MERRA2_100.tavgM_2d_aer_Nx.??????.nc4 merged_file.nc
cdo selvar,DUEXTTAU merged_file.nc DUEXTTAU.nc
This should actually work if the begin dates are all set correctly. The problem is that merged_file.nc could actually be massive, and so it may be better to loop through to extract the variable first and then combine:
for file in `ls MERRA2_100.tavgM_2d_aer_Nx.??????.nc4`; do
cdo selvar,DUEXTTAU $file ${file#????}_duexttau.nc4
done
cdo mergetime MERRA2_100.tavgM_2d_aer_Nx.??????_duexttau.nc4 DUEXTTAU.nc
rm -f MERRA2_100.tavgM_2d_aer_Nx.??????_duexttau.nc4 # clean up
Good morning,
I'm new about powershell and I'd like to ask you if somebody can help me.
I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.
> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)
The error is:
Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line
I tried to use different way (I putted in my code fill=true, but doesn't work) to solve the problem, but I couldn't do it.
After different researches I found this kind of solution (always to do in R):
>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")
The focus about the use of powershell in R is to delete a specific row and to load with fread the file.
My question is:
How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?
Thank you in advance for every type of help.
Francesco
I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.
A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.
To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
# Split the line on semicolons
$elements = $line -split ';'
# If there were $colonCount elements, add those to builder
if($elements.count -eq $colonCount) {
# If $line's contents need modifications, do it here
# before adding it into the builder
[void]$sb.AppendLine($line)
++$i
}
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
}
This is easy done in powershell: Read csv in generic list, remove line and write back:
Add-Type -AssemblyName System.Collections
[System.Collections.Generic.List[string]]$csvList = #()
$csvFile = 'C:\test\myfile.csv'
$csvList = [System.IO.File]::ReadLines( $csvFile )
$lineToDelete = 2
[void]$csvList.RemoveAt( $lineToDelete - 1 )
[System.IO.File]::WriteAllLines( $csvFile, $csvList ) | Out-Null
vonPryz's helpful answer offers the best solution, given the size of your input file.
The following works too, but will be slow - in general, due to the overhead of using a pipeline, but also because Get-Content itself is slow due to decorating each line read with additional properties (see green-lighted, but not yet implemented GitHub suggestion #7537):
# Exclude line number 458945 (0-based index 458944)
Get-Content C:/a/b/c/file.csv | Select-Object -SkipIndex 458944 > output.csv
The beneficial flip side of use of the pipeline is that it acts as a memory throttle, so the above command can be used to process arbitrarily large files (though it may take a long time).
I have been digging through the Apple Photos macOS app for a couple weekends now and I am stuck. I am hoping the smart people at StackOverflow can figure this out.
What I don't know:
How are new hex directories determined and how do they correspond to RK.modelId. Perhaps 16 mod of RKFace.ModelId, or mod 256 of RKFace.ModelId?
After a while, the facetile hex value no longer corresponds to the RKFace.ModelId. For example RKFace.modelId 61047 should be facetile_ee77.jpeg. The correct facetile, however, is face/20/01/facetile_1209b.jpeg. hex value 1209b is dec value 73883 for which I have no RKFace.ModelId.
Things I know:
Apple Photos leverages deep learning networks to detect and crop faces out of your imported photos. It saves a cropped jpeg these detected faces into your photo library in resources/media/face/00/00/facetile_1.jpeg.
A record corresponding to this facetile is inserted into RKFace where RKFace.modelId integer is the decimal number of tail hex version of the filename. You can use a standard dec to hex converter and derive the correct values. For example:
Each subdirectory, for example "/00/00" will only hold a maximum of 256 facetiles before it starts a new directory. The directory name is also in hex format with directories. For example 3e, 3f.
While trying to render photo mosaics, I stumbled upon that issue, too...
Then I was lucky to find both a master image and the corresponding facetile,
allowing me to grep around, searching for the decimal and hex equivalent of the numbers embedded in the filenames.
This is what I came up with (assuming you are searching for someone named NAME):
SELECT
printf('%04x', mr.modelId) AS tileId
FROM
RKModelResource mr, RKFace f, RKPerson p
WHERE
f.modelId = mr.attachedModelId
AND f.personId = p.modelId
AND p.displayName = NAME
This select prints out all RKModelResource.modelIds in hex, used to name the corresponding facetiles you were searching for. All that is needed now is the complete path to the facetile.
So, a complete bash script to copy all those facetiles of a person (to a local folder out in the current directory) could be:
#!/bin/bash
set -eEu
PHOTOS_PATH=$HOME/Pictures/Photos\ Library.photoslibrary
DB_PATH=$PHOTOS_PATH/database/photos.db
echo $NAME
mkdir -p out/$NAME
TILES=( $(sqlite3 "$DB_PATH" "SELECT printf('%04x', mr.modelId) AS tileId FROM RKModelResource mr, RKFace f, RKPerson p WHERE f.modelId = mr.attachedModelId AND f.personId = p.modelId AND p.displayName='"$NAME"'") )
for TILE in ${TILES[#]}; do
FOLDER=${TILE:0:2}
SOURCE="$PHOTOS_PATH/resources/media/face/$FOLDER/00/facetile_$TILE.jpeg"
[[ -e "$SOURCE" ]] || continue
TARGET=out/$NAME/$TILE.jpeg
[[ -e "$TARGET" ]] && continue
cp "$SOURCE" "$TARGET" || :
done