Chapter 9, page 163, of the AMPL book gives an example of reading a single parameter from a file:
For example, if you want to read the number of weeks and the
hours available each week for our simple production model (Figure 4-4),
param T > 0;
param avail {1..T} >= 0;
from a file week_data.txt containing
4
40 40 32 40
then you can give the command
read T, avail[1], avail[2], avail[3], avail[4] <week_data.txt;
This command fails in GLPK with error colon missing where expected. The Modeling Language GNU MathProg Language reference only contains table data IN, which serves for reading tabular data. Can GLPK read a single parameter from a file?
You can use the table statement to read parameters from a CSV files or SQL tables.
You can use data files for passing parameters, e.g. see this example of writing data files in AWK and Visual Basic.
AMPL and GMPL are related functional languages. GMPL contains a subset of the AMPL syntax but differs in several areas like the table statement.
One way to read a single parameter is to write the data into a file with a certain syntax, e.g. the contents below show a single parameter and a table:
param T := 4;
param avail :=
1 0
2 1
3 1
4 0;
end;
To verify the syntax, consider this code in the file problem.mod:
param T > 0;
param avail {1..T} >= 0;
var use {1..T} >= 0;
maximize usage: sum {t in 1..T} avail[t];
subject to constraint {t in 1..T}: use[t] <= avail[t];
solve;
end;
The result shows that it worked:
> glpsol -m problem.mod -d problem.dat
GLPSOL: GLPK LP/MIP Solver, v4.65
Parameter(s) specified in the command line:
-m problem.mod -d problem.dat
Reading model section from problem.mod...
13 lines were read
Reading data section from problem.dat...
9 lines were read
Generating usage...
Generating constraint...
Model has been successfully generated
glp_mpl_build_prob: row usage; constant term 2 ignored
GLPK Simplex Optimizer, v4.65
5 rows, 4 columns, 4 non-zeros
Preprocessing...
~ 0: obj = 2.000000000e+00 infeas = 0.000e+00
OPTIMAL SOLUTION FOUND BY LP PREPROCESSOR
Time used: 0.0 secs
Memory used: 0.1 Mb (110236 bytes)
Model has been successfully processed
Related
I am trying to concatenate a large (~4000) number of ncdf files into a single file. Each input file is a spatial raster, with a x and y dimension.
I am trying to work with ncecat:
ncecat -4 -L 5 -D 2 --open_ram --cnk_csh=1000000000 \
--cnk_dmn record,2000 --cnk_dmn x,10 --cnk_dmn y,10 \
$input_files output.nc
This gives me something like this:
netcdf test { dimensions:
record = UNLIMITED ; // (6 currently)
y = 11250 ;
x = 15000 ; variables:
float Band1(record, y, x) ;
Band1:long_name = "GDAL Band Number 1" ;
Band1:_FillValue = -3.4e+38f ;
Band1:grid_mapping = "transverse_mercator" ;
Band1:_Storage = "chunked" ;
Band1:_ChunkSizes = 1, 10, 10 ;
Band1:_DeflateLevel = 5 Band1:_Filter = "|1,5 ;
Band1:_Shuffle = "true" ;
Band1:_Endianness = "little" ;
, and the record dimension was not actually chunked.
I think I can run this command first, and then use ncks on the output file to fix the record dim and rechunk again, however, as ncks needs to read everything into ram, and is also another time-costly operation, I am searching a way to tell ncecat that it should also consider the record-dim as a chunking dimension. I haven't found a way to do this yet.
Your command looks well-formed, though there are a few comments I would make. First, the behavior you are seeing may be a bug, since the command should produce record dimension chunks of size 2000 as request. Second, please read the chunking documentation here. This leads to the possibility that adding the --cnk_plc=cnk_xpl option may help. Third, I suggest you concatenate and chunk the files with ncrcat not ncecat. The former is less memory-intensive than the latter, as described here.
I have some trouble using CombiTimeTable.
I want to fill the table using a txt file that contains two columns, the first is the time and the second is the related value (a current sample). Furthermore, I add #1 in the first line as the manual says.
Moreover, I add the following parameters:
tableOnFile=true,
fileName="C:/Users/gg/Desktop/CurrentDrivingCycle.txt"
I also have to add the parameter tableName but I don't know how to define it. I tried to define it using the name of the file (i.e. CurrentDrivingCycle) but I got this error message at the end of the simulation:
Table matrix "CurrentDrivingCycle" not found on file "C:/Users/ggalli/Desktop/CurrentDrivingCycle.txt".
simulation terminated by an assertion at initialization
Simulation process failed. Exited with code -1.
Do you know how can I solve this issue?
Thank you in advance!
See the documentation:
https://build.openmodelica.org/Documentation/Modelica.Blocks.Sources.CombiTimeTable.html
The name tab1(6,2) in the example of the documentation is the tableName. So yours should look something like:
#1
double CurrentDrivingCycle(6,2) # comment line
0 0
1 0
1 1
2 4
3 9
4 16
Good morning,
I'm new about powershell and I'd like to ask you if somebody can help me.
I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.
> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)
The error is:
Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line
I tried to use different way (I putted in my code fill=true, but doesn't work) to solve the problem, but I couldn't do it.
After different researches I found this kind of solution (always to do in R):
>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")
The focus about the use of powershell in R is to delete a specific row and to load with fread the file.
My question is:
How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?
Thank you in advance for every type of help.
Francesco
I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.
A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.
To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
# Split the line on semicolons
$elements = $line -split ';'
# If there were $colonCount elements, add those to builder
if($elements.count -eq $colonCount) {
# If $line's contents need modifications, do it here
# before adding it into the builder
[void]$sb.AppendLine($line)
++$i
}
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
add-content "MyCleanCsvFile.csv" $sb.ToString()
}
This is easy done in powershell: Read csv in generic list, remove line and write back:
Add-Type -AssemblyName System.Collections
[System.Collections.Generic.List[string]]$csvList = #()
$csvFile = 'C:\test\myfile.csv'
$csvList = [System.IO.File]::ReadLines( $csvFile )
$lineToDelete = 2
[void]$csvList.RemoveAt( $lineToDelete - 1 )
[System.IO.File]::WriteAllLines( $csvFile, $csvList ) | Out-Null
vonPryz's helpful answer offers the best solution, given the size of your input file.
The following works too, but will be slow - in general, due to the overhead of using a pipeline, but also because Get-Content itself is slow due to decorating each line read with additional properties (see green-lighted, but not yet implemented GitHub suggestion #7537):
# Exclude line number 458945 (0-based index 458944)
Get-Content C:/a/b/c/file.csv | Select-Object -SkipIndex 458944 > output.csv
The beneficial flip side of use of the pipeline is that it acts as a memory throttle, so the above command can be used to process arbitrarily large files (though it may take a long time).
I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.
Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.
I am fully aware that similar questions may have been posted, but after searching it seems that the details of our questions are different (or at least I did not manage to find a solution that can be adopted in my case).
I currently have two files: "messyFile" and "wantedID". "messyFile" is of size 80,000,000 X 2,500, whereas "wantedID" is of size 1 x 462. On the 253rd line of "messyFile", there are 2500 IDs. However, all I want is the 462 IDs in the file "wantedID". Assuming that the 462 IDs are a subset of the 2500 IDs, how can I process the file "messyFile" such that it only contains information about the 462 IDs (ie. of size 80,000,000 X 462).
Thank you so much for your patience!
ps: Sorry for the confusion. But yeah, the question can be boiled down to something like this. In the 1st row of "File#1", there are 10 IDs. In the 1st row of "File#2", there are 3 IDs ("File#2" consists of only 1 line). The 3 IDs are a subset of the 10 IDs. Now, I hope to process "File#1" so that it contains only information about the 3 IDs listed in "File#2".
ps2: "messyFile" is a vcf file, whereas "wantedID" can be a text file (I said "can be" because it is small, so I can make almost any type for it)
ps3: "File#1" should look something like this:
sample#1 sample#2 sample#3 sample#4 sample#5
0 1 0 0 1
1 1 2 0 2
"File#2" should look something like this:
sample#2 sample#4 sample#5
Desired output should look like this:
sample#2 sample#4 sample#5
1 0 1
1 0 2
For parsing VCF format, use bcftools:
http://samtools.github.io/bcftools/bcftools.html
Specifically for your task see the view command:
http://samtools.github.io/bcftools/bcftools.html#view
Example:
bcftools view -Ov -S 462sample.list -r chr:pos -o subset.vcf superset.vcf
You will need to get the position of the SNP to specify chr:pos above.
You can do this using DbSNP:
http://www.ncbi.nlm.nih.gov/SNP/index.html
Just make sure to match the genome build to the one used in the VCF file.
You can also use plink:
https://www.cog-genomics.org/plink2
But, PLINK is finicky about duplicated SNPs and other things, so it may complain unless you address these issues.
I've done what you are attempting in the past using the awk programming language. For your sanity, I recommend using one of the above tools :)
Ok, I have no idea what a vcf file is but if the File#1 and File#2 samples you gave were files containing tab separated columns this will work:
declare -a data=(`head -1 data.txt`)
declare -a header=(`head -1 header.txt`)
declare fields
declare -i count
for i in "${header[#]}" ; do
count=0
for j in "${data[#]}" ; do
count=$count+1;
if [ $i == $j ] ; then
fields=$fields,$count
fi
done
done
cut -f ${fields:1} data.txt
If they aren't tab separated values perhaps it can be amended for the actual data format.