Read in a file in R up to a designated point - r

I want to be able to read in a .txt file into R which is always going to be a n by 2 file (n rows and 2 columns). I don't know how big n is each time but at the end of the file it breaks into just one column, so the end of the file may look like this :
abc 500
def 600
--- 0
Total 0
--- 0
Ideally I would like to stop reading in when we hit the ----. This is always consistent, so read in but stop when you get to ----.

Related

Lookup Table (CombiTimeTable) in OpenModelica

I have some trouble using CombiTimeTable.
I want to fill the table using a txt file that contains two columns, the first is the time and the second is the related value (a current sample). Furthermore, I add #1 in the first line as the manual says.
Moreover, I add the following parameters:
tableOnFile=true,
fileName="C:/Users/gg/Desktop/CurrentDrivingCycle.txt"
I also have to add the parameter tableName but I don't know how to define it. I tried to define it using the name of the file (i.e. CurrentDrivingCycle) but I got this error message at the end of the simulation:
Table matrix "CurrentDrivingCycle" not found on file "C:/Users/ggalli/Desktop/CurrentDrivingCycle.txt".
simulation terminated by an assertion at initialization
Simulation process failed. Exited with code -1.
Do you know how can I solve this issue?
Thank you in advance!
See the documentation:
https://build.openmodelica.org/Documentation/Modelica.Blocks.Sources.CombiTimeTable.html
The name tab1(6,2) in the example of the documentation is the tableName. So yours should look something like:
#1
double CurrentDrivingCycle(6,2) # comment line
0 0
1 0
1 1
2 4
3 9
4 16

Optimizing `data.table::fread` speed advice

I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.
Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.

Using Cat to merge Files, ignoring the last Byte of each file

I have written a rudimentary HTTP downloader that downloads parts of a file and stores them in temporary, partial files in a directory.
I can use
cat * > outputfilename
To concatenate the partial files together as their order is given by the individual filenames.
However.
The range of each file is something like:
File 1: 0 - 1000
File 2: 1000 - 2000
File 3: 2000 - 3000
For a file that is 3000 bytes in size. i.e. Last Byte overlaps of first byte.
The cat command duplicates the overlapping bytes into a new file.
Specifically, We can see this with images looking wrong
i.e:
(just using an image from imgur)
https://i.imgur.com/XEvBCtp.jpg
Renders with 1/(the number of partial files) of the image correctly.
To note: The original image is 250 KB.
The concatenated image is 287 KB.
I will be implementing this in C99, Unix as a method that calls exec.
I'm not sure where to upload the partial files to assist w/ stackoverflow.

How to match ID in column in unix?

I am fully aware that similar questions may have been posted, but after searching it seems that the details of our questions are different (or at least I did not manage to find a solution that can be adopted in my case).
I currently have two files: "messyFile" and "wantedID". "messyFile" is of size 80,000,000 X 2,500, whereas "wantedID" is of size 1 x 462. On the 253rd line of "messyFile", there are 2500 IDs. However, all I want is the 462 IDs in the file "wantedID". Assuming that the 462 IDs are a subset of the 2500 IDs, how can I process the file "messyFile" such that it only contains information about the 462 IDs (ie. of size 80,000,000 X 462).
Thank you so much for your patience!
ps: Sorry for the confusion. But yeah, the question can be boiled down to something like this. In the 1st row of "File#1", there are 10 IDs. In the 1st row of "File#2", there are 3 IDs ("File#2" consists of only 1 line). The 3 IDs are a subset of the 10 IDs. Now, I hope to process "File#1" so that it contains only information about the 3 IDs listed in "File#2".
ps2: "messyFile" is a vcf file, whereas "wantedID" can be a text file (I said "can be" because it is small, so I can make almost any type for it)
ps3: "File#1" should look something like this:
sample#1 sample#2 sample#3 sample#4 sample#5
0 1 0 0 1
1 1 2 0 2
"File#2" should look something like this:
sample#2 sample#4 sample#5
Desired output should look like this:
sample#2 sample#4 sample#5
1 0 1
1 0 2
For parsing VCF format, use bcftools:
http://samtools.github.io/bcftools/bcftools.html
Specifically for your task see the view command:
http://samtools.github.io/bcftools/bcftools.html#view
Example:
bcftools view -Ov -S 462sample.list -r chr:pos -o subset.vcf superset.vcf
You will need to get the position of the SNP to specify chr:pos above.
You can do this using DbSNP:
http://www.ncbi.nlm.nih.gov/SNP/index.html
Just make sure to match the genome build to the one used in the VCF file.
You can also use plink:
https://www.cog-genomics.org/plink2
But, PLINK is finicky about duplicated SNPs and other things, so it may complain unless you address these issues.
I've done what you are attempting in the past using the awk programming language. For your sanity, I recommend using one of the above tools :)
Ok, I have no idea what a vcf file is but if the File#1 and File#2 samples you gave were files containing tab separated columns this will work:
declare -a data=(`head -1 data.txt`)
declare -a header=(`head -1 header.txt`)
declare fields
declare -i count
for i in "${header[#]}" ; do
count=0
for j in "${data[#]}" ; do
count=$count+1;
if [ $i == $j ] ; then
fields=$fields,$count
fi
done
done
cut -f ${fields:1} data.txt
If they aren't tab separated values perhaps it can be amended for the actual data format.

R - read html files within a folder, count frequency, and export output

I'm planning to use R to do some simple text mining tasks. Specifically, I would like to do the following:
Automatically read each html file within a folder, then
For each file, do frequency count of some particular words (e.g., "financial constraint" "oil export" etc.), then
Automatically write output to a csv. file using the following data structure (e.g., file 1 has "financial constraint" showing 3 times and "oil export" 4 times, etc.):
file_name count_financial_constraint count_oil_export
1 3 4
2 0 3
3 4 0
4 1 2
Can anyone please let me know where I should start, so far I think I've figured out how to clean html files and then do the count but I'm still not sure how to automate the process (I really need this as I have around 5 folders containing about 1000 html files within each)? Thanks!
Try this:
gethtml<-function(path=".") {
files<-list.files(path)
setwd(path)
html<-grepl("*.html",files)
files<-files[html]
htmlcount<-vector()
for (i in files) {
htmlcount[i]<- ##### add function that reads html file and counts it
}
return(sum(htmlcount))
}
R is not intended for doing rigorous text parsing. Subsequently, the tools for such tasks are limited. If you insist on doing it with R then you better get familiar with regular expressions and have a look at this.
However, I highly recommend using Python with the beautifulsoup library, which is specifically designed for this task.

Resources