R: Extract value and lines after key word (text file mining) - r

Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.

ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.

Related

Character values stored in DATAFRAME with Double Quotes while reading into R

I have a csv file with almost 4 millions records and 30 + columns.
The Columns are of varied type that includes Numeric, Alphanumeric, Date Column, character etc.
Attempt 1:
When I first read the file in R using read.csv Function then only 2 millions of the records were read.
This may have happened because of some special characters in the DATA.
Attempt 2:
I provided the argument quote = "" in read.csv Function and all the records were read succesfully.
However this brings up 2 issues:
a. all teh Columns were appended with 'x.' modifier:
egs.: x.date , x.name
b. all the Character Columns were loaded in dataframe, enclosed with double quotes ""
Can someone, please advise me that how to resolve these 2 issues and get the data loaded in R succesfully?
I work for a financial insititution and the data is highly sensitive, hence cannot paste the screenshot over here.
I also tried to create the scenario at my home but all my efforts were of little or of no avail.
The below screenshot is closest I have came to the exact scenario:
DATAFRAME SCREENSHOT: Not exact copy

readcsv fails to read # character in Julia

I've been using asd=readcsv(filename) to read a csv file in Julia.
The first row of the csv file contains strings which describe the column contents; the rest of the data is a mix of integers and floats. readcsv reads the numbers just fine, but only reads the first 4+1/2 string entries.
After that, it renders "". If I ask the REPL to display asd[1,:], it tells me it is 1x65 Array{Any,2}.
The fifth column in the first row of the csv file (this seems to be the entry it chokes on) is APP #1 bias voltage [V]; but asd[1,5] is just APP . So it looks to me as though readcsv has choked on the "#" character.
I tried using "quotes=false" keyword in readcsv, but it didn't help.
I used to use xlsread in Matlab and it worked fine.
Has anybody out there seen this sort of thing before?
The comment character in Julia is #, and this applies when reading files from delimited text files.
But luckily, the readcsv() and readdlm() functions have an optional argument to help in these situations.
You should try readcsv(filename; comment_char = '/').
Of course, the example above assumes that you don't have any / characters in your first line. If you do, then you'll have to change that / above to something else.

Read lines by number from a large file

I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before read.table:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan, it is faster and gives more control.
If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
Use of seek on Windows is discouraged. We have found so many
errors in the Windows implementation of file positioning that
users are advised to use it only at their own risk, and asked not
to waste the R developers' time with bug reports on Windows'
deficiencies.
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
Before I was able to get an R solution/answer, I've done it in Ruby:
#!/usr/bin/env ruby
NUM_SEQS = 14024829
linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}
File.open("./data/uniprot_2011_02.tab") do |f|
while line = f.gets
print line if linenumbers.include? f.lineno
end
end
runs fast (as fast as my storage can read the file).
I compile a solution based on the discussions here.
scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.

repair data in csv file

I have a huge csv file, separated by comma's and I want to do a analysis with glm in R.
In one column there exists data with a comma implied, something like: bla,blabla
When reading the file in R with read.csv.sql there comes a error-message:
RS-DBI driver: (RS_sqlite_import: ./agp.csv line 47612 expected 37 columns of data but found 38)
This is due to the 'extra' comma in some of the data, not the whole column has an extra column.
How can I fix this? I want to remove this extra superfluous comma.
Thanks for the reaction,
André
The CSV format is very simple and can easily be hand edited. In order to include a comma in a value, you must surround the value with quotes quotes. Try this: "bla,blabla". If that data happens to contain any quotes, eg. blah,"thequotedblah",blah, those quotes need to be escaped with another quote, like this: "blah,""thequotedblah"",blah".
Although there is no official standard around it, there isn't much to the CSV format. Wikipedia has a great CSV reference that I have personally used to implement CSV support in applications. Spend 5-10 minutes reading it and you'll know everything you ever need to know to manually create/read/repair CSV data.
Is it just this one line that contains a non-quoted comma - or are there several such lines? Editing the .csv with an editor that can handle large files (e.g. Ultraedit) to sanitize that one record would certainly help. Asaph's suggestion of quoting is also a good 'un.

Is there a way to check the spelling of words in a character vector?

The text to be checked is in Greek, but I would like to know if it can be done for English words too. My initial idea is described here, and I have already found a way to do it using VBA. But I wonder if there's a way to do it using R. If there isn't a way in R, do you think of something better than Excel-vba?
Alternatively, OpenOffice ships with a dictionary that entries stored in a text file. You can read that and remove the word definitions to create your word list.
This was tested on v3.0; the file location may have shifted, and the filename will change depending on which dictionary you want.
library(stringr)
dict <- readLines("C:/Program Files/OpenOffice.org 3/share/uno_packages/cache/uno_packages/174.tmp_/dict-en.oxt/th_en_US_v2.dat")
is_word <- str_detect(dict, "^[^(]")
words <- str_split_fixed(dict[is_word], "\\|", 2)
words <- words[,1]
This list contains some multi-word phrases. You may prefer to split on the first space, and take unique values. You probably also want to write words to file, to save repeating yourself.
Once this is done, checking a word is as easy as
c("persnickety", "sqwrzib") %in% words # TRUE FALSE
There exists an open source GNU spell checker called Aspell with suppot for various languages. This is a command line program which I basically use for scanning bunches of text files at once (then the output is just given to the console).
But there also exists a C API and perhaps more interesting for you a Pipe mode which accepts streams of texts and outputs to the standard output.
Hope this helps.

Resources