fread - multiple separators in a string - r

I'm trying to read a table using fread.
The txt file has text which look like:
"No","Comment","Type"
"0","he said:"wonderful|"","A"
"1","Pr/ "d/s". "a", n) ","B"
R codes I'm using is: dataset0 <- fread("data/test.txt", stringsAsFactors = F) with the development version of data.table R package.
Expect to see a dataset with three columns; however:
Error in fread(input = "data/stackoverflow.txt", stringsAsFactors = FALSE) :
Line 3 starting <<"1","Pr/ ">> has more than the expected 3 fields.
Separator 3 occurs at position 26 which is character 6 of the last field: << n) ","B">>.
Consider setting 'comment.char=' if there is a trailing comment to be ignored.
How to solve it?

The development version of data.table handles files like this where the embedded quotes have not been escaped. See point 10 on the wiki page.
I just tested it on your input and it works.
$ more unescaped.txt
"No","Comment","Type"
"0","he said:"wonderful."","A"
"1","The problem is: reading table, and also "a problem, yes." keep going on.","A"
> DT = fread("unescaped.txt")
> DT
No Comment Type
1: 0 he said:"wonderful." A
2: 1 The problem is: reading table, and also "a problem, yes." keep going on. A
> ncol(DT)
[1] 3

Use readLines to read line by line, then replace delimiter and read.table:
# read with no sep
x <- readLines("test.txt")
# introduce new sep - "|"
x <- gsub("\",\"", "\"|\"", x)
# read with new sep
read.table(text = x, sep = "|", header = TRUE)
# No Comment Type
# 1 0 he said:"wonderful." A
# 2 1 The problem is: reading table, and also "a problem, yes." keep going on. A

Related

Decimal read in does not change

I try to read in a .csv file with, example, such a column:
These values are meant like they are representing thousands of hours, not two or three hours and so on.
When I try to change the reading in options through
read.csv(file, sep = ";, dec = ".") nothing changes. It doesn't matter what I define, dec = "." or dec = "," it will always keep these numbers above.
You can use the following code:
library(readr)
df <- read_csv('data.csv', locale = locale(grouping_mark = "."))
df
Output:
# A tibble: 4 × 1
`X-ray`
<dbl>
1 2771
2 3783
3 1267
4 7798
As you can see, the values are now thousands.
An elegant way (in my opinion) is to create a new class, which you then use in the reading process.
This way, you stay flexible when your data is (really) messed up and the decimal/thousand separator is not equal over all (numeric) columns.
# Define a new class of numbers
setClass("newNumbers")
# Define substitution of dots to nothing
setAs("character", "newNumbers", function(from) as.numeric(gsub("\\.", "", from)))
# Now read
str(data.table::fread( "test \n 1.235 \n 1.265", colClasses = "newNumbers"))
# Classes ‘data.table’ and 'data.frame': 2 obs. of 1 variable:
# $ test: num 1235 1265
Solution proposed by Quinten will work; however, it's worth adding that function which is designed to process numbers with a grouping mark is col_number.
with(asNamespace("readr"),
read_delim(
I("X-ray hours\n---\n2.771\n3.778\n3,21\n"),
delim = ";",
col_names = c("x_ray_hours"),
col_types = cols(x_ray_hours = col_number()),
na = c("---"),
skip = 1
))
There is no need to define specific locale to handle this specific case only. Also locale setting will apply to the whole data and intention in this case to handle only that specific column. From docs:
?readr::parse_number
This drops any non-numeric characters before or after the first number.
Also if the columns use ; as a separator, read_delim is more appropriate.

Reading a file that has non fixed number of columns fread() in R

I am trying to read a file which in default is supposed to have 7 columns but probably there might be some commas within some strings which is causing other rows to have more than 7 columns.
Regardless of which info that is in other columns my only goal is to read the first 7 columns. However, fread is not reading the whole file even after adding the argument select = 1:7
> data <- fread("dpp.DAT",header=FALSE, fill=T, select = 1:7, sep=", ",stringsAsFactors = F)
Warning message:
In fread("dpp.DAT", header = FALSE, fill=T, select = 1:7,sep = ",", stringsAsFactors = F) :
Stopped early on line 45922. Expected 7 fields but found 8. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<84172666,DS,BRAND 4 - DERIVATIVE,#PL LOC BDD : BDD - BRAND 3 - DERIVATIVE,37324,BLEND-A-MD-INSPRD-BY-NTR-SGHH,BLEND B MAR INSPIRED BY OTHER CHAMOMILE, VAG + HHHH>>
Is there trick you can suggest to read all the rows of the file?
Sample dataset
Say we have a text file "test.txt" like this:
a,b,c
d,e,f
g,h,i,j
k,l,m
We can read it in and set FILL=T and then subset the final column out:
> fread("test.txt", fill=T)[,-4]
V1 V2 V3
1: a b c
2: d e f
3: g h i
4: k l m
Or, set select=1:3:
> fread("test.txt", fill=T, select = 1:3)
V1 V2 V3
1: a b c
2: d e f
3: g h i
4: k l m
EDIT
The solution was to use the cut unix command as such:
terminal$ cut Test_Fread_column.DAT -d',' -f1-7 > tmp
R> fread("tmp")
data.table gets finicky about the extra columns showing up in the middle as opposed to the beginning so that's why using select and fill don't work here. What you can do is take all the rows it gives you up front and then try again with skip on the rows you've already loaded. On that second (or more) attempt the extra columns will now be in the beginning so fill and select work as expected. There are probably more elegant ways to do the following but this works
library(data.table)
#capture warnings so we can evaluate what happened last in code
tempfile='tmp321364.txt'
conn<-file(tempfile, open="r+")
sink(file=conn, type='message')
DT<-list()
while(TRUE) {
DT[[length(DT)+1]] <- fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=ifelse(length(DT)>0,sum(sapply(DT, nrow)),0))
if(nrow(DT[[length(DT)]])==0) break
warns<-readLines(conn)
if(length(warns)==3) { #The warning about extra columns is 3 lines long
DT[[length(DT)+1]]<- fread(filename, header=FALSE,stringsAsFactors = F, fill=T, select=1:7, skip=sum(sapply(DT, nrow)))
if(nrow(DT[[length(DT)]])==0) break
} else { #an error about skipping too many rows is not 3 lines, assuming away other issues
break
}
}
DT<-rbindlist(DT)
sink(NULL, type='message')
close(conn)
rm(tempfile)
With your exact data, you don't need the while(TRUE) loop but if, for example, there were a 10th column that shows up even further down then this will work for those cases.
Dean's answer provides more automation than mine. Whenever I hit this problem (which is actually probably poorly formatted data), I resort to manually finding and then rebuilding the extract with rbind:
s1 <- fread("Extract.txt",
nrows=674170,
strip.white = TRUE,
fill = TRUE,
blank.lines.skip = TRUE,
encoding="UTF-8")
s2 <- fread("Extract.txt",
strip.white = TRUE,
fill = TRUE,
blank.lines.skip = TRUE,
skip=674170,
encoding="UTF-8")
# ad.infinitum until you complete "Extract.txt"
s3 <- rbind(s1,s2)
rm(s1)
rm(s2)

R: Read in a subset of lines and turn it into a conventional format (data.table approach preferred)

I have a file that that has over a hundred million rows and scattered throughout there are extra tab delimiters in fields. I need to read the problematic rows into R whilst ignoring the others due to the large file size involved.
Example txt file with extra delimiters in some rows:
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"
The first thing that I tried was using the 'readLines' function however whilst I can specify the row to stop on it will still read everything else up to that point which could still be too much
readLines(textConnection(text_file), n = 4)
[1] "My\tname\tis\tAlpha" "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta"
I then realised that I could also use the other dataset import functions if I specified the delimiter to be something that should probably never appear. The "fread" function from the data.table package would be perfect for this as it is the fastest way to deal with large datasets like mine however when I tried it the data was in a format that I couldnt really work with further:
library(data.table)
library(stringi)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 3)
> lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
> invalid_delimiter_rows <- which(stri_count_regex(lines, "\\t") != 3)
Warning message:
In stri_count_regex(lines, "\\t") :
argument is not an atomic vector; coercing
Preferably I shouldnt have to convert this data after importing however when I tried changing this to a character vector or list it was still in a bad format (the concatenation is considered part of the string and not a function). What is the most computing time efficient way that I could approach this issue?
> class(lines)
[1] "data.table" "data.frame"
> as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\")"
Let's replicate the process till fread() import:
# your example string
text_file <-"My\tname\tis\tAlpha\nMy\tname\tis\t\t\tBravo\nMy\tname\tis\tCharlie\nMy\tname\tis\t\t\tDelta\nMy\tname\tis\tEcho"
# import
library(data.table)
lines <- fread(text_file, sep = NULL, header = FALSE, skip = 1, nrows = 5)
lines
V1
1: My\tname\tis\t\t\tBravo
2: My\tname\tis\tCharlie
3: My\tname\tis\t\t\tDelta
4: My\tname\tis\tEcho
When you try
as.character(lines)
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
it converts all data.table in character, so each column will be a concatenated vector. See below:
as.character(data.table(lines$V1, lines$V1))
[1] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
[2] "c(\"My\\tname\\tis\\t\\t\\tBravo\", \"My\\tname\\tis\\tCharlie\", \"My\\tname\\tis\\t\\t\\tDelta\", \"My\\tname\\tis\\tEcho\")"
What you want is extract just lines$V1, which is already a character vector.
lines$V1
[1] "My\tname\tis\t\t\tBravo" "My\tname\tis\tCharlie" "My\tname\tis\t\t\tDelta" "My\tname\tis\tEcho"

Unimplemented type list when trying to write.table

I have the following data.table (data.frame) called output:
> head(output)
Id Title IsProhibited
1 10000074 Renault Logan, 2005 0
2 10000124 Ñêëàäñêîå ïîìåùåíèå, 345 ì<U+00B2> 0
3 10000175 Ñó-øåô 0
4 10000196 3-ê êâàðòèðà, 64 ì<U+00B2>, 3/5 ýò. 0
5 10000387 Samsung galaxy S4 mini GT-I9190 (÷¸ðíûé) 0
6 10000395 Êàðòèíà ""Êðûì. Ïîñåëîê Àðîìàò"" (õîëñò, ìàñëî) 0
I am trying to export it to a CSV like so:
> write.table(output, 'output.csv', sep = ',', row.names = FALSE, append = T)
However, when doing so I get the following error:
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type 'list' in 'EncodeElement'
In addition: Warning message:
In write.table(output, "output.csv", sep = ",", row.names = FALSE, :
appending column names to file
I have tried converting the Title to a string so that it is no longer of type list like so:
toString(output$Title)
But, I get the same error. My types are:
> class(output)
[1] "data.frame"
> class(output$Id)
[1] "integer"
> class(output$Title)
[1] "list"
> class(output$IsProhibited)
[1] "factor"
Can anyone tell me how I can export my data.frame to CSV?
Another strange thing that I've noticed, is that if I write head(output) my text is not encoded properly (as shown above) whereas if I simply write output$Title[0:3] it will display the text correctly like so:
> output$Title[0:3]
[[1]]
[1] "Renault Logan, 2005"
[[2]]
[1] "Складское помещение, 345 м²"
[[3]]
[1] "Су-шеф"
Any ideas regarding that? Is it relevant to my initial problem?
Edit: Here is my new output:
Id Title IsProhibited
10000074 Renault Logan, 2005 0
10000124 СкладÑкое помещение, 345 м<U+00B2> 0
10000175 Су-шеф 0
10000196 3-к квартира, 64 м<U+00B2>, 3/5 ÑÑ‚. 0
10000387 Samsung galaxy S4 mini GT-I9190 (чёрный) 0
10000395 Картина \\"Крым. ПоÑелок Ðромат\"\" (холÑÑ‚ маÑло)" 0
10000594 КальÑн 25 Ñм 0
10000612 1-к квартира, 45 м<U+00B2>, 6/17 ÑÑ‚. 0
10000816 Гараж, 18 м<U+00B2> 0
10000831 Платье 0
10000930 Карбюраторы К-22И, К-22Г от газ 21 и газ 51 0
Notice how line ID 10000395 is messed up? It seems to contains quotes of it's own which are messing up the CSV. How can I fix that?
Do this, irrespective of how many columns you have:
df <- apply(df,2,as.character)
Then do write.csv.
As mentioned in the comments, you should be able to do something like this (untested) to get "flatten" your list into a character vector:
output$Title <- vapply(output$Title, paste, collapse = ", ", character(1L))
As also mentioned, if you wanted to try the unlist approach, you could "expand" each row by the individual values in output$Title, something like this:
x <- vapply(output$Title, length, 1L) ## How many items per list element
output <- output[rep(rownames(output), x), ] ## Expand the data frame
output$Title <- unlist(output$Title, use.names = FALSE) ## Replace with raw values
There is a new function (introduced in november 2016) in data.table package that handles writing a data.table object to csv quite well, even in those cases when a column of the data.table is a list.
fwrite(data.table, file ="myDT.csv")
Another easy solution. Maybe one or more columns are of type list, so we need convert them to "character" or data frame. So there are two easy solutions
Convert each column "as.character" using--
df$col1 = as.character(df$col1)
df$col2 = as.character(df$col2)
.......and so on
The best one convert df in to a "matrix"
df = as.matrix(df)
now write df into csv. Works for me.
Those are all elegant solutions.
For the curious reader who would prefer some R-code to ready made packages , here's an R-function that returns a non-list dataframe that can be exported and saved as .csv.
output is the "troublesome" data frame in question.
df_unlist<-function(df){
df<-as.data.frame(df)
nr<-nrow(df)
c.names<-colnames(df)
lscols<-as.vector(which(apply(df,2,is.list)==TRUE))
if(length(lscols)!=0){
for(i in lscols){
temp<-as.vector(unlist(df[,i]))
if(length(temp)!=nr){
adj<-nr-length(temp)
temp<-c(rep(0,adj),temp)
}
df[,i]<-temp
} #end for
df<-as.data.frame(df)
colnames(df)<-c.names
}
return(df)
}
Apply the function on dataframe "output" :
newDF<-df_unlist(output)
You can next confirm that the new (newDF) data frame is not 'listed' via apply(). This should successfully return FALSE.
apply(newDF,2,is.list) #2 for column-wise step.
Proceed to save the new dataframe, newDF as a .csv file to a path of your choice.
write.csv(newDF,"E:/Data/newDF.csv")
Assuming
the path you want to save to is Path, i.e. path=Path
df is the dataframe you want to save,
follow those steps:
Save df as txt document:
write.table(df,"Path/df.txt",sep="|")
Read the text file into R:
Data = read.table("Path/df.txt",sep="|")
Now save as csv:
write.csv(Data, "Path/df.csv")
That's it.
# First coerce the data.frame to all-character
df = data.frame(lapply(output, as.character), stringsAsFactors=FALSE)
# write file
write.csv(df,"output.csv")

Replace strings in text based on dictionary

I am new to R and need suggestions.
I have a dataframe with 1 text field in it. I need to fix the misspelled words in that text field. To help with that, I have a second file (dictionary) with 2 columns - the misspelled words and the correct words to replace them.
How would you recommend doing it? I wrote a simple "for loop" but the performance is an issue.
The file has ~120K rows and the dictionary has ~5k rows and the program's been running for hours. The text can have a max of 2000 characters.
Here is the code:
output<-source_file$MEMO_MANUAL_TXT
for (i in 1:nrow(fix_file)) { #dictionary file
target<-paste0(" ", fix_file$change_to_target[i], " ")
replace<-paste0(" ", fix_file$target[i], " ")
output<-gsub(target, replace, output, fixed = TRUE)
I would try agrep. I'm not sure how well it scales though.
Eg.
> agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
Also check out pmatch and charmatch although I feel they won't be as useful to you.
here an example , to show #joran comment using a data.table left join. It is very fast (instantaneously here).
library(data.table)
n1 <- 120e3
n2 <- 1e3
set.seed(1)
## create vocab
tt <- outer(letters,letters,paste0)
vocab <- as.vector(outer(tt,tt,paste0))
## create the dictionary
dict <- data.table(miss=sample(vocab,n2,rep=F),
good=sample(letters,n2,rep=T),key='miss')
## the text table
orig <- data.table(miss=sample(vocab,n1,rep=TRUE),key='miss')
orig[dict]
orig[dict]
miss good
1: aakq v
2: adac t
3: adxj r
4: aeye t
5: afji g
---
1027: zvia d
1028: zygp p
1029: zyjm x
1030: zzak t
1031: zzvs q

Resources