I am trying to read a rather big file with the read.table.ffdf method from the library ff. Unfortunately, the column-names of this table contain whitespaces, tabs and other special characters. It looks roughly like this (but with ~400 columns):
attribute_1;next attribute;who creates, these horrible) column&nämes
198705;RXBR ;2017-07-05 00:00:00
This isn't pretty, I know, but i am forced to work with this, so I have to set check.names to FALSE.
Furthermore, I am generating a list with the column-class-types which I do like this:
path <- 'path_to_csv-file'
headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2, check.names = FALSE)
#print(headset)
headclasses <- vector(mode = 'character', length = 0)
#heavily simplified version - switch_statement is in an extra function
for(i in colnames(headset)){
headclasses[[i]] <- switch (i,
'attribute_1' = 'numeric',
'next attribute' = 'factor',
'who creates, these horrible) column&nämes' = 'POSIXct'
)
}
#print(colnames(headset))
#print(headclasses)
Now, if i call:
df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 1e4, VERBOSE = TRUE)
I get the following error:
Error in repnam(colClasses, colnames(x), default = NA) :
the following argument names do not match'next attribute','(who creates, these horrible column&nämes)'
Why do I get this error? And how can I fix it so that I have the uglier strings as column names?
Note, that in the previous call, check.names is set to FALSE.
My work so far:
1. Trying with proper names but wrong check.names option when calling read.table.ffdf
If I let R choose proper column-names (i.e. check.names = TRUE in the first call to a read-method) and adjust the switch-statement accordingly, I get no error at all (yet a warning) even if I set check.names = FALSE in the read.table.ffdf-method:
headset <- read.csv(path, sep= ';', dec= '.', header = TRUE, nrows = 2)
print(headset)
headclasses <- vector(mode = 'character', length = 0)
#heavily simplified version - switch_statement is in an extra function
for(i in colnames(headset)){
headclasses[[i]] <- switch (i,
'attribute_1' = 'numeric',
'next.attribute' = 'factor',
'who.creates..these.horrible..column.nämes' = 'POSIXct'
)
}
print(colnames(headset))
print(headclasses)
my_df <- read.table.ffdf(file=path, levels = NULL, appendLevels = TRUE, FUN = 'read.table', na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2, VERBOSE = TRUE)
print(my_df)
print(colnames(my_df))
"attribute_1" "next.attribute" "who.creates..these.horrible..column.nämes"
Warning message:
In read.table(na.strings = c("\N", ""), sep = ";", dec = ".", colClasses > = list( :
not all columns named in 'colClasses' exist
So this works, when it shouldn't?
Of course, leaving out check.names when calling read.table.ffdf works in the same way, so somewhere something goes missing.
2. Checking source Code of read.table.ffdf
I went to the rdrr.io site (read.table.ffdf-source-code) to check the source code and tried to understand, what I am doing wrong. To cut it short, this is what happens to my file:
rt.args <- list(na.strings = c('\\N',''), sep= ';', dec= '.', colClasses = headclasses, check.names = FALSE , header = TRUE, nrows = 2)
rt.args$file <- path
asffdf_args <- list()
FUN <- 'read.table'
dat <- do.call(FUN, rt.args)
x <- do.call("as.ffdf", c(list(dat), asffdf_args))
#print(colnames(dat))
#print(colnames(x))
and this yields
"attribute_1" "next attribute" "who creates, these horrible) column&nämes"
"attribute_1" "next.attribute" "who.creates..these.horrible..column.nämes"
Ok, so this is where it goes wrong.
I don't know which asffdf_args to pass and since I am kind of new to R, I am not sure what to look for exactly other than some kind of check.names equivalent. I already had a look at the as.ffdf.data.frame method via
getAnywhere(as.ffdf.data.frame)
but that didn't help me understand what I should put in.
So, how can I make read.table.ffdf-work with the uglier column-names? Which 'asffdf_args' do I have to pass to make check.names = FALSE work in said method?
I could adapt my switch-statement (for roughly 400 columns), read the file with check.names = TRUE and after read.table.ffdf is done, I could set the column names to the desired ones (since I have to work with the nastier names later on). But this classifies as a workaround for me and does not satisfy me at all.
This is my first question here, so be gentle with me, if I am overlooking something major and feel free to push me in the right direction.
Thanks in advance for the help.
As is, you probably cannot pass arguments the way you would like to.
as.ffdf.data.frame() calls ffdf() on it's last line.
ffdf in turn calls make.names a few times, without checking any arguments.
If you edit ffdf(), and comment out the line vnam <- make.names(vnam, unique = TRUE) towards the very end of the function, then as.ffdf.data.frame() will be able to retain your funky column names.
I am not providing the modified version of ffdf as the function is more than 300 lines long.
I have tested with a new function ffdf_new, injecting it as follows:
# save original version
orig <- ff::ffdf
# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("ffdf", ffdf_new)
# simple test below
DF <- data.frame(
'attribute_1' = 1:10,
'next attribute' = 3:12,
'who creates, these horrible) column&nämes' = 11:20,
check.names = FALSE
)
as.ffdf.data.frame(DF)[["who creates, these horrible) column&nämes"]]
## ff (open) integer length=10 (10)
## [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
## 11 12 13 14 15 16 17 18 19 20
# switch back
godmode:::assignAnywhere("ffdf", orig)
Related
I am using lapply to read a list of files. The files have multiple rows and columns, and I interested in the first row in the first column. The code I am using is:
lapply(file_list, read.csv,sep=',', header = F, col.names=F, nrow=1, colClasses = c('character', 'NULL', 'NULL'))
The first row has three columns but I am only reading the first one. From other posts on stackoverflow I found that the way to do this would be to use colClasses = c('character', 'NULL', 'NULL'). While this approach is working, I would like to know the underlying issue that is causing the following error message to be generated and hopefully prevent it from popping up:
"In read.table(file = file, header = header, sep = sep, quote = quote, :
cols = 1 != length(data) = 3"
It's to let you know that you're just keeping one column of the data out of three because it doesn't know how to handle colClasses of "NULL". Note your NULL is in quotation marks.
An example:
write.csv(data.frame(fi=letters[1:3],
fy=rnorm(3,500,1),
fo=rnorm(3,50,2))
,file="a.csv",row.names = F)
write.csv(data.frame(fib=letters[2:4],
fyb=rnorm(3,5,1),
fob=rnorm(3,50,2))
,file="b.csv",row.names = F)
file_list=list("a.csv","b.csv")
lapply(file_list, read.csv,sep=',', header = F, col.names=F, nrow=1, colClasses = c('character', 'NULL', 'NULL'))
Which results in:
[[1]]
FALSE.
1 fi
[[2]]
FALSE.
1 fib
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
cols = 1 != length(data) = 3
Which is the same as if you used:
lapply(file_list, read.csv,sep=',', header = F, col.names=F,
nrow=1, colClasses = c('character', 'asdasd', 'asdasd'))
But the warning goes away (and you get the rest of the row as a result) if you do:
lapply(file_list, read.csv,sep=',', header = F, col.names=F,
nrow=1, colClasses = c( 'character',NULL, NULL))
You can see where errors and warnings come from in source code for a function by entering, for example, read.table directly without anything following it, then searching for your particular warning within it.
I have a list containing 2 or more dataframes:
d <- data.frame(x=1:3, y=letters[1:3])
f <- data.frame(x=11:13, y=letters[11:13])
df <- list(d, f)
to save them as .csv, I use the following syntax:
filenames = paste0('C:/Output_', names(df), '.csv')
Map(write.csv, df, filenames)
But I would like to add some strings to obtain a specific format, like:
quote = FALSE, row.names = FALSE, sep = "\t", na = "", col.names = FALSE
And the thing is that I am not that sure where to add that syntax. Wherever I try, I get a warning saying my syntax has been ignored.
> Warning messages:
1: In (function (...) : attempt to set 'col.names' ignored
2: In (function (...) : attempt to set 'sep' ignored
3: In (function (...) : attempt to set 'col.names' ignored
4: In (function (...) : attempt to set 'sep' ignored
Any suggestions? In BaseR preferably!
Why you're still getting col.names warnings: farther down in the documentation (?write.csv) you'll see
These wrappers [write.csv and write.csv2] are deliberately inflexible: they are designed to
ensure that the correct conventions are used to write a valid
file. Attempts to change ‘append’, ‘col.names’, ‘sep’, ‘dec’ or
‘qmethod’ are ignored, with a warning.
Should go away if you use write.table() instead.
You need to use anonymous function in order to be able to pass further arguments, i.e.
Map(function(...) write.csv(..., quote = FALSE, row.names = FALSE, sep = "\t", na = ""), df, filenames)
I'm importing a lot of datasets. All of them have some empty lines at the top (before header), however it's not always the same number of rows that I need to skip.
Right now I'm using:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = 9)
But sometimes I only need to skip 3 lines fx.
Can I somehow set up a rule that when my column B (in Excel) contains one of the following words at the beginning of a sentence:
Datastatistik
Overførte records
FI-CA
Oprettet
Column A is always empty but I delete this in a code after the import.
This is an example of my data (I have hidden personal numbers):
My first variable header is called "Bilagsnummer" or "Bilagsnr.".
I don't know if it's possible to set up a rule that says something like the first occurrence of this word is my header? Really I'm just brainstorming here, cause I have no idea how to automatise this data import.
---EDIT---
I looked at the post #Bram linked to, and it did solve some of my problem.
I changed some of it.
This is the code I used:
temp <- readLines("file.xls")
skipline <- which(grepl("\tDatastatistik", temp) |
grepl("\tOverførte", temp) |
grepl("FI-CA", temp) |
grepl("Oprettet", temp) |
temp == "")
So the skipline interger that I made contains those lines that need to be skipped. These are correct using the grepl function (since the wording at the end of sentence changes from time to time).
Now, I still have a problem though.
When I use skip = skipline in my read.delim It only works for the fist row.
I get the warning message:
In if (skip > 0L) readLines(file, skip) :
the condition has length > 1 and only the first element will be used
May have found a solution, but not the optimal one. Let's see.
Import your df with the empty lines:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE)
Find the number of empty rows at the beginning:
NonNAindex <- which(!is.na(df2[,2]))
lastEmpty <- (min(NonNAindex)-1)
Re-import your document using that info:
df2 <- read_delim("filename.xls",
"\t", escape_double = FALSE,
guess_max=10000,
locale = locale(encoding = "ISO-8859-1"),
na = "empty", trim_ws = TRUE, skip = lastEmpty)
I'm new with R and seek some digestible guidance. I wish to create data.frame so I can create column and establish variables in my data. I start with exporting url into R and save into Excel;
data <- read.delim("http://weather.uwyo.edu/cgi-bin/wyowx.fcgi?
TYPE=sflist&DATE=20170303&HOUR=current&UNITS=M&STATION=wmkd",
fill = TRUE, header = TRUE,sep = "\t" stringsAsFactors = TRUE,
na.strings = " ", strip.white = TRUE, nrows = 27, skip = 9)
write.xlsx(data, "E:/Self Tutorial/R/data.xls")
This data got missing value somewhere in the middle of element thus make the length irregular. Due to irregular length I use write.table instead of data.frame.
As 1st attempt, in global environment, data exist in R value(NULL) not in R data;
dat.table = write.table(data)
str(dat.table) # just checking #result NULL?
try again
dat.table = write.table(data,"E:/Self Tutorial/R/data.xls", sep = "\t", quote = FALSE)
dat.table ##print nothing
remove sep =
dat.table = write.table(data,"E:/Self Tutorial/R/data.xls", quote = FALSE
dat.table ##print nothing
since its not working, I try read.table
dat.read <- read.table("E:/Self Tutorial/R/data.xls", header = T, sep = "\t")
Data loaded in R console, but as expected with irregular column distribution, (??even though I already use {na.strings = " ", strip.white = TRUE} in read.delim argument)
What should I understand from this mistake, and which is it. Thank you.
I would be delighted and most grateful if anyone can explain to me why I am having a problem exporting some data from a function which extracts coefficients from a linear model. I have hundreds to do so I’m hoping to build a loop to handle it but have fallen at an earlier hurdle.
I am using methods borrowed from someone much smarter at this stuff:
https://stat.ethz.ch/pipermail/r-sig-ecology/2008-May/000062.html
The relevant bits (data creation, the function and finally my attempt to export my data are below) but firstly I will mention that the data, “export”, is exported to the experiment.csv file as a single COLUMN. I am told that the Append property of the write.table function only works with rows. Consequently it overwrites previous runs of the same sets of commands rather than successfully appending it.
The error messages are of the form below: (they are all the same, one for each piece of information).
Warning messages:
1: In write.csv(export, file = "experiment.csv", append = TRUE, quote = TRUE, :
attempt to set 'append' ignored
#DATA CREATION
# create an empty list
mod <- list()
# start a loop for create 5 objects of class 'lm'
for (i in 1:5) {
x <- rnorm(i*10)
y <- rnorm(i*10)
mod[[paste("run",i,sep="")]] <- lm(y ~ x)
}
# FUNCTION TO EXTRACT DATA
myFun <-
function(lm)
{
out <- c(lm$coefficients[1],
lm$coefficients[2],
length(lm$model$y),
summary(lm)$coefficients[2,2],
pf(summary(lm)$fstatistic[1], summary(lm)$fstatistic[2],
summary(lm)$fstatistic[3], lower.tail = FALSE),
summary(lm)$r.squared)
names(out) <- c("intercept","slope","n","slope.SE","p.value","r.squared")
return(out)}
# FAILED ATTEMPT TO EXPORT
export <-myFun(mod$run1)
write.csv(export, file = "experiment.csv", append = TRUE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = FALSE,
col.names = FALSE, qmethod = c("escape", "double"),
fileEncoding = "")