make csv data import case insensitive - r

I realize this is a total newbie one (as always in my case), but I'm trying to learn R, and I need to import hundreds of csv files, that have the same structure, but in some the column names are uppercase, and in some they are lower case.
so I have (for now)
flow0300csv <- Sys.glob("./csvfiles/*0300*.csv")
for (fileName in flow0300csv) {
flow0300 <- read.csv(fileName, header=T,sep=";",
colClasses = "character")[,c('CODE','CLASS','NAME')]
}
but I get an error because of the lower cases. I have tried to apply "tolower" but I can't make it work. Any tips?

The problem here isn't in reading the CSV files, it's in trying to index using column names that don't actually exist in your "lowercase" data frames.
You can instead use grep() with ignore.case = TRUE to index to the columns you want.
tmp <- read.csv(fileName, header = T, sep = ";",
colClasses = "character")
ind <- grep(patt = "code|class|name", x = colnames(tmp),
ignore.case = TRUE)
tmp[, ind]
You may want to look into readr::read_csv2() or even data.table::fread() for better performance.

After reading the .csv-file you may want to convert the column names to all uppercase with
flow0300 <- read.csv(fileName, header = T, sep = ";", colClasses = "character")
colnames(flow0300) <- toupper(colnames(flow0300))
flow0300 <- flow0300[, c("CODE", "CLASS", "NAME")]
EDIT: Extended solution with the input of #xraynaud.

Related

numeric fields turning into "char" while using stringsAsFactor = F

I am trying to import a few csv files from a specific folder:
setwd("C://Users//XYZ//Test")
filelist = list.files(pattern = ".*.csv")
datalist = lapply(filelist, FUN=read.delim, sep = ',', header=TRUE,
stringsAsFactors = F)
for (i in 1:length(datalist)){
datalist[[i]]<-cbind(datalist[[i]],filelist[i])
}
Data = do.call("rbind", datalist)
After I use the above code, a few columns are type character, despite containing numbers. If I don't use stringsAsFactor = F then the fields read as factor which turns into missing values when I use as.numeric(as.character()) later on.
Is there any solution so that I can keep some fields as numeric? The fields that I want to be as numeric look like this:
Price.Plan Feature.Charges
$180.00 $6,307.56
$180.00 $5,431.25
Thanks
The $, , are not considered numeric, so while using stringsAsFactors = FALSE in the read.delim, it assigns the column type as character. To change that, remove the $, , with gsub, convert to numeric and assign it to the particular columns
df <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))

Adding a column characters based on file name in R

I have several hundred files regarding information in .pet files organized by date code (19960101 is format YYYYMMDD). I'm trying to add a column, NDate with the date code:
for (pet.atual in files.pet) {
data.pet.atual <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
data.pet.atual <- cbind(data.pet.atual, NDate= pet.atual)
}
What i'm trying to achieve, for example, is for the 01-01-1996 NDate = 19960101, for 02-01-1996 NDate = 19960102 and so on. Still the for loop just replaces the NDate field everytime it runs with the latest pet.atual, ideas? Thanks
Small modification should do the trick:
data.pet.atual <- NULL
for (pet.atual in files.pet) {
tmp.data <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
tmp.data <- cbind(tmp.data, NDate= pet.atual)
data.pet.atual <- rbind(data.pet.atual, tmp.data)
}
You can also replace the tmp.data<-cbind(...) by tmp.data$NDate <- pet.atual
You may also try fread() and rbindlist() from the data.table package (untested due to lack of a reproducible example):
library(data.table)
result <- rbindlist(lapply(files.pet, fread), idcol = "NDate")
result[, NDate := anytime::anydate(files.pet[NDate])]
lapply() "loops" over all entries in files.pet executing fread() for each entry and returns a list with the data.tables fread has created from reading each file. rbindlist() is used to combine all pieces into one large data.table. The parameter idcol = NDate generates an index column named NDate to identify the origin of each row in the final output. The ids are integer numbers 1 to the length of the list (if the list is not named).
Finally, the id number is used to lookup the file name in files.pet which is directly converted to class Date using the anytime package.
EDIT Perhaps, it might be more efficient to convert the file names to Date first before looking them up:
result[, NDate := anytime::anydate(files.pet)[NDate]]
Although fread() is pretty smart in analysing and guessing the right parameters for reading the files it might be necessary (and perhaps faster as well) to supply additional parameters, e.g.:
result <- rbindlist(lapply(files.pet, fread, header = FALSE, sep = ","), idcol = "NDate")
Yes, lapply will help, as Frank suggests. And you want to use rbind to keep the dates different for each file. Something along the lines of:
I'm assuming files.pet is a list of all the files you want to include...
my.fun<-function(file){
data <- read.table(file = file,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";")
data$NDate = file
return(data)}
data.pet.atual <- do.call(rbind.data.frame, lapply(files.pet, FUN=my.fun))
I can't test this without a reproducible example, so you may need to play with it a bit, but the general approach should work!

Importing multiple textfiles in R (with a twist)

Good day,
i have 33 *.tsv (tabbed delimited) in a directory. All files have the same row.names, but different columns.
I want to simulteanously import all files, and the final product should be a list of 33 dataframes (or matrices) with names according to their file names.
data <-lapply(dir(), read.table) is not working as intended. The resulting list entries are factors due to row.names.
data <- lapply(dir(), read.table, row.names=1, header = TRUE, sep = "\t", dec = ".") does not work, because of double row.names errors.
The same is true when applying a solution presented here.
Another option would be the import of a big single file and then split into 33 objects by header names (who are separated by _1, _2, _3, and so on (also including character strings after the underscore).
Any help is appreciated as usual.
Not very elegant, but how about
data <- lapply(dir(), read.table, header = TRUE, sep = "\t", dec = ".")
data <- lapply(data, function(x) rownames(x) <- x[,1] )

How to remove outlier values while reading a file

I have a large number of files, each in tab-delimited format. I need to apply some modeling (glm/gbm etc) on each of these files. They are obtained from hospital data where in exceptional cases entries may not be the proper format. For example, when entering glucose level for a patient, the data entry operator may enter N or A by mistake instead of actual number.
While reading these files in loop, I am encountering problem as such columns (glucose) are treated as factor while it should be a numeric. It is painful to investigate each file and and look for error. I am reading the files in the following way but it is certainly not a good approach.
read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NEG', 'TR', 'NA', '<NA>', "Done", "D", "A"))
Is there any other function through which I can assume those values/outliers to be na?
You can inspect which elements are not number (for the glucose case):
data = read.csv(file, as.is = TRUE, sep = '\t') # dont convert string to factor
glucose = data$glucose
sapply(glucose, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
Then you can work with these indexes (interpolate or remove).
To loop the files:
files = list.files(path, '*.csv')
for (file in files)
{
data = read.csv(file, sep = '\t', as.is = TRUE)
gluc = data$glucose
idxs = sapply(gluc, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
# interpolate or remove here
}
Use the colClasses argument to read.table and friends to specify which columns should be numeric, then R does not need to try and guess. If a column is designated to be numeric then any entries that are not numbers will be converted to NA automatically.

How can I write out multiple files with different filenames in R

I have one BIG file (>10000 lines of data) and I want write out a separate file by ID. I have 50 unique ID names and I want a separate text file for each one. Here's what Ive got so far, and I keep getting errors. My ID is actually character string which I would prefer if I can name each file after that character string it would be best.
for (i in 1:car$ID) {
a <- data.frame(car[,i])
carib <- car1[,(c("x","y","time","sd"))]
myfile <- gsub("( )", "", paste("C:/bridge", carib, "_", i, ".txt"))
write.table(a, file=myfile,
sep="", row.names=F, col.names=T quote=FALSE, append=FALSE)
}
One approach would be to use the plyr package and the d_ply() function. d_ply() expects a data.frame as an input. You also provide a column(s) that you want to slice and dice that data.frame by to operate on independently of one another. In this case, you have the column ID. This specific function does not return an object, and is thus useful for plotting, or making charter iteratively, etc. Here's a small working example:
library(plyr)
dat <- data.frame(ID = rep(letters[1:3],2) , x = rnorm(6), y = rnorm(6))
d_ply(dat, "ID", function(x)
write.table(x, file = paste(x$ID[1], "txt", sep = "."), sep = "\t", row.names = FALSE))
Will generate three tab separates files with the ID column as the name of the files (a.txt, b.txt, c.txt).
EDIT - to address follow up question
You could always subset the columns you want before passing it into d_ply(). Alternatively, you can use/abuse the [ operator and select the columns you want within the call itself:
dat <- data.frame(ID = rep(letters[1:3],2) , x = rnorm(6), y = rnorm(6)
, foo = rnorm(6))
d_ply(dat, "ID", function(x)
write.table(x[, c("x", "foo")], file = paste(x$ID[1], "txt", sep = ".")
, sep = "\t", row.names = FALSE))
For the data frame called mtcars separated by mtcars$cyl:
lapply(split(mtcars, mtcars$cyl),
function(x)write.table(x, file = paste(x$cyl[1], ".txt", sep = "")))
This produces "4.txt", "6.txt", "8.txt" with the corresponding data. This should be faster than looping/subsetting since the subsetting (splitting) is vectorized.

Resources