I have a large number of files, each in tab-delimited format. I need to apply some modeling (glm/gbm etc) on each of these files. They are obtained from hospital data where in exceptional cases entries may not be the proper format. For example, when entering glucose level for a patient, the data entry operator may enter N or A by mistake instead of actual number.
While reading these files in loop, I am encountering problem as such columns (glucose) are treated as factor while it should be a numeric. It is painful to investigate each file and and look for error. I am reading the files in the following way but it is certainly not a good approach.
read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NEG', 'TR', 'NA', '<NA>', "Done", "D", "A"))
Is there any other function through which I can assume those values/outliers to be na?
You can inspect which elements are not number (for the glucose case):
data = read.csv(file, as.is = TRUE, sep = '\t') # dont convert string to factor
glucose = data$glucose
sapply(glucose, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
Then you can work with these indexes (interpolate or remove).
To loop the files:
files = list.files(path, '*.csv')
for (file in files)
{
data = read.csv(file, sep = '\t', as.is = TRUE)
gluc = data$glucose
idxs = sapply(gluc, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
# interpolate or remove here
}
Use the colClasses argument to read.table and friends to specify which columns should be numeric, then R does not need to try and guess. If a column is designated to be numeric then any entries that are not numbers will be converted to NA automatically.
Related
I realize this is a total newbie one (as always in my case), but I'm trying to learn R, and I need to import hundreds of csv files, that have the same structure, but in some the column names are uppercase, and in some they are lower case.
so I have (for now)
flow0300csv <- Sys.glob("./csvfiles/*0300*.csv")
for (fileName in flow0300csv) {
flow0300 <- read.csv(fileName, header=T,sep=";",
colClasses = "character")[,c('CODE','CLASS','NAME')]
}
but I get an error because of the lower cases. I have tried to apply "tolower" but I can't make it work. Any tips?
The problem here isn't in reading the CSV files, it's in trying to index using column names that don't actually exist in your "lowercase" data frames.
You can instead use grep() with ignore.case = TRUE to index to the columns you want.
tmp <- read.csv(fileName, header = T, sep = ";",
colClasses = "character")
ind <- grep(patt = "code|class|name", x = colnames(tmp),
ignore.case = TRUE)
tmp[, ind]
You may want to look into readr::read_csv2() or even data.table::fread() for better performance.
After reading the .csv-file you may want to convert the column names to all uppercase with
flow0300 <- read.csv(fileName, header = T, sep = ";", colClasses = "character")
colnames(flow0300) <- toupper(colnames(flow0300))
flow0300 <- flow0300[, c("CODE", "CLASS", "NAME")]
EDIT: Extended solution with the input of #xraynaud.
I got several CSV files which contain numbers in the local german style i.e. with a comma as the decimal separator and the point as the thousand separator e.g. 10.380,45. The values in the CSV file are separated by ";". The files also contain columns from the classes character, Date, Date & Time and Logical.
The problem with the read.table functions is, that you can specify the decimal separator with dec=",", but NOT the thousand point separator. (If I'm wrong, please correct me)
I know that preprocessing is a workaround, but I want to write my code in a way, that others can use it without me.
I found a way to read the CSV file the way I want it with read.csv2, by setting my own classes, as can be seen in the following example.
Based on Most elegant way to load csv with point as thousands separator in R
# Create test example
df_test_write <- cbind.data.frame(c("a","b","c","d","e","f","g","h","i","j",rep("k",times=200)),
c("5.200,39","250,36","1.000.258,25","3,58","5,55","10.550,00","10.333,00","80,33","20.500.000,00","10,00",rep("3.133,33",times=200)),
c("25.03.2015","28.04.2015","03.05.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016",rep("08.08.2016",times=200)),
stringsAsFactors=FALSE)
colnames(df_test_write) <- c("col_text","col_num","col_date")
# write test csv
write.csv2(df_test_write,file="Test.csv",quote=FALSE,row.names=FALSE)
#### read with read.csv2 ####
# First, define your own class
#define your own numeric class
setClass('myNum')
#define conversion
setAs("character","myNum", function(from) as.numeric(gsub(",","\\.",gsub("\\.","",from))))
# own date class
library(lubridate)
setClass('myDate')
setAs("character","myDate",function(from) dmy(from))
# Read the csv file, in colClasses the columns class can be defined
df_test_readcsv <- read.csv2(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
)
)
My problem now is, that the different datasets have up to 200 columns and 350000 Rows. With the upper solution I need between 40 and 60 seconds to load one CSV file and I would like to speed this up.
Through my research I found fread() from the data.table package, which is really fast. It takes approximately 3 to 5 seconds to load the CSV file.
Unfortunately there is also no possibility to specify the thousand separator. So I tried to use my solution with colClasses, but there seems to be the issue, that you can't use individual classes with fread https://github.com/Rdatatable/data.table/issues/491
See also my following test code:
##### read with fread ####
library(data.table)
# Test without colclasses
df_test_readfread1 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
dec = ",",
sep=";",
verbose=TRUE)
str(df_test_readfread1)
# PROBLEM: In my real dataset it turns the number into an numeric column,
# unforunately it sees the "." as decimal separator, so it turns e.g. 10.550,
# into 10.5
# Here it keeps everything as character
# Test with colclasses
df_test_readfread2 <- fread(paste0(getwd(),"/Test.csv"),
stringsAsFactors = FALSE,
colClasses = c(
col_text = "character",
col_num = "myNum",
col_date = "myDate"
),
sep=";",
verbose=TRUE)
str(df_test_readfread2)
# Keeps everything as character
So my question is: Is there a way to read CSV files with numeric values like 10.380,45 with fread?
(Alternatively: What is the fastest way to read a CSV with such numeric values?)
I never used package myself, but it's from Hadley Wickham, should be good stuff
https://cran.r-project.org/web/packages/readr/readr.pdf
It supposed to handle locales:
locale(date_names = "en", date_format = "%AD", time_format = "%AT",
decimal_mark = ".", grouping_mark = ",", tz = "UTC",
encoding = "UTF-8", asciify = FALSE)
decimal_mark and grouping_mark is what you're looking for
EDIT form PhiSeu: Solution
Thanks to your suggestion here are two solutions with read_csv2() from the readr package. For my 350000 row CSV file it takes approximately 8 seconds, which is much faster then the read.csv2 solution.
(Another helpful package from hadley and RStudio, thanks)
library(readr)
# solution 1 with specified columns
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
locale = locale("de"),
col_names = TRUE,
cols(
col_text = col_character(),
col_num = col_number(), # number is automatically regcognized through locale=("de")
col_date2 = col_date(format ="%d.%m.%Y") # Date specification
)
)
# solution 2 with overall definition of date format
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
locale = locale("de",date_format = "%d.%m.%Y"), # specifies the date format for the whole file
col_names = TRUE
)
Remove all commas first maybe.
filepath<-paste0(getwd(),"/Test.csv")
filestring<-readChar(filepath, file.info(filepath)$size)
filestring<-gsub('.','',filestring,fixed=TRUE)
fread(filestring)
I have big data set which consist of around 94 columns and 3 Million rows. This file have single as well as multiple spaces as delimiter between columns. I need to read some columns from this file in R. For this I tried using read.table() with options which can be seen in the code below, the code is pasted below-
### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-
col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))
### Reading first 100 rows of the data
data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)
Since, the file which has to read in have more than one space as the delimiter between some of the column, the above method does not work. Is there any method using which we can read in this file efficiently.
You need to change your delimiter. " " refers to one whitespace character. "" refers to any length whitespace as being the delimiter
data <- read.table(file, sep = "" , header = F , nrows = 100,
na.strings ="", stringsAsFactors= F)
From the manual:
If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.
Also, with a large datafile you may want to consider data.table:::fread to quickly read data straight into a data.table. I was myself using this function this morning. It is still experimental, but I find it works very well indeed.
If you want to use the tidyverse (or readr respectively) package instead, you can use read_table instead.
read_table(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
guess_max = min(n_max, 1000), progress = show_progress(), comment = "")
And see here in the description:
read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.
If you field have a fixed width, you should consider using read.fwf() which might handle missing values better.
I have a 3 million row, 500 column dataset. Although the columns are numeric, when importing from csv file, all are treated as factor, not numeric. I am trying to convert them back to numeric with the command
wikifixedn<-as.numeric(as.character(wikifixed))
wikifixed is the dataframe.
It's taking forever... My MacBook Pro, with 16GB ram and 2.3GHz Core i7 has been churning at this for more than an hour. Can I see somewhere how far along am I in the process or if the process is moving along? Is here another, faster method to deal with the conversation problem?
BTW: I tried, when importing the csv file, to force the columns to be treated as numeric using
> wikifixed<-read.csv('~/OneDrive/kredible/finaldata/wutao/wikipediausers.csv', header = TRUE, stringsAsFactors=F)
Yet, when checking I get
> is.numeric(wikifixed)
[1] FALSE
See here
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
you probably should create a vector for colClasses
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
stringsAsFactors
logical: should character vectors be converted to factors? Note that this is overridden by as.is and colClasses, both of which allow finer control.
colClasses
character. A vector of classes to be assumed for the columns. Recycled as necessary, or if the character vector is named, unspecified values are taken to be NA.
Possible values are NA (the default, when type.convert is used), "NULL" (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or "factor", "Date" or "POSIXct". Otherwise there needs to be an as method (from package methods) for conversion from "character" to the specified formal class.
Note that colClasses is specified per column (not per variable) and so includes the column of row names (if any).
ALSO see here in case you want to go to data.table because you may run into more issues.
fread in R imports a large .csv file as a data frame with one row
require(data.table)
fread("pre2012_alldatapoints.csv", sep = ",", header= TRUE)
and read
the data.table FAQ at
https://github.com/Rdatatable/data.table/wiki
I have big data set which consist of around 94 columns and 3 Million rows. This file have single as well as multiple spaces as delimiter between columns. I need to read some columns from this file in R. For this I tried using read.table() with options which can be seen in the code below, the code is pasted below-
### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-
col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))
### Reading first 100 rows of the data
data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)
Since, the file which has to read in have more than one space as the delimiter between some of the column, the above method does not work. Is there any method using which we can read in this file efficiently.
You need to change your delimiter. " " refers to one whitespace character. "" refers to any length whitespace as being the delimiter
data <- read.table(file, sep = "" , header = F , nrows = 100,
na.strings ="", stringsAsFactors= F)
From the manual:
If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.
Also, with a large datafile you may want to consider data.table:::fread to quickly read data straight into a data.table. I was myself using this function this morning. It is still experimental, but I find it works very well indeed.
If you want to use the tidyverse (or readr respectively) package instead, you can use read_table instead.
read_table(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
guess_max = min(n_max, 1000), progress = show_progress(), comment = "")
And see here in the description:
read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.
If you field have a fixed width, you should consider using read.fwf() which might handle missing values better.