Why is read.csv getting wrong classes? - r

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks

I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

Related

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

Set column types for csv with read.csv.ffdf

I am using a payments dataset from Austin Text Open Data. I am trying to load the data with the following code:-
library(ff)
asd <- read.table.ffdf(file = "~/Downloads/Fiscal_Year_2010_eCheckbook_Payments.csv", first.rows = 100, next.ros = 50, FUN = "read.csv", VERBOSE = TRUE)
This shows me the following error:-
read.table.ffdf 301..Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '7AHM'
This happens on 339th line of csv file at 5th column of the dataset. The reason why I think this is happening is that all the values of the 5th column are integers where as this happens to be string. But the actual type of the column should be string.
So I wanted to know if there was a way I could set the types of the column
Below I am providing the types for all the columns in a vector:-
c("character","integer","integer","character","character", "character","character","character","character","character","integer","character","character","character","character","character","character","character","integer","character","character","character","character","character","integer","integer","integer","character","character","character","character","double","character","integer")
You can also find the type of each column from the description of the dataset.
Please also keep in mind that I am very new to this library. Practically just found out about it today.
Maybe you need to transform your data type...The following is just an example that maybe to help you.
data <- transform(
data,
age=as.integer(age),
sex=as.factor(sex),
cp=as.factor(cp),
trestbps=as.integer(trestbps),
choi=as.integer(choi),
fbs=as.factor(fbs),
restecg=as.factor(restecg),
thalach=as.integer(thalach),
exang=as.factor(exang),
oldpeak=as.numeric(oldpeak),
slope=as.factor(slope),
ca=as.factor(ca),
thai=as.factor(thai),
num=as.factor(num)
)
sapply(data, class)

How to convert a factor type into a numeric type in R after reading a csv file?

After reading a csv file
data<-read.table(paste0('C:/Users/data/','30092017ARB.csv'),header=TRUE, sep=";")
I get for rather all numeric variable factor as the type, specially for the last column.
I tried all suggestion here However, I get a warning for all suggestions
Warning message:
NAs introduced by coercion
Some one mentioned even in this post:
"Every answer in this post failed to generate results for me , NAs were getting generated."
any idea how can I solve this problem?
Addendum: in the following pic you can see one possible approach suggested in here
However, I get always the same NA .
The percent sign is clearly the problem. Replace the "%" by the empty string, "", and then convert to numeric.
data[[3]] <- sub("%", "", data[[3]])
data[[3]] <- as.numeric(data[[3]])
You can do this in one line of code,
data[[3]] <- as.numeric(sub("%", "", data[[3]]))
Also, two notes on reading the data in.
First, some files use the semi-colon as a column separator. This is very used in countries where the decimal point is the comma. That is why R has two functions to read files in the CSV format.
These functions are both calls to read.table with some defaults changed.
read.csv - Sets arguments header = TRUE and sep = ",".
read.csv2 - Sets arguments header = TRUE, sep = ";" and dec = ",".
For a full explanation see read.table or at an R prompt run help("read.table").
Second, you can avoid factor problems if you use argument stringsAsFactors = FALSE from the start, when reading in the data.

write.table unintendedly adds subscript x to header

I have got a comma delimited csv document with predefined headers and a few rows. I just want to exchange the comma delimiter to a pipe delimiter. So my naive approach is:
myData <- read.csv(file="C:/test.CSV", header=TRUE, sep=",", check.names = FALSE)
Viewing myData gives me results without X subscripts in header columns. If I set check.names = TRUE, the column headers have a X subscript.
Now I am trying to write a new csv with pipe-delimiter.
write.table(MyData1, file = "C:/test_pipe.CSV",row.names=FALSE, na="",col.names=TRUE, sep="|")
In the next step I am going to test my results:
mydata.test <- read.csv(file="C:/test_pipe.CSV", header=TRUE, sep="|")
Import seems fine, but unfortunately the X subscript in column headers appear again. Now my question is:
Is there something wrong with the original file or is there an error in my naive approach?
The original csv test.csv was created with Excel, of course without X subscripts in column headers.
Thanks in advance
You have to keep using check.names = FALSE, also the second time.
Else your header will be modified, because apparently it contains variable names that would not be considered valid names of columns of a data.frame. E.g., special characters would be replaced by dots, i.e. . Similarly, numbers would be pre-fixed with X.

How to delete specific rows from multiple columns

I am importing some columns from multiple csv files from R. I want to delete all the data after row 1472.
temp = list.files(pattern="*.csv") #Importing csv files
Normalyears<-c(temp[1],temp[2],temp[3],temp[5],temp[6],temp[7],temp[9],temp[10],temp[11],temp[13],temp[14],temp[15],temp[17],temp[18],temp[19],temp[21],temp[22],temp[23])
leapyears<-c(temp[4],temp[8],temp[12],temp[16],temp[20]) #separating csv files with based on leap years and normal years.
Importing only the second column of each csv file.
myfiles_Normalyears = lapply(Normalyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
new.data.leapyears <- NULL
for(i in 1:length(myfiles_leapyears)) {
in.data <- read.table(if(is.null(myfiles_leapyears[i])),skip=c(1472:4399),sep=",")
new.data.leapyears <- rbind(new.data.leapyears, in.data)}
the loop is suppose to delete all the rows starting from 1472 to 4399.
Error: Error in read.table(myfiles_leapyears[i], skip = c(1472:4399), sep = ",") :
'file' must be a character string or connection
There is a nrows parameter to read.table, so why not try
read.table(myfiles_leapyears[i], nrows = 1471,sep=",")
Your myfiles_leapyears is a list. When subsetting a list, you need double brackets to access a single element, otherwise you just get a sublist of length 1.
So replace
myfiles_leapyears[i]
with
myfiles_leapyears[[i]]
that will at least take care of invalid subscript type 'list' errors. I'd second Josh W. that the nrows argument seems smarter than the skip argument.
Alternatively, if you define using sapply ("s" for simplify) instead of lapply ("l" for list), you'll probably be fine using [i]:
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
It is fine. I just turned the data from a list into a dataframe.
df <- as.data.frame(myfiles_leapyears,byrow=T)
leap_df<-head(df,-2928)

Resources