when reading a csv file via fread and using colClasses to read the columns as numerics I am having trouble with data that consists of numbers with commas instead of dots. Since the data files have different origins, some use "." and some use "," as decimal separator
dt <- data.table(a=c("1,4","2,0","4,5","3,5","6,9"),c=(10:14))
write.csv(dt,"dt.csv",row.names=F)
dcsv <- fread("dt.csv", colClasses = list(numeric = 1:2), dec = ",").
I have 2 problems:
I want to read both columns as numerics. So I tried using dec = ",". I now get an error: Column number 2 (colClasses[[1]][2]) is out of range [1,ncol=1]
So I changed to colClasses = list(numeric = 1), but don't quite understand this.
Still the first column turns out to be character type instead of numeric.
How could I also change dec to .and ,, since I don't know in advance what type of decimal separator any of the hundreds of files uses. I tried a vector, but did not work out. What am I missing? Thanks for any help!
It is not normal to have a file with 2 different types of numeric separator.
You should question the source of the file first thing.
Nevertheless, if you have such a file, the correct way to read it is with the variables with a comma separator as a string then to convert it to a numeric.
library(data.table)
dt <- data.table(a=c("1,4","2,0","4,5","3,5","6,9"),c=(10:14))
write.csv(dt,"dt.csv",row.names=F)
dcsv <- fread("dt.csv", dec = ".")
dcsv[, a:= as.numeric(gsub("\"", "", gsub(",", ".", a)))]
If you don't know if your variable is with a comma or a dot separator, you can loop over your variable to test if the variable is a string with only number and comma and convert only the ones fulfilling that condition.
Related
ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
After reading a csv file
data<-read.table(paste0('C:/Users/data/','30092017ARB.csv'),header=TRUE, sep=";")
I get for rather all numeric variable factor as the type, specially for the last column.
I tried all suggestion here However, I get a warning for all suggestions
Warning message:
NAs introduced by coercion
Some one mentioned even in this post:
"Every answer in this post failed to generate results for me , NAs were getting generated."
any idea how can I solve this problem?
Addendum: in the following pic you can see one possible approach suggested in here
However, I get always the same NA .
The percent sign is clearly the problem. Replace the "%" by the empty string, "", and then convert to numeric.
data[[3]] <- sub("%", "", data[[3]])
data[[3]] <- as.numeric(data[[3]])
You can do this in one line of code,
data[[3]] <- as.numeric(sub("%", "", data[[3]]))
Also, two notes on reading the data in.
First, some files use the semi-colon as a column separator. This is very used in countries where the decimal point is the comma. That is why R has two functions to read files in the CSV format.
These functions are both calls to read.table with some defaults changed.
read.csv - Sets arguments header = TRUE and sep = ",".
read.csv2 - Sets arguments header = TRUE, sep = ";" and dec = ",".
For a full explanation see read.table or at an R prompt run help("read.table").
Second, you can avoid factor problems if you use argument stringsAsFactors = FALSE from the start, when reading in the data.
I have data files that contain the following:
the first 10 columns are numbers, the last column is text. They are separated by space. The problem is that the text in the last column may also contain space. So when I used read.table() I got the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 21 did not have 11 elements
what's the easiest way of reading the first 10 columns into a data matrix, and the last column into a string vector? Should I use readLines() first then process it?
If you cannot re-export or recreate your data files with different, non-whitespace separators or with quotation marks around the last column to avoid that problem, you can use read.table(... , fill = TRUE) to read in a file with unequal columns and then combine columns 11+ with dat$col11 <- do.call(paste, c(dat[11:nrow(dat)], sep=" ")) (or something like that) and then drop the now unwanted columns with dat[11:(nrow(dat)-1)] <- NULL. Finally, you may need to trim the whitespace from the end of the eleventh column with trimws(dat$col11).
Note that fill only considers the first five lines of your file, so you may need to find out the number of 'pseudo-columns' in the longest line manually and specify an appropriate number of col.names in read.table (see the linked answer).
Hinted by the useful fill = TRUE option of read.table() function, I used the following to solve my problem:
dat <- read.table(fname, fill = T)
dat <- dat[subset(1:nrow(dat),!((1:nrow(dat)) %in% (which(dat[,11]=="No") + 1))),]
The fill = TRUE option puts everything after the first space of the 11th column to a new row (redundant rows that the original data do not have). The code above removes the redundant rows based on three assumptions: (1) the number of space separators in the 11th column is no more than 11 such that we know there is only one more row of text after a line whose 11th column contains space (that's what the +1 does); (2) we know the line whose 11th column starts with a certain word (in my case it is "No") (3) Keeping only the first word in the 11th column would be sufficient (without ambiguity).
The following solved my problem:
nc <- max(count.fields(fname), sep = " ")
data <- read.table(fname, fill = T, col.names = paste0("V", seq_len(nc)), sep = " ", header = F)
Then the first 10 columns will be the numeric results I want and the remaining nc-10 columns can be combined into one string vector.
The most helpful post is:
How can you read a CSV file in R with different number of columns
You could reformat your file before reading it in R.
For example, using perl in a terminal:
perl -pe 's/(?<=[0-9]) /,/g' myfile.txt > myfile.csv
This replaces every space preceded by a number by a comma.
Then read it into R using read.csv:
df = read.csv("myfile.csv")
This is partially related to reading in files in so-called European way, more in How to read in numbers with a comma as decimal separator?. I have data with a row such as "Invoice","1324","Name","John","Age","10","Height","143,5","Products","1;2;3;4","ProductIDs","01;02;03;04" where a comma acts as a separator of field values and inside the field values, delimited with double-quotes, comma acts as a decimal separator.
Inside field values, the semicolon also acts as other separator but we can exclude this observation for now on and concentrate on correctly first reading in the file with commas having different meaning in different places.
How to read in numbers with a comma as a decimal separator and a field separator in R?
It might be possible to do using the dec parameter depending on how you're reading the file in. Here is how I would do it using data.table:
dat <- fread('"Name", "Age"
"Joe", "1,2"')
dat[, Age := as.numeric(gsub(",", ".", Age))]
# Name Age
# 1: Joe 1.2
How about this?
read.table("file.name", sep=",", quote = "\"", dec=",")
I am very new to R and I am having trouble accessing a dataset I've imported. I'm using RStudio and used the Import Dataset function when importing my csv-file and pasted the line from the console-window to the source-window. The code looks as follows:
setwd("c:/kalle/R")
stuckey <- read.csv("C:/kalle/R/stuckey.csv")
point <- stuckey$PTS
time <- stuckey$MP
However, the data isn't integer or numeric as I am used to but factors so when I try to plot the variables I only get histograms, not the usual plot. When checking the data it seems to be in order, just that I'm unable to use it since it's in factor form.
Both the data import function (here: read.csv()) as well as a global option offer you to say stringsAsFactors=FALSE which should fix this.
By default, read.csv checks the first few rows of your data to see whether to treat each variable as numeric. If it finds non-numeric values, it assumes the variable is character data, and character variables are converted to factors.
It looks like the PTS and MP variables in your dataset contain non-numerics, which is why you're getting unexpected results. You can force these variables to numeric with
point <- as.numeric(as.character(point))
time <- as.numeric(as.character(time))
But any values that can't be converted will become missing. (The R FAQ gives a slightly different method for factor -> numeric conversion but I can never remember what it is.)
You can set this globally for all read.csv/read.* commands with
options(stringsAsFactors=F)
Then read the file as follows:
my.tab <- read.table( "filename.csv", as.is=T )
When importing csv data files the import command should reflect both the data seperation between each column (;) and the float-number seperator for your numeric values (for numerical variable = 2,5 this would be ",").
The command for importing a csv, therefore, has to be a bit more comprehensive with more commands:
stuckey <- read.csv2("C:/kalle/R/stuckey.csv", header=TRUE, sep=";", dec=",")
This should import all variables as either integers or numeric.
None of these answers mention the colClasses argument which is another way to specify the variable classes in read.csv.
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "numeric") # all variables to numeric
or you can specify which columns to convert:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = c("PTS" = "numeric", "MP" = "numeric") # specific columns to numeric
Note that if a variable can't be converted to numeric then it will be converted to factor as default which makes it more difficult to convert to number. Therefore, it can be advisable just to read all variables in as 'character' colClasses = "character" and then convert the specific columns to numeric once the csv is read in:
stuckey <- read.csv("C:/kalle/R/stuckey.csv", colClasses = "character")
point <- as.numeric(stuckey$PTS)
time <- as.numeric(stuckey$MP)
I'm new to R as well and faced the exact same problem. But then I looked at my data and noticed that it is being caused due to the fact that my csv file was using a comma separator (,) in all numeric columns (Ex: 1,233,444.56 instead of 1233444.56).
I removed the comma separator in my csv file and then reloaded into R. My data frame now recognises all columns as numbers.
I'm sure there's a way to handle this within the read.csv function itself.
This only worked right for me when including strip.white = TRUE in the read.csv command.
(I found the solution here.)
for me the solution was to include skip = 0
(number of rows to skip at the top of the file. Can be set >0)
mydata <- read.csv(file = "file.csv", header = TRUE, sep = ",", skip = 22)