Numeric variables converted to factors when reading a CSV file - r

I'm trying to read a .csv file into R where all the column are numeric. However, they get converted to factor everytime I import them.
Here's a sample of how my CSV looks like:
This is my code:
options(StringsAsFactors=F)
data<-read.csv("in.csv", dec = ",", sep = ";")
As you can see, I set dec to , and sep to ;. Still, all the vectors that should be numerics are factors!
Can someone give me some advice? Thanks!

Your NA strings in the csv file, N/A, are interpreted as character and then the whole column is converted to character. If you have stringsAsFactors = TRUE in options or in read.csv (default), the column is further converted to factor. You can use the argument na.strings to tell read.csv which strings should be interpreted as NA.
A small example:
df <- read.csv(text = "x;y
N/A;2,2
3,3;4,4", dec = ",", sep = ";")
str(df)
df <- read.csv(text = "x;y
N/A;2,2
3,3;4,4", dec = ",", sep = ";", na.strings = "N/A")
str(df)
Update following comment
Although not apparent from the sample data provided, there is also a problem with instances of '$' concatenated to the numbers, e.g. '$3,3'. Such values will be interpreted as character, and then the dec = "," doesn't help us. We need to replace both the '$' and the ',' before the variable is converted to numeric.
df <- read.csv(text = "x;y;z
N/A;1,1;2,2$
$3,3;5,5;4,4", dec = ",", sep = ";", na.strings = "N/A")
df
str(df)
df[] <- lapply(df, function(x){
x2 <- gsub(pattern = "$", replacement = "", x = x, fixed = TRUE)
x3 <- gsub(pattern = ",", replacement = ".", x = x2, fixed = TRUE)
as.numeric(x3)
}
)
df
str(df)

You could have gotten your original code to work actually - there's a tiny typo ('stringsAsFactors', not 'StringsAsFactors'). The options command wont complain with the wrong text, but it just wont work. When done correctly, it'll read it as char, instead of factors. You can then convert columns to whatever format you want.

I just had this same issue, and tried all the fixes on this and other duplicate posts. None really worked all that well. The way I went about fixing it was actually on the excel side. If you highlight all the columns in your source file (in excel), right click==> format cells then select 'number' it'll import perfectly fine (so long as you have no non-numeric characters below the header)

Related

NA introduced by coercion

I have a file a notepad txt file inflation.txt that looks something like this:
1950-1 0.0084490544865279
1950-2 −0.0050487986543660
1950-3 0.0038461526886055
1950-4 0.0214293914558992
1951-1 0.0232839389540449
1951-2 0.0299121323429455
1951-3 0.0379293285389640
1951-4 0.0212773984472849
From a previous stackoverflow post, I learned how to import this file into R:
data <- read.table("inflation.txt", sep = "" , header = F ,
na.strings ="", stringsAsFactors= F, encoding = "UTF-8")
However, this code reads the file as a character. When I try to convert this file to numeric format, all negative values are replaced with NA:
b=as.numeric(data$V2)
Warning message:
In base::as.numeric(x) : NAs introduced by coercion
> head(b)
[1] 0.008449054 NA 0.003846153 0.021429391 0.023283939 0.029912132
Can someone please show me what I am doing wrong? Is it possible to save the inflation.txt file as a data.frame?
I would read the file using space as a separator, then spin out two separate columns for the year and quarter from your R script:
data <- read.table("inflation.txt", sep = " ", header=FALSE,
na.strings="", stringsAsFactors=FALSE, encoding="UTF-8")
names(data) <- c("ym", "vals")
data$year <- as.numeric(sub("-.*$", "", data$ym))
data$month <- as.numeric(sub("^\\d+-", "", data$ym))
data <- data[, c("year", "month", "vals")]
The issue is that "−" that you have in your data is not minus sign (it is a dash), hence the data is being read as character.
You have two options.
Open the file in any text editor and find and replace all the "−" with negative sign and then using read.table would work directly.
data <- read.table("inflation.txt")
If you can't change the data in the original file then replace them with sub after reading the data into R.
data$V2 <- as.numeric(sub('−', '-', data$V2, fixed = TRUE))

How to avoid factors in R when reading csv data

I have data in a csv file. when i get it read, the columns are in factor levels using which I cannot do any computation.
I used
as.numeric(df$variablename) but it renders a completely different set of data for the variable.
original data in the variable: 2961,488,632,
as.numeric output: 1,8,16
When reading data using read.table you can
specify how your data is separated sep = ,
what the decimal point is dec = ,
how NA characters look like na.strings =
that you do not want to convert strings to factors stringsAsFactors = F
In your case you could use something like:
read.table("mycsv.csv", header = TRUE, sep = ",", dec = ".", stringsAsFactors = F,
na.strings = c("", "-"))
In addition to the answer by Cettt , there's also colClasses.
If you know in advance what data types the columns your csv file has, you can specify this. This stops R from "guessing" what the datatype is, and lets you know when something isn't right, rather than deciding it must be a string. e.g. if your 4-column csv file has columns that are Text, Factors, Integer, Numeric, you can use
read.table("mycsv.csv", header = T, sep = ",", dec = ".",
colClasses=c("character", "factor", "integer", "numeric"))
Edited to add:
As pointed out by gersht, the issue is likely some non-number in the numbers column. Often, this can be how the value NA was coded. Specifying colClasses causes R to give an error message when it encounters any such "not numeric or NA" values, so you can easily see the issue. If it's a non-default coding of NA, use the argument na.strings = c("NA", "YOUR NA VALUE") If it's another issue, you'll likely have to fix the file before importing. For example:
read.table(sep=",",
colClasses=c("character", "numeric"),
text="
cat,11
canary,12
dog,1O") # NB not a 10; it's a 1 and a capital-oh.
gives
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '1O'

read.table that delete 0 of each rows in r

I have a file in which every row is a string of numbers. Example of a row: 0234
Example of this file:
00020
04921
04622
...
When i use read.table it delete all the first 0 of each row (00020 becomes 20, 04921 -> 4921,...). I use:
example <- read.table(fileName, sep="\t",check.names=FALSE)
After this, for obtain a vector i use as.vector(unlist(example)).
I try different options of read.table but the problem remains
The read.table by default checks the column values and change the column type accordingly. If we want a custom type, specify it with colClasses
example <- read.table(fileName, sep="\t",check.names=FALSE,
colClasses = "character", stringsAsFactors = FALSE)
When we are not specifying the colClasses, the function use type.convert to automatically assign the column types based on the value
read.table # function
...
...
data[[i]] <- if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
numerals = numerals, na.strings = character(0L))
...
...
If I understand the issue correctly, you read in your data file with read.table but since you want a vector, not a data frame, you then unlist the df. And you want to keep the leading zeros.
There is a simpler way of doing the same, use scan.
example <- scan(file = fileName, what = character(), sep = "\t")

Issues importing csv data into R where the data contains additional commas

I have a very large data set that for illustrative purposes looks something like the following.
Cust_ID , Sales_Assistant , Store
123 , Mary, Worthington, 22
456 , Jack, Charles , 42
The real data has many more columns and millions of rows. I'm using the following code to import it into R but it is falling over because one or more of the columns has a comma in the data (see Sales_Assistant above).
df <- read.csv("C:/dataextract.csv", header = TRUE , as.is = TRUE , sep = "," , na.strings = "NA" , quote = "" , fill = TRUE , dec = "." , allowEscapes = FALSE , row.names=NULL)
Adding row.names=NULL imported all the data but it split the Sales_Assistant column over two columns and threw all the other data out of alignment. If I run the code without this I get an error...
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
...and the data won't load.
Can you think of a way around this that doesn't involve tackling the data at source, or opening it in a text editor? Is there a solution in R?
First and foremost, it is a csv file. "Mary, Worthington" is meant to respond to two columns. If you have commas in your values, consider saving the data by using tsv (tab-separated values).
However, if you data has equal amount of commas per row with good alignment in some sense, I would consider ignoring the first row (which is the column names as you read the file) of the data frame and reassigning it proper column names.
For instance, in your case you can replace Sales_Assistant by
Sales_Assistant_First_Name, Sales_Assistant_Last_Name
which makes perfect sense. Then I could basically do
df <- df[-1, ]
colnames(df) <- c("Cust_ID" , "Sales_Assistant_First_Name" , "Sales_Assistant_Last_Name", "Store")
df <- read.csv("C:/dataextract.csv", skip = 1, header = FALSE)
df_cnames <- read.csv("C:/dataextract.csv", nrow = 1, header = FALSE)
df <- within(df, V2V3 <- paste(V2, V3, sep = ''))
df <- subset(df, select = (c("V1", "V2V3", "V4")))
colnames(df) <- df_cnames
It may need some modification depending on the actual source

Specifying colClasses in read.table using the class function

Is there a way to use read.table() to read all or part of the file in, use the class function to get the column types, modify the column types, and then re-read the file?
Basically I have columns which are zero padded integers that I like to treat as strings. If I let read.table() just do its thing it of course assumes these are numbers and strips off the leading zeros and makes the column type integer. Thing is I have a fair number of columns so while I can create a character vector specifying each one I only want to change a couple from what R's best guess is. What I'd like to do is read the first few lines:
myTable <- read.table("//myFile.txt", sep="\t", quote="\"", header=TRUE, stringsAsFactors=FALSE, nrows = 5)
Then get the column classes:
colTypes <- sapply(myTable, class)
Change a couple of column types i.e.:
colTypes[1] <- "character"
And then re-read the file in using the modified column types:
myTable <- read.table("//myFile.txt", sep="\t", quote="\"", colClasses=colTypes, header=TRUE, stringsAsFactors=FALSE, nrows = 5)
While this seems like an infinitely reasonable thing to do, and colTypes = c("character") works fine, when I actually try it I get a:
scan() expected 'an integer', got '"000001"'
class(colTypes) and class(c("character")) both return "character" so what's the problem?
You use read.tables colClasses = argument to specify the columns you want classified as characters. For example:
txt <-
"var1, var2, var3
0001, 0002, 1
0003, 0004, 2"
df <-
read.table(
text = txt,
sep = ",",
header = TRUE,
colClasses = "character") ## read all as characters
df
df2 <-
read.table(
text = txt,
sep = ",",
header = TRUE,
colClasses = c("character", "character", "double")) ## the third column is numeric
df2
[updated...] or, you could set and re-set colClasses with a vector...
df <-
read.table(
text = txt,
sep = ",",
header = TRUE)
df
## they're all currently read as integer
myColClasses <-
sapply(df, class)
## create a vector of column names for zero padded variable
zero_padded <-
c("var1", "var2")
## if a name is in zero_padded, return "character", else leave it be
myColClasses <-
ifelse(names(myColClasses) %in% zero_padded,
"character",
myColClasses)
## read in with colClasses set to myColClasses
df2 <-
read.table(
text = txt,
sep = ",",
colClasses = myColClasses,
header = TRUE)
df2

Resources