I am trying to read data from a CSV file into a data frame. The data contains names which I do not want to have as factors. I cannot use the stringAsFactors=FALSE argument since there are other columns which I want to have as factors.
How do I achieve the desired behavior?
Note : The data has thousands of columns...I need to modify the datatype only for one column..the types assigned by default for the rest are all fine
Use the colClasses argument to specify the type of each column. For example:
x <- read.csv("myfile.csv", colClasses=c("numeric","factor","character"))
You could specify the column classes. From ?read.table
colClasses: character. A vector of classes to be assumed for the
columns. Recycled as necessary, or if the character vector
is named, unspecified values are taken to be 'NA'.
Possible values are 'NA' (the default, when 'type.convert' is
used), '"NULL"' (when the column is skipped), one of the
atomic vector classes (logical, integer, numeric, complex,
character, raw), or '"factor"', '"Date"' or '"POSIXct"'.
Otherwise there needs to be an 'as' method (from package
'methods') for conversion from '"character"' to the specified
formal class.
Note that 'colClasses' is specified per column (not per
variable) and so includes the column of row names (if any).
So something like:
types = c("numeric", "character", "factor")
read.table("file.txt", colClasses = types)
should do the trick.
Personally, I would just read the columns in as strings or factors and then change the columns you want.
As the documentation in a previous answer states, if you know the name of the column before reading in your data, you can use a named character vector to specify that column only.
types <- c(b="character") #Set the column named "b" to character
df <- read.table(header=TRUE,sep=",",colClasses=types,text="
a,b,c,d,e
1,asdf,morning,4,greeting
5,fiewhn,evening,12,greeting
9,ddddd,afternoon,292,farewell
33,eianzpod,evening,1111,farewell
191,dnmxzcv,afternoon,394,greeting
")
sapply(df,class)
# a b c d e
# "integer" "character" "factor" "integer" "factor"
If there is no header, you can also do it by position:
types <- c(V2="character") #Set the second column to character
df <- read.table(header=FALSE,sep=",",colClasses=types,text="
1,asdf,morning,4,greeting
5,fiewhn,evening,12,greeting
9,ddddd,afternoon,292,farewell
33,eianzpod,evening,1111,farewell
191,dnmxzcv,afternoon,394,greeting
")
sapply(df,class)
# V1 V2 V3 V4 V5
#"integer" "character" "factor" "integer" "factor"
And finally, if you know the position but have a header, you can build the vector of appropriate length. For colClasses, NA means default.
types <- rep.int(NA_character_,5) #make this length the number of columns
types[2] <- "character" #force the second column as character
df <- read.table(header=TRUE,sep=",",colClasses=types,text="
a,b,c,d,e
1,asdf,morning,4,greeting
5,fiewhn,evening,12,greeting
9,ddddd,afternoon,292,farewell
33,eianzpod,evening,1111,farewell
191,dnmxzcv,afternoon,394,greeting
")
sapply(df,class)
# V1 V2 V3 V4 V5
#"integer" "character" "factor" "integer" "factor"
Related
On the output of read.table, as.vector produces an m x 1 matrix rather than a length m vector:
# data.txt contains one integer per line and nothing else
dataframe = read.table("data.txt", encoding='UTF-8', header=F)
v = as.vector(dataframe)
is.vector(v)
[1] FALSE
length(v)
[1] 1
dim(v)
[1] 19783 1
Consider readLines instead of read.table which imports the one column directly into a vector:
data <- readLines(con="data.txt", n=-1L, encoding='UTF-8', warn=FALSE)
is.vector(data)
#[1] TRUE
To summerise the above data types:
Data frame: A tabular object where each column can be a different type. A data frame is really a list.
Matrix: A tabular object where all values must have the same type.
Vector: A one dimensional object; all values must have the same type.
Hence it doesn't (in general) make sense to convert from a data frame to a vector.
In your example, you can either
unlist(dataframe)
or convert to a matrix, then use as.vector
as.vector(data.matrix(dataframe))
I have a data frame that I construct as such:
> yyz <- data.frame(a = c("1","2","n/a"), b = c(1,2,"n/a"))
> apply(yyz, 2, class)
a b
"character" "character"
I am attempting to convert the last column to numeric while still maintaining the first column as a character. I tried this:
> yyz$b <- as.numeric(as.character(yyz$b))
> yyz
a b
1 1
2 2
n/a NA
But when I run the apply class it is showing me that they are both character classes.
> apply(yyz, 2, class)
a b
"character" "character"
Am I setting up the data frame wrong? Or is it the way R is interpreting the data frame?
If we need only one column to be numeric
yyz$b <- as.numeric(as.character(yyz$b))
But, if all the columns needs to changed to numeric, use lapply to loop over the columns and convert to numeric by first converting it to character class as the columns were factor.
yyz[] <- lapply(yyz, function(x) as.numeric(as.character(x)))
Both the columns in the OP's post are factor because of the string "n/a". This could be easily avoided while reading the file using na.strings = "n/a" in the read.table/read.csv or if we are using data.frame, we can have character columns with stringsAsFactors=FALSE (the default is stringsAsFactors=TRUE)
Regarding the usage of apply, it converts the dataset to matrix and matrix can hold only a single class. To check the class, we need
lapply(yyz, class)
Or
sapply(yyz, class)
Or check
str(yyz)
On creating a column whose contents contain duplicate values, I notice the following with regard to factors.
1.If a column with duplicate character values is made part of a data frame at the time of data frame creation, it is of class factor, but if the same column is appended later, it is of class character though the values in both cases are the same. Why is this?
#creating a data frame
name = c('waugh','waugh','smith')
age = c(21,21,27)
df = data.frame(name,age)
#adding a new column which has the same values as the 'name' column above, to the data frame
df$newcol = c('waugh','waugh','smith')
#you can see that the class'es of the two are different though the values are same
class(df$name)
## [1] "factor"
class(df$newcol)
## [1] "character"
Only the column which has duplicate alphabetic contents becomes a factor; If a column contains duplicate numeric values, it is not treated as a factor. Why is that? I could very well mean that 1-Male, 0-Female, in which case, it should be a factor?
note that both these columns contain duplicate values
class(df$name)
## [1] "factor"
class(df$age)
## [1] "numeric"
This was basically answered in the comments, but i'll put the answer here to close out the question.
When you use data.frame() to create a data.frame, that function actually manipulates the arguments you pass in to create the data.frame object. Specifically, by default, it has a parameter named stringsAsFactors=TRUE so that it will take all character vectors you pass in and convert them to factor vectors since normally you treat these values as categorical random variables in various statistical tests and it can be more efficient to store character values as a factor if you have many values that are repeated in the vector.
df <- data.frame(name,age)
class(df$name)
# [1] "factor"
df <- data.frame(name,age, stringsAsFactors=FALSE)
class(df$name)
# [1] "character"
Note that the data.frame itself doesn't remember the "stringsAsFactors" value used during its construction. This is only used when you actually run data.frame(). So if you add columns by assigning them via the $<- syntax or cbind(), the coercion will not happen
df1 <- data.frame(name,age)
df2 <- data.frame(name,age, stringsAsFactors=FALSE)
df1$name2 <- name
df2$name2 <- name
df3 <- cbind(data.frame(name,age), name2=name)
class(df1$name2)
# [1] "character"
class(df2$name2)
# [1] "character"
class(df3$name2)
# [1] "character"
If you want to add the column as a factor, you will need to convert to factor yourself
df = data.frame(name,age)
df$name2 <- factor(name)
class(df$name2)
# [1] "factor"
I am taking a data.table:
DT <- data.table(num=c(1,4,6,7,8,12,13, 15), let=rep(c("A","B"), each=4))
An then I have the following result:
> sapply(DT, class)
num let
"numeric" "character"
Which is ok.
Then, adding a line:
DT<-rbind(DT, as.list(c(8, "B")))
And then:
> sapply(DT, class)
num let
"character" "character"
I find this vicious that R changed the first column type to character and did not expect it ... I can change the column to numeric afterwards but it's painfull if I have to check after every insert.
Is there's a way to add line without this drawback?
Your first problem stems from your use of c, the function to combine arguments into a vector. This produces an atomic vector (in this case - you are combining two length one atomic vectors, namely the vector 8 and the vector "B") which may be of only one data type, so in your example c(8,"B") is evaluated first, resulting in:
str( c(8, "B") )
# chr [1:2] "8" "B"
Therefore you should not expect any other result!
I have the following data frame:
name1 name2
A B
B D
C C
D A
the columns "name1" and "name2" are treated as factors and therefore A, B, C, and D are treated as levels. However I want to somehow convert this data frame so that it become
name1 name2
"A" "B"
"B" "D"
"C" "C"
"D" "A"
In other words, convert it in a way that A, B, C, and D are treated as string.
how can i do that?
you're looking for as.character, which you need to apply to each column of the data.frame
Assuming X is your data.frame
If fctr.cols are the names of your factor columns, then you can use:
X[, fctr.cols] <- sapply(X[, fctr.cols], as.character)
You can collect your factor columns using is.factor:
fctr.cols <- sapply(X, is.factor)
This may be a little simpler than the answer above.
#where your dataframe = df
df.name1 <- as.character (df.name1)
df.name2 <- as.character (df.name2)
I need to do things like this all the time at work because the data is so messy. I have been able to do it on import with StringsAsFactors=FALSE, but in the newest version of r I am getting an error on read.csv. Ideally I will figure that out soon... In the meantime I have been doing this as a quick and effective method.
It takes the old variable, foo, which is factor type, and converts it to a new variable, fooChar, which is character type. I usually do it in situ by naming the new variable the same as the old one, but you may want to play with it before you trust it to replace values.
#Convert from Factor to Char
#Data frame named data
#Old Variable named foo, factor type
#New Variable named fooChar, character type
data$fooChar <-as.character(data$foo)
#confirm the data looks the same:
table (data$fooChar)
#confirm structure of new variable
str(data)
If you want to convert only the selected column of factor variable instead of all the factor variable columns in the data frame, you can use:
file1[,n] <- sapply(file1[,n], as.character)
where n is the column number.