I have a data.table list and I want to convert some of the columns to numeric and perform rowSums. I am trying to figure out how to convert all the columns to numeric.
This is what I have tried.
# Obtain data
tbl<-get_data(sqlquery = tqry, dbase=db1, server=serv)
# Names of the columns that need to be converted to numeric
score<-names(tbl)[grep('score',names(tbl),ignore.case = T)]
tbl[,class(AcceptingNewPatientsScore)]
[1] "character"
### Wrong - Having problem here
tbl[,eval(score):=as.numeric(get(score))]
tbl[,class(AcceptingNewPatientsScore)]
[1] "numeric" # It converted but jumbled scores.
tbl[,tscore:=rowSums(.SD,na.rm = FALSE),.SDcols=score]
Thanks to #Frank for his suggestion.
tbl[,(score):=lapply(.SD, as.numeric),.SDcols=score]
On the output of read.table, as.vector produces an m x 1 matrix rather than a length m vector:
# data.txt contains one integer per line and nothing else
dataframe = read.table("data.txt", encoding='UTF-8', header=F)
v = as.vector(dataframe)
is.vector(v)
[1] FALSE
length(v)
[1] 1
dim(v)
[1] 19783 1
Consider readLines instead of read.table which imports the one column directly into a vector:
data <- readLines(con="data.txt", n=-1L, encoding='UTF-8', warn=FALSE)
is.vector(data)
#[1] TRUE
To summerise the above data types:
Data frame: A tabular object where each column can be a different type. A data frame is really a list.
Matrix: A tabular object where all values must have the same type.
Vector: A one dimensional object; all values must have the same type.
Hence it doesn't (in general) make sense to convert from a data frame to a vector.
In your example, you can either
unlist(dataframe)
or convert to a matrix, then use as.vector
as.vector(data.matrix(dataframe))
I have encountered an issue that I do not understand and could not find an explanation so far. Here is an example :
x = matrix(data = "test", nrow = 5, ncol = 3)
typeof(x[1, 1])
> "character"
x = as.data.frame(x)
typeof(x[1, 1])
> "integer"
Any idea as to why as.data.frame() coerce data to integer type and how to prevent it from happening ?
The matrix can hold only a single class. Normally, we use matrix for numeric elements. Suppose if there is a single element in matrix that is non-numeric, it will convert the whole matrix to character class.
Regarding the OP's post, we have a matrix with character elements. Coercing a matrix to data.frame (with as.data.frame), it will be converted to data.frame, but the default option (stringsAsFactors=TRUE) in data.frame for 'character' elements in each column will be to convert it to factor class. When we use typeof, we get the integer representation of factor.
This can be avoided by using stringsAsFactors=FALSE
x1 <- as.data.frame(x, stringsAsFactors=FALSE)
On creating a column whose contents contain duplicate values, I notice the following with regard to factors.
1.If a column with duplicate character values is made part of a data frame at the time of data frame creation, it is of class factor, but if the same column is appended later, it is of class character though the values in both cases are the same. Why is this?
#creating a data frame
name = c('waugh','waugh','smith')
age = c(21,21,27)
df = data.frame(name,age)
#adding a new column which has the same values as the 'name' column above, to the data frame
df$newcol = c('waugh','waugh','smith')
#you can see that the class'es of the two are different though the values are same
class(df$name)
## [1] "factor"
class(df$newcol)
## [1] "character"
Only the column which has duplicate alphabetic contents becomes a factor; If a column contains duplicate numeric values, it is not treated as a factor. Why is that? I could very well mean that 1-Male, 0-Female, in which case, it should be a factor?
note that both these columns contain duplicate values
class(df$name)
## [1] "factor"
class(df$age)
## [1] "numeric"
This was basically answered in the comments, but i'll put the answer here to close out the question.
When you use data.frame() to create a data.frame, that function actually manipulates the arguments you pass in to create the data.frame object. Specifically, by default, it has a parameter named stringsAsFactors=TRUE so that it will take all character vectors you pass in and convert them to factor vectors since normally you treat these values as categorical random variables in various statistical tests and it can be more efficient to store character values as a factor if you have many values that are repeated in the vector.
df <- data.frame(name,age)
class(df$name)
# [1] "factor"
df <- data.frame(name,age, stringsAsFactors=FALSE)
class(df$name)
# [1] "character"
Note that the data.frame itself doesn't remember the "stringsAsFactors" value used during its construction. This is only used when you actually run data.frame(). So if you add columns by assigning them via the $<- syntax or cbind(), the coercion will not happen
df1 <- data.frame(name,age)
df2 <- data.frame(name,age, stringsAsFactors=FALSE)
df1$name2 <- name
df2$name2 <- name
df3 <- cbind(data.frame(name,age), name2=name)
class(df1$name2)
# [1] "character"
class(df2$name2)
# [1] "character"
class(df3$name2)
# [1] "character"
If you want to add the column as a factor, you will need to convert to factor yourself
df = data.frame(name,age)
df$name2 <- factor(name)
class(df$name2)
# [1] "factor"
I am trying to read data from a CSV file into a data frame. The data contains names which I do not want to have as factors. I cannot use the stringAsFactors=FALSE argument since there are other columns which I want to have as factors.
How do I achieve the desired behavior?
Note : The data has thousands of columns...I need to modify the datatype only for one column..the types assigned by default for the rest are all fine
Use the colClasses argument to specify the type of each column. For example:
x <- read.csv("myfile.csv", colClasses=c("numeric","factor","character"))
You could specify the column classes. From ?read.table
colClasses: character. A vector of classes to be assumed for the
columns. Recycled as necessary, or if the character vector
is named, unspecified values are taken to be 'NA'.
Possible values are 'NA' (the default, when 'type.convert' is
used), '"NULL"' (when the column is skipped), one of the
atomic vector classes (logical, integer, numeric, complex,
character, raw), or '"factor"', '"Date"' or '"POSIXct"'.
Otherwise there needs to be an 'as' method (from package
'methods') for conversion from '"character"' to the specified
formal class.
Note that 'colClasses' is specified per column (not per
variable) and so includes the column of row names (if any).
So something like:
types = c("numeric", "character", "factor")
read.table("file.txt", colClasses = types)
should do the trick.
Personally, I would just read the columns in as strings or factors and then change the columns you want.
As the documentation in a previous answer states, if you know the name of the column before reading in your data, you can use a named character vector to specify that column only.
types <- c(b="character") #Set the column named "b" to character
df <- read.table(header=TRUE,sep=",",colClasses=types,text="
a,b,c,d,e
1,asdf,morning,4,greeting
5,fiewhn,evening,12,greeting
9,ddddd,afternoon,292,farewell
33,eianzpod,evening,1111,farewell
191,dnmxzcv,afternoon,394,greeting
")
sapply(df,class)
# a b c d e
# "integer" "character" "factor" "integer" "factor"
If there is no header, you can also do it by position:
types <- c(V2="character") #Set the second column to character
df <- read.table(header=FALSE,sep=",",colClasses=types,text="
1,asdf,morning,4,greeting
5,fiewhn,evening,12,greeting
9,ddddd,afternoon,292,farewell
33,eianzpod,evening,1111,farewell
191,dnmxzcv,afternoon,394,greeting
")
sapply(df,class)
# V1 V2 V3 V4 V5
#"integer" "character" "factor" "integer" "factor"
And finally, if you know the position but have a header, you can build the vector of appropriate length. For colClasses, NA means default.
types <- rep.int(NA_character_,5) #make this length the number of columns
types[2] <- "character" #force the second column as character
df <- read.table(header=TRUE,sep=",",colClasses=types,text="
a,b,c,d,e
1,asdf,morning,4,greeting
5,fiewhn,evening,12,greeting
9,ddddd,afternoon,292,farewell
33,eianzpod,evening,1111,farewell
191,dnmxzcv,afternoon,394,greeting
")
sapply(df,class)
# V1 V2 V3 V4 V5
#"integer" "character" "factor" "integer" "factor"