This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 6 years ago.
I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.
Question is how do I slice a particular column from a data frame using a string (character).
Example:
strColumnNames <- "Admit,Rank"
strDelimiter <- ","
strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
for( strColName in strSplittedColumnNames[[1]] ){
dfData$as.name(strColName) <- factor(dfData$get(strColName))
}
Tried:
dfData$as.name()
dfData$get(as.name())
dfData$get()
Error Msg:
Error: attempt to apply non-function
Any help would be greatly appreciated! Thank you!!!
You need to change
dfData$as.name(strColName) <- factor(dfData$get(strColName))
to
dfData[[strColName]] <- factor(dfData[[strColName]])
You may read ?"[[" for more.
In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:
dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"
dat$z
# [1] NULL
dat[[z]]
# [1] 1 2 3 4 5
Regarding the other answer
apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:
x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: chr "a" "b" "c" "d"
# $ x2: chr "A" "B" "C" "D"
# $ x3: int 1 2 3 4
Note, you still have character variable rather than factor. As I said, we have to use lapply:
x[1:2] <- lapply(x[1:2], as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int 1 2 3 4
Now we see the factor class in x1 and x2.
Using apply for a data frame is never a good idea. If you read the source code of apply:
dl <- length(dim(X))
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X)
else as.array(X)
You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.
Yet apply is written in R not C, with an ordinary for loop:
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
so it is no better than an explicit for loop you write yourself.
I would use a different method.
Create a vector of column names you want to change to factors:
factorCols <- c("Admit", "Rank")
Then extract these columns by index:
myCols <- which(names(dfData) %in% factorCols)
Finally, use apply to change these columns to factors:
dfData[,myCols] <- lapply(dfData[,myCols],as.factor)
Related
In data frames, [-indexing can be performed using a single character. E.g. mtcars["mpg"].
On the other hand, trying the same on a matrix, results in NA, e.g.
m = cbind(A = 1:5, B = 1:5)
m["A"]
# NA
...implying that this is somehow an invalid way to subset a matrix.
Is this normal R behavior? If so, where is it documented?
cbind() creates a matrix, by default. mtcars is a data frame.
class(cbind(A = 1:5, B = 1:5))
# [1] "matrix" "array"
class(mtcars)
# [1] "data.frame"
Because data frames are built as lists of columns, dataframe["column_name"], using one argument in [, defaults to treating the data frame as a list, allowing you to select columns, mostly the same as dataframe[, "column_name"].
A matrix has no such list underpinnings, so if you use [ with one argument, it doesn't assume you want columns. Use matrix[, "column_name"] to select columns from a matrix.
cbind is a bad way to create data frames from scratch. You can specify cbind.data.frame(A = 1:5, B = 1:5), but it's simpler and clearer to use data.frame(A = 1:5, B = 1:5). However, if you are adding multiple columns to an existing data frame then cbind(my_data_frame, A = 1:5, B = 1:5) is fine, and will result in a data frame as long as one of the arguments is already a data frame.
This behaviour is documented in ?"[", section "Matrices and arrays":
Matrices and arrays are vectors with a dimension attribute and so
all the vector forms of indexing can be used with a single index.
It means that if you use just a single index, the object to subset is treated as an object without dimensions and so if the index is a character vector, the method will look for the names attribute, which is absent in this case (try names(m) on your matrix to check this). What you did in the question is totally equivalent to (c(1:5, 1:5))["A"]. If you use a double index instead, the method will search for the dimnames attribute to subset. Even if confusing, a matrix may have both names and dimnames. Consider this:
m<-matrix(c(1:5,1:5), ncol = 2, dimnames = list(LETTERS[1:5], LETTERS[1:2]))
names(m)<-LETTERS[1:10]
#check whether the attributes are set
str(m)
# int [1:5, 1:2] 1 2 3 4 5 1 2 3 4 5
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# ..$ : chr [1:2] "A" "B"
# - attr(*, "names")= chr [1:10] "A" "B" "C" "D" ...
We have set rownames, colnames and names. Let's subset it:
#a column
m[,"A"]
#A B C D E
#1 2 3 4 5
#a row
m["A",]
# A B
#1 1
#an element
m["A"]
#A
#1
Two cases here,
m = cbind(A = 1:5, B = 11:15)
typeof(m)
"integer"
And
typeof(mtcars)
"list"
So reading is different. First case needs comma,
cbind(A = 1:5, B = 11:15)[,"A"]
[1] 1 2 3 4 5
I have been wondering about this for a long time. The data.frame class in base R only allow the columns to be vectors. I was looking for a package which generalize this so that each "column" can be a 2-d or even n-d array with similar methods to the original class data.frame such as sub-setting with "[]", merge, aggregate, etc.
My reason for such a class is to deal with Monte Carlo simulation data. For example, for each simulation the result can be expressed as a data frame in which the row indices are dates, and columns include character and numeric. If I simulate 1000 times then I get 1000 such data frames. If there is a class in R with which I can store the results in one object and has the convenience of most of the data.frame methods, it'll make my coding a lot easier.
As I couldn't find such a package I attempted to create my own with no success. I came across this package "S4Vectors" with a "DataFrame" class, which "supports the storage of any type of object (with length and [ methods) as columns." Here is my attempt.
library(S4Vectors)
test <- matrix(1:6,2,3)
test1 <- matrix(7:12,2,3)
setClass("Column", slots=list(), contains = "matrix")
setMethod("length", "Column", function(x) {nrow(x)})
'[.Column' <- function(x, i, j, ...) {
i <- ((i-1)*ncol(x)+1):(i*(ncol(x)))
NextMethod()
}
testColumn <- new("Column", test)
testColumn1 <- new("Column", test1)
length(testColumn)
testColumn[1]
testDataFrame <- DataFrame(Col1 = testColumn, Col2 = testColumn1)
I did get the length and [ method to work but the last statement gives an error "cannot coerce class "Column" to a DataFrame".
Has anyone ever tried to do something similar?
Update: Thanks to G. Grothendieck I now know a data frame can take a matrix as a column by using the I() function. Now I am wondering if there is way to preserve such a structure in all operations. An example would be to aggregate the data frame
data.frame(v = c(1,1,2,2), m = I(diag(4)))
by v so that the result is
data.frame(v = c(1,2), m = I(matrix(c(1,1,0,0,0,0,1,1), 2, 4, byrow = T))).
data frames do allow matrix columns:
m <- diag(4)
v <- 1:4
DF <- data.frame(v, m = I(m))
str(DF)
giving:
'data.frame': 4 obs. of 2 variables:
$ v: int 1 2 3 4
$ m: 'AsIs' num [1:4, 1:4] 1 0 0 0 0 1 0 0 0 0 ...
Update 1
The R aggregate function can create matrix columns. For example,
DF <- data.frame(v = 1:4, g = c(1, 1, 2, 2))
ag <- aggregate(v ~ g, DF, function(x) c(sum = sum(x), mean = mean(x)))
str(ag)
giving:
'data.frame': 2 obs. of 2 variables:
$ g: num 1 2
$ v: num [1:2, 1:2] 3 7 1.5 3.5
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "sum" "mean"
Update 2
I don't think the aggregation discussed in the comments is nicely supported in R but you can use the following workaround:
m <- matrix(1:16, 4)
v <- c(1, 1, 2, 2)
DF <- data.frame(v, m = I(m))
nr <- nrow(DF)
ag2 <- aggregate(list(sum = 1:nr), DF["v"], function(ix) colSums(DF$m[ix, ]))
str(ag2)
giving:
'data.frame': 2 obs. of 2 variables:
$ v : num 1 2
$ sum: num [1:2, 1:4] 3 7 11 15 19 23 27 31
I have a character vector of classes that I would like to apply to a dataframe, so as to convert the current class of each field in that dataframe to the corresponding entry in the vector. For example:
frame <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
With a for-loop, I know that this can be accomplished using lapply. For example:
for(i in 1:2) { frame[i] <- lapply(frame[i], paste("as", classes[i], sep = ".")) }
For my purposes, however, a for-loop cannot work. Is there another solution that I am missing?
Thank you in advance for your input!
Note: I have been informed that this might be a duplicate of this post. And, yes, my question is similar to it. But I have looked at the class() approach before. And it does not seem to effectively deal with converting fields to factors. The lapply approach, on the other hand, does it well. But, unfortunately, I cannot utilize a for-loop in this instance
If you're not averse to using lapply without a for loop, you can try something like the following.
frame[] <- lapply(seq_along(frame), function(x) {
FUN <- paste("as", classes[x], sep = ".")
match.fun(FUN)(frame[[x]])
})
str(frame)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
However, a better option is to try to apply the correct classes when you're reading the data in to begin with.
x <- tempfile() # Just to pretend....
write.csv(frame2, x, row.names = FALSE) # ... that we are reading a csv
frame3 <- read.csv(x, colClasses = classes)
str(frame3)
# 'data.frame': 4 obs. of 2 variables:
# $ A: chr "2" "3" "4" "5"
# $ B: Factor w/ 4 levels "3","4","5","6": 1 2 3 4
Sample data:
frame <- frame2 <- data.frame(A = c(2:5), B = c(3:6))
classes <- c("character", "factor")
Why does this work,
# add ONE column to dataframe with zero rows
x <- data.frame(a=character(0))
x["b"] <- character(0)
while this does not?
# add SEVERAL columns to dataframe with zero rows
x <- data.frame(a=character(0))
x[c("b", "c")] <- character(0)
error in value[[jvseq[[jjj]]]] : index out of limits [... freely translated]
Note, that this is perfectly okay, if we have non-zero rows.
x <- data.frame(a=1)
x["b"] <- NA
x <- data.frame(a=1)
x[c("b", "c")] <- NA
And what would be a simple alternative to add multiple columns to zero row dataframes?
From help("[.data.frame"):
Data frames can be indexed in several modes. When [ and [[ are used
with a single vector index (x[i] or x[[i]]), they index the data frame
as if it were a list.
From help("["):
Recursive (list-like) objects
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
Thus, you need to do pass a list (or data.frame):
x <- data.frame(a=character(0))
x[c("b", "c")] <- list(character(0), character(0))
str(x)
#'data.frame': 0 obs. of 3 variables:
# $ a: Factor w/ 0 levels:
# $ b: chr
# $ c: chr
This question already has answers here:
Determine the data types of a data frame's columns
(11 answers)
Closed 2 years ago.
What is an easy way to find out what class each column is in a data frame?
One option is to use lapply and class. For example:
> foo <- data.frame(c("a", "b"), c(1, 2))
> names(foo) <- c("SomeFactor", "SomeNumeric")
> lapply(foo, class)
$SomeFactor
[1] "factor"
$SomeNumeric
[1] "numeric"
Another option is str:
> str(foo)
'data.frame': 2 obs. of 2 variables:
$ SomeFactor : Factor w/ 2 levels "a","b": 1 2
$ SomeNumeric: num 1 2
You can simple make use of lapply or sapply builtin functions.
lapply will return you a list -
lapply(dataframe,class)
while sapply will take the best possible return type ex. Vector etc -
sapply(dataframe,class)
Both the commands will return you all the column names with their respective class.
Hello was looking for the same, and it could be also
unlist(lapply(mtcars,class))
I wanted a more compact output than the great answers above using lapply, so here's an alternative wrapped as a small function.
# Example data
df <-
data.frame(
w = seq.int(10),
x = LETTERS[seq.int(10)],
y = factor(letters[seq.int(10)]),
z = seq(
as.POSIXct('2020-01-01'),
as.POSIXct('2020-10-01'),
length.out = 10
)
)
# Function returning compact column classes
col_classes <- function(df) {
t(as.data.frame(lapply(df, function(x) paste(class(x), collapse = ','))))
}
# Return example data's column classes
col_classes(df)
[,1]
w "integer"
x "character"
y "factor"
z "POSIXct,POSIXt"
You can use purrr as well, which is similar to apply family functions:
as.data.frame(purrr::map_chr(mtcars, class))
purrr::map_df(mtcars, class)