I have a data which is a m*n matrix. I would like to split the matrix by column and save each column separately in a different vector.
E.g
data<-matrix(1:9, ncol=3)
I would like to have vec1 containing the first column so
vec1 will be transpose of [1,2,3], a column matrix with dimension 3*1 which is basically the first column of data. Similarly, vec2 represents the 2nd column and vec3 represents the last column.
I understand that I can do this manually by repeating
vec1<-data[,1],
vec2<-data[,2]
...
vecn<-data[,n].
However, this is not feasible when n is large.
So I would like to know whether it is feasible to use a loop to do this.
As has been pointed out, this is probably not a good idea. However, if you still felt the need to proceed down this path, rather than using a list or just using the source matrix itself, the easiest approach would probably be to use a combination of list2env and data.frame.
Here's a demo, step-by-step:
data <- matrix(1:9, ncol=3)
ls() # Only one object in my workplace
# [1] "data"
data_list <- unclass(data.frame(data))
str(data_list)
# List of 3
# $ X1: int [1:3] 1 2 3
# $ X2: int [1:3] 4 5 6
# $ X3: int [1:3] 7 8 9
# - attr(*, "row.names")= int [1:3] 1 2 3
ls() # Two objects now
# [1] "data" "data_list"
list2env(data_list, envir = .GlobalEnv)
ls() # Five objects now
# [1] "data" "data_list" "X1" "X2" "X3"
X1
# [1] 1 2 3
If you want single-column data.frames, you can use split.list:
list2env(setNames(split.default(data.frame(data), seq(ncol(data))),
paste0("var", seq(ncol(data)))), envir = .GlobalEnv)
var1
# X1
# 1 1
# 2 2
# 3 3
Putting this together, you can actually do this all at once (without first having to create "data_list") like this:
list2env(data.frame(data), envir = .GlobalEnv)
But again, you should have a good reason to do so!
Related
In data frames, [-indexing can be performed using a single character. E.g. mtcars["mpg"].
On the other hand, trying the same on a matrix, results in NA, e.g.
m = cbind(A = 1:5, B = 1:5)
m["A"]
# NA
...implying that this is somehow an invalid way to subset a matrix.
Is this normal R behavior? If so, where is it documented?
cbind() creates a matrix, by default. mtcars is a data frame.
class(cbind(A = 1:5, B = 1:5))
# [1] "matrix" "array"
class(mtcars)
# [1] "data.frame"
Because data frames are built as lists of columns, dataframe["column_name"], using one argument in [, defaults to treating the data frame as a list, allowing you to select columns, mostly the same as dataframe[, "column_name"].
A matrix has no such list underpinnings, so if you use [ with one argument, it doesn't assume you want columns. Use matrix[, "column_name"] to select columns from a matrix.
cbind is a bad way to create data frames from scratch. You can specify cbind.data.frame(A = 1:5, B = 1:5), but it's simpler and clearer to use data.frame(A = 1:5, B = 1:5). However, if you are adding multiple columns to an existing data frame then cbind(my_data_frame, A = 1:5, B = 1:5) is fine, and will result in a data frame as long as one of the arguments is already a data frame.
This behaviour is documented in ?"[", section "Matrices and arrays":
Matrices and arrays are vectors with a dimension attribute and so
all the vector forms of indexing can be used with a single index.
It means that if you use just a single index, the object to subset is treated as an object without dimensions and so if the index is a character vector, the method will look for the names attribute, which is absent in this case (try names(m) on your matrix to check this). What you did in the question is totally equivalent to (c(1:5, 1:5))["A"]. If you use a double index instead, the method will search for the dimnames attribute to subset. Even if confusing, a matrix may have both names and dimnames. Consider this:
m<-matrix(c(1:5,1:5), ncol = 2, dimnames = list(LETTERS[1:5], LETTERS[1:2]))
names(m)<-LETTERS[1:10]
#check whether the attributes are set
str(m)
# int [1:5, 1:2] 1 2 3 4 5 1 2 3 4 5
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# ..$ : chr [1:2] "A" "B"
# - attr(*, "names")= chr [1:10] "A" "B" "C" "D" ...
We have set rownames, colnames and names. Let's subset it:
#a column
m[,"A"]
#A B C D E
#1 2 3 4 5
#a row
m["A",]
# A B
#1 1
#an element
m["A"]
#A
#1
Two cases here,
m = cbind(A = 1:5, B = 11:15)
typeof(m)
"integer"
And
typeof(mtcars)
"list"
So reading is different. First case needs comma,
cbind(A = 1:5, B = 11:15)[,"A"]
[1] 1 2 3 4 5
I have a data.frame that arbitrarily defines parameter names and sequence boundaries:
dfParameterValues <- data.frame(ParameterName = character(), seqFrom = integer(), seqTo = integer(), seqBy = integer())
row1 <- data.frame(ParameterName = "parameterA", seqFrom = 1, seqTo = 2, seqBy = 1)
row2 <- data.frame(ParameterName = "parameterB", seqFrom = 5, seqTo = 7, seqBy = 1)
row3 <- data.frame(ParameterName = "parameterC", seqFrom = 10, seqTo = 11, seqBy = 1)
dfParameterValues <- rbind(dfParameterValues, row1)
dfParameterValues <- rbind(dfParameterValues, row2)
dfParameterValues <- rbind(dfParameterValues, row3)
I would like to use this approach to create a grid of c parameter columns based on the number of unique ParameterNames that contain r rows of all possible combinations of the sequences given by seqFrom, seqTo, and seqBy. The result would therefore look somewhat like this or should have a content like the following:
ParameterA ParameterB ParameterC
1 5 10
1 5 11
1 6 10
1 6 11
1 7 10
1 7 11
2 5 10
2 5 11
2 6 10
2 6 11
2 7 10
2 7 11
Edit: Note that the parameter names and their numbers are not known in advance. The data.frame comes from elsewhere so I cannot use the standard static expand.grid approach and need something like a flexible function that creates the expanded grid based on any dataframe with the columns ParameterName, seqFrom, seqTo, seqBy.
I've been playing around with for loops (which is bad to begin with) and it hasn't lead me to any elegant ideas. I can't seem to find a way to come up with the result by using tidyr without constructing the sequences seperately first, either. Do you have any elegant approaches?
Bonus kudos for extending this to include not only numerical sequences, but vectors/sets of characters / other factors, too.
Many thanks!
Going off CPak's answer, you could use
my_table <- expand.grid(apply(dfParameterValues, 1, function(x) seq(as.numeric(x['seqFrom']), as.numeric(x['seqTo']), as.numeric(x['seqBy']))))
names(my_table) <- c("ParameterA", "ParameterB", "ParameterC")
my_table <- my_table[order(my_table$ParameterA, my_table$ParameterB), ]
#smanski's answer is technically correct (and should arguably be accepted since it motivated this), but it is also a good example of when to be careful when using apply with data.frames. In this case, the frame contains at least one column that is character, so all columns are converted, resulting in the need to use as.numeric. The safer alternative is to only pull the columns needed, such as either of:
expand.grid(apply(dfParameterValues[,-1], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
I prefer the second, because it only pulls what it needs and therefore what it "knows" should be numeric. (I find explicit is often safer.)
The reason this is happening is that apply silently converts the data to a matrix, so to see the effects, try:
str(as.matrix(dfParameterValues))
# chr [1:3, 1:4] "parameterA" "parameterB" "parameterC" " 1" " 5" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:4] "ParameterName" "seqFrom" "seqTo" "seqBy"
str(as.matrix(dfParameterValues[c("seqFrom","seqTo","seqBy")]))
# num [1:3, 1:3] 1 5 10 2 7 11 1 1 1
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:3] "seqFrom" "seqTo" "seqBy"
(Note the chr on the first and the num on the second.)
Neither one preserves the parameter names. To do that, just sandwich the call with setNames:
setNames(
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) )),
dfParameterValues$ParameterName)
This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 6 years ago.
I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.
Question is how do I slice a particular column from a data frame using a string (character).
Example:
strColumnNames <- "Admit,Rank"
strDelimiter <- ","
strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
for( strColName in strSplittedColumnNames[[1]] ){
dfData$as.name(strColName) <- factor(dfData$get(strColName))
}
Tried:
dfData$as.name()
dfData$get(as.name())
dfData$get()
Error Msg:
Error: attempt to apply non-function
Any help would be greatly appreciated! Thank you!!!
You need to change
dfData$as.name(strColName) <- factor(dfData$get(strColName))
to
dfData[[strColName]] <- factor(dfData[[strColName]])
You may read ?"[[" for more.
In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:
dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"
dat$z
# [1] NULL
dat[[z]]
# [1] 1 2 3 4 5
Regarding the other answer
apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:
x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: chr "a" "b" "c" "d"
# $ x2: chr "A" "B" "C" "D"
# $ x3: int 1 2 3 4
Note, you still have character variable rather than factor. As I said, we have to use lapply:
x[1:2] <- lapply(x[1:2], as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int 1 2 3 4
Now we see the factor class in x1 and x2.
Using apply for a data frame is never a good idea. If you read the source code of apply:
dl <- length(dim(X))
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X)
else as.array(X)
You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.
Yet apply is written in R not C, with an ordinary for loop:
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
so it is no better than an explicit for loop you write yourself.
I would use a different method.
Create a vector of column names you want to change to factors:
factorCols <- c("Admit", "Rank")
Then extract these columns by index:
myCols <- which(names(dfData) %in% factorCols)
Finally, use apply to change these columns to factors:
dfData[,myCols] <- lapply(dfData[,myCols],as.factor)
I have a data frame called input. The first column refers to an Article ID (ArtID), the subsequent columns will be used to create the matrix.
Based on the ArtID, I want R to generate a 2x2 matrix (more precise: It needs to be a numeric 2x2 matrix). Specifically, I want to create a matrix for the first row (ArtID == 1), the second row(ArtID == 2) and so on...
What I came up with so far is this:
for(i in 1:3) {stored.matrix = matrix(input[which(ArtID ==i),-1],nrow = 2)
This gives me a 2x2 matrix, but it is not numeric (which it needs to be).
If I apply as.numeric, the matrix is no longer a 2x2 matrix.
How do I get a 2x2 numerical matrix?
Minimal reproducible example:
ArtID = c(1,2,3)
AC_AC = c(1,1,1)
MKT_AC = c(0.5,0.6,0.2)
AC_MKT = c(0.5,0.6,0.2)
MKT_MKT = c(1,1,1)
input = data.frame(ArtID, AC_AC, MKT_AC, AC_MKT, MKT_MKT)
stored.matrix = matrix(input[which(ArtID ==i),-1],nrow = 2)
# [,1] [,2]
#[1,] 1 0.5
#[2,] 0.5 1
is.numeric(stored.matrix)
# [1] FALSE
as.numeric(stored.matrix)
## [1] 1.0 0.5 0.5 1.0
As you can see after applying as.numeric() the matrix is no longer 2x2.
Can anyone help?
You could use unlist():
matrix(unlist(input[ArtID ==i,-1]),2)
or use
storage.mode(m) <- "numeric"
when you have only numerical values in your data frame, it is more appropriate to use a matrix. Convert your data frame to a matrix will solve all problem. Also,
input <- data.matrix(input)
ArtID = c(1,2,3)
AC_AC = c(1,1,1)
MKT_AC = c(0.5,0.6,0.2)
AC_MKT = c(0.5,0.6,0.2)
MKT_MKT = c(1,1,1)
input = data.frame(ArtID, AC_AC, MKT_AC, AC_MKT, MKT_MKT)
input <- data.matrix(input) ## <- this line
stored.matrix = matrix(input[which(ArtID ==i),-1], 2)
is.numeric(stored.matrix)
# [1] TRUE
So what was the problem?
If input is a data frame, input[which(ArtID == i),-1] by row subsetting still returns a data frame. A data frame is a special type of list. When you feed a list to matrix(), you get into a situation of matrix list.
If you read ?matrix for what data it can take, you will see:
data: an optional data vector (including a list or ‘expression’
vector). Non-atomic classed R objects are coerced by
‘as.vector’ and all attributes discarded.
Note that a list is also of vector data type (e.g., is.vector(list(a = 1)) gives TRUE), so it is legitimate to feed a list to matrix. You can try
test <- matrix(list(a = 1, b = 2, c = 3, d = 4), 2)
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
This is indeed a matrix in the sense that class(test) give "matrix"), but
str(test)
#List of 4
# $ : num 1
# $ : num 2
# $ : num 3
# $ : num 4
# - attr(*, "dim")= int [1:2] 2 2
typeof(test)
# [1] "list"
so it is not the usual numerical matrix we refer to.
The input list can be ragged, too.
test <- matrix(list(a = 1, b = 2:3, c = 4:6, d = 7:10), 2)
# [,1] [,2]
#[1,] 1 Integer,3
#[2,] Integer,2 Integer,4
str(test)
#List of 4
# $ : num 1
# $ : int [1:2] 2 3
# $ : int [1:3] 4 5 6
# $ : int [1:4] 7 8 9 10
# - attr(*, "dim")= int [1:2] 2 2
And I was wondering why typeof() gives me list... :)
Yes, so had realized something unusual. The storage mode of a matrix is determined by that of its element. For a matrix list, elements are list, hence the matrix has "list" mode.
Supose I have a list of 3 elements and each element is a list of 2 other elements. The first, a 4-dimensional vector and the second, say, a char. The following code will produce a list exactly as I just described it:
x <- NULL
for(i in 1:3){
set.seed(i); a <- list(sample(1:4, 4, replace = T), LETTERS[i])
x <- c(x, list(a))
}
Its structure is there fore of the following type (the exact values may chage since I used the sample function):
> str(x)
List of 3
$ :List of 2
..$ : int [1:4] 2 2 3 4
..$ : chr "A"
$ :List of 2
..$ : int [1:4] 1 3 3 1
..$ : chr "B"
$ :List of 2
..$ : int [1:4] 1 4 2 2
..$ : chr "C"
Now, I have an other 4-dimensional vector, say y:
y <- 1:4
Finally I want to create a matrix resulting from the operation (say sum) between y and each 4-dimensional vector stored in the list. For the given example, this matrix would give the following result:
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 4 5 6
[3,] 6 6 5
[4,] 8 5 6
Question: How can I create the above matrix in a simple and elegant way? I was searching for some solution that could use some apply function or that could use directly the sum function in some way that I'm not aware of.
Try this:
# you can also simply write: sapply(x, function(x) x[[1]]) + y
foo <- function(x) x[[1]]
sapply(x, foo) + y
The function foo extracts the vector inside the list;
sapply returns those vectors as a matrix;
Finally, we use recycling rule for addition.
Update 1
Well, since #Frank mentioned it. I might make a little explanation. The '[[' operator in R (note the quote!) is a function, taking two arguments. The first is a vector type object, like a vector/list; while the second is the index which you want to refer to. For the following example:
a <- 1:4
a[2] # 2
'[['(a, 2) # 2
Though my original answer is easier to digest, it is not the most efficient, because for each list element, two function calls are invoked to take out the vector. While if we use '[[' directly, only one function call is invoked. Therefore, we get time savings by reducing function call overhead. Function call overhead can be noticeable, when the function is small and does not do much work.
Operators in R are essentially functions. +, *, etc are arithmetic operators and you can find them by ?'+'. Similarly, you can find ?'[['. Don't worry too much if you can't follow this at the moment. Sooner or later you will get to it.
Update 2
I don't understand how it actually does the job. When I simply ask for [[1]] at the console, I get the first element of the list (both the integer vector and the char value), not just the vector. I guess the remainder should be the magics of the sapply function.
Ah, if you have difficulty in understanding sapply (or similarly lapply), consider the following. But I will start from lapply.
output <- lapply(x, foo) is doing something like:
output <- vector("list", length = length(x))
for (i in 1:length(x)) output[[i]] <- foo(x[[i]])
So lapply returns a list:
> output
[[1]]
[1] 2 2 3 4
[[2]]
[1] 1 4 4 3
[[3]]
[1] 3 1 1 1
Yes, lapply loops through the elements of x, applying function foo, and return the result in another list.
sapply takes the similar idea, but returns a vector/matrix. You may think that sapply collapses the result of lapply to a vector/matrix.
Sure, my this part of explanation is just to make things understandable. lapply and sapply is not really implemented as R loop. They are more efficient.