So I have a list of length 9 where each element is a dataframe. I wanted to extract specific columns from each dataframe in the most efficient way possible, so I used the below function.
Down <- lapply(tables, "[", 2)
This successfully extracted the information I wanted, but why? What is "[" and how does it satisfy the semantics of lapply?
To augment Ritchie Sacramento's very nice explanation, you can also access the dataframe columns in a list using conventional notation:
cars1 <- mtcars
cars2 <- cars1
cars3 <- cars2
tables <- list(cars1, cars2, cars3)
lapply(tables, function(x) x$cyl)
[[1]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
lapply(tables, function(x) x[, 2])
[[1]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Related
c. What is the correlation coefficient between “beds” and “baths” for the houses with total rooms not exceeding (i.e., less than or equal) 6 rooms (total rooms = “beds” + “baths”)
Data:
myData$beds + myData$baths
[1] 5 5 6 5 5 5 5 7 6 5 5 5 6 6 5 6 5 6 5 6 6 5 7 5 5 8 5 5 7 5 7 7 6 7 8 6 7
[38] 7 7 6 5 7 6 7 6 7 5 7 6 5 5 7 5 5 6 7 7 5 5 7 5 7 8 8 5 6 6 5 5 6 6 8 6 7
[75] 7 6 6 6 7 6 6 7 7 6 6 6 8 7 6 8 6 9 10 7 7 7 6 6 6 7 8 7 6 7 7 7 7 7
How do I write code that does the cor(myData$beds, myData$baths), with the condition that the sum of both is less than 6?
How to create a vector sequence of:
2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
I tried to use:
2:8+rep(0:6,each=6)
but the result is:
2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 .... 12 13 14
Please help. Thanks.
This should accomplish what you're looking for:
x = 2
VecSeq = c(x:8)
while (x < 7) {
x = x + 1
calc = c(x:8)
VecSeq = c(VecSeq, calc)
}
VecSeq # Your desired vector
you could do this:
library(purrr)
unlist(map(2:7, ~.x:8))
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
and a little function in base R:
funky_vec <- function(from,to){unlist(sapply(from:(to-1),`:`,to))}
funky_vec(2,8)
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
This is made really easy with sequence (since R 4.0.0):
sequence(7:2, 2:7)
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
The vector (1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9)
seq() and rep() maybe can not deliver parameters.
I read the help doc but fail to find the way.
You could try
(1:5) + rep(0:4,each=5)
#[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
NOTE: (1:5) and 0:4 can be replaced by seq(1,5) and seq(0,4)
Another one:
as.vector(outer(1:5,0:4,"+"))
I have a numeric element z as below:
> sort(z)
[1] 1 5 5 5 6 6 7 7 7 7 7 9 9
I would like to sequentially reorganize this element so to have
> z
[1] 1 2 2 2 3 3 4 4 4 4 4 5 5
I guess converting z to a factor and use it as an index should be the way.
You answered it yourself really:
as.integer(factor(sort(z)))
I know this has been accepted already but I decided to look inside factor() to see how it's done there. It more or less comes down to this:
x <- sort(z)
match(x, unique(x))
Which is an extra line I suppose but it should be faster if that matters.
This should do the trick
z = sort(sample(1:10, 100, replace = TRUE))
cumsum(diff(z)) + 1
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[26] 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6
[51] 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
Note that diff omits the first element of the series. So to compensate:
c(1, cumsum(diff(z)) + 1)
Alternative using rle:
z = sort(sample(1:10, 100, replace = TRUE))
rle_result = rle(sort(z))
rep(rle_result$values, rle_result$lengths)
> rep(rle_result$values, rle_result$lengths)
[1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[26] 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6
[51] 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
rep(seq_along(rle(x)$l), rle(x)$l)
I have a dataframe running into about 500,000 rows. One of these columns contains positive integer values, say column A. let there be another column B
I now need to create a second dataframe with number of rows equal to sum(dataframe$A). this is done.
A question of performance arises when i need to fill this new data frame up with data. I am trying to create a column A2 for this second frame as follows:
A2<-vector()
for (i in 1:nrow(dataframe)){
A2<-c(A2,rep(dataframe$B[i],dataframe$A[i]))
}
The external loop is obviously very slow for the large number of rows being processed. Any suggestions on how to achieve this task with faster processing.
Thanks for responses
You simply do not need the loop at all. rep is already vectorized.
A2 <- rep(dataframe$B, dataframe$A)
Should work. As a reproducible example, here is your way using the built in mtcars dataset.
x <- vector()
for(i in 1:nrow(mtcars)) {x <- c(x, rep(mtcars$cyl[i], mtcars$gear[i]))}
> x
[1] 6 6 6 6 6 6 6 6 4 4 4 4 6 6 6 8 8 8 6 6 6 8 8 8 4 4 4 4 4 4 4 4 6 6 6 6 6
[38] 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8
[75] 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 6 6 6 6 6 8 8
[112] 8 8 8 4 4 4 4
and vectorized, it is:
x2 <- rep(mtcars$cyl, mtcars$gear)
> x2
[1] 6 6 6 6 6 6 6 6 4 4 4 4 6 6 6 8 8 8 6 6 6 8 8 8 4 4 4 4 4 4 4 4 6 6 6 6 6
[38] 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8
[75] 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 6 6 6 6 6 8 8
[112] 8 8 8 4 4 4 4
which will be orders of magnitude faster than using a loop.