I have 2 objects:
A data frame with 3 variables:
v1 <- 1:10
v2 <- 11:20
v3 <- 21:30
df <- data.frame(v1,v2,v3)
A numeric vector with 3 elements:
nv <- c(6,11,28)
I would like to compare the first variable to the first number, the second variable to the second number and so on.
which(df$v1 > nv[1])
which(df$v2 > nv[2])
which(df$v3 > nv[3])
Of course in reality my data frame has a lot more variables so manually typing each variable is not an option.
I encounter these kinds of problems quite frequently. What kind of documentation would I need to read to be fluent in these matters?
One option would be to compare with equally sized elements. For this we can replicate the elements in 'nv' each by number of rows of 'df' (rep(nv, each=nrow(df))) and compare with df or use the col function that does similar output as rep.
which(df > nv[col(df)], arr.ind=TRUE)
If you need a logical matrix that corresponds to comparison of each column with each element of 'nv'
sweep(df, 2, nv, FUN='>')
You could also use mapply:
mapply(FUN=function(x, y)which(x > y), x=df, y=nv)
#$v1
#[1] 7 8 9 10
#
#$v2
#[1] 2 3 4 5 6 7 8 9 10
#
#$v3
#[1] 9 10
I think these sorts of situations are tricky because normal looping solutions (e.g. the apply function) only loop through one object, but you need to loop both through df and nv simultaneously. One approach is to loop through the indices and to use them to grab the appropriate information from both df and nv. A convenient way to loop through indices is the sapply function:
sapply(seq_along(nv), function(x) which(df[,x] > nv[x]))
# [[1]]
# [1] 7 8 9 10
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 9 10
Related
I have a dataframe like this:
a=c(rep(1,3), rep(2,2))
b=c(2,4,7,9,1)
df <- data.frame(a,b)
> df
a b
1 1 2
2 1 4
3 1 7
4 2 9
5 2 1
I want to create a list with as many elements as different values in column "a" (in this case "2") and store the values of column "b" in the list according to column "a". I am trying something like this:
lst <-list()
ff <-function(){lili[[df$a]] <- df$b}
apply(ff, df)
Which obviously is not working...But what I basically want to do is:
lst <- list(c(2,4), c(7,9,1))
but using apply over the rows of a large df to populate the list.
split(df$b, df$a)
$`1`
[1] 2 4 7
$`2`
[1] 9 1
This is extra nice because the list names will be the values of a by default.
That said, I agree with alistaire's comment. This seems like an XY problem - there's a good chance that whatever you do next would be done easily by data.table or dplyr without creating this separate list.
Try this: lapply(unique(df$a),function(x) df$b[df$a==x])
Here is an option using unstack
unstack(df, b~a)
#$`1`
#[1] 2 4 7
#$`2`
#[1] 9 1
I have a vector of integers like this:
a <- c(2,3,4,1,2,1,3,5,6,3,2)
values<-c(1,2,3,4,5,6)
I want to list, for every unique value in my vector (the unique values being ordered), the position of their occurences. My desired output:
rep_indx<-data.frame(c(4,6),c(1,5,11),c(2,7,10),c(3),c(8),c(9))
split fits pretty well here, which returns a list of indexes for each unique value in a:
indList <- split(seq_along(a), a)
indList
# $`1`
# [1] 4 6
#
# $`2`
# [1] 1 5 11
#
# $`3`
# [1] 2 7 10
#
# $`4`
# [1] 3
#
# $`5`
# [1] 8
#
# $`6`
# [1] 9
And you can access the index by passing the value as a character, i.e.:
indList[["1"]]
# [1] 4 6
You can do this, using sapply. The ordering that you need is ensured by the sort function.
sapply(sort(unique(a)), function(x) which(a %in% x))
#### [[1]]
#### [1] 4 6
####
#### [[2]]
#### [1] 1 5 11
#### ...
It will result in a list, giving the indices of your repetitions. It can't be a data.frame because a data.frame needs to have columns of same lengths.
sort(unique(a)) is exactly your vector variable.
NOTE: you can also use lapply to force the output to be a list. With sapply, you get a list except if by chance the number of replicates is always the same, then the output will be a matrix... so, your choice!
Perhaps this also works
order(match(a, values))
#[1] 4 6 1 5 11 2 7 10 3 8 9
You can use the lapply function to return a list with the indexes.
lapply(values, function (x) which(a == x))
a questions from a relative n00b: I’d like to split a vector into three vectors of different lengths, with the values assigned to each vector at random. For example, I’d like to split the vector of length 12 below into vectors of length 2,3, and 7
I can get three equal sized vectors using this:
test<-1:12
split(test,sample(1:3))
Any suggestions on how to split test into vectors of 2,3, and 7 instead of three vectors of length 4?
You could use rep to create the indices for each group and then split based on that
split(1:12, rep(1:3, c(2, 3, 7)))
If you wanted the items to be randomly assigned so that it's not just the first 2 items in the first vector, the next 3 items in the second vector, ..., you could just add call to sample
split(1:12, sample(rep(1:3, c(2, 3, 7))))
If you don't have the specific lengths (2,3,7) in mind but just don't want it to be equal length vectors every time then SimonO101's answer is the way to go.
How about using sample slightly differently...
set.seed(123)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 6
#$`2`
#[1] 3 7 9 10 12
#$`3`
#[1] 2 4 5 8 11
set.seed(1234)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 7 8
#$`2`
#[1] 2 3 4 6 9 10 12
#$`3`
#[1] 5 11
The first argument in sample is the number of groups to split the vector into. The second argument is the number of elements in the vector. This will randomly assign each successive element into one of 3 vectors. For 4 vectors just do split( test , sample(4, 12 , repl = TRUE) ).
It is easier than you think. To split the vector in three new randomly chosen sets run the following code:
test <- 1:12
split(sample(test), 1:3)
By doing so any time you run your this code you would get a new random distribution in three different sets(perfect for k-fold cross validation).
You get:
> split(sample(test), 1:3)
$`1`
[1] 5 8 7 3
$`2`
[1] 4 1 10 9
$`3`
[1] 2 11 12 6
> split(sample(test), 1:3)
$`1`
[1] 12 6 4 1
$`2`
[1] 3 8 7 5
$`3`
[1] 9 2 10 11
You could use an auxiliary vector to format the way you want to split your data. Example:
Data <- c(1,2,3,4,5,6)
Format <- c("X","Y","X","Y","Z,"Z")
output <- split(Data,Format)
Will generate the output:
$X
[1] 1 3
$Y
[1] 2 4
$Z
[1] 5 6
I have two lists of equal length: one is a list of data frames, another is a list of vectors, such that a length of a each vector coincides with the number of rows in a respective data frame of the first list. I want to assign vectors from the second list as a value of the first column in each data frame. It would probably be easier to explain with the code bellow:
for (i in seq_along(data)){
data[[c(i, 1)]] = links[[i]]
}
, where data is a list of data frames, and links is a list of vectors. While this code works fine, and speedwise there is no particular need to avoid for loops, I wonder whether there is any other way to perform the same action without for?
Since dataand links have the same lengths, and you are replacing one-for-one, Map() and/or mapply() would be a good choice. Using the data from the other answer,
data <- list(data.frame(a = 1:3, b = 4:6), data.frame(a = 10:14, b = 15:19))
links <- list(7:9, 20:24)
You can do
Map("[<-", data, 1, value = links)
# [[1]]
# a b
# 1 7 4
# 2 8 5
# 3 9 6
#
# [[2]]
# a b
# 1 20 15
# 2 21 16
# 3 22 17
# 4 23 18
# 5 24 19
Although only the R gods know how safe this is. It would be safer to use an anonymous function.
Map(function(x, y, z) { x[y] <- z; x }, data, 1, links)
Since your data is already separated into independent list components, there is no way to do this operation without some kind of loop, hidden or explicit. This is because it is impossible to operate on multiple independent list components in any kind of vectorized or monolithic way; the nature of a list is a sequence of (allowably) heterogeneous objects which cannot be accessed as a unit.
You can replace the for-loop with the hidden loop in lapply(), but there isn't much point in going to the trouble:
l1 <- list(data.frame(a=1:3,b=4:6), data.frame(a=10:14,b=15:19) );
l2 <- list(7:9, 20:24 );
invisible(lapply(seq_along(l1), function(i) l1[[i]][,1] <<- l2[[i]] ));
l1;
## [[1]]
## a b
## 1 7 4
## 2 8 5
## 3 9 6
##
## [[2]]
## a b
## 1 20 15
## 2 21 16
## 3 22 17
## 4 23 18
## 5 24 19
Alternatively, you can completely restructure your data, either temporarily or permanently, but this is obviously somewhat outside the scope of your question. For example, here's a way to temporarily rbind()/c() the inputs that allows performing the first-column replacement as a unit, before split()ting back into the original list form (it should also be noted that I had to use sapply() once anyway just to get the lengths to use for splitting the aggregated data.frame back into separate data.frames) (it should also be noted that this will only work if all data.frames have the same number and names of columns):
unname(split(transform(do.call(rbind,l1),a=do.call(c,l2)),rep(seq_along(l2),sapply(l2,length))));
## [[1]]
## a b
## 1 7 4
## 2 8 5
## 3 9 6
##
## [[2]]
## a b
## 4 20 15
## 5 21 16
## 6 22 17
## 7 23 18
## 8 24 19
As you can see, a for-loop is clearly more straightforward and sensible than these alternatives.
a questions from a relative n00b: I’d like to split a vector into three vectors of different lengths, with the values assigned to each vector at random. For example, I’d like to split the vector of length 12 below into vectors of length 2,3, and 7
I can get three equal sized vectors using this:
test<-1:12
split(test,sample(1:3))
Any suggestions on how to split test into vectors of 2,3, and 7 instead of three vectors of length 4?
You could use rep to create the indices for each group and then split based on that
split(1:12, rep(1:3, c(2, 3, 7)))
If you wanted the items to be randomly assigned so that it's not just the first 2 items in the first vector, the next 3 items in the second vector, ..., you could just add call to sample
split(1:12, sample(rep(1:3, c(2, 3, 7))))
If you don't have the specific lengths (2,3,7) in mind but just don't want it to be equal length vectors every time then SimonO101's answer is the way to go.
How about using sample slightly differently...
set.seed(123)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 6
#$`2`
#[1] 3 7 9 10 12
#$`3`
#[1] 2 4 5 8 11
set.seed(1234)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 7 8
#$`2`
#[1] 2 3 4 6 9 10 12
#$`3`
#[1] 5 11
The first argument in sample is the number of groups to split the vector into. The second argument is the number of elements in the vector. This will randomly assign each successive element into one of 3 vectors. For 4 vectors just do split( test , sample(4, 12 , repl = TRUE) ).
It is easier than you think. To split the vector in three new randomly chosen sets run the following code:
test <- 1:12
split(sample(test), 1:3)
By doing so any time you run your this code you would get a new random distribution in three different sets(perfect for k-fold cross validation).
You get:
> split(sample(test), 1:3)
$`1`
[1] 5 8 7 3
$`2`
[1] 4 1 10 9
$`3`
[1] 2 11 12 6
> split(sample(test), 1:3)
$`1`
[1] 12 6 4 1
$`2`
[1] 3 8 7 5
$`3`
[1] 9 2 10 11
You could use an auxiliary vector to format the way you want to split your data. Example:
Data <- c(1,2,3,4,5,6)
Format <- c("X","Y","X","Y","Z,"Z")
output <- split(Data,Format)
Will generate the output:
$X
[1] 1 3
$Y
[1] 2 4
$Z
[1] 5 6