using which command in R to copy subset of an array - r

I want to use which in R to copy a segment of array. However, it seems like which skips the repetitive elements. Here is an example:
a <- c(1,2,3,4,1,2,2,3)
b <- c(1,2)
a <- a[which(a==b)]
a
[1] 1 2 1 2
I want to have an output like:
a
[1] 1 2 1 2 2
Any ideas?

I think you want %in%. It returns a logical, TRUE, when the values of a are also in b. Then if you vectorize that, the result is those values of a that are also in b.
> a <- c(1,2,3,4,1,2,2,3)
> b <- c(1,2)
> a[a %in% b]
[1] 1 2 1 2 2

Related

populating a list with data extracted from a dataframe

I have a dataframe like this:
a=c(rep(1,3), rep(2,2))
b=c(2,4,7,9,1)
df <- data.frame(a,b)
> df
a b
1 1 2
2 1 4
3 1 7
4 2 9
5 2 1
I want to create a list with as many elements as different values in column "a" (in this case "2") and store the values of column "b" in the list according to column "a". I am trying something like this:
lst <-list()
ff <-function(){lili[[df$a]] <- df$b}
apply(ff, df)
Which obviously is not working...But what I basically want to do is:
lst <- list(c(2,4), c(7,9,1))
but using apply over the rows of a large df to populate the list.
split(df$b, df$a)
$`1`
[1] 2 4 7
$`2`
[1] 9 1
This is extra nice because the list names will be the values of a by default.
That said, I agree with alistaire's comment. This seems like an XY problem - there's a good chance that whatever you do next would be done easily by data.table or dplyr without creating this separate list.
Try this: lapply(unique(df$a),function(x) df$b[df$a==x])
Here is an option using unstack
unstack(df, b~a)
#$`1`
#[1] 2 4 7
#$`2`
#[1] 9 1

How to combine two vectors with missing values?

I have two vectors of the same length and I'm trying to combine them such that they fill out each others missing values. For example:
a=c("",1,2,"")
b=c(5,"","",6)
I'm looking for this output:
5 1 2 6
Thanks much
In this case, the normally numeric comparison via pmax also works:
as.numeric(pmax(a,b))
#[1] 5 1 2 6
This is because R will resort to alphanumeric sorting when max/min etc are applied to character data:
max(c("b","a"))
#[1] "b"
And:
as.numeric(paste(a,b))
[1] 5 1 2 6
Or:
a[a==""] <- b[b!=""]
as.numeric(a)
# [1] 5 1 2 6
a[a == ""] <- 0
b[b == ""] <- 0
a <- as.numeric(a)
b <- as.numeric(b)
output <- a + b
as.numeric(ifelse(a != "", a, b))

Assignment to the result of a function changes variable

Looking through the ave function, I found a remarkable line:
split(x, g) <- lapply(split(x, g), FUN) # From ave
Interestingly, this line changes the value of x, which I found unexpected. I expected that split(x,g) would result in a list, which could be assigned to, but discarded afterward. My question is, why does the value of x change?
Another example may explain better:
a <- data.frame(id=c(1,1,2,2), value=c(4,5,7,6))
# id value
# 1 1 4
# 2 1 5
# 3 2 7
# 4 2 6
split(a,a$id) # Split a row-wise by id into a list of size 2
# $`1`
# id value
# 1 1 4
# 2 1 5
# $`2`
# id value
# 3 2 7
# 4 2 6
# Find the row with highest value for each id
lapply(split(a,a$id),function(x) x[which.max(x$value),])
# $`1`
# id value
# 2 1 5
# $`2`
# id value
# 3 2 7
# Assigning to the split changes the data.frame a!
split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),])
a
# id value
# 1 1 5
# 2 1 5
# 3 2 7
# 4 2 7
Not only has a changed, but it changed to a value that does not look like the right hand side of the assignment! Even if assigning to split(a,a$id) somehow changes a (which I don't understand), why does it result in a data.frame instead of a list?
Note that I understand that there are better ways to accomplish this task. My question is why does split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),]) change a?
The help page for split says in its header: "The replacement forms replace values corresponding to such a division." So it really should not be unexpected, although I admit it is not widely used. I do not understand how your example illustrates that the assigned values "do not look like the RHS of the assignment!". The max values are assigned to the 'value' lists within categories defined by the second argument factor.
(I do thank you for the question. I had not realized that split<- was at the core of ave. I guess it is more widely used than I realized, since I think ave is a wonderfully useful function.)
Just after definition of a, perform split(a, a$id)=1, the result would be:
> a
id value
1 1 1
2 1 1
3 1 1
4 1 1
The key here is that split<- actually modified the LHS with RHS values.
Here's an example:
> x <- c(1,2,3);
> split(x,x==2)
$`FALSE`
[1] 1 3
$`TRUE`
[1] 2
> split(x,x==2) <- split(c(10,20,30),c(10,20,30)==20)
> x
[1] 10 20 30
Note the line where I re-assign split(x,x==2) <- . This actually reassigns x.
As the comments below have stated, you can look up the definition of split<- like so
> `split<-.default`
function (x, f, drop = FALSE, ..., value)
{
ix <- split(seq_along(x), f, drop = drop, ...)
n <- length(value)
j <- 0
for (i in ix) {
j <- j%%n + 1
x[i] <- value[[j]]
}
x
}
<bytecode: 0x1e18ef8>
<environment: namespace:base>

R: How can I sum across variables, within cases, while counting NA as zero

Fake data for illustration:
df <- data.frame(a=c(1,2,3,4,5), b=(c(2,2,2,2,NA)),
c=c(NA,2,3,4,5)))
This would get me the answer I want IF it weren't for the NA values:
df$count <- with(df, (a==1) + (b==2) + (c==3))
Also, would there be an even more elegant way if I was only interested in, e.g. variables==2?
df$count <- with(df, (a==2) + (b==2) + (c==2))
Many thanks!
The following works for your specific example, but I have a suspicion that your real use case is more complicated:
df$count <- apply(df,1,function(x){sum(x == 1:3,na.rm = TRUE)})
> df
a b c count
1 1 2 NA 2
2 2 2 2 1
3 3 2 3 2
4 4 2 4 1
5 5 NA 5 0
but this general approach should work. For instance, your second example would be something like this:
df$count <- apply(df,1,function(x){sum(x == 2,na.rm = TRUE)})
or more generally you could allow yourself to pass in a variable for the comparison:
df$count <- apply(df,1,function(x,compare){sum(x == compare,na.rm = TRUE)},compare = 1:3)
Another way is to subtract your target vector from each row of your data.frame, negate and then do rowSums with na.rm=TRUE:
target <- 1:3
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 2 1 2 1 0
target <- rep(2,3)
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 1 3 1 1 0

R data frame select by global variable

I'm not sure how to do this without getting an error. Here is a simplified example of my problem.
Say I have this data frame DF
a b c d
1 2 3 4
2 3 4 5
3 4 5 6
Then I have a variable
x <- min(c(1,2,3))
Now I want do do the following
y <- DF[a == x]
But when I try to refer to some variable like "x" I get an error because R is looking for a column "x" in my data frame. I get the "undefined columns selected" error
How can I do what I am trying to do in R?
You may benefit from reading an Introduction to R, especially on matrices, data.frames and indexing. Your a is a column of a data.frame, your x is a scalar. The comparison you have there does not work.
Maybe you meant
R> DF$a == min(c(1,2,3))
[1] TRUE FALSE FALSE
R> DF[,"a"] == min(c(1,2,3))
[1] TRUE FALSE FALSE
R>
which tells you that the first row fits but not the other too. Wrapping this in which() gives you indices instead.
I think this is what you're looking for:
> x <- min(DF$a)
> DF[DF$a == x,]
a b c d
1 1 2 3 4
An easier way (avoiding the 'x' variable) would be this:
> DF[which.min(DF$a),]
a b c d
1 1 2 3 4
or this:
> subset(DF, a==min(a))
a b c d
1 1 2 3 4

Resources