subsetting a vector/data frame in R yields different results

subsetting a vector/data frame in R yields different results - r

Recently, I was asked about subsetting a data frame in R. My colleague had this line of code
dd2 <- subset(dd, tret == c("T1", "T2", "T3", "T4")) which yields 1/4 of the subset. In contrast to the standard dd2 <- subset(dd, tret == "T1" | tret == "T2" | tret == "T3" | tret == "T4") which yields 960 rows, the first line of code only yields 240 rows.
Same thing happens to vectors. For instance,
x <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
y <- x[x == 1 | x == 2] gives a vector different from
y <- x[x == c(1,2)]
Any insight on the differences? Thank you.

The issue is with the recycling of values when we use a vector with length greater than 1 with another one having length > 1.
x == 1:2
#[1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
where
x
#[1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
and the comparison works in the following way
rep(1:2, length.out = length(x))
#[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
In the above example, 1 is compared to the first element of x, 2 with 2nd element, 1 again with 3rd element of x, 2 with 4th, and it repeats until the end of the vector 'x'. For comparing vectors of length > 1, use %in%
identical(x[x == 1 | x == 2], x[x %in% 1:2])
#[1] TRUE

Related

Flag based on multiple conditions

Being this my initial dataset:
x <- c("a","a","b","b","c","c","d","d")
y <- c("a","a","a","b","c","c", "d", "d")
z <- c(5,1,2,6,1,1,5,6)
df <- data.frame(x,y,z)
I am trying to create a column in a dataframe to flag if there is another row in the dataset with the following condition:
There is a row in the dataset with the same "x" and "y" columns. And at least 1 of the rows of the dataset, with that "x" and "y" has a "z" value >= 5
With the example provided, the output should be:
x y z flag
1 a a 5 TRUE
2 a a 1 TRUE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE
Thank you!

I use data.table package for all my aggregations. With this package I would do the following:
library(data.table)
dt <- as.data.table(df)
# by=.(x, y): grouping by x and y
# find all cases where
# 1. the maximum z value is >= 5
# 2. there are more than 1 entry for that (x, y) combo. .N is a data.table syntax for number of rows in that group
# := is a data.table syntax to assign back in to the original data.table
dt[, flag := max(z) >= 5 & .N > 1, by=.(x, y)]
# Does x need to equal y? If so use this
dt[, flag := max(z) >= 5 & .N > 1 & x == y, by=.(x, y)]
# view the result
dt[]
# return back to df
df <- as.data.frame(dt)
df

You can try the code below
> within(df, flag <- x==y & z>=5)
x y z flag
1 a a 5 TRUE
2 a a 1 FALSE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE

Changing values of column based on whether another column satisfy a criteria

I want to subtract 1 from the values of column A if column B is <= 20.
A = c(1,2,3,4,5)
B = c(10,20,30,40,50)
df = data.frame(A,B)
output
A B
1 0 10
2 1 20
3 3 30
4 4 40
5 5 50
My data is very huge so I prefer not to use a loop. Is there any computationally efficient method in R?

You can do
df$A[df$B <= 20] <- df$A[df$B <= 20] - 1
# A B
#1 0 10
#2 1 20
#3 3 30
#4 4 40
#5 5 50
We can break this down step-by-step to understand how this works.
First we check which numbers in B is less than equal to 20 which gives us a logical vector
df$B <= 20
#[1] TRUE TRUE FALSE FALSE FALSE
Using that logical vector we can select the numbers in A
df$A[df$B <= 20]
#[1] 1 2
Subtract 1 from those numbers
df$A[df$B <= 20] - 1
#[1] 0 1
and replace these values for the same indices in A.
With dplyr we can also use case_when
library(dplyr)
df %>%
mutate(A = case_when(B <= 20 ~ A - 1,
TRUE ~ A))

Another possibility:
df$A <- ifelse(df$B < 21, df$A - 1, df$A)

And here is a data.table solution:
library(data.table)
setDT(df)
df[B <= 20, A := A - 1]

Logical operators: AND acting liking OR

I'm having a hard time understand how R is treating the AND and OR operators when I'm using filter from dplyr.
Here's an example to illustrate:
library(dplyr)
xy <- data.frame(x=1:6, y=c("a", "b"), z= c(rep("d",3), rep("g",3)))
> xy
x y z
1 1 a d
2 2 b d
3 3 a d
4 4 b g
5 5 a g
6 6 b g
Using filter I want to eliminate all rows where x==1 and z==d. This would lead me to believe I want to use the AND operator: &
> filter(xy, x != 1 & z != "d")
x y z
1 4 b g
2 5 a g
3 6 b g
But this removes all rows that have either x==1 or z==d. What's more confusing, is that when I use the OR operator, | I get the desired result:
> filter(xy, x != 1 | z != "d")
x y z
1 2 b d
2 3 a d
3 4 b g
4 5 a g
5 6 b g
Also, this does work, however not as desirable for if I were stringing together == and != in the same conditional evaluation.
> filter(xy, !(x == 1 & z == "d"))
x y z
1 2 b d
2 3 a d
3 4 b g
4 5 a g
5 6 b g
Can someone explain what I'm missing?

This is a question of boolean algebra. The logical expression !(x == 1 & z == d) is equivalent to x != 1 | z != d, just the same as -(x + y) is equivalent to -x - y. Eliminating the bracket, you change all == to != and all & to | and vice versa. This leads to the fact that
!(x == 1 & z == "d")
is NOT the same as
x != 1 & z != "d"
but rather
x != 1 | z != "d"

A couple tips that won't fit in a comment:
If you're having trouble understanding how something is working in R, I'd highly recommend running each individual piece of the operation. With dplyr, it's easy to keep track on intermediate steps and display them all:
mutate(xy,
A = x != 1,
B = z != 'd',
A_and_B = A & B,
A_or_B = A | B
)
# x y z A B A_and_B A_or_B
# 1 1 a d FALSE FALSE FALSE FALSE
# 2 2 b d TRUE FALSE FALSE TRUE
# 3 3 a d TRUE FALSE FALSE TRUE
# 4 4 b g TRUE TRUE TRUE TRUE
# 5 5 a g TRUE TRUE TRUE TRUE
# 6 6 b g TRUE TRUE TRUE TRUE
I think that if you look at the definition of each column its values will make perfect sense. Then, after going one step at a time, hopefully the results will make sense too.
As others have stated in various ways, you're setting yourself up for a hard time from the start with
Using filter I want to eliminate all rows where x==1 and z==d
Don't think of filter as eliminating rows, think of it as keeping rows. If you mentally invert your goal to "keep all rows where..." you'll set yourself up for a more direct translation of words to code.

The result of filter is the rows where the specified condition is true.
Take for example x != 1 & z != "d". What are the rows where this condition is true? The output you got. The other rows were removed, because the condition was not true for those rows.
In this example, your real intention was to eliminate rows where x == 1 and z == "d".
In other words, you want to keep the rows where the condition x == 1 and z == "d" is false.
Putting that into code becomes filter(xy, !(x == 1 and z == "d")).
It's ironic that this looks much like your intention, and very different from what you actually tried to write.
If you forget this logic of filter,
you can remind yourself with a simpler experiment, filter(xy, TRUE) which will return all rows, and filter(xy, FALSE) which will return none.

# x != 1 & z != "d" evaluates to a single TRUE/FALSE vector which subsets the data
# note how & and | behave in isolation:
TRUE & TRUE # T AND T = T
## [1] TRUE
TRUE & FALSE # T AND F = F
## [1] FALSE
FALSE & FALSE # F AND F = F
## [1] FALSE
TRUE | TRUE # T OR T = T
## [1] TRUE
TRUE | FALSE # T OR F = T
## [1] TRUE
FALSE | FALSE # F OR F = F
## [1] FALSE
# Apply over vectors
(x1 <- xy$x != 1)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE
(z1 <- xy$z != "d")
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
x1 & z1 # you get last 3 rows
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
x1 | z1 # you get all but 1st row (which contains 1 and d)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE

Nested lapply() in a list?

I have a list l, which has the following features:
It has 3 elements
Each element is a numeric vector of length 5
Each vector contains numbers from 1 to 5
l = list(a = c(2, 3, 1, 5, 1), b = c(4, 3, 3, 5, 2), c = c(5, 1, 3, 2, 4))
I want to do two things:
First
I want to know how many times each number occurs in the entire list and I want each result in a vector (or any form that can allow me to perform computations with the results later):
Code 1:
> a <- table(sapply(l, "["))
> x <- as.data.frame(a)
> x
Var1 Freq
1 1 3
2 2 3
3 3 4
4 4 2
5 5 3
Is there anyway to do it without using the table() function. I would like to do it "manually". I try to do it right below.
Code 2: (I know this is not very efficient!)
x <- data.frame(
"1" <- sum(sapply(l, "[")) == 1
"2" <- sum(sapply(l, "[")) == 2
"3" <- sum(sapply(l, "[")) == 3
"4" <- sum(sapply(l, "[")) == 4
"5" <- sum(sapply(l, "[")) == 5)
I tried the following, but I did not work. I actually did not understand the result.
> sapply(l, "[") == 1:5
a b c
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE TRUE TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
> sum(sapply(l, "[") == 1:5)
[1] 2
Second
Now, I would like to get the number of times each number appears in the list, but now in each element $a, $b and $c. I thought about using the lapply() but I don't know how exactly. Following is what I tried, but it is inefficient just like Code 2:
lapply(l, function(x) sum(x == 1))
lapply(l, function(x) sum(x == 2))
lapply(l, function(x) sum(x == 3))
lapply(l, function(x) sum(x == 4))
lapply(l, function(x) sum(x == 5))
What I get with these 5 lines of code are 5 lists of 3 elements each containing a single numeric value. For example, the second line of code tells me how many times number 2 appears in each element of l.
Code 3:
> lapply(l, function(x) sum(x == 2))
$a
[1] 1
$b
[1] 1
$c
[1] 1
What I would like to obtain is a list with three elements containing all the information I am looking for.
Please, use the references "Code 1", "Code 2" and "Code 3" in your answers. Thank you very much.

Just use as.data.frame(l) for the second part and table(unlist(l)) for the first.
> table(unlist(l))
1 2 3 4 5
3 3 4 2 3
> data.frame(lapply(l, tabulate))
a b c
1 2 0 1
2 1 1 1
3 1 2 1
4 0 1 1
5 1 1 1`

For code 1/2, you could use sapply to obtain the counts for whichever values you wanted:
l = list(a = c(2, 3, 1, 5, 1), b = c(4, 3, 3, 5, 2), c = c(5, 1, 3, 2, 4))
data.frame(number = 1:5,
freq = sapply(1:5, function(x) sum(unlist(l) == x)))
# number freq
# 1 1 3
# 2 2 3
# 3 3 4
# 4 4 2
# 5 5 3
For code 3, if you wanted to get the counts for lists a, b, and c, you could just apply your frequency function to each element of the list with the lapply function:
freqs = lapply(l, function(y) sapply(1:5, function(x) sum(unlist(y) == x)))
data.frame(number = 1:5, a=freqs$a, b=freqs$b, c=freqs$c)
# number a b c
# 1 1 2 0 1
# 2 2 1 1 1
# 3 3 1 2 1
# 4 4 0 1 1
# 5 5 1 1 1

here you have another example with nested lapply().
created data:
list = NULL
list[[1]] = c(1:5)
list[[2]] = c(1:5)+3
list[[2]] = c(1:5)+4
list[[3]] = c(1:5)-1
list[[4]] = c(1:5)*3
list2 = NULL
list2[[1]] = rep(1,5)
list2[[2]] = rep(2,5)
list2[[3]] = rep(0,5)
The result is this; it serve to subtract each element of one list with all elements of the other list.
lapply(list, function(d){ lapply(list2, function(a,b) {a-b}, b=d)})

find indices of non zero elements in matrix

I want to get the indices of non zero elements in a matrix.for example
X <- matrix(c(1,0,3,4,0,5), byrow=TRUE, nrow=2);
should give me something like this
row col
1 1
1 3
2 1
2 3
Can any one please tell me how to do that?

which(X!=0,arr.ind = T)
row col
[1,] 1 1
[2,] 2 1
[3,] 1 3
[4,] 2 3
If arr.ind == TRUE and X is an array, the result is a matrix whose rows each are the indices of the elements of X

There's an error in your example code - True is not defined, use TRUE.
X <-matrix(c(1,0,3,4,0,5), byrow = TRUE, nrow = 2)
which should do it:
which(!X == 0)
X[ which(!X == 0)]
#[1] 1 4 3 5
to get the row/col indices:
row(X)[which(!X == 0)]
col(X)[which(!X == 0)]
to use those to index back into the matrix:
X[cbind(row(X)[which(!X == 0)], col(X)[which(!X == 0)])]
#[1] 1 4 3 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

subsetting a vector/data frame in R yields different results - r

Related

Flag based on multiple conditions

Changing values of column based on whether another column satisfy a criteria

Logical operators: AND acting liking OR

Nested lapply() in a list?

find indices of non zero elements in matrix

Categories

Resources