Select groups where all values are larger than x [duplicate] - r

This question already has answers here:
Select groups where all values are positive
(2 answers)
Closed 7 months ago.
I have a data frame which is grouped by 'subject'. I want filter those 'subject' where all 'values' are above a certain value, values > 0.5
Example df:
df1 <- data.frame(subject = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
values = c(.4, .6, .6, .6, .6, .6, .6, .6, .6, .4))
df1
subject values
1 1 0.4
2 2 0.6
3 3 0.6
4 4 0.6
5 5 0.6
6 1 0.6
7 2 0.6
8 3 0.6
9 4 0.6
10 5 0.4
Desired output:
> df1
subject values
1 2 0.6
2 3 0.6
3 4 0.6
4 2 0.6
5 3 0.6
6 4 0.6

You can use all() inside a grouped filter using dplyr:
library(dplyr)
df1 %>%
group_by(subject) %>%
filter(all(values > .5)) %>%
ungroup()
Output:
# A tibble: 6 x 2
subject values
<dbl> <dbl>
1 2 0.6
2 3 0.6
3 4 0.6
4 2 0.6
5 3 0.6
6 4 0.6

Using min in ave.
df1[with(df1, ave(values, subject, FUN=min)) > .5, ]
# subject values
# 2 2 0.6
# 3 3 0.6
# 4 4 0.6
# 7 2 0.6
# 8 3 0.6
# 9 4 0.6

a data.table approach
library(data.table_)
setDT(df1)[, .SD[all(values == 0.6) == TRUE], by = .(subject)][]
# subject values
# 1: 2 0.6
# 2: 2 0.6
# 3: 3 0.6
# 4: 3 0.6
# 5: 4 0.6
# 6: 4 0.6

Related

R: Compare elements from a column based upon other column conditions?

I would like to create a new df, based upon whether the second or third condition's for each subject are greater than the first condition.
Example df:
df1 <- data.frame(subject = rep(1:5, 3),
condition = rep(c("first", "second", "third"), each = 5),
values = c(.4, .4, .4, .4, .4, .6, .6, .6, .6, .4, .6, .6, .6, .4, .4))
> df1
subject condition values
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
5 5 first 0.4
6 1 second 0.6
7 2 second 0.6
8 3 second 0.6
9 4 second 0.6
10 5 second 0.4
11 1 third 0.6
12 2 third 0.6
13 3 third 0.6
14 4 third 0.4
15 5 third 0.4
The resulting df would be this:
> df2
subject condition values
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
6 1 second 0.6
7 2 second 0.6
8 3 second 0.6
9 4 second 0.6
11 1 third 0.6
12 2 third 0.6
13 3 third 0.6
14 4 third 0.4
Here, subject #5 does not meet the criteria. This is because only subject #5's values are not greater than the first condition in either the second or third condition.
Thanks.
We may group by 'subject' and filter if any of the second or third 'values' are greater than 'first'
library(dplyr)
df1 %>%
group_by(subject) %>%
filter(any(values[2:3] > first(values))) %>%
ungroup
-output
# A tibble: 12 × 3
subject condition values
<int> <chr> <dbl>
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
5 1 second 0.6
6 2 second 0.6
7 3 second 0.6
8 4 second 0.6
9 1 third 0.6
10 2 third 0.6
11 3 third 0.6
12 4 third 0.4
Using ave.
df1[with(df1, ave(values, subject, FUN=\(x) any(x[2:3] > x[1])) == 1), ]
# subject condition values
# 1 1 first 0.4
# 2 2 first 0.4
# 3 3 first 0.4
# 4 4 first 0.4
# 6 1 second 0.6
# 7 2 second 0.6
# 8 3 second 0.6
# 9 4 second 0.6
# 11 1 third 0.6
# 12 2 third 0.6
# 13 3 third 0.6
# 14 4 third 0.4

Filter a group of a data.frame based on multiple conditions

I am looking for an elegant way to filter the values of a specific group of big data.frame based on multiple conditions.
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0.2,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like only for the time point 1 to filter out all the values that are smaller than 1 but bigger than 0.1
I want my data.frame to look like this.
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Any help is highly appreciated.
With dplyr you can do
library(dplyr)
data %>% filter(!(time == 1 & (value <= 0.1 | value >= 1)))
# group time value
# 1 A 1 0.2
# 2 A 2 0.1
# 3 B 2 10.0
# 4 C 2 20.0
# 5 A 3 10.0
# 6 B 3 20.0
# 7 C 3 30.0
Or if you have too much free time and you decided to avoid dplyr:
ind <- with(data, (data$time==1 & (data$value > 0.1 & data$value < 1)))
ind <- ifelse((data$time==1) & (data$value > 0.1 & data$value < 1), TRUE, FALSE)
#above two do the same
data$ind <- ind
data <- data[!(data$time==1 & ind==F),]
data$ind <- NULL
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Another simple option would be to use subset twice and then append the results in a row wise manner.
rbind(
subset(data, time == 1 & value > 0.1 & value < 1),
subset(data, time != 1)
)
# group time value
# 1 A 1 0.2
# 4 A 2 0.1
# 5 B 2 10.0
# 6 C 2 20.0
# 7 A 3 10.0
# 8 B 3 20.0
# 9 C 3 30.0

making a dataset of multiple x values in one y value

I have a correlation dataset that looks like this:
V1 V2 R2
1 2 0.4
1 3 0.5
3 5 0.3
And i want to convert it to a two-column data in such a way that I would have multiple x (in column V) in one y (in column R2) for scatter plotting. It would look like this:
V R2
1 0.4
2 0.4
1 0.5
2 0.5
3 0.5
3 0.3
4 0.3
5 0.3
How can I do this in R?
In the tidyverse, you can make a list column of the required vectors with purrr::map2 to iterate seq over each pair of start and end points, and then expand with tidyr::unnest:
df <- data.frame(V1 = c(1L, 1L, 3L),
V2 = c(2L, 3L, 5L),
R2 = c(0.4, 0.5, 0.3))
library(tidyverse)
df %>% transmute(V = map2(V1, V2, seq), R2) %>% unnest()
#> R2 V
#> 1 0.4 1
#> 2 0.4 2
#> 3 0.5 1
#> 4 0.5 2
#> 5 0.5 3
#> 6 0.3 3
#> 7 0.3 4
#> 8 0.3 5
In base R, there isn't a simple equivalent of unnest, so it's easier to use Map (the multivariate lapply, roughly equivalent to purrr::map2 above) to build a list of data frames, complete with the R2 value (recycled by data.frame), which than then be do.call(rbind, ...)ed into a single data frame:
do.call(rbind,
Map(function(v1, v2, r2){data.frame(V = v1:v2, R2 = r2)},
df$V1, df$V2, df$R2))
#> V R2
#> 1 1 0.4
#> 2 2 0.4
#> 3 1 0.5
#> 4 2 0.5
#> 5 3 0.5
#> 6 3 0.3
#> 7 4 0.3
#> 8 5 0.3
Check out the intermediate products of each to get a feel for how they work.
Here is one option using data.table
library(data.table)
setDT(df1)[, .(V = V1:V2, R2), by = .(grp = 1:nrow(df1))][, grp := NULL][]
# V R2
#1: 1 0.4
#2: 2 0.4
#3: 1 0.5
#4: 2 0.5
#5: 3 0.5
#6: 3 0.3
#7: 4 0.3
#8: 5 0.3

Lagging variable by group does not work in dplyr

I'm desperately trying to lag a variable by group. I found this post that deals with essentially the same problem I'm facing, but the solution does not work for me, no idea why.
This is my problem:
library(dplyr)
df <- data.frame(monthvec = c(rep(1:2, 2), rep(3:5, 3)))
df <- df %>%
arrange(monthvec) %>%
mutate(growth=ifelse(monthvec==1, 0.3,
ifelse(monthvec==2, 0.5,
ifelse(monthvec==3, 0.7,
ifelse(monthvec==4, 0.1,
ifelse(monthvec==5, 0.6,NA))))))
df%>%
group_by(monthvec) %>%
mutate(lag.growth = lag(growth, order_by=monthvec))
Source: local data frame [13 x 3]
Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 0.3
3 2 0.5 NA
4 2 0.5 0.5
5 3 0.7 NA
6 3 0.7 0.7
7 3 0.7 0.7
8 4 0.1 NA
9 4 0.1 0.1
10 4 0.1 0.1
11 5 0.6 NA
12 5 0.6 0.6
13 5 0.6 0.6
This is what I'd like it to be in the end:
df$lag.growth <- c(NA, NA, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7,0.7,0.7, 0.1,0.1,0.1)
monthvec growth lag.growth
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
I believe that one problem is that my groups are not of equal length...
Thanks for helping out.
Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.
library(dplyr)
df %>%
group_by(monthvec) %>%
mutate(cnt = n()) %>%
ungroup() %>%
mutate(lag.growth = lag(growth, first(cnt))) %>%
group_by(monthvec) %>%
mutate(lag.growth = first(lag.growth)) %>%
select(-cnt)
which gives,
# A tibble: 13 x 3
# Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
You may join your original data with a dataframe with a shifted "monthvec".
left_join(df, df %>% mutate(monthvec = monthvec + 1) %>% unique(), by = "monthvec")
# monthvec growth.x growth.y
# 1 1 0.3 NA
# 2 1 0.3 NA
# 3 2 0.5 0.3
# 4 2 0.5 0.3
# 5 3 0.7 0.5
# 6 3 0.7 0.5
# 7 3 0.7 0.5
# 8 4 0.1 0.7
# 9 4 0.1 0.7
# 10 4 0.1 0.7
# 11 5 0.6 0.1
# 12 5 0.6 0.1
# 13 5 0.6 0.1

Reshape matrix to data frame

I have association matrix file that looks like this (4 rows and 3 columns) .
test=read.table("test.csv", sep=",", header=T)
head(test)
LosAngeles SanDiego Seattle
1 2 3
A 1 0.1 0.2 0.2
B 2 0.2 0.4 0.2
C 3 0.3 0.5 0.3
D 4 0.2 0.5 0.1
What I want to is reshape this matrix file into data frame. The result should look something like this (12(= 4 * 3) rows and 3 columns):
RowNum ColumnNum Value
1 1 0.1
2 1 0.2
3 1 0.3
4 1 0.2
1 2 0.2
2 2 0.4
3 2 0.5
4 2 0.5
1 3 0.2
2 3 0.2
3 3 0.3
4 3 0.1
That is, if my matrix file has 100 rows and 90 columns. I want to make new data frame file that contains 9000 (= 100 * 90) rows and 3 columns. I've tried to use reshape package but but I do not seem to be able to get it right. Any suggestions how to solve this problem?
Use as.data.frame.table. Its the boss:
m <- matrix(data = c(0.1, 0.2, 0.2,
0.2, 0.4, 0.2,
0.3, 0.5, 0.3,
0.2, 0.5, 0.1),
nrow = 4, byrow = TRUE,
dimnames = list(row = 1:4, col = 1:3))
m
# col
# row 1 2 3
# 1 0.1 0.2 0.2
# 2 0.2 0.4 0.2
# 3 0.3 0.5 0.3
# 4 0.2 0.5 0.1
as.data.frame.table(m)
# row col Freq
# 1 1 1 0.1
# 2 2 1 0.2
# 3 3 1 0.3
# 4 4 1 0.2
# 5 1 2 0.2
# 6 2 2 0.4
# 7 3 2 0.5
# 8 4 2 0.5
# 9 1 3 0.2
# 10 2 3 0.2
# 11 3 3 0.3
# 12 4 3 0.1
This should do the trick:
test <- as.matrix(read.table(text="
1 2 3
1 0.1 0.2 0.2
2 0.2 0.4 0.2
3 0.3 0.5 0.3
4 0.2 0.5 0.1", header=TRUE))
data.frame(which(test==test, arr.ind=TRUE),
Value=test[which(test==test)],
row.names=NULL)
# row col Value
#1 1 1 0.1
#2 2 1 0.2
#3 3 1 0.3
#4 4 1 0.2
#5 1 2 0.2
#6 2 2 0.4
#7 3 2 0.5
#8 4 2 0.5
#9 1 3 0.2
#10 2 3 0.2
#11 3 3 0.3
#12 4 3 0.1

Resources