Subset dataframe based on levels - r

I have the following dataframe in R. I want to subset it based on three criteria,for each unique value of x within each level of id
If there is only 1 value of x, keep that row
If x has the same value of z, with two different values of y, keep the row where y does not = 1.3
If x has three values of z, keep the two rows where y does not = 1.3
id x y z
a 1 0.2 100
a 2 1 200
a 2 1.3 200
b 1 0.5 400
b 1 1 500
b 1 1.3 600
the solution would look like this:
id x y z
a 1 0.2 100
a 2 1 200
b 1 0.5 400
b 1 1 500
Any help would be appreciated

We can group by 'id', 'x' and filter based on the conditions
library(dplyr)
df1 %>%
group_by(id, x) %>%
filter(n() == 1|(n() > 1 & y != 1.3))
data
df1 <- structure(list(id = c("a", "a", "a", "b", "b", "b"), x = c(1L,
2L, 2L, 1L, 1L, 1L), y = c(0.2, 1, 1.3, 0.5, 1, 1.3), z = c(100L,
200L, 200L, 400L, 500L, 600L)), class = "data.frame", row.names = c(NA,
-6L))

Related

Calculate minimum distance between groups of points in data frame

my data frame looks like this:
Time, Value, Group
0, 1.0, A
1, 2.0, A
2, 3.0, A
0, 4.0, B
1, 6.0, B
2, 6.0, B
0, 7.0, C
1, 7.0, C
2, 9.0, C
I need to find for each combination (A, B), (A, C), (B, C) the maximum difference over each corresponding Time points.
So comparing A and B has maximum distance for t=1 which is 6 (B) - 2 (A) = 4.
The full output should be something like this:
combination,time,distance
AB, 0, 4
AC, 0, 6
BC, 0, 3
One way in base R using combn :
do.call(rbind, combn(unique(df$Group), 2, function(x) {
df1 <- subset(df, Group == x[1])
df2 <- subset(df, Group == x[2])
df3 <- merge(df1, df2, by = 'Time')
value <- abs(df3$Value.x - df3$Value.y)
data.frame(combn = paste(x, collapse = ''),
time = df3$Time[which.max(value)],
max_difference = max(value))
}, simplify = FALSE))
# combn time max_difference
#1 AB 1 4
#2 AC 0 8
#3 BC 0 5
We create all combination of unique Group values, subset the data for them and merge them on Time. Subtract the corresponding value columns and return the max difference between them.
data
df <- structure(list(Time = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 0L, 0L),
Value = c(1, 2, 3, 4, 6, 6, 7, 7, 9), Group = c("A", "A",
"A", "B", "B", "B", "C", "C", "C")),
class = "data.frame", row.names = c(NA, -9L))
One dplyr option could be:
df %>%
inner_join(df, by = "Time") %>%
filter(Group.x != Group.y) %>%
group_by(Time,
Group = paste(pmax(Group.x, Group.y), pmin(Group.x, Group.y), sep = "-")) %>%
summarise(Max_Distance = abs(max(Value.x[Group.x == first(Group.x)]) - max(Value.y[Group.y == first(Group.y)])))
Time Group Max_Distance
<int> <chr> <dbl>
1 0 B-A 3
2 0 C-A 8
3 0 C-B 5
4 1 B-A 4
5 2 B-A 3

Finding minimum by groups and among columns

I am trying to find the minimum value among different columns and group.
A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2),
so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
We can use data.table methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
For each group, you can calculate min value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Here is a base R option using pmin + ave + subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

check if numbers in a column are ascending by a certain value (R dataframe)

I have a column of numbers (index) in a dataframe like the below. I am attempting to check if these numbers are in ascending order by the value of 1. For example, group B and C do not ascend by 1. While I can check by sight, my dataframe is thousands of rows long, so I'd prefer to automate this. Does anyone have advice? Thank you!
group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2
...
I think this works. diff calculates the difference between the two subsequent numbers, and then we can use all to see if all the differences are 1. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
group_by(group) %>%
summarize(Result = all(diff(index) == 1)) %>%
ungroup()
dat2
# # A tibble: 3 x 2
# group Result
# <chr> <lgl>
# 1 A TRUE
# 2 B FALSE
# 3 C FALSE
DATA
dat <- read.table(text = "group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2",
header = TRUE, stringsAsFactors = FALSE)
Maybe aggregate could help
> aggregate(.~group,df1,function(v) all(diff(v)==1))
group index
1 A TRUE
2 B FALSE
3 C FALSE
We can do a group by group, get the difference between the current and previous value (shift) and check if all the differences are equal to 1.
library(data.table)
setDT(df1)[, .(Result = all((index - shift(index))[-1] == 1)), group]
# group Result
#1: A TRUE
#2: B FALSE
#3: C FALSE
data
df1 <- structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "C", "C", "C", "C"), index = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 2L, 0L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-13L))

How to extract a column based on column name?

I have a data frame df
m n o p
a 1 1 2 5
b 1 2 0 4
c 3 3 3 3
I can extract column m by:
df[,"m"]
Now the problem is, the column name was generated somewhere else (multiple times, in a for loop). For example, column name m was generated by choosing a specific element in the dataframe, gen, in one loop
:
> gen[i,1]
[1] m
How do I extract the column based on gen[i,1]?
Just nest the subsetting.
dat[,"m"]
# [1] 1 1 3
i <- 13
gen[i, 1]
# [1] "m"
dat[, gen[i, 1]]
# [1] 1 1 3
Or, if you don't want the column to be dropped:
dat[, gen[i, 1], drop=FALSE]
# m
# a 1
# b 1
# c 3
Data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)
We can use select from dplyr
library(dplyr)
i <- 13
dat %>%
select(gen[i, 1])
# m
#a 1
#b 1
#c 3
data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)

Replace a subset of data frame

I have a data frame with some error
T item V1 V2
1 a 2 .1
2 a 5 .8
1 b 1 .7
2 b 2 .2
I have another data frame with corrections for items concerning V1 only
T item V1
1 a 2
2 a 6
How do I get the final data frame? Should I use merge or rbind. Note: actual data frames are big.
An option would be a data.table join on the 'T', 'item' and assigning the 'V1' with the the corresponding 'V1' column (i.V1) from the second dataset
library(data.table)
setDT(df1)[df2, V1 := i.V1, on = .(T, item)]
df1
# T item V1 V2
#1: 1 a 2 0.1
#2: 2 a 6 0.8
#3: 1 b 1 0.7
#4: 2 b 2 0.2
data
df1 <- structure(list(T = c(1L, 2L, 1L, 2L), item = c("a", "a", "b",
"b"), V1 = c(2L, 5L, 1L, 2L), V2 = c(0.1, 0.8, 0.7, 0.2)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(T = 1:2, item = c("a", "a"), V1 = c(2L, 6L)),
class = "data.frame", row.names = c(NA,
-2L))
This should work -
library(dplyr)
df1 %>%
left_join(df2, by = c("T", "item")) %>%
mutate(
V1 = coalesce(as.numeric(V1.y), as.numeric(V1.x))
) %>%
select(-V1.x, -V1.y)

Resources