Subset data based on another data in R - r

I have two data sets dat1 and dat2, that look like:
a<-c(rep(1,5), rep(2,3), rep(1,2), rep(2,4), rep(1,2))
b<-c(rep("AA", 8), rep("BB", 6), rep("CC", 2))
v<-c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
"x4", "x5", "x6", "x7", "x8", "x9", "x5", "x8")
ab<-c(1,2,5,6,58,2,4,14,2,25,23,1,12,14,15,14)
dat1<-data.frame(a,b,v,ab)
names(dat1)<-c("loc", "point", "sp", "ab")
a<-c(rep(1,8), rep(2,4), rep(3, 2), rep(1,4))
b<-c(rep("AA", 8), rep("BB", 6), rep("DD", 4))
v<-c("y1", "y2", "y3", "y4", "y6", "y7", "y8", "y12",
"y1", "y2", "y3", "y4", "y5", "y6", "y1", "y2", "y3", "y6")
ab<-c(1,2,45,14,1,12,14,15,10,2,32,14,1,12,18,9,6,7)
dat2<-data.frame(a,b,v,ab)
names(dat2)<-c("loc", "point", "sp", "ab")
and I need to make subsets of these dataframes, where each subset contains only combinations of loc and point which are in dat1 and dat2.
My result should look like:
res1
loc point sp ab
1 1 AA x1 1
2 1 AA x2 2
3 1 AA x3 5
4 1 AA x4 6
5 1 AA x5 58
11 2 BB x6 23
12 2 BB x7 1
13 2 BB x8 12
14 2 BB x9 14
res2
loc point sp ab
1 1 AA y1 1
2 1 AA y2 2
3 1 AA y3 45
4 1 AA y4 14
5 1 AA y6 1
6 1 AA y7 12
7 1 AA y8 14
8 1 AA y12 15
9 2 BB y1 10
10 2 BB y2 2
11 2 BB y3 32
12 2 BB y4 14
I have tried merge() and than divide the result in two dataframes, but there are not same number of rows, so the rows of smaller data multiplied to fill the gaps. My tries with subset() also failed.
This is simialar to Subset a data frame based on another but I havent succeed even when I triend their solutions (ie. intersect).
Thx for help!

IMHO you can try:
merge(dat1, unique(dat2[,1:2]))
merge(dat2, unique(dat1[,1:2]))

semi_join in the dplyr package is designed for this:
library(dplyr)
# get just the rows in dat1 that have matches in dat2
dat1 %>% semi_join(dat2, by=c('loc', 'point'))

Related

How to combine member of one column and then count other columns in R?

I have a data frame:
df <- structure(list(ID = c("x1", "x1", "x1", "x1", "x1", "x1", "x2", "x2", "x2", "x2", "x2", "x2", "x3", "x3", "x3", "x3", "x3", "x3", "x1", "x1", "x1", "x1", "x1", "x1", "x2", "x2", "x2", "x2", "x2", "x2", "x3", "x3", "x3", "x3", "x3", "x3"), col1=c("a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2"), col2 = c("a", "b", "c", "d", "e", "f", "a", "b", "c", "d", "e", "f","a", "b", "c", "d", "e", "f","a", "b", "c", "d", "e", "f", "a", "b", "c", "d", "e", "f","a", "b", "c", "d", "e", "f"), col3 = c(2,13,1,21,0,5,3,0,6,4,50,0,0,0,0,9,5,0,51,3,6,0,0,9,89,4,29,1,4,17,6,16,9,1,0,0)),
class = "data.frame", row.names = c(NA,-36L))
ID col1 col2 col3
x1 a1 a 2
x1 a1 b 13
x1 a1 c 1
x1 a1 d 21
x1 a1 e 0
x1 a1 f 5
x2 a1 a 3
x2 a1 b 0
x2 a1 c 6
x2 a1 d 4
x2 a1 e 50
x2 a1 f 0
x3 a1 a 0
x3 a1 b 0
x3 a1 c 0
x3 a1 d 9
x3 a1 e 5
x3 a1 f 0
x1 a2 a 51
x1 a2 b 3
x1 a2 c 6
x1 a2 d 0
x1 a2 e 0
x1 a2 f 9
x2 a2 a 89
x2 a2 b 4
x2 a2 c 29
x2 a2 d 1
x2 a2 e 4
x2 a2 f 17
x3 a2 a 6
x3 a2 b 16
x3 a2 c 9
x3 a2 d 1
x3 a2 e 0
x3 a2 f 0
I want to count the unique IDs that have "a", "b" or "c" >0 (more than zero), then "d" or "e" >0, and finally "f">0.
Then get the sum of all (abc), (de) and (f) separately in a different column.
So the result would look like the following:
df2<- structure(list(col1=c("a1","a1","a1","a2","a2","a2"), col2 = c("abc", "de", "f", "abc", "de", "f"), count.ID = c(2,3,1,3,2,2), total=c(25,89,5,213,6,26)),
class = "data.frame", row.names = c(NA,-6L))
col1 col2 count.ID total
a1 abc 2 25
a1 de 3 89
a1 f 1 5
a2 abc 3 213
a2 de 2 6
a2 f 2 26
How is this possible in R?
Thanks
One premise would be to create a frame of grouping variables mapping old col2 to the new combined col2, and then merge/join it to the original data.
dplyr
library(dplyr)
groups <- data.frame(col2=c("a","b","c","d","e","f"), col2b=c("abc","abc","abc","de","de","f"))
left_join(df, groups, by = "col2") %>%
group_by(col1, col2 = col2b) %>%
summarize(count.ID = length(unique(ID[col3 > 0])), total = sum(col3)) %>%
ungroup()
# # A tibble: 6 x 4
# col1 col2 count.ID total
# <chr> <chr> <int> <dbl>
# 1 a1 abc 2 25
# 2 a1 de 3 89
# 3 a1 f 1 5
# 4 a2 abc 3 213
# 5 a2 de 2 6
# 6 a2 f 2 26

How to check if pairs remain the same between years?

I have a data frame with pairs of individual birds (male and female) that were observed in several years. I am trying to figure out whether these pairs have changed from one year to the next so that I can do some further analyses.
My data is structured like this:
dat <- tibble(year = rep(1:3, each = 3),
Male = c("A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"),
Female = c("X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"))
# A tibble: 9 x 3
year Male Female
<int> <chr> <chr>
1 1 A1 X1
2 1 B1 Y1
3 1 C1 Z1
4 2 A1 X1
5 2 B1 Y2
6 2 C1 Z2
7 3 A1 X1
8 3 B1 Y2
9 3 C2 Z2
And my expected output is something like:
# A tibble: 9 x 5
year Male Female male_state female_state
<int> <chr> <chr> <chr> <chr>
1 1 A1 X1 new new
2 1 B1 Y1 new new
3 1 C1 Z1 new new
4 2 A1 X1 reunited reunited
5 2 B1 Y2 divorced new
6 2 C1 Z2 divorced new
7 3 A1 X1 reunited reunited
8 3 B1 Y2 reunited reunited
9 3 C2 Z2 new divorced
I cannot figure out how to check whether a value from a different column is the same in the year before (e.g. if the male ID is the same for a certain female in year 2 or 3 as in the year prior). Any ideas?
This (probably overcomplicated) pipe produces the following output.
dat <- tibble(year = rep(1:3, each = 3),
Male = c("A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"),
Female = c("X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"))
dat %>%
mutate(pair=paste0(Male,Female)) %>%
arrange(pair,year) %>%
mutate(check = if_else((pair==lag(pair)) & (year>lag(year)), 'old couple', 'new couple')) %>%
mutate(check = if_else(is.na(check), 'new couple', check)) %>%
mutate(divorced = if_else((Male == lag(Male)) & (Female != lag(Female)), 'divorce', '')) %>%
mutate(divorced = if_else(is.na(divorced), '', divorced))
OUTPUT:
# A tibble: 9 × 6
year Male Female pair check divorced
<int> <chr> <chr> <chr> <chr> <chr>
1 1 A1 X1 A1X1 new couple ""
2 2 A1 X1 A1X1 old couple ""
3 3 A1 X1 A1X1 old couple ""
4 1 B1 Y1 B1Y1 new couple ""
5 2 B1 Y2 B1Y2 new couple "divorce"
6 3 B1 Y2 B1Y2 old couple ""
7 1 C1 Z1 C1Z1 new couple ""
8 2 C1 Z2 C1Z2 new couple "divorce"
9 3 C2 Z2 C2Z2 new couple ""
Try this:
library(tidyverse)
dat <- tibble(
year = rep(1:3, each = 3),
Male = c(
"A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"
),
Female = c(
"X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"
)
)
dat |>
mutate(pairing = str_c(Male, "|", Female)) |>
add_count(pairing) |>
group_by(pairing) |>
mutate(male_state = if_else(pairing == lag(pairing), "reunited", NA_character_),
female_state = if_else(pairing == lag(pairing), "reunited", NA_character_)) |>
group_by(Male) |>
mutate(
male_state = if_else(row_number() == 1, "new", male_state),
male_state = if_else(is.na(male_state), "divorced", male_state)
) |>
group_by(Female) |>
mutate(
female_state = if_else(row_number() == 1, "new", female_state),
female_state = if_else(is.na(female_state), "divorced", female_state)
) |>
arrange(year, Male)
#> # A tibble: 9 × 7
#> # Groups: Female [5]
#> year Male Female pairing n male_state female_state
#> <int> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 1 A1 X1 A1|X1 3 new new
#> 2 1 B1 Y1 B1|Y1 1 new new
#> 3 1 C1 Z1 C1|Z1 1 new new
#> 4 2 A1 X1 A1|X1 3 reunited reunited
#> 5 2 B1 Y2 B1|Y2 2 divorced new
#> 6 2 C1 Z2 C1|Z2 1 divorced new
#> 7 3 A1 X1 A1|X1 3 reunited reunited
#> 8 3 B1 Y2 B1|Y2 2 reunited reunited
#> 9 3 C2 Z2 C2|Z2 1 new divorced
Created on 2022-05-03 by the reprex package (v2.0.1)

How collect members of a column based on the value of a specific member in that column in R

In the following data frame, I want to collect members of B1, where their value in B2 is equal to or more than the value of "b" in B2. And then after this new information, count how many times each of the B1 members occurred.
dataframe:
ID B1 B2
z1 a 2.5
z1 b 1.7
z1 c 170
z1 c 9
z1 d 3
y2 a 0
y2 b 21
y2 c 15
y2 c 101
y2 d 30
y2 d 3
y2 d 15.5
x3 a 30.8
x3 a 54
x3 a 0
x3 b 30.8
x3 c 30.8
x3 d 7
so the result would be:
ID B1 B2
z1 a 2.5
z1 c 170
z1 c 9
z1 d 3
y2 c 101
y2 d 30
x3 a 30.8
x3 a 54
x3 c 30.8
and
ID B1 count
z1 a 1
z1 c 2
z1 d 1
y2 a 0
y2 c 1
y2 d 1
x3 a 2
x3 c 1
x3 d 0
Grouped by 'ID', filter where the 'B2' is greater than or equal to 'B2' where 'B1' is 'b' as well as create another condition where 'B1' is not equal to 'b'
library(dplyr)
out1 <- df1 %>%
group_by(ID) %>%
filter(any(B1 == "b") & B2 >= min(B2[B1 == "b"]), B1 != 'b')
-output
> out1
# A tibble: 9 × 3
# Groups: ID [3]
ID B1 B2
<chr> <chr> <dbl>
1 z1 a 2.5
2 z1 c 170
3 z1 c 9
4 z1 d 3
5 y2 c 101
6 y2 d 30
7 x3 a 30.8
8 x3 a 54
9 x3 c 30.8
The second output will be do a group by with summarise to get the number of rows, and then fill the missing combinations with complete
library(tidyr)
out1 %>%
group_by(B1, .add = TRUE) %>%
summarise(count = n(), .groups = "drop_last") %>%
complete(B1 = unique(.$B1), fill = list(count = 0)) %>%
ungroup
# A tibble: 9 × 3
ID B1 count
<chr> <chr> <int>
1 x3 a 2
2 x3 c 1
3 x3 d 0
4 y2 a 0
5 y2 c 1
6 y2 d 1
7 z1 a 1
8 z1 c 2
9 z1 d 1
data
df1 <- structure(list(ID = c("z1", "z1", "z1", "z1", "z1", "y2", "y2",
"y2", "y2", "y2", "y2", "y2", "x3", "x3", "x3", "x3", "x3", "x3"
), B1 = c("a", "b", "c", "c", "d", "a", "b", "c", "c", "d", "d",
"d", "a", "a", "a", "b", "c", "d"), B2 = c(2.5, 1.7, 170, 9,
3, 0, 21, 15, 101, 30, 3, 15.5, 30.8, 54, 0, 30.8, 30.8, 7)),
class = "data.frame", row.names = c(NA,
-18L))
Using tidyverse:
library(tidyverse)
df %>%
group_by(ID) %>%
filter(B2 > B2[B1 == "b"]) %>%
group_by(ID, B1) %>%
count(name = "count") %>%
as.data.frame()
#> ID B1 count
#> 1 x3 a 1
#> 2 y2 c 1
#> 3 y2 d 1
#> 4 z1 a 1
#> 5 z1 c 2
#> 6 z1 d 1
Created on 2022-04-26 by the reprex package (v2.0.1)

Extract data based on a time series column in R

I have an annual daily timeseries pixel data in a data frame in such a way that each date occurs multiple times for each of the pixel. Now I would like to extract/subset this data based on a set of dates stored in another data frame. How can I do this in R using dplyr?
Sample data
X Y T Value
X1 Y1 1/1/2004 1
X2 Y2 1/1/2004 2
X3 Y3 1/1/2004 3
X1 Y1 1/2/2004 4
X2 Y2 1/2/2004 5
X3 Y3 1/2/2004 6
X1 Y1 1/3/2004 7
X2 Y2 1/3/2004 8
X3 Y3 1/3/2004 9
Dates of interest
1/1/2004
1/2/2004
Code
library(dplyr)
X = c("X1", "X2", "X3", "X1", "X2", "X3", "X1", "X2", "X3")
Y = c("Y1", "Y2", "Y3", "Y1", "Y2", "Y3", "Y1", "Y2", "Y3")
T = c("1/1/2004", "1/2/2004", "1/3/2004", "1/1/2004", "1/2/2004", "1/3/2004","1/1/2004", "1/2/2004", "1/3/2004")
Value = c("1", "2", "3", "4", "5", "6", "7", "8", "9")
df = data.frame(X, Y, T, Value)
# Desired dates
TS = read.csv("TS.csv")
TS
"1/1/2004", "1/2/2004"
#stuck...___
If your TS is TS = c("1/1/2004", "1/2/2004"), simply using filter,
library(dplyr)
df %>%
filter(T %in% TS)
X Y T Value
1 X1 Y1 1/1/2004 1
2 X2 Y2 1/2/2004 2
3 X1 Y1 1/1/2004 4
4 X2 Y2 1/2/2004 5
5 X1 Y1 1/1/2004 7
6 X2 Y2 1/2/2004 8
if your TS is TS = ("1/1/2004, 1/2/2004")
library(stringr)
df %>%
filter(T %in% str_split(gsub("\\s+", "", TS), ",", simplify = TRUE))
Base R:
> df[df$T %in% TS,]
X Y T Value
1 X1 Y1 1/1/2004 1
2 X2 Y2 1/2/2004 2
4 X1 Y1 1/1/2004 4
5 X2 Y2 1/2/2004 5
7 X1 Y1 1/1/2004 7
8 X2 Y2 1/2/2004 8
>
If TS is
"1/1/2004, 1/2/2004", use stringr:
> df[df$T %in% stringr::str_split(TS, ", ", simplify=TRUE),]
X Y T Value
1 X1 Y1 1/1/2004 1
2 X2 Y2 1/2/2004 2
4 X1 Y1 1/1/2004 4
5 X2 Y2 1/2/2004 5
7 X1 Y1 1/1/2004 7
8 X2 Y2 1/2/2004 8
>

Finding ALL indices of positions in vector matching columns of DF in R

I have a dataframe A whose columns I want to match with the row.names of another dataframe B.
# A
v1 v2
X1 X3
X1 X5
X1 X15
X2 X3
X2 X4
...
# row.names of B (some values are duplicated)
row_names_B=c('X17', 'X1', 'X2', 'X15', 'X3', 'X3', 'X1', 'X5', 'X4', ...)
I want to match the columns of A with the positions of row_names_B, such that I can return a list of ALL positions in B for each row in A.
#my results:
v1_index v2_index
2 5 #matches X1 in pos 2, X3 in pos 5
2 6 #matches X1 in pos 2, X3 in pos 6
7 5 #matches X1 in pos 7, X3 in pos 5
7 6 #matches X1 in pos 7, X3 in pos 6
2 5 #matches X1 in pos 2, X3 in pos 8
7 5 #matches X1 in pos 7, X3 in pos 8
...
Note that I want to find all possible solutions.
I understand that this should be with some variant of match or which as given in this example, but I'm not sure how to do the explosion for each of the matches. The way I see it is by running it through for loops, row by row, but perhaps there is a better way to do this?
You could create a list of position based on their name and randomly assign one value in the dataframe A from the list of positions.
C <- A
ref <- split(seq_along(row_names_B), row_names_B)
C[] <- lapply(A, function(y) sapply(ref[y],
function(x) if(length(x) == 1) x else sample(x, 1)))
C
# v1 v2
#1 2 5
#2 2 8
#3 7 4
#4 3 5
#5 3 9
data
A <- structure(list(v1 = c("X1", "X1", "X1", "X2", "X2"), v2 = c("X3",
"X5", "X15", "X3", "X4")), class = "data.frame", row.names = c(NA, -5L))
row_names_B <- c("X17", "X1", "X2", "X15", "X3", "X3", "X1", "X5", "X4")

Resources