How to fill a new dataframe column by conditional summing - r

I would like build a new variable by summing values considering multiple conditions (and an extra one, see below). Here you can see R Code until something like my current issue.
# The raw dataframe
area <- c("A", "A", "B", "A", "C", "B", "A", "B", "A", "C")
varclass <- c("Z1", "Z1", "Z1", "Z2", "Z1", "Z1", "Z2", "Z1", "Z2", "Z2")
count <- c(45, 56, 2, 8, 345, 3, 98, 2, 6, 9)
df1 <- data.frame(area,
varclass,
count,
stringsAsFactors = FALSE)
df1
# See how df1 looks like...
# area varclass count
#1 A Z1 45
#2 A Z1 56
#3 B Z1 2
#4 A Z2 8
#5 C Z1 345
#6 B Z1 3
#7 A Z2 98
#8 B Z1 2
#9 A Z2 6
#10 C Z2 9
# Building the final dataframe
df2 <- data.frame(unique(df1$area),
stringsAsFactors = FALSE)
names(df2)[1] <- "area"
# See how df2 looks like...
# area
#1 A
#2 B
#3 C
# The new variable to build
df2$Z1_sum <- sum(df1[df1$varclass == "Z1" & df1$area == df2$area,]$count)
# doesn't work
# See what I hope
# area Z1_sum
#1 A 101
#2 B 7
#3 C 345
As you can see in the last line, I would like building a new variable, Z1_sum, in the df2 database. Z1_sum is the sum of count from the df1 database where varclass = "Z1" and df1$area meet the value of df2$area current row (in MS Excel, that means using a LC1 or $A2 cell id).
Please, consider the fact that I'm not looking for solutions involving to directly build df2 from df1 by using a group by condition or the dcast function... I only want a formula wich allow me to return correct values in my new column. It's my extra condition. Why? It's because I have next to build other variables with most sophisticated formulas than just a sum. By understanding how make such conditional operations, I hope move on...
Thanks for your help.
Jeff

Do you mean:
df2 <- setNames(
aggregate(
count ~ area,
df1[df1$varclass == "Z1", ],
sum
),
c("area", "Z1_sum")
)
df2
area Z1_sum
1 A 101
2 B 7
3 C 345
or
df2$Z1_sum <- aggregate(count ~ area, df1[df1$varclass == "Z1", ], sum)$count
Edit to address your comment.
Try with:
df2 <- aggregate(
count ~ area + varclass,
df1,
sum
)
that will give you your data in the "long" format:
df2
area varclass count
1 A Z1 101
2 B Z1 7
3 C Z1 345
4 A Z2 112
5 C Z2 9
Now you need to reshape it to the "wide" format using something like:
df2 <- xtabs(count ~ area + varclass, df2)
varclass
area Z1 Z2
A 101 112
B 7 0
C 345 9
or:
df2 <- reshape(df2, idvar = "area", timevar = "varclass", direction = "wide")
area count.Z1 count.Z2
1 A 101 112
2 B 7 NA
3 C 345 9

Just subset based on whichever Z* you want counted in your final sum.
df1Z1 <- df1[df1$varclass %in% c("Z1"), ]
aggregate(count ~ area, data = df1Z1, FUN = sum)
area count
1 A 101
2 B 7
3 C 345

You can get your desired result using dplyr:
library(dplyr)
df2 <- group_by(df1, area) %>%
filter(varclass == "Z1") %>%
summarize(Z1_sum = sum(count)) %>%
df2
#> # A tibble: 3 x 2
#> area Z1_sum
#> <chr> <dbl>
#> 1 A 101
#> 2 B 7
#> 3 C 345
The dplyr verbs should be pretty explanatory, and the %>% is the pipe operating, taking the output from one function and making it the first input to the next. group_by here groups by the column area so when we calculate the sum (in summarize) it's the sum for each area group. The filter subsets the data.

Related

How do I merge data frames (and duplicate values) by the nearest date for each individual ID?

I have two data frames that I am trying to join by date (grouped by individual).
I have made example data frames of both (the real df1 is 5700 rows, and the real df2 is 287 rows).
df1 has IDs (including some not in df2), dates, and behavior values.
df2 has IDs (though fewer than df1), dates (fewer than those in df1), and hormone values.
My goal is to match the hormones for a given individual from the nearest date in df2 to the nearest date in df1 (matching as closely as possible but only duplicating the values of hormones from df2 in df1 when the nearest dates are less than or equal to 2 days apart).
I would like to have the hormones that don't match a behavioral observation printed at the bottom of the new data frame with their date such that they aren't lost (example in df3)
df1
ID Date behavior
a 1-12-2020 0
b 1-12-2020 1
b 1-13-2020 1
c 1-12-2020 2
d 1-12-2020 0
c 1-13-2020 1
c 1-14-2020 0
c 1-15-2020 1
c 1-16-2020 2
df2
ID Date hormone
a 1-10-2020 20
b 1-18-2019 70
c 1-10-2020 80
c 1-16-2020 90
#goal dataframe
df3
ID Date behavior hormone
a 1-12-2020 0 20
b 1-12-2020 1 NA [> 2 days from hormone]
b 1-13-2020 1 NA [> 2 days from hormone]
c 1-12-2020 2 80
d 1-12-2020 0 NA [no matching individual in df2]
c 1-13-2020 1 NA [> 2 days from hormone]
c 1-14-2020 0 90
c 1-15-2020 1 90
c 1-16-2020 2 90
b 1-18-2019 NA 70 [unmatched hormone at bottom of df3]
here is the code to create these data frames:
df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
behavior = c(0,1,1,2,0,1,0,1,2) )
df2 <- data.frame(ID = c("a", "b", "c", "c"),
date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
hormone = c(20,70,80,90) )
df1$date<-as.factor(df1$date)
df1$date<-strptime(df1$date,format="%m-%d-%Y")
#for nearest date function to work
df1$date<-as.Date(df1$date,"%m/%d/%y")
df2$date<-as.factor(df2$date)
df2$date<-strptime(df2$date,format="%m-%d-%Y")
#for nearest date function to work
df2$date<-as.Date(df2$date,"%m/%d/%y")
I have been able to use a function from a previous question on the forum (link and code below) to match the nearest dates and duplicate to fill, but am not able to limit the time frame of matches, or print unmatched dates in new rows. Is there a way to do this?
This is what I started working from (code below):
How to match by nearest date from two data frames?
# Function to get the index specifying closest or after
Ind_closest_or_after <- function(d1, d2){
which.min(ifelse(d1 - d2 < 0, Inf, d1 - d2))
}
# Calculate the indices
closest_or_after_ind <- map_int(.x = df1$date, .f = Ind_closest_or_after, d2 = df2$date)
# Add index columns to the data frames and join
df2 <- df2 %>%
mutate(ind = 1:nrow(df2))
df1 <- df1 %>%
mutate(ind = closest_or_after_ind)
df3<-left_join(df2, df1, by = 'ind')
This answer seems the closest but doesn't limit the values:
Merge two data frames by nearest date and ID
#function to do all but limit dates and print unmatched
library(data.table)
setDT(df2)[, date := date]
df2[df1, on = .(ID, date = date), roll = 'nearest']
You can join the tables by filtering all possible combinations (cross product using expand_grid):
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
behavior = c(0,1,1,2,0,1,0,1,2) )
df2 <- data.frame(ID = c("a", "b", "c", "c"),
date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
hormone = c(20,70,80,90) )
joined <-
df1 %>%
rename_all(~ paste0(., ".1")) %>%
expand_grid(df2 %>% rename_all(~ paste0(., ".2"))) %>%
mutate(across(starts_with("date"), ~ .x %>% parse_date(format = "%m-%d-%Y"))) %>%
mutate(time_diff = abs(date.1 - date.2)) %>%
filter(time_diff <= days(2) & ID.1 == ID.2) %>%
select(ID = ID.1, behavior = behavior.1, hormone = hormone.2)
joined
#> # A tibble: 5 x 3
#> ID behavior hormone
#> <chr> <dbl> <dbl>
#> 1 a 0 20
#> 2 c 2 80
#> 3 c 0 90
#> 4 c 1 90
#> 5 c 2 90
df1 %>%
left_join(joined) %>%
full_join(df2) %>%
as_tibble() %>%
distinct(ID, behavior, .keep_all = TRUE) %>%
arrange(ID, behavior)
#> Joining, by = c("ID", "behavior")
#> Joining, by = c("ID", "date", "hormone")
#> # A tibble: 9 x 4
#> ID date behavior hormone
#> <chr> <chr> <dbl> <dbl>
#> 1 a 1-12-2020 0 20
#> 2 a 1-10-2020 NA 20
#> 3 b 1-12-2020 1 NA
#> 4 b 1-18-2019 NA 70
#> 5 c 1-14-2020 0 90
#> 6 c 1-13-2020 1 90
#> 7 c 1-12-2020 2 80
#> 8 c 1-10-2020 NA 80
#> 9 d 1-12-2020 0 NA
Created on 2022-02-18 by the reprex package (v2.0.0)
This will result in one row for each (ID, behavior) pair. You can replace this e.g. with ID, date to have only one time point at any given time point for each ID.

Find unique entries in otherwise identical rows

I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f

How can I extract a subset of data based on another data frame and grab observations before and after that subset

I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar
We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56
You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56

R using combn with apply

I have a data frame that has percentage values for a number of variables and observations, as follows:
obs <- data.frame(Site = c("A", "B", "C"), X = c(11, 22, 33), Y = c(44, 55, 66), Z = c(77, 88, 99))
I need to prepare this data as an edge list for network analysis, with "Site" as the nodes and the remaining variables as the edges. The result should look like this:
Node1 Node2 Weight Type
A B 33 X
A C 44 X
...
B C 187 Z
So that for "Weight" we are calculating the sum of all possible pairs, and this separately for each column (which ends up in "Type").
I suppose the answer to this has to be using apply on a combn expression, like here Applying combn() function to data frame, but I haven't quite been able to work it out.
I can do this all by hand taking the combinations for "Site"
sites <- combn(obs$Site, 2)
Then the individual columns like so
combA <- combn(obs$A, 2, function(x) sum(x)
and binding those datasets together, but this obviously become annoying very soon.
I have tried to do all the variable columns in one go like this
b <- apply(newdf[, -1], 1, function(x){
sum(utils::combn(x, 2))
}
)
but there is something wrong with that.
Can anyone help, please?
One option would be to create a function and then map that function to all the columns that you have.
func1 <- function(var){
obs %>%
transmute(Node1 = combn(Site, 2)[1, ],
Node2 = combn(Site, 2)[2, ],
Weight = combn(!!sym(var), 2, function(x) sum(x)),
Type = var)
}
map(colnames(obs)[-1], func1) %>% bind_rows()
Here is an example using combn
do.call(
rbind,
combn(1:nrow(obs),
2,
FUN = function(k) cbind(data.frame(t(obs[k, 1])), stack(data.frame(as.list(colSums(obs[k, -1]))))),
simplify = FALSE
)
)
which gives
X1 X2 values ind
1 A B 33 X
2 A B 99 Y
3 A B 165 Z
4 A C 44 X
5 A C 110 Y
6 A C 176 Z
7 B C 55 X
8 B C 121 Y
9 B C 187 Z
try it this way
library(tidyverse)
obs_long <- obs %>% pivot_longer(-Site, names_to = "type")
sites <- combn(obs$Site, 2) %>% t() %>% as_tibble()
Type <- tibble(type = c("X", "Y", "Z"))
merge(sites, Type) %>%
left_join(obs_long, by = c("V1" = "Site", "type" = "type")) %>%
left_join(obs_long, by = c("V2" = "Site", "type" = "type")) %>%
mutate(res = value.x + value.y) %>%
select(-c(value.x, value.y))
V1 V2 type res
1 A B X 33
2 A C X 44
3 B C X 55
4 A B Y 99
5 A C Y 110
6 B C Y 121
7 A B Z 165
8 A C Z 176
9 B C Z 187

Merging rows will defining different parameters for each row being combined in R

I have a dataframe with different parameters in each. I'll like to merge rows using a different set of parameters for each row.
Here is my sameple data ZZ:
ZZ<-data.frame(Name =c("A","B","C","D","E","F"),A1=c(19,20,21,23,45,67),A2=c(1,2,3,4,5,6),A3=c(7,8,13,24,88,90),x=c(4,5,6,8,23,16),y=c(-3,-7,-6,-9,3,2))
> ZZ
Name A1 A2 A3 x y
1 A 19 1 7 4 -3
2 B 20 2 8 5 -7
3 C 21 3 13 6 -6
4 D 23 4 24 8 -9
5 E 45 5 88 23 3
6 F 67 6 90 16 2
I want to aggregate the rows A,B,C and D,E,F such that a new name is defined for each group (eg:C1 and C2), A1,A2 and A3 are combined by sum while x and y using the mean.
How can this be done please? The result should be:
> ZZ2
Name A1 A2 A3 x y
1 C1 60 6 28 5.000 -5.333
2 C2 135 15 202 15.667 -1.333
Based on how I interpreted your question I believe this should give you what you want using dplyr:
library(dplyr)
result <- ZZ %>%
mutate(Name = ifelse(Name %in% c("A", "B", "C"), "C1", "C2")) %>%
group_by(Name) %>%
summarise(A1 = sum(A1), A2 = sum(A2), A3 = sum(A3), x = mean(x), y = mean(y)) %>%
ungroup()
Depending on how many rows you have with different names there might be better alternatives for the mutating the Name variable into the 2 groups.
EDIT: Example if 4 cases exist
result <- ZZ %>%
mutate(Name = case_when(Name %in% c("A", "B", "C") ~ "C1",
Name %in% c("D", "E") ~ "C2",
Name %in% c("F", "G") ~ "C3",
Name %in% c("H", "I") ~ "C4")) %>%
group_by(Name) %>%
summarise(A1 = sum(A1), A2 = sum(A2), A3 = sum(A3), x = mean(x), y = mean(y)) %>%
ungroup()

Resources