How to remove duplicates based on two colums with a condition? - r

I'd like to remove some duplicates but not all of them. I'm going to explain after showing the data i'm working with.
Here is an sample of my dataframe :
df <- data.frame("S" = c("A", "B", "C", "D", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/04/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "004", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B", "B"),
"Q" = c(1, 2, 3, 4, 5, 6),
"U" = c(rep("A", 6)),
"P" = c(2, 3, 4, 4, 7, 7),
stringsAsFactors = FALSE)
And now some code i'm applying on this dataframe :
df$P <- round(as.double(df$P), digits = 2)
df <- df[order(df$R, df$P),]
df <- df %>%
group_by(R) %>%
mutate(price = P - min(P)) %>%
ungroup()
df$Ecart <- df$price * as.double(df$Q)
df <- df %>%
group_by(R) %>%
mutate(EcartTotal = cumsum(Ecart)) %>%
ungroup()
The result I'm expecting :
result <- data.frame("S" = c("A", "B", "C", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B"),
"Q" = c(1, 2, 3, 5, 6),
"U" = c(rep("A", 5)),
"P" = c(2, 3, 4, 7, 7),
"price" = c(0, 1, 0, 3, 3),
"Ecart" = c(0, 2, 0, 15, 18),
"EcartTotal" = c(NA, 2, NA, NA, 33),
stringsAsFactors = FALSE)
So to obtain this I'd like to remove the duplicates of the column R only if their price is equal to 0.
I'd also like to replace the value of EcartTotal by NA if they are not equal to the max value for each R

We can filter based on the condition and then replace the value of 'EcartTotal' to NA after grouping by 'R'
library(dplyr)
df %>%
filter(!(duplicated(R) & price == 0)) %>%
group_by(R) %>%
mutate(EcartTotal = replace(EcartTotal, EcartTotal != max(EcartTotal), NA))
# A tibble: 5 x 12
# Groups: R [2]
# S D N R RF Des Q U P price Ecart EcartTotal
# <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 01/01/2019 001 ABC1 ABC1F A 1 A 2 0 0 NA
#2 B 01/02/2019 002 ABC1 ABC1F A 2 A 3 1 2 2
#3 C 01/03/2019 003 ABC2 ABC2F B 3 A 4 0 0 NA
#4 E 01/05/2019 005 ABC2 ABC2F B 5 A 7 3 15 NA
#5 F 01/06/2019 006 ABC2 ABC2F B 6 A 7 3 18 33
Or the filter after the group_by step
df %>%
group_by(R) %>%
filter(!(row_number() > 1 & price == 0)) %>%
mutate(EcartTotal = EcartTotal * NA^(EcartTotal != max(EcartTotal)))

Related

Assign value to new column based on values in 2 other columns

Here is an example code:
Group <- c("A", "A", "A", "A", "A", "B", "B", "B","B", "B")
Actor <- c(1, 3, 6, 4, 1, 2, 2, 6, 4, 3)
df <- data.frame(Group,Actor)
df
Now, what I want to do is to create three new columns (Sex, Status, SexStat) based on the data in the Group and Actor columns.
For example, if Group = A and Actor = 1, then Sex = M, Status = Dom, and SexStat = DomM. If Group = A and Actor = 3, then Sex = F, Status = Med, and SexStat = MedF (and so on).
The numbers do not always align with the same rank/sexes in every group, and with 5500 lines of data, I would love it if there was a way to not do this manually! Any help would be much appreciated.
You can create conditions for Sex and Status and then paste them to create SexStat
library(dplyr)
Group <- c("A", "A", "A", "A", "A", "B", "B", "B","B", "B")
Actor <- c(1, 3, 6, 4, 1, 2, 2, 6, 4, 3)
df <- data.frame(Group,Actor)
df
df %>%
mutate(
Sex = case_when(
Group == "A" & Actor == 1 ~ "M",
Group == "A" & Actor == 3 ~ "F",
TRUE ~ ""
),
Status = case_when(
Group == "A" & Actor == 1 ~ "Dom",
Group == "A" & Actor == 3 ~ "Med",
TRUE ~ ""
),
SexStat = paste0(Status,Sex)
)
Group Actor Sex Status SexStat
1 A 1 M Dom DomM
2 A 3 F Med MedF
3 A 6
4 A 4
5 A 1 M Dom DomM
6 B 2
7 B 2
8 B 6
9 B 4
10 B 3
We may do this with a key/value dataset by joining
library(dplyr)
library(tidyr)
library(stringr)
keydat <- tibble(Group = "A", Actor = c(1, 3), Sex = c("M", "F"), Status = c("Dom", "Med"))
df %>%
left_join(keydat) %>%
mutate(across(c(Sex, Status), replace_na, ""),
SexStat = str_c(Status, Sex))
-output
Group Actor Sex Status SexStat
1 A 1 M Dom DomM
2 A 3 F Med MedF
3 A 6
4 A 4
5 A 1 M Dom DomM
6 B 2
7 B 2
8 B 6
9 B 4
10 B 3

Get most recent observation for variables asked at different time points

Someone asked this already in a simpler version here, but I cannot quite get it to work for my case.
I have observational data on a number of individuals across multiple years for a set of questions, but not everyone is asked every question every year. I want to generate a new dataframe that has the most recent answer for each individual.
The data looks like this:
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"), time = c(1:4), questionA = c("Yes", NA, "No", NA, "No", NA, "No", "Yes", "No", NA, NA, "No"), questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 5, NA))
The resulting dataframe for this example should look like this:
most_recent <- data.frame(individual = c("A", "B", "C"), questionA = c("No", "Yes", "No"), questionB = c(5, 4, 5))
Ideally I am looking for a dplyr solution. Thank you!
We can use dplyr's across() for this:
df %>%
group_by(individual) %>%
summarize(across(starts_with("question"), ~ last(na.omit(.))))
# # A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
# 1 A No 5
# 2 B Yes 4
# 3 C No 5
My take in base R, it filters the df by the most recent time of each person
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"),
time = c(1:4),
questionA = c("Yes", NA, NA, "No", "No", NA, NA, "Yes", "No", NA, NA, "No"),
questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 3, 5),stringsAsFactors = F)
#new column to use with %in%
df$match <- paste(df$individual, df$time)
#find the most recent sample for each individual
id <- unique(df$individual)
most_recent <- sapply(id, function(id){
time <- max(df$time[df$individual == id])
return(paste(id,time))
})
#filter df by most recent
final <- df[df$match %in% most_recent,]
final
individual time questionA questionB match
4 A 4 No 5 A 4
8 B 4 Yes 4 B 4
12 C 4 No 5 C 4
We could use slice_tail after filling the 'question' NA with the adjacent non-NA, grouped and ordered by 'individual', 'time' columns
library(dplyr)
library(tidyr)
df %>%
arrange(individual, time) %>%
select(-time) %>%
group_by(individual) %>%
fill(starts_with('question')) %>%
slice_tail(n = 1) %>%
ungroup
-output
# A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
#1 A No 5
#2 B Yes 4
#3 C No 5

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

2-group heterogeneity index

I have a dataset with two distinct groups (A and B) belonging to 3 different categories (1, 2, 3):
library(tidyverse)
set.seed(100)
df <- tibble(Group = sample(c(1, 2, 3), 20, replace = T),Company = sample(c('A', 'B'), 20, replace = T))
I want to come come up with a metric that characterizes group composition across the timespan.
Thus far, I have used an index based on Shannon's Index which gives a measure of heterogeneity varying between 0 and 1. With 1 being a perfectly heterogeneous (equal representation of each group) and 0 being completely homogeneous (only 1 group is represented):
df %>%
group_by(Group, Company) %>%
summarise(n=n()) %>%
mutate(p = n / sum(n)) %>%
mutate(Shannon = -(p*log2(p) + (1-p)))
Yielding:
Group Company n p Shannon
<dbl> <chr> <int> <dbl> <dbl>
1 A 2 0.6666667 0.05664167
1 B 1 0.3333333 -0.13834583
2 A 4 0.5000000 0.00000000
2 B 4 0.5000000 0.00000000
3 A 1 0.1111111 -0.53667500
3 B 8 0.8888889 0.03993333
However, I am looking for an index between [-1, +1]. Where the index yields -1 when only group A is present at a time point, +1 when only group B is present at a time point, 0 being an equal representation.
How can I create such an index? I have looked at measures such as Moran's I as inspiration, but they do not seem to suit the need.
A simple solution might be to calculate the mean.
I transformed Company into value with A = -1 and B = 1 and calculated the mean by Group.
The result will be an index for each Group, with -1 when Company has just "A"s or 1 when there are just "B"s.
Data
df <- structure(list(Group = c(2, 2, 3, 3, 1, 2, 3, 1, 1, 3, 3, 1,
2, 2, 3, 2, 2, 1, 1, 3), Company = c("A", "A", "A", "A", "B",
"B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "A", "B", "B",
"A", "B")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
Code
df %>%
mutate(value = ifelse(Company == "A", -1, 1)) %>%
group_by(Group) %>%
summarise(index = mean(value))
Output
# A tibble: 3 x 2
Group index
<dbl> <dbl>
1 1 0.333
2 2 -0.429
3 3 0.429

Insert specified values in R grouped df and fill up missing values using another df (R)

I have 2 dfs : df & xdf.
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
In df, within "id" column, for the groups (a & b), I would like to insert only that row of xdf which matches the same id name as in df. How can I make it ? I have tried following commands but all of the values of xdf$x are inserted for each group.
ndf <- df %>%
group_by(id) %>%
do(add_row(., id = .$id[1], x = xdf$x))
> ndf
# A tibble: 12 x 3
# Groups: id [2]
id x y
<chr> <dbl> <dbl>
1 a 1 0.2
2 a 2 0
3 a 3 0.9
4 a 4 7
5 a 2 NA
6 a 3.5 NA
7 b 1 1
8 b 2 0.3
9 b 3 5
10 b 4 5.1
11 b 2 NA
12 b 3.5 NA
# expected result should be : ndf <- ndf[c(-6,-11),]
My end goal is to fill these newborns NA of ndf with the approx() function. But my issue remains because I'm using xout = xdf$x that calls supernumerary values. How can I overcome this? Can you help to write a function that makes xout varies?
f <- function(z)
{
fdf <- approx(z$x, z$y, xout = xdf$x, method = "linear")
return(data.frame(nx= fdf$x, y.out = fdf$y, id = unique(z$id)))
}
jdf <- as.data.frame(ddply(ndf, .(id), f))
zdf <- subset(jdf, select = c(id, nx, y.out))
> zdf
id nx y.out
1 a 2.0 0.00
2 a 3.5 3.95
3 b 2.0 0.30
4 b 3.5 5.05
# expected results
id nx y.out
1 a 2.0 0.00
2 b 3.5 5.05
Any helpful tips to this is welcome. Many thanks!
library(dplyr)
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
ndf <- df %>%
bind_rows(xdf) %>%
arrange(id)
zdf <- ndf %>%
group_by(id) %>%
group_modify(~mutate(., y_approx = approx(.$x, .$y, .$x, method = "linear")[["y"]])) %>%
ungroup() %>%
filter(is.na(y)) %>%
select(id, y_approx)

Resources