Pivot_wider introduces NA's - r

I am doing datamanagement for a project and I am running into difficulties with what I thought would be a basic reshape from Long format to Wide.
The Data looks something like this:
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
Time = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 1, 1, 2, 2),
Type = c("A", "B", "C", "D", "A", "B","C", "D", "A", "A", "B", "C", "D", "A", "B"),
Value = c(100, NA, 40, 123, 95, NA, 45, 1234, 100, 70, NA, 50, 12345, 75, NA)),
row.names = c(NA, 15L), class = "data.frame")
Based on previous Stackoverflow Answers I am trying to use pivot-wider like this:
df.wide <- df %>%
group_by(ID, Type) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Type, values_from = Value)
However this returns a dataframe with NA values at max(Time) for each ID that looks like this:
# A tibble: 5 x 7
ID Time row A B C D
<dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 100 NA 40 123
2 1 2 2 95 NA 45 1234
3 1 3 3 100 NA NA NA
4 2 1 1 70 NA 50 12345
5 2 2 2 75 NA NA NA
What am I doing wrong? My google and Stackoverflow-fu has not been able to help me.

Related

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

2-group heterogeneity index

I have a dataset with two distinct groups (A and B) belonging to 3 different categories (1, 2, 3):
library(tidyverse)
set.seed(100)
df <- tibble(Group = sample(c(1, 2, 3), 20, replace = T),Company = sample(c('A', 'B'), 20, replace = T))
I want to come come up with a metric that characterizes group composition across the timespan.
Thus far, I have used an index based on Shannon's Index which gives a measure of heterogeneity varying between 0 and 1. With 1 being a perfectly heterogeneous (equal representation of each group) and 0 being completely homogeneous (only 1 group is represented):
df %>%
group_by(Group, Company) %>%
summarise(n=n()) %>%
mutate(p = n / sum(n)) %>%
mutate(Shannon = -(p*log2(p) + (1-p)))
Yielding:
Group Company n p Shannon
<dbl> <chr> <int> <dbl> <dbl>
1 A 2 0.6666667 0.05664167
1 B 1 0.3333333 -0.13834583
2 A 4 0.5000000 0.00000000
2 B 4 0.5000000 0.00000000
3 A 1 0.1111111 -0.53667500
3 B 8 0.8888889 0.03993333
However, I am looking for an index between [-1, +1]. Where the index yields -1 when only group A is present at a time point, +1 when only group B is present at a time point, 0 being an equal representation.
How can I create such an index? I have looked at measures such as Moran's I as inspiration, but they do not seem to suit the need.
A simple solution might be to calculate the mean.
I transformed Company into value with A = -1 and B = 1 and calculated the mean by Group.
The result will be an index for each Group, with -1 when Company has just "A"s or 1 when there are just "B"s.
Data
df <- structure(list(Group = c(2, 2, 3, 3, 1, 2, 3, 1, 1, 3, 3, 1,
2, 2, 3, 2, 2, 1, 1, 3), Company = c("A", "A", "A", "A", "B",
"B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "A", "B", "B",
"A", "B")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
Code
df %>%
mutate(value = ifelse(Company == "A", -1, 1)) %>%
group_by(Group) %>%
summarise(index = mean(value))
Output
# A tibble: 3 x 2
Group index
<dbl> <dbl>
1 1 0.333
2 2 -0.429
3 3 0.429

How to remove duplicates based on two colums with a condition?

I'd like to remove some duplicates but not all of them. I'm going to explain after showing the data i'm working with.
Here is an sample of my dataframe :
df <- data.frame("S" = c("A", "B", "C", "D", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/04/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "004", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B", "B"),
"Q" = c(1, 2, 3, 4, 5, 6),
"U" = c(rep("A", 6)),
"P" = c(2, 3, 4, 4, 7, 7),
stringsAsFactors = FALSE)
And now some code i'm applying on this dataframe :
df$P <- round(as.double(df$P), digits = 2)
df <- df[order(df$R, df$P),]
df <- df %>%
group_by(R) %>%
mutate(price = P - min(P)) %>%
ungroup()
df$Ecart <- df$price * as.double(df$Q)
df <- df %>%
group_by(R) %>%
mutate(EcartTotal = cumsum(Ecart)) %>%
ungroup()
The result I'm expecting :
result <- data.frame("S" = c("A", "B", "C", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B"),
"Q" = c(1, 2, 3, 5, 6),
"U" = c(rep("A", 5)),
"P" = c(2, 3, 4, 7, 7),
"price" = c(0, 1, 0, 3, 3),
"Ecart" = c(0, 2, 0, 15, 18),
"EcartTotal" = c(NA, 2, NA, NA, 33),
stringsAsFactors = FALSE)
So to obtain this I'd like to remove the duplicates of the column R only if their price is equal to 0.
I'd also like to replace the value of EcartTotal by NA if they are not equal to the max value for each R
We can filter based on the condition and then replace the value of 'EcartTotal' to NA after grouping by 'R'
library(dplyr)
df %>%
filter(!(duplicated(R) & price == 0)) %>%
group_by(R) %>%
mutate(EcartTotal = replace(EcartTotal, EcartTotal != max(EcartTotal), NA))
# A tibble: 5 x 12
# Groups: R [2]
# S D N R RF Des Q U P price Ecart EcartTotal
# <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 01/01/2019 001 ABC1 ABC1F A 1 A 2 0 0 NA
#2 B 01/02/2019 002 ABC1 ABC1F A 2 A 3 1 2 2
#3 C 01/03/2019 003 ABC2 ABC2F B 3 A 4 0 0 NA
#4 E 01/05/2019 005 ABC2 ABC2F B 5 A 7 3 15 NA
#5 F 01/06/2019 006 ABC2 ABC2F B 6 A 7 3 18 33
Or the filter after the group_by step
df %>%
group_by(R) %>%
filter(!(row_number() > 1 & price == 0)) %>%
mutate(EcartTotal = EcartTotal * NA^(EcartTotal != max(EcartTotal)))

How can I fill columns based on values in another column? [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 3 years ago.
I have a large dataframe containing a cross table of keys from other tables. Instead of having multiple instances of key1 coupled with different values for key2 I would like there to be one row for each key1 with several columns instead.
I tried doing this with a for loop but it couldn't get it to work.
Here's an example. I have a data frame with the structure df1 and I would like it to have the structure of df2.
df1 <- data.frame(c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "d"),c(1, 2, 3, 2, 3, 1, 2, 3, 4, 5, 9))
names(df1) <- c("key1", "key2")
df2 <- data.frame(c("a", "b", "c", "d"), c(1, 2, 1, 9), c(2, 3, 2, NA), c(3, NA, 3, NA), c(NA, NA, 4, NA), c(NA, NA, 5, NA))
names(df2) <- c("key1", "key2_1", "key2_2", "key2_3", "key2_4", "key2_5")
I suspect this is possible using an approach utilizing apply but I haven't found a way yet. Any help is appreciated!
library(dplyr)
library(tidyr)
df1 %>%
group_by(key1) %>%
mutate(var = paste0("key2_", seq(n()))) %>%
spread(var, key2)
# # A tibble: 4 x 6
# # Groups: key1 [4]
# key1 key2_1 key2_2 key2_3 key2_4 key2_5
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 2 3 NA NA
# 2 b 2 3 NA NA NA
# 3 c 1 2 3 4 5
# 4 d 9 NA NA NA NA

Calculate median for multiple columns by group based on subsets defined by other columns

I am trying to calculate the median (but that could be substituted by similar metrics) by group for multiple columns based on subsets defined by other columns. This is direct follow-on question from this previous post of mine. I have attempted to incorporate calculating the median via aggregate into the Map(function(x,y) dosomething, x, y) solution kindly provided by #Frank, but that didn't work. Let me illustrate:
Calculate median for A and B by groups GRP1 and GRP2
df <- data.frame(GRP1 = c("A","A","A","A","A","A","B","B","B","B","B","B"), GRP2 = c("A","A","A","B","B","B","A","A","A","B","B","B"), A = c(0,4,6,7,0,1,9,0,0,8,3,4), B = c(6,0,4,8,6,7,0,9,9,7,3,0))
med <- aggregate(.~GRP1+GRP2,df,FUN=median)
Simple. Now add columns defining which rows to be used for calculating the median, i.e. rows with NAs should be dropped, column a defines which rows to be used for calculating the median in column A, same for columns b and B:
a <- c(1,4,7,3,NA,3,7,NA,NA,4,8,1)
b <- c(5,NA,7,9,5,6,NA,8,1,7,2,9)
df1 <- cbind(df,a,b)
As mentioned above, I have tried combining Map and aggregate, but that didn't work. I assume that Map doesn't know what to do with GRP1 and GRP2.
med1 <- Map(function(x,y) aggregate(.~GRP1+GRP2,df1[!is.na(y)],FUN=median), x=df1[,3:4], y=df1[, 5:6])
This is the result I'm looking for:
GRP1 GRP2 A B
1 A A 4 5
2 B A 9 9
3 A B 4 7
4 B B 4 3
Any help will be much appreciated!
Using data.table
library(data.table)
setDT(df1)
df1[, .(A = median(A[!is.na(a)]), B = median(B[!is.na(b)])), by = .(GRP1, GRP2)]
GRP1 GRP2 A B
1: A A 4 5
2: A B 4 7
3: B A 9 9
4: B B 4 3
Same logic in dplyr
library(dplyr)
df1 %>%
group_by(GRP1, GRP2) %>%
summarise(A = median(A[!is.na(a)]), B = median(B[!is.na(b)]))
The original df1:
df1 <- data.frame(
GRP1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
GRP2 = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
A = c(0, 4, 6, 7, 0, 1, 9, 0, 0, 8, 3, 4),
B = c(6, 0, 4, 8, 6, 7, 0, 9, 9, 7, 3, 0),
a = c(1, 4, 7, 3, NA, 3, 7, NA, NA, 4, 8, 1),
b = c(5, NA, 7, 9, 5, 6, NA, 8, 1, 7, 2, 9)
)
With dplyr:
library(dplyr)
df1 %>%
mutate(A = ifelse(is.na(a), NA, A),
B = ifelse(is.na(b), NA, B)) %>%
# I use this to put as NA the values we don't want to include
group_by(GRP1, GRP2) %>%
summarise(A = median(A, na.rm = T),
B = median(B, na.rm = T))
# A tibble: 4 x 4
# Groups: GRP1 [?]
GRP1 GRP2 A B
<fct> <fct> <dbl> <dbl>
1 A A 4 5
2 A B 4 7
3 B A 9 9
4 B B 4 3

Resources