gather() per grouped variables in R for specific columns - r

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!

Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Related

How many times does the value for column B appear for a value in column A?

I am having the hardest time coming up with a code that lets me match a topic (Column B) to a name (Column A) and create a frequency column for the times B has matched with A (or how many times both have appeared together). Col A and B are codes for longer names.
I thought maybe using the count function from plyr but cant make it work. Maybe you can give me an idea of what I could use for a code?
For example I have a table:
**Col A
Col B**
1
38
1
6
1
38
2
38
2
7
2
7
2
8
2
7
The result that I am looking for is
**Col A
Col B
freq**
1
38
2
1
6
1
2
38
1
2
7
3
2
8
1
So the number 38 has appeared in "1" two times. 6 has appeared one time. and so on.
I have 600 rows of data and cant come up with a useful or even a close call code.
Thank you so much for your help!
Summarise and count using dplyr:
library(dplyr)
df2 <- df %>%
group_by(col1, col2) %>%
summarise(count = n()) %>%
ungroup()
returns:
col1 col2 count
<dbl> <dbl> <int>
1 1 6 1
2 1 38 2
3 2 7 3
4 2 8 1
5 2 38 1

R: Duplicating a subset of row values, based on condition, across a whole dataframe

I have a dataframe df containing count data at different sites, across two days:
day site count
1 A 2
1 B 3
2 A 10
2 B 12
I would like to add a new column day1count that represents the count value at day 1, for each unique site. So, on rows where day==1, count and day1count would be identical. The new df would look like:
day site count day1count
1 A 2 2
1 B 3 3
2 A 10 2
2 B 12 3
So far I've created a new column that has duplicate values for day 1 rows, and NA for everything else:
df$day1count= ifelse(df$day==1, df$count, NA)
day site count day1count
1 A 2 2
1 B 3 3
2 A 10 NA
2 B 12 NA
How can I now replace the NA entries with values corresponding to each unique site from day 1?
I figured it out. It's not very elegant (and I invite others to submit a more efficient approach) but...
Do NOT create the new column with df$day1count= ifelse(df$day==1, df$count, NA) as I did in the original example. Instead, start by making a duplicate of df, but which only contains rows from day 1
tmpdf = df[df$day==1,]
Rename count as day1count, and remove day column
tmpdf = rename(tmpdf, c("count"="day1count"))
tmpdf$day = NULL
Merge the two dataframes by site
newdf = merge(x=df,y=tmpdf, by="site")
newdf
site day count day1count
1 A 1 2 2
2 A 2 10 2
3 B 1 3 3
4 B 2 12 3
With tidyverse you could do the following:
library(tidyverse)
df %>%
group_by(site) %>%
mutate(day1count = first(count))
Output
# A tibble: 4 x 4
# Groups: site [2]
day site count day1count
<int> <fct> <int> <int>
1 1 A 2 2
2 1 B 3 3
3 2 A 10 2
4 2 B 12 3
Data
df <- read.table(
text =
"day site count
1 A 2
1 B 3
2 A 10
2 B 12", header = T
)

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

Resources