Merging rows based on multiple variables [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Working with a dataset that looks like this:
UserID PartnerID Happiness Result
1 2 30 1
2 1 20 1
As you can see this is repetitive. I'd like to take those two rows above and merge them into a single row. I have searched around but haven't found a solution that would work here. My ideal output would be this:
UserID PartnerID Happiness1 Happiness2 Result
1 2 30 20 1

If you have no aversion to using packages, I would recommend you use tidyverse for this. The following piece of code should get your desired output:
#install.packages("devtools")
#devtools::install_github("hadley/tidyverse")
library(tidyverse)
# Create a data.frame
dff <- structure(list(UserID = c(1, 2, 3, 4, 5, 6),
PartnerID = c(2,1, 4, 3, 6, 5),
Happiness = c(30, 20, 40, 50, 30, 20),
Result = c(1, 1, 1, 1, 1, 1)),
.Names = c("UserID", "PartnerID", "Happiness","Result"),
row.names = c(NA, 6L),
class = "data.frame")
# UserID PartnerID Happiness Result
# 1 2 30 1
# 2 1 20 1
# 3 4 40 1
# 4 3 50 1
# 5 6 30 1
# 6 5 20 1
# Reshape the data.frame
dff %>% mutate(grouper = paste(UserID,
PartnerID,
sep = "")) %>%
mutate(grouper = unlist(map(strsplit(grouper,""),
function(x) paste0(sort(x),
collapse="")))) %>%
group_by(grouper) %>%
mutate(Happiness = toString(Happiness)) %>%
ungroup() %>%
dplyr::filter(!duplicated(grouper)) %>%
separate(Happiness, into = c("Happiness1","Happiness2")) %>%
select(-grouper)
This solution uses chained operations with the help of the %>% operator.
The idea here is to create a grouping column (called grouper) by first concatenating the UserID and PartnerID columns, and then sorting the characters in each row. At this point, the grouper column should contain the ID of the user and the ID of their partner in a sorted order. This means that both the user and their partner have the values in the grouper column. Therefore, you can go ahead and use the group_by function from tidyverse to group your data by the grouper column. Once you have been able to group the data, you can mutate the Happiness column to a string (that's what the toString function is doing). Then at this point you can ungroup and filter out the duplicates. Once the duplicates are taken out, you can separate the Happiness column into two different columns: Happiness1 and Happiness2. Ultimately, you can drop the grouper column by using select(-grouper).
That should yield:
# UserID PartnerID Happiness1 Happiness2 Result
# 1 2 30 20 1
# 3 4 40 50 1
# 5 6 30 20 1
I hope this helps.

Maybe something like this, suppose your data is (I just added more toy data for the sake of clarity):
> df
# UserID PartnerID Happiness Result
# 1 4 30 1
# 2 3 20 0
# 3 2 10 0
# 4 1 15 1
#10 13 20 1
# 13 10 25 1
# 5 6 10 0
# 11 12 10 1
# 6 5 10 0
# 12 11 15 1
Then this:
dups <- duplicated(t(apply(df[,c(1,2)],1,sort)))
cbind(df[, c(1,3)], df[match(df$UserID,df$PartnerID), c(1,3,4)])[dups,]
Will give you your desired output:
# UserID Happiness UserID Happiness Result
# 3 10 2 20 0
# 4 15 1 30 1
# 13 25 10 20 1
# 6 10 5 10 0
# 12 15 11 10 1

Related

How to duplicate each row based on a new column? [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I'm not exactly sure how to ask the question since english isn't my first language. What I want is duplicate each unique id rows 13 times and create a new column which contains rows with value ranging from -8 to 4 to fill those 13 previously duplicated rows. I think my sample data and expected data will provide a better explanation.
sample data:
data <- data.frame(id = seq(1,100,1),
letters = sample(c("A", "B", "C", "D"), replace = TRUE))
> head(data)
id letters
1 1 A
2 2 B
3 3 B
4 4 C
5 5 A
6 6 B
the expected data:
newcol id letters
1 -8 1 A
2 -7 1 A
3 -6 1 A
4 -5 1 A
5 -4 1 A
6 -3 1 A
7 -2 1 A
8 -1 1 A
9 0 1 A
10 1 1 A
11 2 1 A
12 3 1 A
13 4 1 A
14 -8 2 B
15 -7 2 B
16 -6 2 B
17 -5 2 B
So I guess I could say that I want to create a new column wit values ranging from -8 to 4 (so 13 different values) for each unique rows in the id column.
Also if possible I would like to know how to do it in base R in with the data.table package.
Thank you and sorry for my poor grammar.
We can use uncount
library(tidyr)
library(dplyr)
data %>%
uncount(13) %>%
group_by(id) %>%
mutate(newcol = -8:4) %>%
ungroup
Or in base R
data1 <- data[rep(seq_len(nrow(data)), each = 13),]
data1$newcol <- -8:4
Or using data.table
library(data.table)
setDT(data)[rep(seq_len(.N), each = 13)][, newcol := rep(-8:4, length.out = .N)][]

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

subtracting the greater column from smaller columns in a dataframe in R

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.
We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

Select first observed data and utilize mutate

I am running into an issue with my data where I want to take the first observed ob score score for each individual id and subtract that from that last observed score.
The problem with asking for the first observation minus the last observation is that sometimes the first observation data is missing.
Is there anyway to ask for the first observed score for each individual, thus skipping any missing data?
I built the below df to illustrate my problem.
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
id ob score
1 5 1 NA
2 5 2 2
3 5 3 3
4 5 4 4
5 5 5 3
6 12 1 7
7 12 2 3
8 12 3 4
9 17 1 3
10 17 2 4
11 20 1 NA
12 20 2 1
13 20 3 4
And what I am hoping to run is code that will give me...
id ob score es
1 5 1 NA -1
2 5 2 2 -1
3 5 3 3 -1
4 5 4 4 -1
5 5 5 3 -1
6 12 1 7 3
7 12 2 3 3
8 12 3 4 3
9 17 1 3 -1
10 17 2 4 -1
11 20 1 NA -3
12 20 2 1 -3
13 20 3 4 -3
I am attempting to work out of dplyr and I understand the use of the 'group_by' command, however, not sure how to 'select' only first observed scores and then mutate to create es.
I would use first() and last() (both dplyr function) and na.omit() (from the default stats package.
First, I would make sure your score column was a numberic column with proper NA values (not strings as in your example)
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
then you can do
library(dplyr)
help %>% group_by(id) %>% arrange(ob) %>%
mutate(es=first(na.omit(score)-last(na.omit(score))))
library(dplyr)
temp <- help %>% group_by(id) %>%
arrange(ob) %>%
filter(!is.na(score)) %>%
mutate(es = first(score) - last(score)) %>%
select(id, es) %>%
distinct()
help %>% left_join(temp)
This solution is a little verbose, only b/c it relies on a couple of helper functions FIRST and LAST:
# The position (indicator) of the first value that evaluates to TRUE.
LAST <- function (x, none = NA) {
out <- FIRST(reverse(x), none = none)
if (identical(none, out)) {
return(none)
}
else {
return(length(x) - out + 1)
}
}
# The position (indicator) of the last value that evaluates to TRUE.
FIRST <- function (x, none = NA)
{
x[is.na(x)] <- FALSE
if (any(x))
return(which.max(x))
else return(none)
}
# returns the difference between the first and last non-missing values
diff2 <- function(x)
x[LAST(!is.na(x))] - x[FIRST(!is.na(x))]
library(dplyr)
help %>%
group_by(id) %>%
arrange(ob) %>%
summarise(diff = diff2(score))

Resources