R: Matching and repeating occurence [duplicate] - r

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
(sample code below) I have two data sets. One is a library of products, the other is customer id, date and viewed product and another detail.I want to get a merge where I see per each id AND date all the library of products as well as where the match was. I have tried using full_join and merge and right and left joins, but they do not repeat the rows. below is the sample of what i am trying to achieve.
id=c(1,1,1,1,2,2)
date=c(1,1,2,2,1,3)
offer=c('a','x','y','x','y','a')
section=c('general','kitchen','general','general','general','kitchen')
t=data.frame(id,date,offer,section)
offer=c('a','x','y','z')
library=data.frame(offer)
######
t table
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 2 y general
4 1 2 x general
5 2 1 y general
6 2 3 a kitchen
library table
offer
1 a
2 x
3 y
4 z
and i want to get this:
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 1 y NA
4 1 1 z general
...
(there would have to be 6*4 observations)
I realize because I match by offer it is not going to repeat the values like so, but what is another option to do that? Thanks a lot!!

You can use complete to get all combinations of library$offer for each id and date.
tidyr::complete(t, id, date, offer = library$offer)
# A tibble: 24 x 4
# id date offer section
# <dbl> <dbl> <chr> <chr>
# 1 1 1 a general
# 2 1 1 x kitchen
# 3 1 1 y NA
# 4 1 1 z NA
# 5 1 2 a NA
# 6 1 2 x general
# 7 1 2 y general
# 8 1 2 z NA
# 9 1 3 a NA
#10 1 3 x NA
# … with 14 more rows

You can use tidyr and dplyr to get the data. The crossing() function will create all combinations of the variables you pass in
library(dplyr)
library(tidyr)
t %>%
select(id, date) %>%
{crossing(id=.$id, date=.$date, library)} %>%
left_join(t)

Related

R, dplyr: Is there a way to add order of groups when there are multiple rows per group without creating a new data frame? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I have data from an experiment that has multiple rows per item (each row has the reading time for one word of a sentence of n words), and multiple items per subject. Items can be varying numbers of rows. Items were presented in a random order, and their order in the data as initially read in reflects the sequence they saw the items in. What I'd like to do is add a column that contains the order in which the subject saw that item (i.e., 1 for the first item, 2 for the second, etc.).
Here's an example of some input data that has the relevant properties:
d <- data.frame(Subject = c(1,1,1,1,1,2,2,2,2,2),
Item = c(2,2,2,1,1,1,1,2,2,2))
Subject Item
1 2
1 2
1 2
1 1
1 1
2 1
2 1
2 2
2 2
2 2
And here's the output I want:
Subject Item order
1 2 1
1 2 1
1 2 1
1 1 2
1 1 2
2 1 1
2 1 1
2 2 2
2 2 2
2 2 2
I know I can do this by setting up a temp data frame that filters d to unique combinations of Subject and Item, adding order to that as something like 1:n() or row_number(), and then using a join function to put it back together with the main data frame. What I'd like to know is whether there's a way to do this without having to create a new data frame just to store the order---can this be done inside dplyr's mutate somehow if I group by Subject and Item, for instance?
Here's one way:
d %>%
group_by(Subject) %>%
mutate(order = match(Item, unique(Item))) %>%
ungroup()
# # A tibble: 10 x 3
# Subject Item order
# <dbl> <dbl> <int>
# 1 1 2 1
# 2 1 2 1
# 3 1 2 1
# 4 1 1 2
# 5 1 1 2
# 6 2 1 1
# 7 2 1 1
# 8 2 2 2
# 9 2 2 2
# 10 2 2 2
Here is a base R option
transform(d,
order = ave(Item, Subject, FUN = function(x) as.integer(factor(x, levels = unique(x))))
)
or
transform(d,
order = ave(Item, Subject, FUN = function(x) match(x, unique(x)))
)
both giving
Subject Item order
1 1 2 1
2 1 2 1
3 1 2 1
4 1 1 2
5 1 1 2
6 2 1 1
7 2 1 1
8 2 2 2
9 2 2 2
10 2 2 2

Creating a sequential ranking based on previous ratings

I have an issue with sequentially updating rankings and no matter how I try to search for a solution - or come up with one myself - I fail.
I am trying to analyse results of an experiment of sequential choice in which participants had to find the best possible option (the option with the highest rating). They were presented with a rating in every trial.
I have an ID, an order and a rating variable for every choice. ID is the participant, rating represents how good the option is (the higher the rating the better) and order is the number of the trial (in this example there were 4 trials)
ID rating order
1 4 1
1 3 2
1 5 3
1 2 4
2 3 1
2 5 2
2 2 3
2 1 4
I would like to create a new variable called "current_rank" which is basically the ranking of the rating of the current choice. This variable always needs to take into consideration all previous trials and ratings so e.g. for the participant with ID "1" this would be:
Trial 1: rating = 4, which means this is the best rating so far, current_rank = 1
Trial 2: rating = 3, which means this is the second best rating so far, current_rank = 2
Trial 3: rating = 5, which means this is the best rating so far, making it the new number 1 so, current_rank = 1
Trial 4: rating = 2, which means this is nowhere near the best, current_rank = 4
If I could do this with all participants and all choices my database should look like this:
ID rating order current_rank
1 4 1 1
1 3 2 2
1 5 3 1
1 2 4 4
2 3 1 1
2 5 2 1
2 2 3 3
2 1 4 4
I could successfully create an overall ranking variable like this:
db %>%
arrange(ID, order) %>%
group_by(ID) %>%
mutate(ovr_rank = min_rank(desc(rating)))
But my goal is to create a variable that is something of a sequential ranking. This would make it possible to see what kind of opinion the participant may have formed about the current rating based on the previous ratings, without knowing what future ratings might be. I tried creating loops or use the apply functions, but couldn't come up with a solution yet.
Any and all ideas are greatly appreciated!
Use runner to apply any R function in cumulative window (or rolling window). Below I used runner which rolls rating and applies rank function on "available" data at the moment (cumulative rank). Uncomment print to exhibit what lands into function(x).
library(dplyr)
library(runner)
data %>%
arrange(ID, order) %>%
group_by(ID) %>%
mutate(
current_rank = runner(
x = rating,
f = function(x) {
# print(x)
rank_available_at_the_moment <- rank(-x, ties.method = "last")
tail(rank_available_at_the_moment, 1)
}
)
)
# # A tibble: 8 x 4
# # Groups: ID [2]
# ID rating order current_rank
# <int> <int> <int> <int>
# 1 1 4 1 1
# 2 1 3 2 2
# 3 1 5 3 1
# 4 1 2 4 4
# 5 2 3 1 1
# 6 2 5 2 1
# 7 2 2 3 3
# 8 2 1 4 4
data
data <- read.table(text = "ID rating order
1 4 1
1 3 2
1 5 3
1 2 4
2 3 1
2 5 2
2 2 3
2 1 4", header = TRUE)
This chunk of code will work:
df <- tibble(
ID = c(1,1,1,1,2,2,2,2),
rating = c(4,3,5,2,3,5,2,1),
rank = c(1,0,0,0,0,0,0,0)
)
for(i in 2:nrow(df)){
if(df$ID[i] != df$ID[i-1]){
df$rank[i] <- 1
} else {
df$rank[i] <- which(sort(df[1:i,]$rating[which(df$ID == df$ID[i])], decreasing = TRUE) == df$rating[i])
}
}
Explanation:
Note that I assume your dataframe is already ordered based on ID and order. In my df there is no order column, but it is mainly for simplicity (and it is not necessarily needed in my solution, again, assuming the rows are already ordered by ID and order).
The for loop simply looks if the ID of that row is different from the row above, it automatically gets rank 1. Otherwise, it looks on the subset of df from row 1 to row i, subsets again by similar ID, sorts the ratings in that subset (including our current rating in question) in descending order, and takes the position of our currently asked rating to be assigned as its rank value.
I hope this answers your question and gives you insight.
Here are 2 options using data.table:
1) non-equi join to find all trials before and incl current trial, rank the rating and extract the current rank:
DT[, cr := .SD[.SD, on=.(ID, trial<=trial), by=.EACHI, order(order(-rating))[.N]]$V1]
2) non-equi join to find number of ratings that are higher than current rating in trials before current trial:
DT[, cr2 := DT[DT, on=.(ID, trial<=trial, rating>rating), by=.EACHI, .N + 1L]$V1]
Note that there might be ties in ratings and it will be good to specify how ratings ties should be handled.
output:
ID rating trial cr cr2
1: 1 4 1 1 1
2: 1 3 2 2 2
3: 1 5 3 1 1
4: 1 2 4 4 4
5: 2 3 1 1 1
6: 2 5 2 1 1
7: 2 2 3 3 3
8: 2 1 4 4 4
data:
library(data.table)
DT <- fread("ID rating trial
1 4 1
1 3 2
1 5 3
1 2 4
2 3 1
2 5 2
2 2 3
2 1 4")

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Filter ids with having count > 1 in data.table [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I would like to subset my data frame to keep only groups that have 3 or more observations on DIFFERENT days. I want to get rid of groups that have less than 3 observations, or the observations they have are not from 3 different days.
Here is a sample data set:
Group Day
1 1
1 3
1 5
1 5
2 2
2 2
2 4
2 4
3 1
3 2
3 3
4 1
4 5
So for the above example, group 1 and group 3 will be kept and group 2 and 4 will be removed from the data frame.
I hope this makes sense, I imagine the solution will be quite simple but I can't work it out (I'm quite new to R and not very fast at coming up with solutions to things like this). I thought maybe the diff function could come in handy but didn't get much further.
With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.
One idea using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
filter(length(unique(Day)) >= 3)
#Source: local data frame [7 x 2]
#Groups: Group [2]
# Group Day
# (int) (int)
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#5 3 1
#6 3 2
#7 3 3
We can use base R
i1 <- rowSums(table(df1)!=0)>=3
subset(df1, Group %in% names(i1)[i1])
# Group Day
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#9 3 1
#10 3 2
#11 3 3
Or a one-liner base R would be
df1[with(df1, as.logical(ave(Day, Group, FUN = function(x) length(unique(x)) >=3))),]

In a dataframe, find the index of the next smaller value for each element of a column

Question:
In a dataframe, I want to create a new column as the indices of the next smaller value of an existing column.
For example, the data looks like this. It is already arranged in item, day.
item day val
1 1 2 3
2 1 4 2
3 1 5 1
4 2 1 1
5 2 3 2
6 2 5 3
First I would like to use group_by(item) in dplyr to select the sub-dataframe of each item.
Then for row 1, I look down the rows and find that row 2 has a smaller val. This is what I want, so I record the day corresponding to that row. Similar for row 2.
Note that for row 3 and 6, they are the last rows of corresponding sub-dataframes, so there is no next smaller value. For row 4 and 5, there is no smaller val when I look down the rows.
The dataframe with the new column should look like this.
item day val next.smaller.day
1 1 2 3 4
2 1 4 2 5
3 1 5 1 -1
4 2 1 1 -1
5 2 3 2 -1
6 2 5 3 -1
I wonder if there is any way of using dplyr to implement this, or any codes in r other than a for loop.
I found a thread asking the algorithm of this question. Given an array, find out the next smaller element for each element .
It is relevant, and the proposed algorithm beats mine in terms of time complexity, but I still find it hard to implement in my scenario.
Thank you!
Update:
Here is another example to re-illustrate what I'm looking for.
item day val next.smaller.day
1 1 2 2 5
2 1 4 3 5
3 1 5 1 -1
4 2 1 3 3
5 2 3 1 -1
6 2 5 2 -1
You can group your data by the item, calculate the different between rows using the diff function and check if it is smaller than zero which will then generate a logic vector and you can use the logic vector to pick up the next day. And since you are picking up the next day, you will need the lead function to shift the day column forward so that it can match the rows where you want to place them.
Side note: Since diff function create a vector one element shorter than the original one and you will always leave the last row out per group, we can pad the diff result by a FALSE condition.
library(dplyr);
df %>% group_by(item) %>% mutate(smaller = c(diff(val) < 0, F),
next.smaller.day = ifelse(smaller, lead(day), -1)) %>%
select(-smaller)
# Source: local data frame [6 x 4]
# Groups: item [2]
# item day val next.smaller.day
# <int> <int> <int> <dbl>
# 1 1 2 3 4
# 2 1 4 2 5
# 3 1 5 1 -1
# 4 2 1 1 -1
# 5 2 3 2 -1
# 6 2 5 3 -1
Update:
find.next.smaller <- function(ini = 1, vec) {
if(length(vec) == 1) NA
else c(ini + min(which(vec[1] > vec[-1])),
find.next.smaller(ini + 1, vec[-1]))
} # the recursive function will go element by element through the vector and find out
# the index of the next smaller value.
df %>% group_by(item) %>% mutate(next.smaller.day = day[find.next.smaller(1, val)],
next.smaller.day = replace(next.smaller.day, is.na(next.smaller.day), -1))
# Source: local data frame [6 x 4]
# Groups: item [2]
#
# item day val next.smaller.day
# <int> <int> <dbl> <dbl>
# 1 1 2 2 5
# 2 1 4 3 5
# 3 1 5 1 -1
# 4 2 1 1 -1
# 5 2 3 2 -1
# 6 2 5 3 -1

Resources