A "special" case of merging in R

A "special" case of merging in R - r

I am a quite unexperienced R user facing the following problem:
I would like to merge two data tables dt1 and dt2.
dt1 contains 1 variable entitled Assessment.
dt2 contains 2 variables entitled ID and Frequency.
Now, I would like to have also the Assessment observations in dt2.
For simplicity, consider this example:
library(dplyr)
library(data.table)
dt1 <- data.table(c("perfect", "perfect", "okay", "unsufficient", "good", "good", "okay", "perfect"))
colnames(dt1) <- "Assessment"
dt2 <- data.table(cbind(c(1,2,3,4,5,6),c(1,3,1,1,1,1)))
colnames(dt2) <- c("ID", "Frequency")
Hence, dt1 looks like that:
Assessment
perfect
perfect
okay
unsufficient
good
good
okay
perfect
dt2 looks like that:
ID
Frequency
1
1
2
3
3
1
4
1
5
1
6
1
My aim would be to get something like:
ID
Frequency
Assessment
1
1
perfect
2
3
perfect;okay;unsufficient
3
1
good
4
1
good
5
1
okay
6
1
perfect
I do not have any idea how to come here and would appreciate each help very much! Thanks a lot!

dt1 %>%
bind_cols(
dt2 %>%
uncount(Frequency)
) %>%
group_by(ID) %>%
summarise(Assessment = paste0(Assessment,collapse = ";"))
# A tibble: 6 x 2
ID Assessment
<dbl> <chr>
1 1 perfect
2 2 perfect;okay;unsufficient
3 3 good
4 4 good
5 5 okay
6 6 perfect

If you trust the right order, as you say in OP, you can rep.int the IDs according to their frequencies.
dt2[dt1[, list(Assessment=toString(Assessment)), by=list(ID=with(dt2, rep.int(ID, Frequency)))], on=.(ID)]
# ID Frequency Assessment
# 1: 1 1 perfect
# 2: 2 3 perfect, okay, unsufficient
# 3: 3 1 good
# 4: 4 1 good
# 5: 5 1 okay
# 6: 6 1 perfect
or
dt2[dt1[, list(Assessment=list(Assessment)), by=list(ID=with(dt2, rep.int(ID, Frequency)))], on=.(ID)]
# ID Frequency Assessment
# 1: 1 1 perfect
# 2: 2 3 perfect,okay,unsufficient
# 3: 3 1 good
# 4: 4 1 good
# 5: 5 1 okay
# 6: 6 1 perfect
The difference is, in second version Assessment is a list column.
Note: if dt2 doesn't contain anything else, there's no need to merge anymore and it simplifies to
dt1[, list(Assessment=toString(Assessment)), by=list(ID=with(dt2, rep.int(ID, Frequency)))]
# ID Assessment
# 1: 1 perfect
# 2: 2 perfect, okay, unsufficient
# 3: 3 good
# 4: 4 good
# 5: 5 okay
# 6: 6 perfect

Related

R: Matching and repeating occurence [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
(sample code below) I have two data sets. One is a library of products, the other is customer id, date and viewed product and another detail.I want to get a merge where I see per each id AND date all the library of products as well as where the match was. I have tried using full_join and merge and right and left joins, but they do not repeat the rows. below is the sample of what i am trying to achieve.
id=c(1,1,1,1,2,2)
date=c(1,1,2,2,1,3)
offer=c('a','x','y','x','y','a')
section=c('general','kitchen','general','general','general','kitchen')
t=data.frame(id,date,offer,section)
offer=c('a','x','y','z')
library=data.frame(offer)
######
t table
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 2 y general
4 1 2 x general
5 2 1 y general
6 2 3 a kitchen
library table
offer
1 a
2 x
3 y
4 z
and i want to get this:
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 1 y NA
4 1 1 z general
...
(there would have to be 6*4 observations)
I realize because I match by offer it is not going to repeat the values like so, but what is another option to do that? Thanks a lot!!

You can use complete to get all combinations of library$offer for each id and date.
tidyr::complete(t, id, date, offer = library$offer)
# A tibble: 24 x 4
# id date offer section
# <dbl> <dbl> <chr> <chr>
# 1 1 1 a general
# 2 1 1 x kitchen
# 3 1 1 y NA
# 4 1 1 z NA
# 5 1 2 a NA
# 6 1 2 x general
# 7 1 2 y general
# 8 1 2 z NA
# 9 1 3 a NA
#10 1 3 x NA
# … with 14 more rows

You can use tidyr and dplyr to get the data. The crossing() function will create all combinations of the variables you pass in
library(dplyr)
library(tidyr)
t %>%
select(id, date) %>%
{crossing(id=.$id, date=.$date, library)} %>%
left_join(t)

Creating a sequential ranking based on previous ratings

I have an issue with sequentially updating rankings and no matter how I try to search for a solution - or come up with one myself - I fail.
I am trying to analyse results of an experiment of sequential choice in which participants had to find the best possible option (the option with the highest rating). They were presented with a rating in every trial.
I have an ID, an order and a rating variable for every choice. ID is the participant, rating represents how good the option is (the higher the rating the better) and order is the number of the trial (in this example there were 4 trials)
ID rating order
1 4 1
1 3 2
1 5 3
1 2 4
2 3 1
2 5 2
2 2 3
2 1 4
I would like to create a new variable called "current_rank" which is basically the ranking of the rating of the current choice. This variable always needs to take into consideration all previous trials and ratings so e.g. for the participant with ID "1" this would be:
Trial 1: rating = 4, which means this is the best rating so far, current_rank = 1
Trial 2: rating = 3, which means this is the second best rating so far, current_rank = 2
Trial 3: rating = 5, which means this is the best rating so far, making it the new number 1 so, current_rank = 1
Trial 4: rating = 2, which means this is nowhere near the best, current_rank = 4
If I could do this with all participants and all choices my database should look like this:
ID rating order current_rank
1 4 1 1
1 3 2 2
1 5 3 1
1 2 4 4
2 3 1 1
2 5 2 1
2 2 3 3
2 1 4 4
I could successfully create an overall ranking variable like this:
db %>%
arrange(ID, order) %>%
group_by(ID) %>%
mutate(ovr_rank = min_rank(desc(rating)))
But my goal is to create a variable that is something of a sequential ranking. This would make it possible to see what kind of opinion the participant may have formed about the current rating based on the previous ratings, without knowing what future ratings might be. I tried creating loops or use the apply functions, but couldn't come up with a solution yet.
Any and all ideas are greatly appreciated!

Use runner to apply any R function in cumulative window (or rolling window). Below I used runner which rolls rating and applies rank function on "available" data at the moment (cumulative rank). Uncomment print to exhibit what lands into function(x).
library(dplyr)
library(runner)
data %>%
arrange(ID, order) %>%
group_by(ID) %>%
mutate(
current_rank = runner(
x = rating,
f = function(x) {
# print(x)
rank_available_at_the_moment <- rank(-x, ties.method = "last")
tail(rank_available_at_the_moment, 1)
}
)
)
# # A tibble: 8 x 4
# # Groups: ID [2]
# ID rating order current_rank
# <int> <int> <int> <int>
# 1 1 4 1 1
# 2 1 3 2 2
# 3 1 5 3 1
# 4 1 2 4 4
# 5 2 3 1 1
# 6 2 5 2 1
# 7 2 2 3 3
# 8 2 1 4 4
data
data <- read.table(text = "ID rating order
1 4 1
1 3 2
1 5 3
1 2 4
2 3 1
2 5 2
2 2 3
2 1 4", header = TRUE)

This chunk of code will work:
df <- tibble(
ID = c(1,1,1,1,2,2,2,2),
rating = c(4,3,5,2,3,5,2,1),
rank = c(1,0,0,0,0,0,0,0)
)
for(i in 2:nrow(df)){
if(df$ID[i] != df$ID[i-1]){
df$rank[i] <- 1
} else {
df$rank[i] <- which(sort(df[1:i,]$rating[which(df$ID == df$ID[i])], decreasing = TRUE) == df$rating[i])
}
}
Explanation:
Note that I assume your dataframe is already ordered based on ID and order. In my df there is no order column, but it is mainly for simplicity (and it is not necessarily needed in my solution, again, assuming the rows are already ordered by ID and order).
The for loop simply looks if the ID of that row is different from the row above, it automatically gets rank 1. Otherwise, it looks on the subset of df from row 1 to row i, subsets again by similar ID, sorts the ratings in that subset (including our current rating in question) in descending order, and takes the position of our currently asked rating to be assigned as its rank value.
I hope this answers your question and gives you insight.

Here are 2 options using data.table:
1) non-equi join to find all trials before and incl current trial, rank the rating and extract the current rank:
DT[, cr := .SD[.SD, on=.(ID, trial<=trial), by=.EACHI, order(order(-rating))[.N]]$V1]
2) non-equi join to find number of ratings that are higher than current rating in trials before current trial:
DT[, cr2 := DT[DT, on=.(ID, trial<=trial, rating>rating), by=.EACHI, .N + 1L]$V1]
Note that there might be ties in ratings and it will be good to specify how ratings ties should be handled.
output:
ID rating trial cr cr2
1: 1 4 1 1 1
2: 1 3 2 2 2
3: 1 5 3 1 1
4: 1 2 4 4 4
5: 2 3 1 1 1
6: 2 5 2 1 1
7: 2 2 3 3 3
8: 2 1 4 4 4
data:
library(data.table)
DT <- fread("ID rating trial
1 4 1
1 3 2
1 5 3
1 2 4
2 3 1
2 5 2
2 2 3
2 1 4")

How to crosstabulate the missings with data.table

Say we have this toy example:
prueba <- data.table(aa=1:7,bb=c(1,2,NA, NA, 3,1,1),
cc=c(1,2,NA, NA, 3,1,1) , YEAR=c(1,1,1,2,2,2,2))
aa bb cc YEAR
1: 1 1 1 1
2: 2 2 2 1
3: 3 NA NA 1
4: 4 NA NA 2
5: 5 3 3 2
6: 6 1 1 2
7: 7 1 1 2
I want to create a table with the values of something by YEAR.
In this simple example I will just ask for the table that says how many missing and non-missing I have.
This is an ugly way to do it, specifying everything by hand:
prueba[,.(sum(is.na(.SD)),sum(!is.na(.SD))), by=YEAR]
Though it doesn't label automatically the new columns we see it says I have 2 missings and 7 non-missing values for year 1, and ...
YEAR V1 V2
1: 1 2 7
2: 2 2 10
It works but what I would really like is to be able to use table() or some data.table equivalent command instead of specifying by hand every term. That would be much more efficient if I have many of them or if we don't know them beforehand.
I've tried with:
prueba[,table(is.na(.SD)), by=YEAR]
but it doesn't work, I get this:
YEAR V1
1: 1 7
2: 1 2
3: 2 10
4: 2 2
How can I get the same format than above?
I've unluckily tried by using as.datable, unlist, lapply, and other things. I think some people use dcast but I don't know how to use it here.
Is there a simple way to do it?
My real table is very large.
Is it better to use the names of the columns instead of .SD?

You can convert the table to a list if you want it as two separate columns
prueba[, as.list(table(is.na(.SD))), by=YEAR]
# YEAR FALSE TRUE
# 1: 1 7 2
# 2: 2 10 2
I suggest not using TRUE and FALSE as column names though.
prueba[, setNames(as.list(table(is.na(.SD))), c('notNA', 'isNA'))
, by = YEAR]
# YEAR notNA isNA
# 1: 1 7 2
# 2: 2 10 2
Another option is to add a new column and then dcast
na_summ <- prueba[, table(is.na(.SD)), by = YEAR]
na_summ[, vname := c('notNA', 'isNA'), YEAR]
dcast(na_summ, YEAR ~ vname, value.var = 'V1')
# YEAR isNA notNA
# 1: 1 2 7
# 2: 2 2 10

Dplyr: subtracting within uneven factor levels

I am trying to learn dplyr, and I cannot find an answer for a relatively simple question on Stackoverflow or the documentation. I thought I'd ask it here.
I have a data.frame that looks like this:
set.seed(1)
dat<-data.frame(rnorm(10,20,20),rep(seq(5),2),rep(c("a","b"),5))
names(dat)<-c("number","factor_1","factor_2")
dat<-dat[order(dat$factor_1,dat$factor_2),]
dat<-dat[c(-3,-7),]
number factor_1 factor_2
1 7.470924 1 a
6 3.590632 1 b
2 23.672866 2 b
3 3.287428 3 a
8 34.766494 3 b
4 51.905616 4 b
5 26.590155 5 a
10 13.892232 5 b
I would like to use dplyr to subtract the values number column associated with factor_2=="b" from factor_2=="a" within each level of factor one.
The first line of the resulting data.frame would look like:
diff factor_1
1 3.880291 1
A caveat is that there are not always values for each level of factor_2 within each level of factor_1. Should this be the case, I would like to assign 0 to the number associated with the missing factor level.
Thank you for your help.

Here is one approach:
set.seed(1)
dat<-data.frame(rnorm(10,20,20),rep(seq(5),2),rep(c("a","b"),5))
names(dat)<-c("number","factor_1","factor_2")
dat<-dat[order(dat$factor_1,dat$factor_2),]
dat<-dat[c(-3,-7),]
# number factor_1 factor_2
#1 7.470924 1 a
#6 3.590632 1 b
#2 23.672866 2 b
#3 3.287428 3 a
#8 34.766494 3 b
#4 51.905616 4 b
#5 26.590155 5 a
#10 13.892232 5 b
library(dplyr)
dat %>%
group_by(factor_1) %>%
summarize(diff=number[match('a',factor_2)]-number[match('b',factor_2)]) ->
d2
d2$diff[is.na(d2$diff)] <- 0
d2
# Source: local data frame [5 x 2]
#
# factor_1 diff
# 1 1 3.880291
# 2 2 0.000000
# 3 3 -31.479066
# 4 4 0.000000
# 5 5 12.697923

Here's a quick data.table solution using your data (next time please use set.seed when producing a data set with rnorm)
library(data.table)
setDT(dat)[order(-factor_2), if(.N == 1L) 0 else diff(number), by = factor_1]
# factor_1 V1
# 1: 1 18.20020
# 2: 2 0.00000
# 3: 3 -51.88444
# 4: 4 0.00000
# 5: 5 61.90332

Is my way of duplicating rows in data.table efficient?

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.
My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.
Here is an exemplatory data.table DT for my annual data and how I currently duplicate:
library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
values = 10:15,
startMonth = seq(from=1, by=2, length=6),
endMonth = seq(from=3, by=3, length=6))
DT
ID values startMonth endMonth
[1,] a_1 10 1 3
[2,] a_2 11 3 6
[3,] a_3 12 5 9
[4,] b_1 13 7 12
[5,] b_2 14 9 15
[6,] b_3 15 11 18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT, ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
[...]
The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:
#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:
DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") :
maxn (4) is not exact multiple of this j column's length (3)
So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?

Looking at this I realized that the answer was only possible because ID was a unique key (without duplicates). Here is another answer with duplicates. But, by the way, some NA seem to creep in. Could this be a bug? I'm using v1.8.7 (commit 796).
library(data.table)
DT <- data.table(x=c(1,1,1,1,2,2,3),y=c(1,1,2,3,1,1,2))
DT[,rep:=1L][c(2,7),rep:=c(2L,3L)] # duplicate row 2 and triple row 7
DT[,num:=1:.N] # to group each row by itself
DT
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 2 1 3
4: 1 3 1 4
5: 2 1 1 5
6: 2 1 1 6
7: 3 2 3 7
DT[,cbind(.SD,dup=1:rep),by="num"]
num x y rep dup
1: 1 1 1 1 1
2: 2 1 1 1 NA # why these NA?
3: 2 1 1 2 NA
4: 3 1 2 1 1
5: 4 1 3 1 1
6: 5 2 1 1 1
7: 6 2 1 1 1
8: 7 3 2 3 1
9: 7 3 2 3 2
10: 7 3 2 3 3
Just for completeness, a faster way is to rep the row numbers and then take the subset in one step (no grouping and no use of cbind or .SD) :
DT[rep(num,rep)]
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 1 2 2
4: 1 2 1 3
5: 1 3 1 4
6: 2 1 1 5
7: 2 1 1 6
8: 3 2 3 7
9: 3 2 3 7
10: 3 2 3 7
where in this example data the column rep happens to be the same name as the rep() base function.

Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.
In the meantime, try :
DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.

Here is a function I wrote which mimics disaggregate (I needed something that handled complex data). It might be useful for you, if it isn't overkill. To expand only rows, set the argument fact to c(1,12) where 12 would be for 12 'month' rows for each 'year' row.
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
# new nonloop method, seems to work just ducky
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T)
#does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
return(invisible(bigx))
}

The fastest and most succinct way of doing it:
DT[rep(1:nrow(DT), endMonth - startMonth)]
We can also enumerate by group by:
dd <- DT[rep(1:nrow(DT), endMonth - startMonth)]
dd[, nn := 1:.N, by = ID]
dd

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

A "special" case of merging in R - r

dt1 %>% bind_cols( dt2 %>% uncount(Frequency) ) %>% group_by(ID) %>% summarise(Assessment = paste0(Assessment,collapse = ";")) # A tibble: 6 x 2 ID Assessment <dbl> <chr> 1 1 perfect 2 2 perfect;okay;unsufficient 3 3 good 4 4 good 5 5 okay 6 6 perfect

Related

R: Matching and repeating occurence [duplicate]

Creating a sequential ranking based on previous ratings

How to crosstabulate the missings with data.table

Dplyr: subtracting within uneven factor levels

Is my way of duplicating rows in data.table efficient?

Categories

Resources