Identify repeats in r

Identify repeats in r - r

Hi I have a dataframe and it looks like this:
test = data.frame("Year" = c("2015","2015","2016","2017","2018"),
"UserID" = c(1,2,1,1,3), "PurchaseValue" = c(1,5,3,3,5))
where "Year" is the time of purchase and "UserID" is the buyer.
I want to create a variable "RepeatedPurchase" that gives "1" if it is a repeated purchase and else 0 (if it is the only purchase/ if it is the first time purchase).
Thus, the desired output would look like this:
I tried to achieve this by first creating a variable "Se" that tells if that purchase is the 1st/ 2nd/ 3rd... purchase of that buyer but my code didn't work. Wondering what's wrong with my code or is there a better way I can identify repeated purchase? Thanks!
library(dplyr)
df %>% arrange(UserID, Year) %>% group_by(UserID) %>% mutate(Se = seq(n())) %>% ungroup()

You do not need dplyr. You can use duplicated() as following:
test=data.frame("Year" = c("2015","2015","2016","2017","2018"), "UserID" = c(1,2,1,1,3), "PurchaseValue" = c(1,5,3,3,5))
repeated<-duplicated(test$UserID)
# [1] FALSE FALSE TRUE TRUE FALSE
test$RepeatedPurchase<-ifelse(repeated==T,1,0)
test
# Year UserID PurchaseValue RepeatedPurchase
# 1 2015 1 1 0
# 2 2015 2 5 0
# 3 2016 1 3 1
# 4 2017 1 3 1
# 5 2018 3 5 0
Cheers!,

We can start by counting the number of purchases for each UserID and assign 1 when it exceeds 1
test %>% group_by(UserID) %>% mutate(RepeatedPurchase = ifelse(1:n()>1, 1, 0))
# A tibble: 5 x 4
# Groups: UserID [3]
Year UserID PurchaseValue Repeatedpurchase
<fct> <dbl> <dbl> <dbl>
1 2015 1.00 1.00 0
2 2015 2.00 5.00 0
3 2016 1.00 3.00 1.00
4 2017 1.00 3.00 1.00
5 2018 3.00 5.00 0

Here is another dplyr solution. We can group_by the UserID and PurchaseValue, and then use as.integer(n() > 1) to evaluate if the count is larger than 1.
library(dplyr)
test2 <- test %>%
group_by(UserID, PurchaseValue) %>%
mutate(RepeatedPurchase = as.integer(n() > 1)) %>%
ungroup()
test2
# # A tibble: 5 x 4
# Year UserID PurchaseValue RepeatedPurchase
# <fct> <dbl> <dbl> <int>
# 1 2015 1 1 0
# 2 2015 2 5 0
# 3 2016 1 3 1
# 4 2017 1 3 1
# 5 2018 3 5 0

Related

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!

It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25

For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]

Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Dplyr solution using slice and group

Ciao, Here is my replicating example.
a=c(1,2,3,4,5,6)
a1=c(15,17,17,16,14,15)
a2=c(0,0,1,1,1,0)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,a1,a2,b,c,d,e,f,g,h,i)
names(mydata) = c("id","age","gender","drop1","year1","drop2","year2","drop3","year3","drop4","year4")
mydata2 <- reshape(mydata, direction = "long", varying = list(c("year1","year2","year3","year4"), c("drop1","drop2","drop3","drop4")),v.names = c("year", "drop"), idvar = "X", timevar = "Year", times = c(1:4))
x1 = mydata2 %>%
group_by(id) %>%
slice(which(drop==1)[1])
x2 = mydata2 %>%
group_by(id) %>%
slice(which(drop==0)[1])
I have data "mydata2" which is tall such that every ID has many rows.
I want to make new data set "x" such that every ID has one row that is based on if they drop or not.
The first of drop1 drop2 drop3 drop4 that equals to 1, I want to take the year of that and put that in a variable dropYEAR. If none of drop1 drop2 drop3 drop4 equals to 1 I want to put the last data point in year1 year2 year3 year4 in the variable dropYEAR.
Ultimately every ID should have 1 row and I want to create 2 new columns: didDROP equals to 1 if the ID ever dropped or 0 if the ID did not ever drop. dropYEAR equals to the year of drop if didDROP equals to 1 or equals to the last reported year1 year2 year3 year4 if the ID did not ever drop. I try to do this in dplyr but this gives part of what I want only because it gets rid of ID values that equals to 0.
This is desired output, thank you to #Wimpel

First mydata2 %>% arrange(id) to understand the dataset, then using dplyr first and lastwe can pull the first year where drop==1 and the last year in case of drop never get 1 where drop is not null. Usingcase_when to check didDROP as it has a nice magic in dealing with NAs.
library(dplyr)
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year[!is.na(drop)]),dropY)) %>%
slice(1)
#Update
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year),dropY),
didDROP=case_when(any(drop==1) ~ 1, #Return 1 if there is any drop=1 o.w it will return 0
TRUE ~ 0)) %>%
select(-dropY) %>% slice(1)
# A tibble: 6 x 9
# Groups: id [6]
id age gender Year year drop X dropYEAR didDROP
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 15 0 1 2010 1 1 2010 1
2 2 17 0 1 2010 0 2 2012 1
3 3 17 1 1 2010 NA 3 2014 0
4 4 16 1 1 2010 NA 4 2012 1
5 5 14 1 1 2010 0 5 2014 0
6 6 15 0 1 2010 NA 6 2014 0
I hope this what you're looking for.

You can sort by id, drop and year, conditionally on dropping or not:
library(dplyr)
mydata2 %>%
mutate(drop=ifelse(is.na(drop),0,drop)) %>%
arrange(id,-drop,year*(2*drop-1)) %>%
group_by(id) %>%
slice(1) %>%
select(id,age,gender,didDROP=drop,dropYEAR=year)
# A tibble: 6 x 5
# Groups: id [6]
id age gender didDROP dropYEAR
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 15 0 1 2010
2 2 17 0 1 2012
3 3 17 1 0 2014
4 4 16 1 1 2012
5 5 14 1 0 2014
6 6 15 0 0 2014

Select first positive match per ID per date in R

I have a dataframe with different observations over time. As soon as an ID has a positive value for "Match", the rows with the ID in the dates that follow has to be removed. This is an example dataframe:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 5 1
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 5 0
2018-06-08 6 1
2018-06-08 7 1
2018-06-08 8 1
Desired output:
Date ID Match
2018-06-06 5 1
2018-06-06 6 0
2018-06-07 6 0
2018-06-07 7 1
2018-06-08 6 1
2018-06-08 8 1
In other words, because ID=5 has a positive match on 2018-06-06, the rows with ID=5 are removed for the following days BUT the row with the first positive match for this ID is kept.
Reproducable example:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- data.frame(Date,ID,Match)
Thank you in advance

One way:
library(data.table)
setDT(df)
df[, Match := as.integer(as.character(Match))] # fix bad format
df[, .SD[shift(cumsum(Match), fill=0) == 0], by=ID]
ID Date Match
1: 5 2018-06-06 1
2: 6 2018-06-06 0
3: 6 2018-06-07 0
4: 6 2018-06-08 1
5: 7 2018-06-07 1
6: 8 2018-06-08 1
We want to drop rows after the first Match == 1.
cumsum takes the cumulative sum of Match. It is zero until the first Match == 1. We want to keep the latter row and so check cumsum on the preceding row with shift.

Here's an alternative approach, where we spot the minimum row number where Match = 1 (i.e. first row with positive match) for each ID and we filter on that:
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match))
library(dplyr)
df %>%
group_by(ID) %>% # for each ID
mutate(min_row = min(row_number()[Match == 1])) %>% # get the first row where you have 1
filter(row_number() <= min_row) %>% # keep previous rows and that row
ungroup() %>% # forget the grouping
select(-min_row) # remove unnecessary column
# # A tibble: 6 x 3
# Date ID Match
# <fct> <fct> <fct>
# 1 2018-06-06 5 1
# 2 2018-06-06 6 0
# 3 2018-06-07 6 0
# 4 2018-06-07 7 1
# 5 2018-06-08 6 1
# 6 2018-06-08 8 1
You can run the code step by step to see how it works. I've created min_row column to help you understand. You can re-write the above as
df %>%
group_by(ID) %>%
filter(row_number() <= min(row_number()[Match == 1])) %>%
ungroup()

Inspired by #Frank answer's
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag = cumsum(as.numeric(Match))) %>%
filter(Match==0 & Flag==0 | Match==1 & Flag==1)
# A tibble: 6 x 4
# Groups: ID [4]
Date ID Match Flag
<chr> <chr> <chr> <dbl>
1 2018-06-06 5 1 1
2 2018-06-06 6 0 0
3 2018-06-07 6 0 0
4 2018-06-07 7 1 1
5 2018-06-08 6 1 1
6 2018-06-08 8 1 1
Data
Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- as.data.frame(cbind(Date,ID,Match),stringsAsFactors = F)

I have another way to do it with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
# You can use order(Date) if you don't want to coerce Date into date object
mutate(ord = order(Date), first_match = min(ord[Match > 0]), ind = seq_along(Date)) %>%
filter(ind <= first_match) %>%
select(Date:Match)
# A tibble: 6 x 3
# Groups: ID [4]
Date ID Match
<chr> <dbl> <dbl>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1

Here is another dplyr option:
library(dplyr)
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(ID) %>%
mutate(first_match = min(Date[Match == 1])) %>%
filter((Match == 1 & Date == first_match) | (Match == 0 & Date < first_match)) %>%
ungroup() %>%
select(-first_match)
# A tibble: 6 x 3
Date ID Match
<date> <fct> <fct>
1 2018-06-06 5 1
2 2018-06-06 6 0
3 2018-06-07 6 0
4 2018-06-07 7 1
5 2018-06-08 6 1
6 2018-06-08 8 1

Reorder a single column in a dataframe within each level of another column

Probably the solution to this problem is really easy but I just can't see it. Here is my sample data frame:
df <- data.frame(id=c(1,1,1,2,2,2), value=rep(1:3,2), level=rep(letters[1:3],2))
df[6,2] <- NA
And here is the desired output that I would like to create:
df$new_value <- c(3,2,1,NA,2,1)
So the order of all columns is the same, and for the new_value column the value column order is reversed within each level of the id column. Any ideas? Thanks!

As I understood your question, it's a coincidence that your data is sorted, if you just want to reverse the order without sorting:
library(dplyr)
df %>% group_by(id) %>% mutate(new_value = rev(value)) %>% ungroup
# A tibble: 6 x 4
id value level new_value
<dbl> <int> <fctr> <int>
1 1 1 a 3
2 1 2 b 2
3 1 3 c 1
4 2 1 a NA
5 2 2 b 2
6 2 NA c 1

A slightly different approach, using the parameters in the sort function:
library(dplyr)
df %>% group_by(id) %>%
mutate(value = sort(value, decreasing=TRUE, na.last=FALSE))
Output:
# A tibble: 6 x 3
# Groups: id [2]
id value level
<dbl> <int> <fctr>
1 1.00 3 a
2 1.00 2 b
3 1.00 1 c
4 2.00 NA a
5 2.00 2 b
6 2.00 1 c
Hope this helps!

We can use order on the missing values and on the column itself
library(dplyr)
df %>%
group_by(id) %>%
mutate(new_value = value[order(!is.na(value), -value)])
# A tibble: 6 x 4
# Groups: id [2]
# id value level new_value
# <dbl> <int> <fctr> <int>
#1 1.00 1 a 3
#2 1.00 2 b 2
#3 1.00 3 c 1
#4 2.00 1 a NA
#5 2.00 2 b 2
#6 2.00 NA c 1
Or using the arrange from dplyr
df %>%
arrange(id, !is.na(value), desc(value)) %>%
transmute(new_value = value) %>%
bind_cols(df, .)
Or using base R and specify the na.last option as FALSE in order
with(df, ave(value, id, FUN = function(x) x[order(-x, na.last = FALSE)]))
#[1] 3 2 1 NA 2 1

Calculate relative changes in a time series with respect to a baseline by group. NA if no baseline value was measured

I'd like to calculate relative changes of measured variables in a data.frame by group with dplyr.
The changes are with respect to a first baseline value at time==0.
I can easily do this in the following example:
# with this easy example it works
df.easy <- data.frame( id =c(1,1,1,2,2,2)
,time=c(0,1,2,0,1,2)
,meas=c(5,6,9,4,5,6))
df.easy %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative =
meas/meas[time==0])
# Source: local data frame [6 x 4]
# Groups: id [2]
#
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1.00
# 2 1 1 6 1.20
# 3 1 2 9 1.80
# 4 2 0 4 1.00
# 5 2 1 5 1.25
# 6 2 2 6 1.50
However, when there are id's with no measuremnt at time==0, this doesn't work.
A similar question is this, but I'd like to get an NA as a result instead of simply taking the first occurence as baseline.
# how to output NA in case there are id's with no measurement at time==0?
df <- data.frame( id =c(1,1,1,2,2,2,3,3)
,time=c(0,1,2,0,1,2,1,2)
,meas=c(5,6,9,4,5,6,5,6))
# same approach now gives an error:
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = meas/meas[time==0])
# Error in mutate_impl(.data, dots) :
# incompatible size (0), expecting 2 (the group size) or 1
Let's try to return NA in case no measurement at time==0 was taken, using ifelse
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas/meas[time==0], NA) )
# Source: local data frame [8 x 4]
# Groups: id [3]
#
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1
# 2 1 1 6 1
# 3 1 2 9 1
# 4 2 0 4 1
# 5 2 1 5 1
# 6 2 2 6 1
# 7 3 1 5 NA
# 8 3 2 6 NA>
Wait, why is above the relative measurement 1?
identical(
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas, NA) ),
df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas[time==0], NA) )
)
# TRUE
It seems that the ifelse prevents meas to pick the current line, but selects always the subset where time==0.
How can I calculate relative changes when there are IDs with no baseline measurement?

Your issue was in the ifelse(). According to the ifelse documentation it returns "A vector of the same length...as test". Since any(time==0) is of length 1 for each group (TRUE or FALSE) only the first observation of the meas / meas[time==0] was being selected. This was then repeated to fill each group.
To fix this all I did was rep the any() to be the length of the group. I believe this should work:
df %>% dplyr::group_by(id) %>%
dplyr::mutate(meas.relative = ifelse(rep(any(time==0),times = n()), meas/meas[time==0], NA) )
# id time meas meas.relative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 5 1.00
# 2 1 1 6 1.20
# 3 1 2 9 1.80
# 4 2 0 4 1.00
# 5 2 1 5 1.25
# 6 2 2 6 1.50
# 7 3 1 5 NA
# 8 3 2 6 NA
To see how this was working incorrectly in your case try:
ifelse(TRUE,c(1,2,3),NA)
#[1] 1
Edit: A data.table solution with the same concept:
as.data.table(df)[, meas.rel := ifelse(rep(any(time==0), .N), meas/meas[time==0], NA_real_)
,by=id]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Identify repeats in r - r

Related

Determine percentage of rows with missing values in a dataframe in R

Dplyr solution using slice and group

Select first positive match per ID per date in R

Reorder a single column in a dataframe within each level of another column

Calculate relative changes in a time series with respect to a baseline by group. NA if no baseline value was measured

Categories

Resources