Group dataframe rows by creating a unique ID column based on the amount of time passed between entries and variable values - r

I'm trying to group the rows of my dataframe into "courses" when the same variables appear at regular date intervals. When there is a gap in time frequency or when one of variables change I would like to give it a new course ID.
To give an example, my data looks something like this:
Date Name Item
1 2018-06-02 Johan Apple
2 2018-07-05 Johan Apple
3 2018-08-02 Johan Apple
4 2019-04-15 Johan Apple
5 2019-05-15 Johan Apple
6 2019-05-30 Samantha Orange
7 2019-06-12 Samantha Orange
8 2019-06-27 Samantha Orange
9 2018-02-15 Mary Lemon
10 2018-04-10 Mary Lemon
11 2018-06-12 Mary Lemon
12 2018-08-13 Mary Lime
13 2018-08-27 Mary Lime
14 2017-03-09 George Kiwi
Each different combination of Name and Item should generate a new course ID.
However (the tricky part) if there is a significant time gap between two transactions where the other variables are constant, defined as: either more than 6 months or more than three times the average interval up to that date for that specific combination of Item and Name then it should be given a new CourseID
In my example:
Because Johan had a break after August 2018, transactions after that should have a new CourseID. Ideally the interval to check for future breaks would then be based on the average in this new group.
Samantha is buying oranges on a biweekly basis with no siginficant gap so all her transactions will have one CourseID.
Mary is buying lemons at a regular interval but then switches to buying limes at a regular interval, so these have two CourseIDs.
George just bought the one Kiwi, so a single CourseID
Code to reproduce:
data.frame(Date = as.Date(c("2018-06-02", "2018-07-05", "2018-08-02", "2019-04-15", "2019-05-15", "2019-05-30", "2019-06-12", "2019-06-27", "2018-02-15", "2018-04-10", "2018-06-12", "2018-08-13", "2018-08-27", "2017-03-09")),
Name = c(rep("Johan", 5), rep("Samantha", 3), rep("Mary", 5), "George"),
Item = c(rep("Apple", 5), rep("Orange", 3), rep("Lemon", 3), rep("Lime",2), "Kiwi"))
I'd like to create an additional column which has a unique identifier for each course - i.e. using stringi or similar.
Ideally the output would look something like this:
Date Name Item CourseID
1 2018-06-02 Johan Apple q3J
2 2018-07-05 Johan Apple q3J
3 2018-08-02 Johan Apple q3J
4 2019-04-15 Johan Apple f8j
5 2019-05-15 Johan Apple f8j
6 2019-05-30 Samantha Orange p8U
7 2019-06-12 Samantha Orange p8U
8 2019-06-27 Samantha Orange p8U
9 2018-02-15 Mary Lemon wi9
10 2018-04-10 Mary Lemon wi9
11 2018-06-12 Mary Lemon wi9
12 2018-08-13 Mary Lime q8U
13 2018-08-27 Mary Lime q8U
14 2017-03-09 George Kiwi jJ0
I've tried going about this using max/min on the date varaible, however I'm stumped when it comes to identifying the break based on the previous purchasing pattern.
There may be a package I don't know which has something for this, however I've been trying with Tidyverse so far.

Here's a dplyr approach that calculates the gap and rolling avg gap within each Name/Item group, then flags large gaps, and assigns a new group for each large gap or change in Name or Item.
df1 %>%
group_by(Name,Item) %>%
mutate(purch_num = row_number(),
time_since_first = Date - first(Date),
gap = Date - lag(Date, default = as.Date(-Inf)),
avg_gap = time_since_first / (purch_num-1),
new_grp_flag = gap > 180 | gap > 3*avg_gap) %>%
ungroup() %>%
mutate(group = cumsum(new_grp_flag))

Related

Replacing values in one table from a corresponding key in another table by specific column

I am processing a large dataset from a questionnaire that contains coded responses in some but not all columns. I would like to replace the coded responses with actual values. The key/dictionary is stored in another database. The complicating factor is that different questions (stored as columns in original dataset) used the same code (typically numeric), but the code has different meanings depending on the column (question).
How can I replace the coded values in the original dataset with different valuse from a corresponding key stored in the dictionary table, but do it by specific column name (also stored in the dictionary table)?
Below is an example of the original dataset and the dictionary table, as well as desired result.
original <- data.frame(
name = c('Jane','Mary','John', 'Billy'),
home = c(1,3,4,2),
car = c('b','b','a','b'),
shirt = c(3,2,1,1),
shoes = c('Black','Black','Black','Brown')
)
keymap <- data.frame(
column_name=c('home','home','home','home','car','car','shirt','shirt','shirt'),
value_old=c('1','2','3','4','a','b','1','2','3'),
value_new=c('Single family','Duplex','Condo','Apartment','Sedan','SUV','White','Red','Blue')
)
result <- data.frame(
name = c('Jane','Mary','John', 'Billy'),
home = c('Single family','Condo','Apartment','Duplex'),
car = c('SUV','SUV','Sedan','SUV'),
shirt = c('Blue','Red','White','White'),
shoes = c('Black','Black','Black','Brown')
)
> original
name home car shirt shoes
1 Jane 1 b 3 Black
2 Mary 3 b 2 Black
3 John 4 a 1 Black
4 Billy 2 b 1 Brown
> keymap
column_name value_old value_new
1 home 1 Single family
2 home 2 Duplex
3 home 3 Condo
4 home 4 Apartment
5 car a Sedan
6 car b SUV
7 shirt 1 White
8 shirt 2 Red
9 shirt 3 Blue
> result
name home car shirt shoes
1 Jane Single family SUV Blue Black
2 Mary Condo SUV Red Black
3 John Apartment Sedan White Black
4 Billy Duplex SUV White Brown
I have tried different approaches using dplyr but have not gotten far as I do not have a robust understanding of the mutate/join syntax.
We may loop across the unique values from the 'column_name' column of 'keymap' in the original, subset the keymap that matches the column name (cur_column()), select the columns 2 and 3, deframe to a named vector and match with the values of the column for replacement
library(dplyr)
library(tibble)
original %>%
mutate(across(all_of(unique(keymap$column_name)), ~
(keymap %>%
filter(column_name == cur_column()) %>%
select(-column_name) %>%
deframe)[as.character(.x)]))
-output
name home car shirt shoes
1 Jane Single family SUV Blue Black
2 Mary Condo SUV Red Black
3 John Apartment Sedan White Black
4 Billy Duplex SUV White Brown
Or an approach in base R
lst1 <- split(with(keymap, setNames(value_new, value_old)), keymap$column_name)
original[names(lst1)] <- Map(\(x, y) y[as.character(x)],
original[names(lst1)], lst1)
Please check below code where we can use the factor to replace the values in one column with data from another dataframe here in this case with keymap
library(tidyverse)
original %>% mutate(home=factor(home, keymap$value_old, keymap$value_new),
car=factor(car, keymap$value_old, keymap$value_new),
shirt=factor(shirt, keymap$value_old, keymap$value_new)
)
Created on 2023-02-04 with reprex v2.0.2
name home car shirt shoes
1 Jane Single family SUV Condo Black
2 Mary Condo SUV Duplex Black
3 John Apartment Sedan Single family Black
4 Billy Duplex SUV Single family Brown

imputing missing values in R dataframe

I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.
Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles
I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))

R: how to aggreate rows by count

This is my data frame
ID=c(1,2,3,4,5,6,7,8,9,10,11,12)
favFruit=c('apple','lemon','pear',
'apple','apple','pear',
'apple','lemon','pear',
'pear','pear','pear')
surveyDate = ('1/1/2005','1/1/2005','1/1/2005',
'2/1/2005','2/1/2005','2/1/2005',
'3/1/2005','3/1/2005','3/1/2005',
'4/1/2005','4/1/2005','4/1/2005')
df<-data.frame(ID,favFruit, surveyDate)
I need to aggregate it so I can plot a line graph in R for count of favFruit by date split by favFruit but I am unable to create an aggregate table. My data has 45000 rows so a manual solution is not possible.
surveyYear favFruit count
1/1/2005 apple 1
1/1/2005 lemon 1
1/1/2005 pear 1
2/1/2005 apple 2
2/1/2005 lemon 0
2/1/2005 pear 1
... etc
I tried this but R printed an error
df2 <- aggregate(df, favFruit, FUN = sum)
and I tried this, another error
df2 <- aggregate(df, date ~ favFruit, sum)
I checked for solutions online but their data generally included a column of quantities which I dont have and the solutions were overly complex. Is there an easy way to do this? Thanx in advance. Thank you to whoever suggested the link as a possible duplicate but it has has date and number of rows. But my question needs number of rows by date and favFruit (one more column) 1
Update:
Ronak Shah's solution worked. Thanx!
The solution provided by Ronak is very good.
In case you prefer to keep the zero counts in your dataframe.
You could use table function:
data.frame(with(df, table(favFruit, surveyDate)))
Output:
favFruit surveyDate Freq
1 apple 1/1/2005 1
2 lemon 1/1/2005 1
3 pear 1/1/2005 1
4 apple 2/1/2005 2
5 lemon 2/1/2005 0
6 pear 2/1/2005 1
7 apple 3/1/2005 1
8 lemon 3/1/2005 1
9 pear 3/1/2005 1
10 apple 4/1/2005 0
11 lemon 4/1/2005 0
12 pear 4/1/2005 3

Match and count total words from an external list with text strings (tweets) in r

I am attempting to conduct emotional sentiment analysis of a large corpus of Tweets (91k) with an external list of emotionally-charged words (from the NRC Emotion Lexicon). To do this, I want to run a count and sum the total number of times any word from the words of joy list is contained within each Tweet. Ideally, this would not be a partial match of the word and not exact match. I would like for the total total to show in a new column in the df.
The df and column name for the Tweets are Tweets_with_Emotions$full_text and the list is Words_of_joy$word.
Example 1
> head(Tweets_with_Emotions, n=10)
ID Date full_text
1 58150 2012-09-12 I love an excellent cookie
2 12357 2012-09-28 Oranges are delicious and excellent
3 50788 2012-10-04 Eager to visit Disneyland
4 66038 2012-10-11 I wish my boyfriend would propose already
5 18119 2012-10-11 Love Maggie Smith
6 48349 2012-10-14 The movie was excellent, loved it.
7 23328 2012-10-16 Pineapples are so delicious and excellent
8 66038 2012-10-26 Eager to see the Champions Cup next week
9 32717 2012-10-28 Hating this show
10 11345 2012-11-08 Eager for the food
Example 2
> > head(words_of_joy, n=5)
word
1 eager
2 champion
3 delicious
4 excellent
5 love
Desired output
> head(New_df, n=10)
ID Date full_text joy_count
1 58150 2012-09-12 I love an excellent cookie 2
2 12357 2012-09-28 Oranges are delicious and excellent 2
3 50788 2012-10-04 Eager to visit Disneyland 1
4 66038 2012-10-11 I wish my boyfriend would propose already 0
5 18119 2012-10-11 Love Maggie Smith 1
6 48349 2012-10-14 The movie was excellent, loved it. 2
7 23328 2012-10-16 Pineapples are so delicious and excellent 2
8 66038 2012-10-26 Eager to see the Champions Cup next week 2
9 32717 2012-10-28 Hating this show 0
10 11345 2012-11-08 Eager for the food 1
I've effectively run the emotion list through the Tweets so that it returns a yes or no as to whether any words from the emotion list are contained within the Tweets (no = 0, yes = 1), however I cannot figure out how to count and return the totals in a new column
new_df <- Tweets_with_Emotions[stringr::str_detect(Tweets_with_Emotions$full_text, paste(Words_of_negative$words,collapse = '|')),]
I'm extremely new to R (and stackoverflow!) and have been struggling to figure this out for a few days so any help would be incredibly appreciated!

R - Create new column with cumulative means by group

I have the following data frame, listing the spends for each category for each day
Dataframe: actualSpends
Date Category Spend ($)
2017/01/01 Apple 10
2017/01/02 Apple 12
2017/01/03 Apple 8
2017/01/01 Banana 13
2017/01/02 Banana 15
2017/01/03 Banana 7
I want to create a new data frame that will list down the average amount spend for each category, for each day of the month.
(e.g. On the 3rd of the month, the average spend of all days that have passed in the month, from the 1st to 31st of each month. )
EDIT:
So the output should look something like..
Date Category AvgSpend ($)
2017/01/01 Apple 10
2017/01/02 Apple 11
2017/01/03 Apple 10
2017/01/01 Banana 13
2017/01/02 Banana 14
2017/01/03 Banana 11.7
Where for each category, the average spend for each day is an average of all the days past. 1st, is an average of 1st. 2nd is an average of 1st + 2nd. 3rd is an average of 1st + 2nd + 3rd.
Is there any workaround for this?
We can use the cummean function from the dplyr package to calculate cumulative averages for each category; then melt the results into a new column:
library(dplyr)
library(reshape2)
unq <- unique(df$Category)
df$AvgSpend <- melt(
sapply(1:length(unq),
function(i) cummean(df$Spending[which(df$Category==unq[i])])))$value
Output:
Date Category Spending AvgSpend
1 2017/01/01 Apple 10 10.00000
2 2017/01/02 Apple 12 11.00000
3 2017/01/03 Apple 8 10.00000
4 2017/01/01 Banana 13 13.00000
5 2017/01/02 Banana 15 14.00000
6 2017/01/03 Banana 7 11.66667
Sample data:
df <- data.frame(Date=c("2017/01/01","2017/01/02","2017/01/03",
"2017/01/01","2017/01/02","2017/01/03"),
Category=c("Apple","Apple","Apple",
"Banana","Banana","Banana"),
Spending=c(10,12,8,13,15,7))
Here is a tidyverse option
library(tidyverse)
df %>%
group_by(Date, Category) %>%
summarise(Spending = mean(Spending, na.rm = TRUE))
# A tibble: 4 x 3
# Groups: Date [?]
# Date Category Spending
# <fctr> <fctr> <dbl>
#1 2017/01/01 Apple 11
#2 2017/01/02 Banana 14
#3 2017/01/03 Apple 8
#4 2017/01/03 Banana 7
You can use 'sqldf' (https://cran.r-project.org/web/packages/sqldf/sqldf.pdf) package
install.packages("sqldf")
library(sqldf)
actualSpends <- data.frame(
Date = c('2017/01/01','2017/01/02', '2017/01/03','2017/01/01','2017/01/02','2017/01/03'),
Category =('Apple','Apple','Apple','Banana','Banana','Banana'),
Spend = c(10,12,8,13,15,7))
sqldf("select Date,Category,sum(Spend) from actualSpends
group by Date,Category ")

Resources