count number of rows within 10 minutes of each other in R - r

I have a table of transaction with their date-time. I'd like to count number of clusters of visits within 10 minutes of each other.
input=
ID visitTime
1 11/10/2017 15:01
1 11/10/2017 15:02
1 11/10/2017 15:19
1 11/10/2017 15:21
1 11/10/2017 15:25
1 11/11/2017 15:32
1 11/11/2017 15:39
1 11/11/2017 15:41
Here, there is a cluster starting on 11/10/2017 15:01 with 2 adjacent visits, one on 11/10/2017 15:19 with 3 visits (2 clusters on the date of 11/10/2017). There is another cluster on 11/11/2017 15:32 with 3 calls. Giving the table below.
output =
ID Date Cluster_count Clusters_with_3ormore_visits
1 11/10/2017 2 1
1 11/11/2017 1 1
What I did:
input %>% group_by(ID) %>% arrange(visitTime) %>%
mutate(nextvisit = lead(visitTime),
gapTime = as.numeric(interval(visitTime,nextvisit), 'minutes'),
repeated = ifelse(gapTime<= 10,1,0))
This can show the start and end of a sequence of visits, but doesn't give me a key to separate them and group them by.
Appreciate any hints/ideas.

In general, cumsum typically solves these issues when you have a column that says if a specific data point is in a different group than the previous one.
I made a few small changes, namely used lastvisit instead of nextvisit and difftime instead of interval (not sure where that function comes from).
input %>% group_by(ID) %>% arrange(visitTime) %>%
mutate(lastvisit = lag(visitTime),
gapTime = as.numeric(difftime(visitTime, lastvisit, 'minutes')),
sameCluster = is.na(gapTime) | gapTime > 10,
cluster = cumsum(sameCluster))
(is.na(gapTime) only handles first row, for which gapTime isn't defined.)

Related

How to do aggregate sum by time range in R?

I have a dataframe as below:
**df**
Cust_name time freq
Andrew 0 4
Dillain 1 2
Alma 2 3
Andrew 1 4
Kiko 2 1
Sarah 2 8
Sarah 0 3
I want to calculate the sum of frequency by the time range provided for each cust_name. Example: If I select time range 0 to 2 for Andrew, it will give me sum of freq: 4+4= 8. And for Sarah, it will give me 8+3=11. I have tried it in the following ways just to get the time range, but do not know how to do the rest, as I am very new to R:
df[(df$time>=0 & df$time<=2),]
You can do this with dplyr.
To make your code reproducible, you should add the creation of your dataframe in your post. Copy and pasting everything is time consuming.
library(dplyr)
df <- data.frame(
cust_name = c('Andrew', 'Dillain', 'Alma', 'Andrew', 'Kiko', 'Sarah', 'Sarah'),
time = c(0,1,2,1,2,2,0),
freq = c(4,2,3,4,1,8,3)
)
df %>%
filter(time >=0, time <=2) %>%
group_by(cust_name) %>%
summarise(sum_freq = sum(freq))

Find the GROWTH RATE of FaceValue for 5 days in percentage

I'm trying to open another column and find the growth rate of the facevalue column per day in percentage
Day
FaceValue
1
₦72,077,680.94
2
₦112,763,770.99
3
₦118,146,250.01
4
₦74,446,035.80
5
₦77,026,183.71
here is the code but it's not working
value_performance%>%
mutate(change=(value_performance$FaceValue-lag(FaceValue,5))/lag(FaceValue,5)*100)
Thanks
Three problems:
FaceValue appears to be a string, not numeric, try first fixing that with as.numeric;
(Almost) never use value_performance$ inside of a dplyr-pipe verb. ("Almost" because there are rare times when you need it. Otherwise you are at best being inefficient, possibly using incorrect values depending on what is happening in the pipe before its use.); and
You say "per day" but you are lagging by 5. While I'm assuming your real data has more than 5 rows, you are still not calculating by-day.
Try this.
value_performance %>%
mutate(
FaceValue = as.numeric(gsub("[^0-9.]", "", FaceValue)),
change = (FaceValue - lag(FaceValue))/lag(FaceValue)
)
# Day FaceValue change
# 1 1 7.21e+07 NA
# 2 2 1.13e+08 0.5645
# 3 3 1.18e+08 0.0477
# 4 4 7.44e+07 -0.3699
# 5 5 7.70e+07 0.0347
With similar data:
Day <- c(1,2,3,4,5)
FaceValue <- c(72077680.94, 112763770.99, 118146250.01, 74446035.80, 77026183.71)
df <- data.frame(Day, FaceValue)
df
df %>%
mutate(change= 100*(FaceValue/lag(FaceValue)-1)
)
Results in:
Day FaceValue change
1 1 72077681 NA
2 2 112763771 56.447557
3 3 118146250 4.773234
4 4 74446036 -36.988236
5 5 77026184 3.465796
Not sure what is wrong. Maybe check your data classes and make sure FaceValue is numerical.

R data frame: Change value in 1 column depending on value in another

I have a data frame called nurse. At the moment it contains several columns but only one (nurse$word) is relevant at the moment. I want to create a new column named nurse$w.frequency which looks at the words in the nurse$word column and if it finds the one specified, I want it to change the corresponding nurse$w.frequency value to a specified integer.
nurse <- read.csv(...)
file word w.frequency
1 determining
2 journey
3 journey
4 serving
5 work
6 journey
... ...
The word frequency for determining and journey, for instance, is 1590 and 4650 respectively. So it should look like the following:
file word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
... ...
I have tried it with the an ifelse statement (below) which seems to work, however, every time I try to change the actual word and frequency it overwrites the results from before.
nurse$w.frequency <- ifelse(nurse$word == "determining", nurse$w.frequency[nurse$word["determining"]] <- 1590, "")
You could first initialise an empty column
nurse$w.frequency <- NA
then populated it with the data you want
nurse$w.frequency[nurse$word == "determining"] <- 1590
nurse$w.frequency[nurse$word == "journey"] <- 4650
Using dplyr:
nurse %>%
mutate(w.frequency =
case_when(
word == "determining" ~ "1590",
word == "journey" ~ "4650",
TRUE ~ ""
))
Gives us:
word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
Data:
nurse <- data.frame(word = c("determining", "journey", "journey", "serving", "work", "journey"))

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

R-convert transaction format dataset to basket format for sequence mining

ORIGINAL TABLE
CELL NUMBER ----------ACTIVITY--------TIME<br/>
001................................call a................12.23<br/>
002................................call b................01.00<br/>
002................................call d................01.09<br/>
001................................call b................12.25<br/>
003................................call a................12.23<br/>
002................................call a................02.07<br/>
003................................call b................12.25<br/>
REQUIRED-
To mine the highest occurring sequence of ACTIVITY from a data-set of size 400,000
ABOVE EXAMPLE SHOULD SHOW
[call a-12.23,call b-12.25] frequency 2<br/>
[call b-01.00,call d-01.09,call a-02.07] frequency 1
I'm aware that this can be achieved using arulesSequences. What transformations on dataset do i need to carry out and how so as to use the arulesSequences package?
Current db format- transaction with 3 columns like sample above.
df<-read.table(header=T,sep="|",text="CELL NUMBER|ACTIVITY|TIME
001|call a|12.23
002|call b|01.00
002|call d|01.09
001|call b|12.25
003|call a|12.23
002|call a|02.07
003|call b|12.25")
require(plyr) # for count() function
freqs<-count(df[,-1]) # [,-1] to exclude the CELL NUMBER column from the group
freqs[order(-freqs$freq),]
ACTIVITY TIME freq
2 call a 12.23 2
4 call b 12.25 2
1 call a 2.07 1
3 call b 1.00 1
5 call d 1.09 1
EDIT - Updated like this:
unique(ddply(freqs,.(-freq),summarise,calls=paste0("[",paste0(paste0(ACTIVITY,"-",TIME),collapse=","),"]","frequency",freq)))
# -freq calls
#1 -2 [call a-12.23,call b-12.25]frequency2
#3 -1 [call a-2.07,call b-1,call d-1.09]frequency1

Resources