I'm trying to open another column and find the growth rate of the facevalue column per day in percentage
Day
FaceValue
1
₦72,077,680.94
2
₦112,763,770.99
3
₦118,146,250.01
4
₦74,446,035.80
5
₦77,026,183.71
here is the code but it's not working
value_performance%>%
mutate(change=(value_performance$FaceValue-lag(FaceValue,5))/lag(FaceValue,5)*100)
Thanks
Three problems:
FaceValue appears to be a string, not numeric, try first fixing that with as.numeric;
(Almost) never use value_performance$ inside of a dplyr-pipe verb. ("Almost" because there are rare times when you need it. Otherwise you are at best being inefficient, possibly using incorrect values depending on what is happening in the pipe before its use.); and
You say "per day" but you are lagging by 5. While I'm assuming your real data has more than 5 rows, you are still not calculating by-day.
Try this.
value_performance %>%
mutate(
FaceValue = as.numeric(gsub("[^0-9.]", "", FaceValue)),
change = (FaceValue - lag(FaceValue))/lag(FaceValue)
)
# Day FaceValue change
# 1 1 7.21e+07 NA
# 2 2 1.13e+08 0.5645
# 3 3 1.18e+08 0.0477
# 4 4 7.44e+07 -0.3699
# 5 5 7.70e+07 0.0347
With similar data:
Day <- c(1,2,3,4,5)
FaceValue <- c(72077680.94, 112763770.99, 118146250.01, 74446035.80, 77026183.71)
df <- data.frame(Day, FaceValue)
df
df %>%
mutate(change= 100*(FaceValue/lag(FaceValue)-1)
)
Results in:
Day FaceValue change
1 1 72077681 NA
2 2 112763771 56.447557
3 3 118146250 4.773234
4 4 74446036 -36.988236
5 5 77026184 3.465796
Not sure what is wrong. Maybe check your data classes and make sure FaceValue is numerical.
I already tried my best but am still pretty much a newbie to R.
Based on like 500mb of input data that currently looks like this:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days
1 2818 5829821 335511.0 1
2 20168 5829746 335265.2 3
3 25428 5830640 331534.6 0
4 27886 5832156 332003.1 3
5 28658 5830888 329727.2 3
6 28871 5829980 332071.3 7
I need to calculate the conditional sum of reviews_last30days - the conditions being a specific and changing area range for each respective record, i.e. R should sum only those reviews for which the calc.latitude and calc.longitude do not deviate more than +/-500 from the longitude and latitude values in each row.
EXAMPLE:
ROW 1 has a calc.latitude 5829821 and a calc.longitude 335511.0, so R should take the sum of all reviews_last30days for which the following ranges apply: calc.latitude 5829321 to 5830321 (value of Row 1 latitude +/-500)
calc.longitude 335011.0 to 336011.0 (value of Row 1 longitude +/-500)
So my intended output would look somewhat like this in column 5:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days reviewsper1000
1 2818 5829821 335511.0 1 4
2 20168 5829746 335265.2 3 4
3 25428 5830640 331534.6 0 10
4 27886 5832156 332003.1 3 3
5 28658 5830888 331727.2 3 10
6 28871 5829980 332071.3 7 10
Hope I calculated correctly in my head, but you get the idea..
Until now I particularly struggle with the fact that my sum conditions are dynamic and "newly assigned" since the latitude and longitude conditions have to be adjusted for each record.
My current code looks like this but it obviously doesn't work that way:
review1000 <- function(TOTALLISTINGS = NULL){
# tibble to return
to_return <- TOTALLISTINGS %>%
group_by(listing_id) %>%
summarise(
reviews1000 = sum(reviews_last30days[(calc.latitude>=(calc.latitude-500) | calc.latitude<=(calc.latitude+500))]))
return(to_return)
}
REVIEWPERAREA <- review1000(TOTALLISTINGS)
I know I also would have to add something for longitude in the code above
Does anyone have an idea how to fix this?
Any help or hints highly appreciated & thanks in advance! :)
See whether the below code will help.
TOTALLISTINGS$reviews1000 <- sapply(1:nrow(TOTALLISTINGS), function(r) {
currentLATI <- TOTALLISTINGS$calc.latitude[r]
currentLONG <- TOTALLISTINGS$calc.longitude[r]
sum(TOTALLISTINGS$reviews_last30days[between(TOTALLISTINGS$calc.latitude,currentLATI - 500, currentLATI + 500) & between(TOTALLISTINGS$calc.longitude,currentLONG - 500, currentLONG + 500)])
})
I have a table of transaction with their date-time. I'd like to count number of clusters of visits within 10 minutes of each other.
input=
ID visitTime
1 11/10/2017 15:01
1 11/10/2017 15:02
1 11/10/2017 15:19
1 11/10/2017 15:21
1 11/10/2017 15:25
1 11/11/2017 15:32
1 11/11/2017 15:39
1 11/11/2017 15:41
Here, there is a cluster starting on 11/10/2017 15:01 with 2 adjacent visits, one on 11/10/2017 15:19 with 3 visits (2 clusters on the date of 11/10/2017). There is another cluster on 11/11/2017 15:32 with 3 calls. Giving the table below.
output =
ID Date Cluster_count Clusters_with_3ormore_visits
1 11/10/2017 2 1
1 11/11/2017 1 1
What I did:
input %>% group_by(ID) %>% arrange(visitTime) %>%
mutate(nextvisit = lead(visitTime),
gapTime = as.numeric(interval(visitTime,nextvisit), 'minutes'),
repeated = ifelse(gapTime<= 10,1,0))
This can show the start and end of a sequence of visits, but doesn't give me a key to separate them and group them by.
Appreciate any hints/ideas.
In general, cumsum typically solves these issues when you have a column that says if a specific data point is in a different group than the previous one.
I made a few small changes, namely used lastvisit instead of nextvisit and difftime instead of interval (not sure where that function comes from).
input %>% group_by(ID) %>% arrange(visitTime) %>%
mutate(lastvisit = lag(visitTime),
gapTime = as.numeric(difftime(visitTime, lastvisit, 'minutes')),
sameCluster = is.na(gapTime) | gapTime > 10,
cluster = cumsum(sameCluster))
(is.na(gapTime) only handles first row, for which gapTime isn't defined.)
I am stuck with a project where I need to merge two data frames. They look something like this:
Data1
Traffic Source Registrations Hour Minute
organic 1 6 13
social 1 8 54
Data2
Email Hour2 Minute2
test#domain.com 6 13
test2#domain2.com 8 55
I have the following line of code to merge the 2 data frames:
merge.df <- merge(Data1, Data2, by.x = c( "Hour", "Minute"),
by.y = c( "Hour2", "Minute2"))
It would work great if the variable time (hours & minutes) wasn't slightly off between the two data sets. Is there a way to make the column "Minute" match with "Minute2" if it's + or - one minute off?
I thought I could create 2 new columns for data set one:
Data1
Traffic Source Registrations Hour Minute Minute_plus1 Minute_minus1
organic 1 6 13 14 12
social 1 8 54 55 53
Is it possible to merge the 2 data frames if "Minute2" matches any variable from either "Minute", "Minute_plus1", or "Minute_minus1"? Or is there a more efficient way to accomplish this merge?
For stuff like this I usually turn to SQL:
library(sqldf)
x = sqldf("
SELECT *
FROM Data1 d1 JOIN Data2 d2
ON d1.Hour = d2.Hour2
AND ABS(d1.Minute - d2.Minute2) <= 1
")
Depending on the size of your data, you could also just join on Hour and then filter. Using dplyr:
library(dplyr)
x = Data1 %>%
left_join(Data2, by = c("Hour" = "Hour2")) %>%
filter(abs(Minute - Minute2) <= 1)
though you could do the same thing with base functions.
I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.