R Calculate change using time and not lag - r

I'm trying to calculate change from one quarter to the next. It's a little complex as I'm grouping variables and I'd like to not use lag if possible. I'm having some difficultly finding a solution. Does anyone have a suggestion?
This is where I am now.
library(dplyr)
library(lubridate)
x <- mutate(x,QTR.YR = quarter(x$Month.of.Day.Timestamp, with_year = TRUE))
New.DF <- x %>% group_by(location_country, Device.Type,QTR.YR) %>% #Select grouping columns to summarise
summarise(Sum.Basket = sum(Baskets.Created), Sum.Visits = sum(Visits),
Sum.Checkout.Starts = sum(Checkout.Starts), Sum.Converted.Visit = sum(Converted.Visits),
Sum.Store.Find = sum(store_find), Sum.Time.on.Page = sum(time_on_pages),
Sum.Time.on.Product.Page = sum(time_on_product_pages),
Visit.Change = sum(Visits)/sum((lag(Visits)),4)-1)
View(New.DF)

Related

R - Difference between two rows conditional on variables

Lately for everything I manage on my own, I feel like I have a point I need to clarify on this board. SO thanks in advance for your help.
The issue is as follows:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90,30,70,90,220,35,70,150,250,10,50,25)
)
Data<-Data[, totalcont := sum(Contract_Size), by=c("Code_ID_Buy","Month")]
View(Data)
I would like to calculate the difference in total contract size from one period to the next, does anybody know whether it is possible ? I have been working on that for weeks not finding the solution...
Kind regards,
This should solve it
library(tidyverse)
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90)
)
Data %>%
group_by(Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(dif_contract = total_contract -lag(total_contract))
If you need to respect another variable do
Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
If you need to maintain the current structure left_join may be right
result <- Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
Data %>%
left_join(result)

Inconsistent results with normalisation in dplyr

I am trying to normalise a column ROE with this normalisation function
2*((PATORBIS$ROE-min(PATORBIS$ROE,na.rm = T))/
((max(PATORBIS$ROE,na.rm = T)-min(PATORBIS$ROE,na.rm = T))))-1
when I run the function above it gives me the correct normalisation, whereas using it with mutate from dplyr, the exact same function gives incorrect results.
Sample data:
PATORBIS <- data.frame(Company=c("ACHAOGEN","ACHAOGEN","ACHAOGEN","ACHAOGEN"),year=as.numeric(c("2013","2014","2015","2016")),ROE=as.numeric(c("-170","-31.2","-62.8",NA)))
plot2 <- PATORBIS %>%
select("Company","year","ROE") %>%
filter(!is.na(ROE)) %>%
mutate(ROE=2*(ROE-min(ROE,na.rm = T))/(max(ROE,na.rm = T)-min(ROE,na.rm = T))-1)
Does anyone have a similar issue of inconsistent results with mutate in dplyr?
I got the same results, exactly the same removing the filter (and the select is not necessary)
library(dplyr)
PATORBIS <- data.frame(Company=c("ACHAOGEN","ACHAOGEN","ACHAOGEN","ACHAOGEN"),year=as.numeric(c("2013","2014","2015","2016")),ROE=as.numeric(c("-170","-31.2","-62.8",NA)))
plot2 <- PATORBIS %>%
# select("Company","year","ROE") %>%
# filter(!is.na(ROE)) %>%
mutate(ROE=2*(ROE-min(ROE,na.rm = T))/(max(ROE,na.rm = T)-min(ROE,na.rm = T))-1)
plot2
2*((PATORBIS$ROE-min(PATORBIS$ROE,na.rm = T))/
((max(PATORBIS$ROE,na.rm = T)-min(PATORBIS$ROE,na.rm = T))))-1

Summarize each category of rows in one column using R

I'm wondering if this is something possible in R:
I have 2 columns. Column A (primaryhistory2.DEPT) has a bunch of categorical data, column B (primaryhistry2.ACT.ENROLL) has numbers and NAs.
I want to get a summary of column B for each category in column A.
Something like, for "NUT" in column A, I want to see min, max, mean, median, NAs, etc. And I would like to see this for every category. Like when you use summary() command.
Not sure if this is possible.. Thank you all in advance!
#Moody_Mudskipper
The results are what I'm looking for. But without column names it's hard to read.
and for the base R, it's not doing counts for NAs, which I do see a lot of NAs in my file.
Very possible using dplyr library:
library(dplyr)
most.of.the.answer = df %>%
group_by(primaryhistory2.DEPT) %>%
summarise(min = min(primaryhistry2.ACT.ENROLL, na.rm = TRUE), max = max(primaryhistry2.ACT.ENROLL, na.rm = TRUE), mean = mean(primaryhistry2.ACT.ENROLL, na.rm = TRUE), median = median(primaryhistry2.ACT.ENROLL, na.rm = TRUE))
(assuming your dataframe is called df)
For counting NA's, try dplyr's filter feature:
count.NAs = df %>% filter(is.na(primaryhistry2.ACT.ENROLL)) %>%
group_by(primaryhistory2.DEPT) %>%
summarise(count.NA = n())
I'll leave it to you to merge the two dataframes.
with base R you can do this:
temp <- aggregate(primaryhistory2..ACT.ENROLL ~ primaryhistory2.DEPT,df,function(x){c(mean = mean(x,na.rm=T),median = median(x,na.rm=T),min = min(x,na.rm=T),max = max(x,na.rm=T),nas=sum(is.na(x)))})
res <- cbind(temp[1],temp[[2]])
If you want to use summary:
summary1 <- sapply(unique(df$primaryhistory2.DEPT),function(x) summary(subset(df,primaryhistory2.DEPT == x)$primaryhistory2..ACT.ENROLL))
colnames(summary1) <- unique(df$primaryhistory2.DEPT)

Converting continuous variable to categorical with dplyrXdf

I'm trying to perform some initial exploration of some data. I am busy analysing one-ways of continuous variables by converting them to factors and calculating frequencies by bands.
I would like to do this with dplyrXdf but it doesn't seem to work the same as normal dplyr for what I'm attempting
sample_data <- RxXdfData("./data/test_set.xdf") #sample xdf for testing
as_data_frame <- rxXdfToDataFrame(sample_data) #same data as dataframe
# Calculate freq by Buildings Sum Insured band
Importing my sample data as a dataframe the below code works
buildings_ad_fr <- as_data_frame %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
But I cant do the same thing using the xdf version of the data
buildings_ad_fr_xdf <- sample_data %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
A workaround I can think would be to use rxDataStep to create the new column by passing through bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000)) in the transforms argument, but it shouldn’t be necessary to have an intermediate step.
I've tried using the .rxArgs function before the group_by expression but that also doesn't seem to work
buildings_ad_fr <- sample_data %>%
mutate(sample_data,.rxArgs = list(transforms = list(bd_cut = cut(BD_INSURED_VALUE,
seq(150000,
10000000,
5000000)))))%>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
Both times on the xdf file it gives the error Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions
Now I know this package can factorise variables but I am not sure how to use it to split up a continuous variable
Does anyone know how to do this?
The mutate should be fine. The summarise is different for Xdf files:
Internally summarise will run rxCube or rxSummary by default, which automatically remove NAs. You don't need na.rm=TRUE.
You can't summarise on an expression. The solution is to run the summarise and then compute the expression:
xdf %>%
group_by(*) %>%
summarise(expos=sum(expos), pd=sum(clms)) %>%
mutate(pd=pd/expos)
I've also just updated dplyXdf to 0.10.0 beta, which adds support for HDFS/Spark and dplyr 0.7 along with several nifty utility functions. If you're not using it already, you might want to check it out. The formal release should happen when the next MRS version comes out.

R: optimize finding max values of function on a data frame and then trim the rest

First of all my data comes from Temperature.xls which can be downloaded from this link: RBook
My code is this:
temp = read.table("Temperature.txt", header = TRUE)
length(unique(temp$Year)) # number of unique values in the Year vector.
res = ddply(temp, c("Year","Month"), summarise, Mean = mean(Temperature, na.rm = TRUE))
res1 = ddply(temp, .(Year,Month), summarise,
SD = sd(Temperature, na.rm = TRUE),
N = sum(!is.na(Temperature))
)
# ordering res1 by sd and year:
res1 = res1[order(res1$Year,res1$SD),];
# finding maximum of SD in res1 by year and displaying just them in a separate data frame
res1_maxsd = ddply(res1, .(Year), summarise, MaxSD = max(SD, na.rm = TRUE)) # find the maxSD in each Year
res1_max = merge(res1_maxsd,res1, all = FALSE) # merge it with the original to see other variables at the max's rows
res1_m = res1_max[res1_max$MaxSD==res1_max$SD,] # find which rows are the ones corresponding to the max value
res1_mm = res1_m[complete.cases(res1_m),] # trim all others (which are NA's)
I know that I can cut the 4 last lines to less lines. Can I somehow execute the last 2 lines in one command? I have stumbled across:
res1_m = res1_max[complete.cases(res1_max$MaxSD==res1_max$SD),]
But this does not give me what I want which is eventually a smaller data frame only with the rows (with all the variables) that contain the maxSD.
Rather than fixing the last 2 lines why not start with res1? Reversing the order of SD and taking the first row per year gives you an equivalent final data set...
res1 <- res1[order(res1$Year,-res1$SD),]
res_final <- res1[!duplicated(res1$Year),]
The last four lines can be cut down if you use dplyr package. Since you want to keep some information from original data set, you probably don't want to use summarise because it only returns summarized information and you have to merge with original dataset, so mutate and filter would be a better choice:
library(dplyr)
res1_mm1 <- res1 %>% group_by(Year) %>% filter(SD == max(SD, na.rm = T))
You can also use a mutate function to create the new column MaxSD which is the same as SD in the result data frame for your case:
res1_mm1 <- res1 %>% group_by(Year) %>% mutate(MaxSD = max(SD, na.rm = T)) %>%
filter(SD == MaxSD)

Resources