Inconsistent results with normalisation in dplyr - r

I am trying to normalise a column ROE with this normalisation function
2*((PATORBIS$ROE-min(PATORBIS$ROE,na.rm = T))/
((max(PATORBIS$ROE,na.rm = T)-min(PATORBIS$ROE,na.rm = T))))-1
when I run the function above it gives me the correct normalisation, whereas using it with mutate from dplyr, the exact same function gives incorrect results.
Sample data:
PATORBIS <- data.frame(Company=c("ACHAOGEN","ACHAOGEN","ACHAOGEN","ACHAOGEN"),year=as.numeric(c("2013","2014","2015","2016")),ROE=as.numeric(c("-170","-31.2","-62.8",NA)))
plot2 <- PATORBIS %>%
select("Company","year","ROE") %>%
filter(!is.na(ROE)) %>%
mutate(ROE=2*(ROE-min(ROE,na.rm = T))/(max(ROE,na.rm = T)-min(ROE,na.rm = T))-1)
Does anyone have a similar issue of inconsistent results with mutate in dplyr?

I got the same results, exactly the same removing the filter (and the select is not necessary)
library(dplyr)
PATORBIS <- data.frame(Company=c("ACHAOGEN","ACHAOGEN","ACHAOGEN","ACHAOGEN"),year=as.numeric(c("2013","2014","2015","2016")),ROE=as.numeric(c("-170","-31.2","-62.8",NA)))
plot2 <- PATORBIS %>%
# select("Company","year","ROE") %>%
# filter(!is.na(ROE)) %>%
mutate(ROE=2*(ROE-min(ROE,na.rm = T))/(max(ROE,na.rm = T)-min(ROE,na.rm = T))-1)
plot2
2*((PATORBIS$ROE-min(PATORBIS$ROE,na.rm = T))/
((max(PATORBIS$ROE,na.rm = T)-min(PATORBIS$ROE,na.rm = T))))-1

Related

How to use %>% in tidymodels in R?

I am trying to split a dataset from tidymodels in R.
library(tidymodels)
data(Sacramento, package = "modeldata")
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)
I want to describe the distribution of the training dataset, but the following error occurs.
Sac_train %>%
select(price) %>%
summarize(min_sell_price = min(),
max_sell_price = max(),
mean_sell_price = mean(),
sd_sell_price = sd())
# Error: In min() : no non-missing arguments to min; returning Inf
However, the following code works.
Sac_train %>%
summarize(min_sell_price = min(price),
max_sell_price = max(price),
mean_sell_price = mean(price),
sd_sell_price = sd(price))
My question is: why select(price) is not working in the first example? Thanks.
Assuming your data are a data frame, despite having only one column selected, you still need to tell R/dplyr what column you want to summarize.
In other words, it doesn't treat a single-column data frame as a vector that you can pass through a function - i.e.:
Sac_train.vec <- 1:25
mean(Sac_train.vec)
# [1] 13
will calculate the mean, whereas
Sac_train.df <- data.frame(price = 1:25)
mean(Sac_train.df)
throws an error.
In the special case of only one column, this may be more parsimonious code:
# Example Data
Sac_train <- data.frame(price = 1:25, col2 = LETTERS[1:25])
Sac_train %>%
select(price) %>%
summarize(across(everything(),
list(min = min, max = max, mean = mean, sd = sd)))
Output:
# price_min price_max price_mean price_sd
# 1 1 25 13 7.359801

r.squared matrix of predictions vs actual values in R

I want to create a matrix that displays the r.squared coefficient of determination of some predictions made over the years and the actual values.
My goal is to display a matrix that looks something like this.
The only way I found is to make multiple lists, calculate each row/ column individually using map2_dbl(l.predicted_line1, l.actual, ~ summary(lm(.x ~ .y))$r.squared), and then add the resulting vectors in a matrix with some code. This would create 9 lists, which I want to avoid.
Is there any way of doing this in a more efficiently?
#sample data
l.actual <- list(
overall_15 = c(59,65,73,73,64,69,64,69,63,NA,82,60,NA,73,NA,73,73,NA,69,
69,71,66,65,70,72,72,NA,64,69,67,64,71,NA,62,62,71,67,63,64,76,72),
overall_16 = c(60,68,75,74,68,71,NA,72,64,69,82,66,64,77,NA,71,72,NA,69,
69,75,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,65,76,73),
overall_17 = c(63,68,NA,74,72,72,NA,73,66,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
overall_18 = c(NA,68,NA,78,73,72,NA,72,68,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
overall_19 = c(NA,NA,NA,77,73,72,NA,71,69,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
overall_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
overall_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
overall_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)
l.predicted <- list(
potential_15 = c(59,68,74,76,65,75,64,72,66,NA,85,60,NA,76,NA,73,75,NA,71,
71,71,67,65,70,72,72,NA,68,74,67,64,71,NA,62,62,71,71,63,67,78,72),
potential_16 = c(60,71,75,75,68,73,NA,74,66,69,83,66,64,77,NA,71,74,NA,70,
70,76,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,66,76,73),
potential_17 = c(63,69,NA,75,72,72,NA,73,69,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
potential_18 = c(NA,68,NA,78,73,72,NA,72,69,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
potential_19 = c(NA,NA,NA,77,73,72,NA,71,70,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
potential_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
potential_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
potential_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)
Here is a solution using some tidyverse packages. The key thing is to use the function expand_grid() to get all combinations of the elements of each list. This results in a tibble with two named list columns. Next we can use mutate() to pull out the names of the list and assign them to new columns, and extract the numeric IDs. Use filter() to retain only the rows where potential is less than or equal to overall. Finally get the R-squared for each row using your suggested code, and plot. (Note I did not try too hard to get the plot to look just like yours.)
library(purrr)
library(dplyr)
library(ggplot2)
library(tidyr)
r_squared_combinations <- expand_grid(l.actual, l.predicted) %>%
mutate(overall = names(l.actual),
potential = names(l.predicted),
overall_n = as.numeric(gsub('overall_', '', overall)),
potential_n = as.numeric(gsub('potential_', '', potential))) %>%
filter(potential_n <= overall_n) %>%
mutate(r_squared = map2_dbl(l.predicted, l.actual, ~ summary(lm(.x ~ .y))$r.squared))
ggplot(r_squared_combinations, aes(x = overall, y = potential, fill = r_squared, label = round(r_squared, 3))) +
geom_tile() +
geom_text(color = 'white')
Side note: incidentally the base function expand.grid() would work about as well as tidyr::expand_grid() but expand_grid() returns a tibble by default which may be more convenient if you are using tidyverse functions otherwise.

Converting continuous variable to categorical with dplyrXdf

I'm trying to perform some initial exploration of some data. I am busy analysing one-ways of continuous variables by converting them to factors and calculating frequencies by bands.
I would like to do this with dplyrXdf but it doesn't seem to work the same as normal dplyr for what I'm attempting
sample_data <- RxXdfData("./data/test_set.xdf") #sample xdf for testing
as_data_frame <- rxXdfToDataFrame(sample_data) #same data as dataframe
# Calculate freq by Buildings Sum Insured band
Importing my sample data as a dataframe the below code works
buildings_ad_fr <- as_data_frame %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
But I cant do the same thing using the xdf version of the data
buildings_ad_fr_xdf <- sample_data %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
A workaround I can think would be to use rxDataStep to create the new column by passing through bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000)) in the transforms argument, but it shouldn’t be necessary to have an intermediate step.
I've tried using the .rxArgs function before the group_by expression but that also doesn't seem to work
buildings_ad_fr <- sample_data %>%
mutate(sample_data,.rxArgs = list(transforms = list(bd_cut = cut(BD_INSURED_VALUE,
seq(150000,
10000000,
5000000)))))%>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
Both times on the xdf file it gives the error Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions
Now I know this package can factorise variables but I am not sure how to use it to split up a continuous variable
Does anyone know how to do this?
The mutate should be fine. The summarise is different for Xdf files:
Internally summarise will run rxCube or rxSummary by default, which automatically remove NAs. You don't need na.rm=TRUE.
You can't summarise on an expression. The solution is to run the summarise and then compute the expression:
xdf %>%
group_by(*) %>%
summarise(expos=sum(expos), pd=sum(clms)) %>%
mutate(pd=pd/expos)
I've also just updated dplyXdf to 0.10.0 beta, which adds support for HDFS/Spark and dplyr 0.7 along with several nifty utility functions. If you're not using it already, you might want to check it out. The formal release should happen when the next MRS version comes out.

Error All select() inputs must resolve to integer column positions. The following do not:

I am trying to use dplyr computation as below and then call this in a function where I can change the column name and dataset name. The code is as below:
sample_table <- function(byvar = TRUE, dataset = TRUE) {
tcount <-
df2 %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(tcount = n) %>%
left_join(
select(
dataset %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(scount = n), byvar, scount
), by = c("byvar")
) %>%
mutate_each(funs(replace(., is.na(.), 0)),-byvar %>% mutate(
tperc = round(tcount / rcount, digits = 2), sperc = round(scount / samplesize, digits = 2),
absdiff = abs(sperc - tperc)
) %>%
select(byvar, tcount, tperc, scount, sperc, absdiff)
return(tcount)
}
category_Sample1 <- sample_table(byvar = "category", dataset = Sample1)
My function name is sample_table.
The Error message is as below:-
Error: All select() inputs must resolve to integer column positions.
The following do not:
* byvar
I know this is a repeat question and I have gone through the below links:
Function writing passing column reference to group_by
Error when combining dplyr inside a function
I am not sure where I am going wrong. rcount is the number of rows in df2 and samplesize is the number of rows in "dataset" dataframe. I have to compute the same thing for another variable with three different "dataset" names.
You use column references as strings (byvar) (Standard Evaluation) and normal reference (tcount, tperc etc.) (Non Standard Evaluation) together.
Make sure you use one of both and the appropriate function: select() or select_(). You can fix your issue by using
select(one_of(c(byvar,'tcount')))

R Calculate change using time and not lag

I'm trying to calculate change from one quarter to the next. It's a little complex as I'm grouping variables and I'd like to not use lag if possible. I'm having some difficultly finding a solution. Does anyone have a suggestion?
This is where I am now.
library(dplyr)
library(lubridate)
x <- mutate(x,QTR.YR = quarter(x$Month.of.Day.Timestamp, with_year = TRUE))
New.DF <- x %>% group_by(location_country, Device.Type,QTR.YR) %>% #Select grouping columns to summarise
summarise(Sum.Basket = sum(Baskets.Created), Sum.Visits = sum(Visits),
Sum.Checkout.Starts = sum(Checkout.Starts), Sum.Converted.Visit = sum(Converted.Visits),
Sum.Store.Find = sum(store_find), Sum.Time.on.Page = sum(time_on_pages),
Sum.Time.on.Product.Page = sum(time_on_product_pages),
Visit.Change = sum(Visits)/sum((lag(Visits)),4)-1)
View(New.DF)

Resources