Converting continuous variable to categorical with dplyrXdf - r

I'm trying to perform some initial exploration of some data. I am busy analysing one-ways of continuous variables by converting them to factors and calculating frequencies by bands.
I would like to do this with dplyrXdf but it doesn't seem to work the same as normal dplyr for what I'm attempting
sample_data <- RxXdfData("./data/test_set.xdf") #sample xdf for testing
as_data_frame <- rxXdfToDataFrame(sample_data) #same data as dataframe
# Calculate freq by Buildings Sum Insured band
Importing my sample data as a dataframe the below code works
buildings_ad_fr <- as_data_frame %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
But I cant do the same thing using the xdf version of the data
buildings_ad_fr_xdf <- sample_data %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
A workaround I can think would be to use rxDataStep to create the new column by passing through bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000)) in the transforms argument, but it shouldn’t be necessary to have an intermediate step.
I've tried using the .rxArgs function before the group_by expression but that also doesn't seem to work
buildings_ad_fr <- sample_data %>%
mutate(sample_data,.rxArgs = list(transforms = list(bd_cut = cut(BD_INSURED_VALUE,
seq(150000,
10000000,
5000000)))))%>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
Both times on the xdf file it gives the error Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions
Now I know this package can factorise variables but I am not sure how to use it to split up a continuous variable
Does anyone know how to do this?

The mutate should be fine. The summarise is different for Xdf files:
Internally summarise will run rxCube or rxSummary by default, which automatically remove NAs. You don't need na.rm=TRUE.
You can't summarise on an expression. The solution is to run the summarise and then compute the expression:
xdf %>%
group_by(*) %>%
summarise(expos=sum(expos), pd=sum(clms)) %>%
mutate(pd=pd/expos)
I've also just updated dplyXdf to 0.10.0 beta, which adds support for HDFS/Spark and dplyr 0.7 along with several nifty utility functions. If you're not using it already, you might want to check it out. The formal release should happen when the next MRS version comes out.

Related

How to mutate a list of dataframes simultaneously in R

I am trying to mutate a dataframes which are part of a list of dataframe all at the same time in R
Here are the functions I am running on the dataframe, this is able to mutate/group_by/summarise
ebird_tod_1 <- ebird_split[[1]] %>% #ebird_split is the df list.
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
This works superb for one dataframe in the list, however I have 45,And I rather not have pages of this coding to create the 45 variable. Hence I am lookingg for a method that would increase the "ebird_tod_1" variable to "ebird_tod_2" "ebird_tod_3" etc. At the same time that the dataframe on which the modification occur should change to "ebird_split[[2]]" "ebird_split[[3]]".
I have tried unsuccessfully to use the repeat and map function.
I hope that is all the info someone need to help, I am new at R,
Thank you.
As you provided no example data the following code is not tested. But a general approach would be to put your code inside a function and to use lapply or purrr::map to loop over your list of data frames and store the result in a list (instead of creating multiple objects):
myfun <- function(x) {
x %>%
mutate(tod_bins = cut(time_observations_started,
breaks = breaks,
labels = labels,
include.lowest = TRUE),
tod_bins = as.numeric(as.character(tod_bins))) %>%
group_by(tod_bins) %>%
summarise(n_checklists = n(),
n_detected = sum(species_observed),
det_freq = mean(species_observed))
}
ebird_tod <- lapply(ebird_split, myfun)
In your example it seems like you want to create data.frames in the global environment from that list of data.frames. To do this we could use rlang::env_bind:
library(tidyverse)
# a list of data.frames
data_ls <- iris %>%
nest_by(Species) %>%
pull(data)
# name the list of data frames
data_ls <- set_names(data_ls, paste("iris", seq_along(data_ls), sep = "_"))
data_ls %>%
# use map or lapply to make some operations
map(~ mutate(.x, new = Sepal.Length + Sepal.Width) %>%
summarise(across(everything(), mean),
n = n())) %>%
# pipe into env_bind and splice list of data.frames
rlang::env_bind(.GlobalEnv, !!! .)
Created on 2022-05-02 by the reprex package (v2.0.1)

Is there a way of creating a loop that will create a new variable for each of the original 18 variables?

I have a data set with 4 variables, one of these variables is a dummy stating whether the individual graduated from a particular program (exits). I need to create a loop that will, for each of the 3 variables create two new variables (mean for dummy = 1 and mean for dummy = 0). This is my code, I want to make it more efficient, since afterwards I want to create a new data.frame for exits == 0 and substract both!.
summary_means_1 = bf %>%
filter(exits == 1) %>%
summarise(
v1_1 = as.double(mean(bf$v25_grad, na.rm = TRUE)),
v2_1 = as.double(mean(bf$v29_read, na.rm = TRUE)),
v3_1 = as.double(mean(bf$v30_math, na.rm = TRUE))
)
You can do this with the plyr package:
Say this is your data (simplified):
df <- data.frame(Dummy=sample(0:1, 10, T), V1=rnorm(10, 10), V2=rpois(10, 0.5))
This code will calculate the mean of each column, split by dummy:
library(magrittr)
library(plyr)
df %>%
group_by(Dummy) %>%
summarise(Mean_V1=mean(V1, na.rm = T),
Mean_V2=mean(V2, na.rm = T))
You'll need to add a new row in the summarise section for each column.
Using base R you can use colMeans with subsetted data:
colMeans(df[df$Dummy==0, -1])
colMeans(df[df$Dummy==1, -1])
Or you could combine them like this:
data.frame(Col=c("V1", "V2"),
Mean_0=colMeans(df[df$Dummy==0, -1]),
Mean_1=colMeans(df[df$Dummy==1, -1]))

Summarize each category of rows in one column using R

I'm wondering if this is something possible in R:
I have 2 columns. Column A (primaryhistory2.DEPT) has a bunch of categorical data, column B (primaryhistry2.ACT.ENROLL) has numbers and NAs.
I want to get a summary of column B for each category in column A.
Something like, for "NUT" in column A, I want to see min, max, mean, median, NAs, etc. And I would like to see this for every category. Like when you use summary() command.
Not sure if this is possible.. Thank you all in advance!
#Moody_Mudskipper
The results are what I'm looking for. But without column names it's hard to read.
and for the base R, it's not doing counts for NAs, which I do see a lot of NAs in my file.
Very possible using dplyr library:
library(dplyr)
most.of.the.answer = df %>%
group_by(primaryhistory2.DEPT) %>%
summarise(min = min(primaryhistry2.ACT.ENROLL, na.rm = TRUE), max = max(primaryhistry2.ACT.ENROLL, na.rm = TRUE), mean = mean(primaryhistry2.ACT.ENROLL, na.rm = TRUE), median = median(primaryhistry2.ACT.ENROLL, na.rm = TRUE))
(assuming your dataframe is called df)
For counting NA's, try dplyr's filter feature:
count.NAs = df %>% filter(is.na(primaryhistry2.ACT.ENROLL)) %>%
group_by(primaryhistory2.DEPT) %>%
summarise(count.NA = n())
I'll leave it to you to merge the two dataframes.
with base R you can do this:
temp <- aggregate(primaryhistory2..ACT.ENROLL ~ primaryhistory2.DEPT,df,function(x){c(mean = mean(x,na.rm=T),median = median(x,na.rm=T),min = min(x,na.rm=T),max = max(x,na.rm=T),nas=sum(is.na(x)))})
res <- cbind(temp[1],temp[[2]])
If you want to use summary:
summary1 <- sapply(unique(df$primaryhistory2.DEPT),function(x) summary(subset(df,primaryhistory2.DEPT == x)$primaryhistory2..ACT.ENROLL))
colnames(summary1) <- unique(df$primaryhistory2.DEPT)

R Calculate change using time and not lag

I'm trying to calculate change from one quarter to the next. It's a little complex as I'm grouping variables and I'd like to not use lag if possible. I'm having some difficultly finding a solution. Does anyone have a suggestion?
This is where I am now.
library(dplyr)
library(lubridate)
x <- mutate(x,QTR.YR = quarter(x$Month.of.Day.Timestamp, with_year = TRUE))
New.DF <- x %>% group_by(location_country, Device.Type,QTR.YR) %>% #Select grouping columns to summarise
summarise(Sum.Basket = sum(Baskets.Created), Sum.Visits = sum(Visits),
Sum.Checkout.Starts = sum(Checkout.Starts), Sum.Converted.Visit = sum(Converted.Visits),
Sum.Store.Find = sum(store_find), Sum.Time.on.Page = sum(time_on_pages),
Sum.Time.on.Product.Page = sum(time_on_product_pages),
Visit.Change = sum(Visits)/sum((lag(Visits)),4)-1)
View(New.DF)

R: optimize finding max values of function on a data frame and then trim the rest

First of all my data comes from Temperature.xls which can be downloaded from this link: RBook
My code is this:
temp = read.table("Temperature.txt", header = TRUE)
length(unique(temp$Year)) # number of unique values in the Year vector.
res = ddply(temp, c("Year","Month"), summarise, Mean = mean(Temperature, na.rm = TRUE))
res1 = ddply(temp, .(Year,Month), summarise,
SD = sd(Temperature, na.rm = TRUE),
N = sum(!is.na(Temperature))
)
# ordering res1 by sd and year:
res1 = res1[order(res1$Year,res1$SD),];
# finding maximum of SD in res1 by year and displaying just them in a separate data frame
res1_maxsd = ddply(res1, .(Year), summarise, MaxSD = max(SD, na.rm = TRUE)) # find the maxSD in each Year
res1_max = merge(res1_maxsd,res1, all = FALSE) # merge it with the original to see other variables at the max's rows
res1_m = res1_max[res1_max$MaxSD==res1_max$SD,] # find which rows are the ones corresponding to the max value
res1_mm = res1_m[complete.cases(res1_m),] # trim all others (which are NA's)
I know that I can cut the 4 last lines to less lines. Can I somehow execute the last 2 lines in one command? I have stumbled across:
res1_m = res1_max[complete.cases(res1_max$MaxSD==res1_max$SD),]
But this does not give me what I want which is eventually a smaller data frame only with the rows (with all the variables) that contain the maxSD.
Rather than fixing the last 2 lines why not start with res1? Reversing the order of SD and taking the first row per year gives you an equivalent final data set...
res1 <- res1[order(res1$Year,-res1$SD),]
res_final <- res1[!duplicated(res1$Year),]
The last four lines can be cut down if you use dplyr package. Since you want to keep some information from original data set, you probably don't want to use summarise because it only returns summarized information and you have to merge with original dataset, so mutate and filter would be a better choice:
library(dplyr)
res1_mm1 <- res1 %>% group_by(Year) %>% filter(SD == max(SD, na.rm = T))
You can also use a mutate function to create the new column MaxSD which is the same as SD in the result data frame for your case:
res1_mm1 <- res1 %>% group_by(Year) %>% mutate(MaxSD = max(SD, na.rm = T)) %>%
filter(SD == MaxSD)

Resources