Extracting beta coefficient from group_map() - r

I am working on a data frame where I am trying to regress two columns(female dummy & scores) while grouping them by another column (country), and extracting the coefficient on female dummy.
I have tried using dplyr, by first grouping my data frame by country, using group_by(), then applying a regression, using group_map(). First off, the coefficients that are shown in the result are all the same, for each group. Second I cannot seem to extract only the second coefficient, and when I try, the code says I cannot implement on a list
f1 %>% group_by(background) %>%
group_map(~ coef(lm(pv1math ~ female, data = f1))) %>%
group_map(~ coef[2])
I essentially want a series of the second coefficient.
I keep getting error for group_split.
error in UseMethod("group_split") :
no applicable method for 'group_split' applied to an object of class "list"

Related

R: summarise multiple columns with different summation functions using dplyr results in error?

I am transforming a customer journey dataset from user aggregation level to a day level aggregation. The problem is that I cannot simply sum or mean all columns, as not all variables can be aggregated in the same way. For example, duration is a variable that I want to summarise via mean, while purchase_own is a variable that I want to summarise via sum.
I used dplyr to get this working, but it gives me an error. I tried the following code:
CJd <- CJre %>% group_by(date) %>% summarise_at(vars(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp, POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch), sum)
%>% summarise_at(vars(duration, difference), mean) %>% summarise_at(CountTP, max)
This results in an error:
Error in .f(.x[[i]], ...) : object 'duration' not found
I suspect that this means that summarise_at(vars(duration, difference), mean) is not allowed as second summarise code. Now my question is, how can I write the summarise function so that summation is different for some variables?
Actual results is that only the first summarise_at gets executed, which results in missing variables in my dataset. The missing variables need to be summarised with mean and max, respectively. The expected outcome is these variables grouped by date and summarised by the named functions mean or max are added to the dataset.
The issue is that after the first summarise_at which didn't include 'duration', therefore, the column is not there in the summarised data. Instead, if we use mutate_at, and create a column, then get the distinct rows of the data and summarise
CJre %>%
group_by(date) %>%
mutate_at(vars(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp,
POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch), sum) %>%
group_by(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp,
POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch, add = TRUE) %>%
summarise_at(vars(duration, difference), mean)
markov, first_touch, last_touch, linear_touch), sum)

Lag function usage within a dplyr subset

My basic goal is to subset a data set, and summarise with new columns that use the lag function. I understand how to subset and the data set, but am struggling to complete using the lag function within my data set and that is giving me trouble.
I have already tried a few different ways of implementing it, but have been unsuccessful.
gapminder %>%
na.omit() %>%
group_by(country) %>%
summarise(prevPeriod = lag(year),
lifeExpGrowth = lag(lifeExp),
popGrowth = lag(pop),
gdppcGrowth = 100*(gdpPercap/lag(gdpPercap) - 1)))
I am currently getting my code to run a lag based upon the country, not the year. the gdppcGrowth is supposed to return a percent as well and I am getting an error;
Column `gdppcGrowth` must be length 1 (a summary value), not 12
For each of the functions, I want to analyze the data by country focusing on growth rates. I want to use the lag(x) function to access the previous value of a series or vector so that 100*(x/lag(x) - 1) computes standard (arithmetic) growth rates of x expressed as a percent.

how to apply lm() to datasets split by factors

In a housing dataset, there are three variables, which are bsqft (the building size of the house), county(a factor variable with 9 levels) and price. I would like to fit an individual regression line using bsqft and price for each separate county. Instead of calling lm() function repeatedly, I prefer using apply function in r but have no idea to create it. Could anyone help me with that? Thanks a lot.
You can use dplyr and broom to do regressions by group and summarise the information back into a dataframe
library(dplyr)
library(broom)
your_dataset %>%
group_by(county) %>%
do(tidy(lm(price ~ bsqft, data=.)))

Regression model over factor levels using dplyr : getting repeated errors

I'm trying to run a logistic regression model over several factor levels in my dataframe and I'm getting replicated results for each factor level instead of a unique model's parameters. It happens when I use the diamond dataset and run the same code, this:
diamonds$E <-
if_else(diamonds$color=='E',1,0) #Make 'E' binary
fitted_models <- diamonds %>%
group_by(clarity) %>% #Group by clarity
do(model=glm(E~price,#regress price on E
data=diamonds,
family=binomial(link='logit')))
fitted_models %>%
tidy(model)%>%
View #use broom package to look
I'm stuck as to why I'm having this particular issue.
The issue is in your glm call. Remove data=diamonds and replace it with data=..
fitted_models <- diamonds %>%
group_by(clarity) %>% #Group by clarity
do(model=glm(E~price,#regress price on E
data = .,
family=binomial(link='logit')))
fitted_models %>%
tidy(model)
whenever you are using do you need to reference the grouped data frame using .. As your code currently reads, you are referencing the original, un-grouped frame not the one passed to do by the pipe. for example, you cannot just call for the column E, you need to use .$E. An alternative solution would be glm(.$E~.$price)

Correlations between vectors in two groups (defined by: group_by)

I want to make a correlation between two vectors in two different groups (defined by group_by). The solution needs to be based on dplyr.
My data is in the so-called CDISC format. For simplicity: here is some dummy data (note that one column ("values") holds all the data):
n=5
bmi<-rnorm(n=n,mean=25)
glucose<-rnorm(n=n,mean=5)
insulin<-rnorm(n=n,mean=10)
id<-rep(paste0("id",1:n),3)
myData<-data.frame(id=id,measurement=c(rep("BMI",n),rep("glucose",n),rep("insulin",n)),values=c(bmi,glucose,insulin))
Keeping in mind that all my functions for working with this kind of data is by using dplyr package, such as:
myData %>% group_by(measurement) %>% summarise(mean(values), n())
How do I get the correlation between glucose and insulin (cor(glucose, insulin))? Or in a more general way: how do I get the correlation between two groups?
The following solution is obviously very wrong (but may help to understand my question):
myData %>% group_by(measurement) %>% summarise(cor(glucose,insulin))

Resources