I've seen a few variations of this question, but they don't seem to specifically answer what I'm trying to accomplish. I have a data frame (df):
month ter dist emp_count var1 var2
1 1 10 21 3000 5120
2 1 10 20 3100 5340
3 1 10 20 3100 5543
4 1 10 21 3250 5625
5 1 10 24 3200 5254
6 1 10 25 3300 5634
7 1 10 26 3600 5435
8 1 10 26 3900 7546
. . . . . .
. . . . . .
. . . . . .
ter holds the values 1, 2, 3, or 4. And dist can be any number 1 thru 50. I want to do a simple regression multiple times based on ter or dist
I have this:
model = lm(var1 ~ emp_count, data = df)
summary(model)
But I'd rather not write out a regression 50 times if I want to compare based on dist.
Split by your iter or dist, then lapply your fit and summary
lapply(split(df, df$dist), function(x) summary(lm(var1 ~ emp_count, data = x)))
Other solution can be accomplished using dplyr and broom packages. Here is the code for your example. First you need to extract the coefficients and p values for the intercept and slope for every linear model (lm), grouped by the variable dist. tidy is like the summary function used for summary(lm).
library(dplyr)
library(broom)
lmodelsCoef <- df %>%
group_by(dist) %>%
do(tidy(lm(ar1 ~ emp_count, .)))
lmodelsCoef <- lmodelsCoef %>%
group_by(dist) %>%
summarize(intercept = estimate[1],
p.value_intercept = p.value[1],
slope = estimate[2],
p.value_slope = p.value[2])
Next you need to extract the r squared value. However, this value is found in the lm object (not in the summary(lm) one). Therefore, you need to use glance for that.
lmodelsCoef2 <- df %>%
group_by(dist) %>%
do(glance(lm(ar1 ~ emp_count, .)))
lmodelsCoef2 <- lmodelsCoef2 %>%
group_by(dist) %>%
summarize(r.squared = r.squared)
#Get the final df
df_lm<-data.frame(lmodelsCoef,
r.squared = lmodelsCoef2$r.squared)
If you only need the coefficients and p-values, then you can use lmList from nlme or lme4:
library(lme4)
df = data.frame(dist=rep(1:50,each=50),
month=sample(1:12,2500,replace=TRUE),
emp_count=rpois(20,2500),
var1=rpois(2500,40),var2=rpois(2500,50))
lmList(var1 ~ emp_count | dist,data=df)
Call: lmList(formula = var1 ~ emp_count | dist, data = df)
Coefficients:
(Intercept) emp_count
1 9.9885028 1.257080e-02
2 96.5774029 -2.238488e-02
3 11.5427710 1.143071e-02
4 37.5422288 8.699393e-04
5 -44.4468575 3.367506e-02
6 50.4651290 -4.084562e-03
To get p-values,std error etc:
summary(lmList(var1 ~ emp_count | dist,data=df))
Related
I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 4 years ago.
I have a table:
CityData ->
City Price Bathrooms Bedrooms Porch
Milwaukee 2300 2 3 yes
Chicago 3400 3 2 yes
Springfield 2300 1 1 no
Chicago 2390 2 1 yes
I would like to run a regression for each city (multiple rows per city) to give me coefficients for each city. I want to regress price on the other confounding variables (bathrooms, bedrooms, porch).
I tried the dplyr library:
library(dplyr)
fitted_models = CityData %>%
group_by(CityData$City) %>%
do(model = lm(CityData$Price ~ CityData$Bathrooms +
CityData$Porch + CityData$Bedrooms, data = CityData))
But the output is just
14 lm list
14 lm list
14 lm list
Any suggestions?
You might try something like this. Here I'll use the mtcars data as an example.
df <- mtcars
models <- df %>% group_by(cyl) %>% summarise(mod = list(lm(mpg ~ wt)))
This will give you a new variable mod that holds all the info for your model. You can call the coefficients like:
models$mod[[1]]$coefficients
(Intercept) wt
39.571196 -5.647025
You can get more complex with it too.
models <- df %>% group_by(cyl) %>% summarise(mod = list(lm(mpg ~ wt + hp)))
models$mod[[1]]$coefficients
(Intercept) wt hp
45.83607319 -5.11506233 -0.09052672
Of course models will also still also hold the info for the group
models$cyl
[1] 4 6 8
Alright, I'm waving my white flag.
I'm trying to compute a loess regression on my dataset.
I want loess to compute a different set of points that plots as a smooth line for each group.
The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.
Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.
I just can't figure out how to make this work on a per-group basis.
Here are some examples of my failed attempts.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))
> test2
# A tibble: 136 x 2
# Groups: CpG [4]
CpG x
<chr> <dbl>
1 cg01003813 0.781
2 cg01003813 0.793
3 cg01003813 0.805
4 cg01003813 0.816
5 cg01003813 0.829
6 cg01003813 0.841
7 cg01003813 0.854
8 cg01003813 0.866
9 cg01003813 0.878
10 cg01003813 0.893
This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::do({
predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
})
This one simply fails with the following error.
"Error: Results 1, 2, 3, 4 must be data frames, not numeric"
Also it still isn't applied as a new column with dplyr::mutate
fems <- fems %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))
This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.
I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.
data looks like this:
> head(test)
X geneID CpG CellLine Meth AVGMOrder neworder Group SmoothMeth
1 40 XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210 1 1 5 0.7808767
2 94 XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120 1 1 5 0.7927130
3 148 XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448 1 1 5 0.8045080
4 202 XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890 1 1 5 0.8163997
5 27 XG cg25296477 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880 2 2 3 0.8285259
6 81 XG cg01003813 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240 2 2 3 0.8409501
unique(test$CpG)
[1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"
So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).
My actual dataset has a few thousand of those CpG's, not just the four.
https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing
This is a neat Tidyverse way to make it work:
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
models <- fems %>%
tidyr::nest(-CpG) %>%
dplyr::mutate(
# Perform loess calculation on each CpG group
m = purrr::map(data, loess,
formula = Meth ~ AVGMOrder, span = .5),
# Retrieve the fitted values from each model
fitted = purrr::map(m, `[[`, "fitted")
)
# Apply fitted y's as a new column
results <- models %>%
dplyr::select(-m) %>%
tidyr::unnest()
# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
geom_point() +
geom_line(aes(y = fitted))
You may have already figured this out -- but if not, here's some help.
Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.
So for your case:
fems <- fems %>%
group_by(CpG) %>%
arrange(CpG, AVGMOrder) %>%
mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))
Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.
Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.
# Define important variables
span_60 <- 60/365 # 60 days of a year
span_365 <- 365/365 # a whole year
# Group and order the data set
df <- as.data.frame(
df %>%
group_by(profile) %>%
arrange(profile, day) %>%
)
)
# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
# Declare that the loess column exists, but is blank
df$loess_60 <- NA
df$loess_365 <- NA
# Identify all unique profilee IDs
all_ids <- unique(x$profile)
# Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
# and isolate the according rows that belong to the profilee ID.
for (i in all_ids) {
len_entries <- length(which(x$profile == i))
queried_rows <- result <- x[which(x$profile == i), ]
# Run the loess fit and write the result to the according column
fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
x[which(x$profile == i), "loess_60"] <- fit_60
x[which(x$profile == i), "loess_365"] <- fit_365
}
# Return the initial data frame
return(x)
}
# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I have imported a CSV with 3 columns , 2 columns for Y and X and the third column which identifies the category for X ( I have 20 groups/categories). I am able to run a regression at overall level but I want to run regression for the 20 categories separately and store the co-efs.
I tried the following :
list2env(split(sample, sample$CATEGORY_DESC), envir = .GlobalEnv)
Now I have 20 files, how do I run a regression on these 20 files and store the co-effs somewhere.
Since no data was provided, I am generating some sample data to show how you can run multiple regressions and store output using dplyr and broom packages.
In the following, there are 20 groups and different x/y values per group. 20 regressions are run and output of these regressions is provided as a data frame:
library(dplyr)
library(broom)
df <- data.frame(group = rep(1:20, 10),
x = rep(1:20, 10) + rnorm(200),
y = rep(1:20, 10) + rnorm(200))
df %>% group_by(group) %>% do(tidy(lm(x ~ y, data = .)))
Sample output:
Source: local data frame [40 x 6]
Groups: group [20]
group term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 0.42679228 1.0110422 0.4221310 0.684045203
2 1 y 0.45625124 0.7913256 0.5765657 0.580089051
3 2 (Intercept) 1.99367392 0.4731639 4.2134955 0.002941805
4 2 y 0.05101438 0.1909607 0.2671460 0.796114398
5 3 (Intercept) 3.14391308 0.8417638 3.7349114 0.005747126
6 3 y 0.08418715 0.2453441 0.3431391 0.740336702
Quick solution with lmList (package nlme):
library(nlme)
lmList(x ~ y | group, data=df)
Call:
Model: x ~ y | group
Data: df
Coefficients:
(Intercept) y
1 0.4786373 0.04978624
2 3.5125369 -0.94751894
3 2.7429958 -0.01208329
4 -5.2231576 2.24589181
5 5.6370824 -0.24223131
6 7.1785581 -0.08077726
7 8.2060808 -0.18283134
8 8.9072851 -0.13090764
9 10.1974577 -0.18514527
10 6.0687105 0.37396911
11 9.0682622 0.23469187
12 15.1081915 -0.29234452
13 17.3147636 -0.30306692
14 13.1352411 0.05873189
15 6.4006623 0.57619151
16 25.4454182 -0.59535396
17 22.0231916 -0.30073768
18 27.7317267 -0.54651597
19 10.9689733 0.45280604
20 23.3495704 -0.14488522
Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226
Borrowed the data df from #Gopala answer.
Consider also a base solution with lapply():
regressionList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,]))
And only the coefficients:
coeffList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,])$coefficients)
Even list of summaries:
summaryList <- lapply(unique(df$group),
function(x) summary(lm(x ~ y, df[df$group==x,])))
I'm new in R and what I want to do is something very simple but I need help.
I have a database that looks like the one above; where spot number = "name" of a protein, grupo = group I and II and APF = fluorescent reading.
I want to do a tstudent test to each protein, by comparing groups I and II, but in a loop.
In the database above there only 1 protein (147) but im my real database i have 444 proteins.
Starting with some fake data:
set.seed(0)
Spot.number <- rep(147:149, each=10)
grupo <- rep(rep(1:2, each=5), 3)
APF <- rnorm(30)
gel <- data.frame(Spot.number, grupo, APF)
> head(gel)
Spot.number grupo APF
1 147 1 2.1780699
2 147 1 -0.2609347
3 147 1 -1.6125236
4 147 1 1.7863384
5 147 1 2.0325473
6 147 2 0.6261739
You can use lapply to loop through the subsets of gel, split by the Spot.number:
tests <- lapply(split(gel, gel$Spot.number), function(spot) t.test(APF ~ grupo, spot))
or just
tests <- by(gel, gel$Spot.number, function(spot) t.test(APF ~ grupo, spot))
You can then move on to e.g. taking only the p values:
sapply(tests, "[[", "p.value")
# 147 148 149
#0.2941609 0.9723856 0.5726007
or confidence interval
sapply(tests, "[[", "conf.int")
# 147 148 149
# [1,] -0.985218 -1.033815 -0.8748502
# [2,] 2.712395 1.066340 1.4240488
And the resulting vector or matrix will already have the Spot.number as names which can be very helpful.
You can perform a t.test within each group using dplyr and my broom package. If your data is stored in a data frame called dat, you would do:
library(dplyr)
library(broom)
results <- dat %>%
group_by(Spot.number) %>%
do(tidy(t.test(APF ~ grupo, .)))
This works by performing t.test(APF ~ grupo, .) on each group defined by Spot.number. The tidy function from broom then turns it into a one-row data frame so that it can be recombined. The results data frame will then contain one row per protein (Spot.number) with columns including estimate, statistic, and p.value.
See this vignette for more on the combination of dplyr and broom.