I need to create a new column with the median ETR variable within a certain industry (SIC) for a sample of firms.
However, I need to exclude the own firm before calculating the industry (SIC) median for ETR.
Does anyone have any suggestions on how I could accomplish this?
Any help would be appreciated.
Thank you!
Sample Data:
Firm SIC ETR
1 20 10
2 20 15
3 20 20
4 20 25
5 20 30
6 21 50
7 21 55
8 21 60
9 21 65
10 21 70
Should Become:
Firm SIC ETR ETR_Median
1 20 10 22.5
2 20 15 22.5
3 20 20 20
4 20 25 17.5
5 20 30 17.5
6 21 50 62.5
7 21 55 62.5
8 21 60 60
9 21 65 57.5
10 21 70 57.5
So, firm #4, for example, have an industry (SIC) median of 17.5 when only considering the other firms in the same industry (SIC).
Consider splitting by SIC groups and run across all its Firm values to exclude from median calculation. Specifically, using:
by (for grouping into subset dfs)
sapply (to iterate across Firm values and call median)
unlist (to convert list to vector for df column binding)
Altogether:
df$ETR_median <- unlist(by(df, df$SIC, function(sub)
sapply(sub$Firm, function(f) median(sub$ETR[sub$Firm != f]))
))
df
# Firm SIC ETR ETR_median
# 1 1 20 10 22.5
# 2 2 20 15 22.5
# 3 3 20 20 20.0
# 4 4 20 25 17.5
# 5 5 20 30 17.5
# 6 6 21 50 62.5
# 7 7 21 55 62.5
# 8 8 21 60 60.0
# 9 9 21 65 57.5
# 10 10 21 70 57.5
You could create a function that excludes the current observation before conducting the median calculation:
median_excl <- function(x){
# pre-allocate our result vector:
med_excl <- vector(length = length(x))
# loop through our vector, excluding the current index and taking the median:
for(i in seq_along(x)){
x_excl <- x[-i]
med <- median(x_excl)
med_excl[i] <- med
}
return(med_excl)
}
Then simply apply it using dplyr or however you chose:
df %>% group_by(SIC) %>% mutate(ETR_Median = median_excl(ETR))
# Firm SIC ETR ETR_median
# 1 1 20 10 22.5
# 2 2 20 15 22.5
# 3 3 20 20 20.0
# 4 4 20 25 17.5
# 5 5 20 30 17.5
# 6 6 21 50 62.5
# 7 7 21 55 62.5
# 8 8 21 60 60.0
# 9 9 21 65 57.5
# 10 10 21 70 57.5
Related
Trying to create a simple function that summarizes a variable of choice via a dplyr chain. Here's my attempt:
get_mutated_df <- function(data, outcome){
{{data}} %>% group_by(speed) %>%
summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>%
print()
}
data(cars)
get_mutated_df(cars, "dist")
However, this returns a tibble of NAs:
# A tibble: 19 × 2
speed dist_mean
<dbl> <dbl>
1 4 NA
2 7 NA
3 8 NA
4 9 NA
What's the appropriate way of doing this?
No {} at data, and remove "" to dist will works.
get_mutated_df <- function(data, outcome){
data %>% group_by(speed) %>%
summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>%
print()
}
get_mutated_df(cars, dist)
speed dist_mean
<dbl> <dbl>
1 4 6
2 7 13
3 8 16
4 9 10
5 10 26
6 11 22.5
7 12 21.5
8 13 35
9 14 50.5
10 15 33.3
11 16 36
12 17 40.7
13 18 64.5
14 19 50
15 20 50.4
16 22 66
17 23 54
18 24 93.8
19 25 85
Code for pre_ thing
carss <- cars
carss$pre_dist <- cars$dist
get_mutated_df_2 <- function(data, outcome){
outcome <- deparse(substitute(outcome))
outcome <- paste0("pre_", outcome)
outcome <- as.symbol(outcome)
data %>% group_by(speed) %>%
summarize(dist_mean := mean({{outcome}}, na.rm = T)) %>%
print()
}
get_mutated_df_2(carss, dist)
speed dist_mean
<dbl> <dbl>
1 4 6
2 7 13
3 8 16
4 9 10
5 10 26
6 11 22.5
7 12 21.5
8 13 35
9 14 50.5
10 15 33.3
11 16 36
12 17 40.7
13 18 64.5
14 19 50
15 20 50.4
16 22 66
17 23 54
18 24 93.8
19 25 85
I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")
I have two data frame. One with all the data A, and a smaller one B that contain an unique identifier of A and column names of A. I am trying to add a column on A base on what the B is pointed to. In another word, I need to get data from A pointed by B.
For example
A<-airquality
B<-data.frame(Month=unique(A$Month),col=c("Ozone","Solar.R", "Wind", "wind","Solar.R"))
This would give me the following
> head(A)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> B
Month col
1 5 Ozone
2 6 Solar.R
3 7 Wind
4 8 wind
5 9 Solar.R
The result should be something like
> head(A)
Ozone Solar.R Wind Temp Month Day ADDED
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
3 12 149 12.6 74 5 3 12
4 18 313 11.5 62 5 4 18
5 NA NA 14.3 56 5 5 NA
6 28 NA 14.9 66 5 6 28
> tail(A)
Ozone Solar.R Wind Temp Month Day ADDED
148 14 20 16.6 63 9 25 20
149 30 193 6.9 70 9 26 193
150 NA 145 13.2 77 9 27 145
151 14 191 14.3 75 9 28 191
152 18 131 8.0 76 9 29 131
153 20 223 11.5 68 9 30 223
The only way I can do it is
for(i in 1:nrow(B))
{
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
while this does work, it become extremely slow as I have a big dataset. I feel like I am doing it the dumb way. What is a good way of doing this?
Thank
You could do this with the sapply or lapply-functions.
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED <- unlist(ADDED)
For partial matching, you would have to filter the data, to get only rows where B has values and then assign the values. But befor we have to assign a value for all rows of the ADDED column; in this case NA.
A$ADDED = NA
A[A$Month %in% B$Month,]$ADDED <- unlist(ADDED)
That is already talking only a 1/3 of the time, compared to a for-loop.
appl <- function(){
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
lappl <- function(){
ADDED <- lapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
forlo <- function(){
for(i in 1:nrow(B)) {
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
}
library(microbenchmark)
mc <- microbenchmark(times = 1000,
sapply = appl(),
lapply = lappl(),
forloop = forlo()
)
mc
Unit: microseconds
expr min lq mean median uq max neval cld
sapply 337.478 359.2125 378.6964 369.7775 385.474 2324.913 1000 a
lapply 319.367 340.7990 366.8448 349.2510 362.532 9051.828 1000 a
forloop 964.136 1013.6415 1074.5584 1032.5070 1059.825 5116.802 1000 b
I am trying to compare the tensile strength of different material weights of material in R. The tensile data is as follows:
tensile <- read.table(text=" Weight Strength Replicate
1 15 7 1
2 15 7 2
3 15 15 3
4 15 11 4
5 15 9 5
6 20 12 1
7 20 17 2
8 20 12 3
9 20 18 4
10 20 18 5
11 25 14 1
12 25 18 2
13 25 18 3
14 25 19 4
15 25 19 5
16 30 19 1
17 30 25 2
18 30 22 3
19 30 19 4
20 30 23 5
21 35 7 1
22 35 10 2
23 35 11 3
24 35 15 4
25 35 11 5", header=TRUE)
The variable Weight should be regarded as a factor (explanatory/independent variable) for the purpose of this analysis:
tensile$Weight <- factor(tensile$Weight)
I first fitted a one-way ANOVA model to my data:
tensile.aov <- aov(Strength ~ Weight, data = tensile)
According to the ANOVA, there appears to be a difference in the different weights with respect to the response (strength). So I then decided to do pairwise comparisons using the LSD (Least Significant Difference):
LSD.aov(tensile.aov)
However, this LSD function was provided through a separate file, so I'm unfortunately unable to share the code here.
I calculated the LSD for my data and got the following table:
Note that, according to the raw p-values, the pairwise comparisons between the 35 and 15 and 25 and 20 weights are the only ones that are not significantly different from each other at the alpha = 0.05 significance level; the other pairwise comparisons are significantly different. I want to create a letter summary to illustrate this, where groups only have the same letter if they are not significantly different from each other, and groups which do not have the same letter are significantly different from each other:
How can I go about creating such a table in R?
I'm also totally open to a 'manual' solution. By this, I mean manually creating a table using vectors and such. I'm new to R, so I don't have a good grasp on even the most basic aspects.
The multcompView package can turn p-values into letters, but in this case, the emmeans package can do both the comparison and the letters.
library(emmeans)
em <- emmeans(tensile.aov, ~Weight)
summary(pairs(em, adjust="none"), infer=TRUE)
#> contrast estimate SE df lower.CL upper.CL t.ratio p.value
#> 15 - 20 -5.6 1.79555 20 -9.3454518 -1.8545482 -3.119 0.0054
#> 15 - 25 -7.8 1.79555 20 -11.5454518 -4.0545482 -4.344 0.0003
#> 15 - 30 -11.8 1.79555 20 -15.5454518 -8.0545482 -6.572 <.0001
#> 15 - 35 -1.0 1.79555 20 -4.7454518 2.7454518 -0.557 0.5838
#> 20 - 25 -2.2 1.79555 20 -5.9454518 1.5454518 -1.225 0.2347
#> 20 - 30 -6.2 1.79555 20 -9.9454518 -2.4545482 -3.453 0.0025
#> 20 - 35 4.6 1.79555 20 0.8545482 8.3454518 2.562 0.0186
#> 25 - 30 -4.0 1.79555 20 -7.7454518 -0.2545482 -2.228 0.0375
#> 25 - 35 6.8 1.79555 20 3.0545482 10.5454518 3.787 0.0012
#> 30 - 35 10.8 1.79555 20 7.0545482 14.5454518 6.015 <.0001
#>
#> Confidence level used: 0.95
cld(em, adjust="none")
#> Weight emmean SE df lower.CL upper.CL .group
#> 15 9.8 1.269646 20 7.151566 12.44843 1
#> 35 10.8 1.269646 20 8.151566 13.44843 1
#> 20 15.4 1.269646 20 12.751566 18.04843 2
#> 25 17.6 1.269646 20 14.951566 20.24843 2
#> 30 21.6 1.269646 20 18.951566 24.24843 3
#>
#> Confidence level used: 0.95
#> significance level used: alpha = 0.05
I managed to do it as follows:
Weight = c(15, 20, 25, 30, 35)
mean = c(9.8, 15.4, 17.6, 21.6, 10.8)
letters = c("a", "b", "b", "", "a")
LSDletterSummary <- data.frame(Weight, mean, letters)
LSDletterSummary
If anyone has a better way to go about it, feel free to share.
In the "An Introduction to reshape2" package Sean C. Anderson presents the following example.
He uses the airquality data and renames the column names
names(airquality) <- tolower(names(airquality))
The data look like
# ozone solar.r wind temp month day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
Then he melts them by
aql <- melt(airquality, id.vars = c("month", "day"))
to get
# month day variable value
# 1 5 1 ozone 41
# 2 5 2 ozone 36
# 3 5 3 ozone 12
# 4 5 4 ozone 18
# 5 5 5 ozone NA
# 6 5 6 ozone 28
Finally he gets the original one (different column order) by
aqw <- dcast(aql, month + day ~ variable)
My Quesiton
Assume now that we do not have ID variables (i.e. month and day) and have melted the data as follows
aql <- melt(airquality)
which look like
# variable value
# 1 ozone 41
# 2 ozone 36
# 3 ozone 12
# 4 ozone 18
# 5 ozone NA
# 6 ozone 28
My question is how can I get the original ones? The original ones would correspond to
# ozone solar.r wind temp
# 1 41 190 7.4 67
# 2 36 118 8.0 72
# 3 12 149 12.6 74
# 4 18 313 11.5 62
# 5 NA NA 14.3 56
# 6 28 NA 14.9 66
Another option is unstack
out <- unstack(aql,value~variable)
head(out)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
As the question is about dcast, we can create a sequence column and then use dcast
aql$indx <- with(aql, ave(seq_along(variable), variable, FUN=seq_along))
out1 <- dcast(aql, indx~variable, value.var='value')[,-1]
head(out1)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
If you are using data.table, the devel version of data.table ie. v1.9.5 also has dcast function. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(aql)[, indx:=1:.N, variable]
dcast(aql, indx~variable, value.var='value')[,-1]
One option using split,
out <- data.frame(sapply(split(aql, aql$variable), `[[`, 2))
Here, the data is split by the variable column, then the second column of each group is combined back into a data frame (the [[ function with the argument 2 is passed to sapply)
head(out)
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6