Putting output of describe function in table R - r

> structure(dat_de$total_all)
[1] 11 11 9 6 9 15 10 6 11 10 10 9 7 13 7 5 5 8 10 14 9 10 13 6 10 11 12 22 11 1 7 9 12 7 7 11 9 7 15 10 6 10
[43] 8 10 9 8 14 5 10 12 14 9 10 18 8 8 15
> structure(dat_en$total_all)
[1] 25 10 12 17 10 11 11 9 9 25 14 10 13 22 13 10 11 15 20 11 9 15 9 14 10 19 10 9 8 14 4 18 16 7 10 13 9 11 12
This is my variable "Total_all" in the german and english version.
I want to put the results of the describe function (see below) of these two variables in a presentable table. Preferably one table for both variables, if that is possible.
> describe(dat_de$total_all)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 57 9.81 3.45 10 9.62 2.97 1 22 21 0.73 1.81 0.46
> describe(dat_en$total_all)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 39 12.69 4.69 11 12.24 2.97 4 25 21 1.01 0.61 0.75
I'm grateful for your help :)

I'm not quite sure which library the describe function is in [edit: looks like you're using the one from the psych package] but you can make a simple, nice looking table using the kable function from knitr:
library(knitr)
library(psych)
de_dat_descr <- data.frame(describe(dat_de$total_all), row.names = "de_dat_descr")
en_dat_descr <- data.frame(describe(dat_en$total_all), row.names = "en_dat_descr")
dat.df <- t(rbind.data.frame(de_dat_descr, en_dat_descr))
kable(dat.df)

Related

For Loop Alternative on large data frame that runs a different filter with each iteration

I'm running a loop that takes the ranking from R1[i] and filters a data frame of all rankings in the specified range and at the same time filtering a different column R2[i] to find the ranking of an opponent so I end up with a new data frame that only includes matches that involve players in those specific ranking ranges so that I can find the mean of a column for only those matches.
For Example: Player 1 is Ranked 10th and Player 2 is Ranked 34th. The following code takes every match including players ranked between 5-15 +/- 20% of 10 and players ranked between 29-39 +/- 20% of 34.
Then it finds the mean of Data_Dif and returns to the initial DF in row [i] and does so for every row.
This code works fine but it's a bit messy and it takes 4 hours to run 57,000 matches. Does anyone have a faster solution please? I have to run this every day.
Rank <- Data %>% filter(between(R1, Data$R1[i]-5-(Data$R1[i]*0.2), Data$R1[i]+5+(Data$R1[i]*0.2)) | between(R1, Data$R2[i]-5-(Data$R2[i]*0.2), Data$R2[i]+5+(Data$R2[i]*0.2)))
%>% filter(between(R2, Data$R1[i]-5-(Data$R1[i]*0.2), Data$R1[i]+5+(Data$R1[i]*0.2)) | between(R2, Data$R2[i]-5-(Data$R2[i]*0.2), Data$R2[i]+5+(Data$R2[i]*0.2)))
Rank_Difference <- Data$Rank_Dif[i]
Rank <- Rank %>% filter(Rank_Dif >= Rank_Difference-5)
Data$Rank_Adv[i] <- mean(Rank$Data_Dif)
}
Data
R1 R2 Rank_Dif Data_Dif Rank_Adv
1 2 1 1 -0.272 0.037696970
2 10 34 24 0.377 0.146838617
3 10 29 19 0.373 0.130336232
4 2 5 3 0.134 0.076242424
5 34 17 17 -0.196 0.094226519
6 1 18 17 0.144 0.186158879
7 17 25 8 0.264 0.036212219
8 42 18 24 0.041 0.102343915
9 5 13 8 -0.010 0.091952381
10 34 21 13 -0.226 0.060790576
11 2 14 12 0.022 0.122350649
12 10 158 148 0.330 0.184901961
13 11 1 10 -0.042 0.109918367
14 29 52 23 0.463 0.054469108
15 10 1000 990 0.628 0.437600000
16 17 329 312 0.445 0.307750000
17 11 20 9 0.216 0.072621875
18 417 200 217 -0.466 0.106737401
19 5 53 48 0.273 0.243890710
20 14 7 7 -0.462 0.075739414

Sliding window means with dplyr & zoo

I have a data frame containing per-base coverage along a genome. A much smaller example version is below:
> head(per_base_cov)
contig_id position coverage
1 contig_1 1 40
2 contig_1 2 33
3 contig_1 3 40
4 contig_1 4 32
5 contig_1 5 36
6 contig_1 6 30
7 contig_1 7 40
8 contig_1 8 38
9 contig_1 9 36
10 contig_1 10 40
11 contig_2 11 38
12 contig_2 12 39
13 contig_2 13 34
14 contig_2 14 39
15 contig_2 15 39
16 contig_2 16 32
17 contig_2 17 30
18 contig_2 18 37
19 contig_2 19 33
20 contig_2 20 35
I would like to calculate sliding window means for each contig, every 4 positions and overlapping by 2 positions. I've tried the following using dplyr and zoo:
per_base_cov %>%
group_by(contig_id) %>%
mutate(cov.win.mean=rollapply(coverage,4,mean,by=2))
But I get the error message:
Error: Problem with `mutate()` input `cov.win.mean`.
x Input `cov.win.mean` can't be recycled to size 10.
ℹ Input `cov.win.mean` is `rollapply(coverage, 4, mean, by = 2)`.
ℹ Input `cov.win.mean` must be size 10 or 1, not 4.
ℹ The error occurred in group 1: contig_id = "contig_1".
Does anyone know how I could solve this? I would like an output that looks something like the following:
contig_id mean_coverage
1 contig_1 36.25
2 contig_1 34.50
3 contig_1 36.00
4 contig_1 38.50
5 contig_2 37.5
6 contig_2 36
7 contig_2 34.5
8 contig_2 33.75
Many thanks in advance.
I managed to find a solution with the help of Ronak below:
win_means <- per_base_cov %>%
group_by(contig_id) %>%
mutate(cov.win.mean=rollapply(coverage,4,mean,by=2, fill=NA))
win_means_complete <- win_means[complete.cases(win_means), ]
win_means_final <- win_means_complete[,c(1,2,4)]
win_means_final <- as.data.frame(win_means_final)
head(win_means_final)
contig_id position cov.win.mean
1 contig_1 2 36.25
2 contig_1 4 34.50
3 contig_1 6 36.00
4 contig_1 8 38.50
5 contig_2 12 37.50
6 contig_2 14 36.00

How to Calculate Industry Medians with Own Firm Excluded

I need to create a new column with the median ETR variable within a certain industry (SIC) for a sample of firms.
However, I need to exclude the own firm before calculating the industry (SIC) median for ETR.
Does anyone have any suggestions on how I could accomplish this?
Any help would be appreciated.
Thank you!
Sample Data:
Firm SIC ETR
1 20 10
2 20 15
3 20 20
4 20 25
5 20 30
6 21 50
7 21 55
8 21 60
9 21 65
10 21 70
Should Become:
Firm SIC ETR ETR_Median
1 20 10 22.5
2 20 15 22.5
3 20 20 20
4 20 25 17.5
5 20 30 17.5
6 21 50 62.5
7 21 55 62.5
8 21 60 60
9 21 65 57.5
10 21 70 57.5
So, firm #4, for example, have an industry (SIC) median of 17.5 when only considering the other firms in the same industry (SIC).
Consider splitting by SIC groups and run across all its Firm values to exclude from median calculation. Specifically, using:
by (for grouping into subset dfs)
sapply (to iterate across Firm values and call median)
unlist (to convert list to vector for df column binding)
Altogether:
df$ETR_median <- unlist(by(df, df$SIC, function(sub)
sapply(sub$Firm, function(f) median(sub$ETR[sub$Firm != f]))
))
df
# Firm SIC ETR ETR_median
# 1 1 20 10 22.5
# 2 2 20 15 22.5
# 3 3 20 20 20.0
# 4 4 20 25 17.5
# 5 5 20 30 17.5
# 6 6 21 50 62.5
# 7 7 21 55 62.5
# 8 8 21 60 60.0
# 9 9 21 65 57.5
# 10 10 21 70 57.5
You could create a function that excludes the current observation before conducting the median calculation:
median_excl <- function(x){
# pre-allocate our result vector:
med_excl <- vector(length = length(x))
# loop through our vector, excluding the current index and taking the median:
for(i in seq_along(x)){
x_excl <- x[-i]
med <- median(x_excl)
med_excl[i] <- med
}
return(med_excl)
}
Then simply apply it using dplyr or however you chose:
df %>% group_by(SIC) %>% mutate(ETR_Median = median_excl(ETR))
# Firm SIC ETR ETR_median
# 1 1 20 10 22.5
# 2 2 20 15 22.5
# 3 3 20 20 20.0
# 4 4 20 25 17.5
# 5 5 20 30 17.5
# 6 6 21 50 62.5
# 7 7 21 55 62.5
# 8 8 21 60 60.0
# 9 9 21 65 57.5
# 10 10 21 70 57.5

How can I create a letter summary for these Least Significant Difference (LSD) results in R?

I am trying to compare the tensile strength of different material weights of material in R. The tensile data is as follows:
tensile <- read.table(text=" Weight Strength Replicate
1 15 7 1
2 15 7 2
3 15 15 3
4 15 11 4
5 15 9 5
6 20 12 1
7 20 17 2
8 20 12 3
9 20 18 4
10 20 18 5
11 25 14 1
12 25 18 2
13 25 18 3
14 25 19 4
15 25 19 5
16 30 19 1
17 30 25 2
18 30 22 3
19 30 19 4
20 30 23 5
21 35 7 1
22 35 10 2
23 35 11 3
24 35 15 4
25 35 11 5", header=TRUE)
The variable Weight should be regarded as a factor (explanatory/independent variable) for the purpose of this analysis:
tensile$Weight <- factor(tensile$Weight)
I first fitted a one-way ANOVA model to my data:
tensile.aov <- aov(Strength ~ Weight, data = tensile)
According to the ANOVA, there appears to be a difference in the different weights with respect to the response (strength). So I then decided to do pairwise comparisons using the LSD (Least Significant Difference):
LSD.aov(tensile.aov)
However, this LSD function was provided through a separate file, so I'm unfortunately unable to share the code here.
I calculated the LSD for my data and got the following table:
Note that, according to the raw p-values, the pairwise comparisons between the 35 and 15 and 25 and 20 weights are the only ones that are not significantly different from each other at the alpha = 0.05 significance level; the other pairwise comparisons are significantly different. I want to create a letter summary to illustrate this, where groups only have the same letter if they are not significantly different from each other, and groups which do not have the same letter are significantly different from each other:
How can I go about creating such a table in R?
I'm also totally open to a 'manual' solution. By this, I mean manually creating a table using vectors and such. I'm new to R, so I don't have a good grasp on even the most basic aspects.
The multcompView package can turn p-values into letters, but in this case, the emmeans package can do both the comparison and the letters.
library(emmeans)
em <- emmeans(tensile.aov, ~Weight)
summary(pairs(em, adjust="none"), infer=TRUE)
#> contrast estimate SE df lower.CL upper.CL t.ratio p.value
#> 15 - 20 -5.6 1.79555 20 -9.3454518 -1.8545482 -3.119 0.0054
#> 15 - 25 -7.8 1.79555 20 -11.5454518 -4.0545482 -4.344 0.0003
#> 15 - 30 -11.8 1.79555 20 -15.5454518 -8.0545482 -6.572 <.0001
#> 15 - 35 -1.0 1.79555 20 -4.7454518 2.7454518 -0.557 0.5838
#> 20 - 25 -2.2 1.79555 20 -5.9454518 1.5454518 -1.225 0.2347
#> 20 - 30 -6.2 1.79555 20 -9.9454518 -2.4545482 -3.453 0.0025
#> 20 - 35 4.6 1.79555 20 0.8545482 8.3454518 2.562 0.0186
#> 25 - 30 -4.0 1.79555 20 -7.7454518 -0.2545482 -2.228 0.0375
#> 25 - 35 6.8 1.79555 20 3.0545482 10.5454518 3.787 0.0012
#> 30 - 35 10.8 1.79555 20 7.0545482 14.5454518 6.015 <.0001
#>
#> Confidence level used: 0.95
cld(em, adjust="none")
#> Weight emmean SE df lower.CL upper.CL .group
#> 15 9.8 1.269646 20 7.151566 12.44843 1
#> 35 10.8 1.269646 20 8.151566 13.44843 1
#> 20 15.4 1.269646 20 12.751566 18.04843 2
#> 25 17.6 1.269646 20 14.951566 20.24843 2
#> 30 21.6 1.269646 20 18.951566 24.24843 3
#>
#> Confidence level used: 0.95
#> significance level used: alpha = 0.05
I managed to do it as follows:
Weight = c(15, 20, 25, 30, 35)
mean = c(9.8, 15.4, 17.6, 21.6, 10.8)
letters = c("a", "b", "b", "", "a")
LSDletterSummary <- data.frame(Weight, mean, letters)
LSDletterSummary
If anyone has a better way to go about it, feel free to share.

R: Deleting Columns When Trying To Replace

I'm trying to replace NA values in a column in a data frame with the value from another column in the same row. Instead of replacing the values the entire column seems to be deleted.
fDF is a data frame where some values are NA. When column 1 has an NA value I want to replace it with the value in column 2.
fDF[columns[1]] = if(is.na(fDF[columns[1]]) == TRUE &
is.na(fDF[columns[2]]) == FALSE) fDF[columns[2]]
I'm not sure what I'm doing wrong here.
Thanks
You can adjust following code to your data:
> ddf
xx yy zz
1 1 10 11.88
2 2 9 NA
3 3 11 12.20
4 4 9 12.48
5 5 7 NA
6 6 6 13.28
7 7 9 13.80
8 8 8 14.40
9 9 5 NA
10 10 4 15.84
11 11 6 16.68
12 12 6 17.60
13 13 5 18.60
14 14 4 19.68
15 15 6 NA
16 16 8 22.08
17 17 4 23.40
18 18 6 24.80
19 19 8 NA
20 20 11 27.84
21 21 8 29.48
22 22 10 31.20
23 23 9 33.00
>
>
> idx = is.na(ddf$zz)
> idx
[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[22] FALSE FALSE
>
> ddf$zz[idx]=ddf$yy[idx]
>
> ddf
xx yy zz
1 1 10 11.88
2 2 9 9.00
3 3 11 12.20
4 4 9 12.48
5 5 7 7.00
6 6 6 13.28
7 7 9 13.80
8 8 8 14.40
9 9 5 5.00
10 10 4 15.84
11 11 6 16.68
12 12 6 17.60
13 13 5 18.60
14 14 4 19.68
15 15 6 6.00
16 16 8 22.08
17 17 4 23.40
18 18 6 24.80
19 19 8 8.00
20 20 11 27.84
21 21 8 29.48
22 22 10 31.20
23 23 9 33.00
>
You want an ifelse() expression:
fDF[columns[1]] <- ifelse(is.na(fDF[columns[1]]), fDF[columns[2]], fDF[columns[1]])
not trying to assign the result of an if statement to a vector, which doesn't make any sense.
[EDIT only for David Arenburg: if that wasn't already explicit enough, in R if statements are not vectorized, hence can only handle scalar expressions, hence they're not what the OP needed. I had already tagged the question 'vectorization' yesterday and the OP is free to go read about vectorization in R in any of the thousands of good writeups and tutorials out there.]

Resources