Creating summary statistic table from subsets of data in R - r

I have a table that looks something like this:
Time Carbon OD
0 Sucrose 1.13
0 Citric acid 1.54
24 Histidine 2.1
24 Glutamine 1.7
48 Maleic acid 2.1
48 Furm acid 3.1
72 Tryptophan 2.3
72 Serine 1.2
72 etc etc
It has four time points, and 9 different carbons that can be split into three groups (organic acids, sugars, amino acids).
EDIT - if its helpful, the OD was measured for each carbon at each time point 8 times. Previously I used this code to create summary statistics for the entire thing:
summary <- aggregate(dataset2$OD,
by = list(Time = dataset2$Time, Carbon = dataset2$Carbon),
FUN = function(x) c(mean = mean(x), sd = sd(x),
n = length(x)))
summary <- do.call(data.frame, dataset2)
summary$se <- dataset2$x.sd / sqrt(dataset2$x.n)
But now I would like to generate the same summary statistics for the means of each of the three groups, if possible, so I would get something like this:
Time Group OD SD n SE
0 Group 1
24 Group 1
48 Group 1
72 Group 1
0 Group 2
I'm not quite sure how to specify this in my code?

Using dplyr:
dataset2 %>%
group_by(Time, Group)
summarise(OD = mean(OD),
SD = sd(OD),
n = n())

Related

Frequency across multiple variables

My data have more than 50 variables and the same values distributed across them. I need to table each value (total values more than 1000 ) with its frequency across the 40 variables.
I need to create a table for each values across all variables for example F447 5 A257 4 G1229 5 C245 2
You can use table to get the frequency across the variables, then we can convert to a dataframe and make the variable names a column, and finally arrange by frequency. If you want to put in descending order, then we could do arrange(desc(frequency)).
library(dplyr)
data.frame(frequency = unclass(table(unlist(df)))) %>%
tibble::rownames_to_column("variable") %>%
arrange(frequency)
Output
variable frequency
1 C245 1
2 A257 2
3 F447 3
4 G1229 3
Or with base R, you could do:
results <- data.frame(frequency = unclass(table(unlist(df))))
results$variable <- row.names(results)
row.names(results) <- NULL
results <- results[order(results$frequency),c(2,1)]
Another option if you would like additional summary and visualization of the frequency, then epiDisplay is a good option.
library(epiDisplay)
tab1(unlist(df), sort.group = "decreasing", cum.percent = TRUE)
Output
Frequency Percent Cum. percent
G1229 3 33.3 33.3
F447 3 33.3 66.7
A257 2 22.2 88.9
C245 1 11.1 100.0
Total 9 100.0 100.0
Data
df <- structure(list(Var1 = c("F447", "A257", "G1229"), Var2 = c("G1229",
"F447", "A257"), Var3 = c("C245", "F447", "G1229")), class = "data.frame", row.names = c(NA,
-3L))

Looking for advice to analyse this particular objective and data in R

Thank you in advance for any assistance.
Aim: I have a 5-day food intake survey dataset that I am trying to analyse in R. I am interested in calculating the mean, se, min and max intake for the weight of a specific food consumed per day.
I would more easily complete this in excel, but due to the scale of data, I require R to complete this.
Example question: What is a person's daily intake (g) of lettuce? [mean, standard deviation, standard error, min, and max]
Example extraction dataset: please note the actual dataset includes a number of foods and a large no. of participants.
participant
day
code
foodname
weight
132
1
62
lettuce
53
84
3
62
lettuce
23
132
3
62
lettuce
32
153
4
62
lettuce
26
142
2
62
lettuce
23
123
3
62
lettuce
23
131
3
62
lettuce
30
153
5
62
lettuce
16
At present:
# import dataset
foodsurvey<-read.spss("foodsurvey.sav",to.data.frame=T,use.value.labels=T)
summary(foodsurvey)
# keep my relevant columns
myvariables = subset(food survey, select = c(1,2,3,4,5) )
# rename columns
colnames(myvariables)<-c('participant','day','code','foodname','foodweight')
# create values
day<-myvariables$day
participant<-myvariables$participant
foodcode<-myvariables$foodcode
foodname<-myvariables$foodname
foodweight<-myvariables$foodweight
# extract lettuce by ID code to be analysed
lettuce<- filter(myvariables, foodcode == "62")
dim(lettuce)
str(lettuce)
# errors arise attempting to analyse consumption (weight) of lettuce per day using ops.factor function
# to analyse the outputs
summary(lettuce/days)
quantile(lettuce/foodweight)
max(lettuce)
min(lettuce)
median(lettuce)
mean(lettuce)
this should give you the mean, standard deviation, standard error, min, and max
food weight for each participant and food type combinantion along these days:
library(dplyr)
myvariables %>%
filter(foodname == "lettuce") %>%
group_by(participant) %>%
summarise(mean = mean(foodweight, na.rm = T),
max_val = max(foodweight),
min_val = min(foodweight),
sd = sd(foodweight, na.rm = T),
se = sqrt(var(foodweight, na.rm = T)/length(foodweight))
Here's a method that groups by participant and food itself to give summaries across everything.
dplyr
library(dplyr)
dat %>%
group_by(participant, foodname) %>%
summarize(
across(weight, list(min = min, mean = mean, max = max,
sigma = sd, se = ~ sd(.)/n()))
) %>%
ungroup()
# # A tibble: 6 x 7
# participant foodname weight_min weight_mean weight_max weight_sigma weight_se
# <int> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 84 lettuce 23 23 23 NA NA
# 2 123 lettuce 23 23 23 NA NA
# 3 131 lettuce 30 30 30 NA NA
# 4 132 lettuce 32 42.5 53 14.8 7.42
# 5 142 lettuce 23 23 23 NA NA
# 6 153 lettuce 16 21 26 7.07 3.54
Once you have those summaries, you can easily filter for one participant, a specific food, etc. If you need to also group by code, just add it to the group_by.
The premise of using summarise(across(...)) is that the first argument includes whichever variables you want to summarize (just weight here, but you can add others if it makes sense), and the second argument is a list of functions in various forms. It accepts just a function symbol (e.g., mean), a tilde-function facilitate by rlang (e.g., ~ sd(.) / n(), where n() is a dplyr-special function), or regular anonymous functions (e.g., function(z) sd(z)/length(z), not shown here). The "name" on the LHS of each listed function is used in the resulting column name.

Integrate functions for depth integrated species abundance

Hei,
I am trying to calculate the organisms quantity per class over the entire depth range (e.g., from 10 m to 90 m). To do that I have the abundance at certain depths (e.g., 10, 30 and 90 m) and I use the integrate function which calculate:
the average of abundance between each pair of depths, multiplied by the difference of the pairs of depths. The values are summed up over the entire depth water column to get a totale abundance over the water column.
See an example (only a tiny part of bigger data set with several locations and year, more class and depths):
View(df)
Class Depth organismQuantity
1 Ciliates 10 1608.89
2 Ciliates 30 2125.09
3 Ciliates 90 1184.92
4 Dinophyceae 10 0.00
5 Dinoflagellates 30 28719.60
6 Dinoflagellates 90 4445.26
integrate = function(x) {
averages = (x$organismQuantity[1:length(x)-1] + x$organismQuantity[2:length(x)]) / 2
sum(averages * diff(x$Depth))
}
result = ddply(df, .(Class), integrate)
print(result)
But I got these result and warning message for lines with NA value :
Class V1
1 Ciliates 136640.1
2 Dinoflagellates NA
3 Dinophyceae 0.0
Warning messages:
1: In averages * diff(x$Depth) :
longer object length is not a multiple of shorter object length
I don't understand why Dinoflagellates got NA value... It is the same for several others class in my complete data set (for some class abundance the integration equation applies for others I got the warning message).
thank you for the help!!
Cheers,
Lucie
Here is a way using function trapz from package caTools, adapted to the problem.
#
# library(caTools)
# Author(s)
# Jarek Tuszynski
#
# Original, adapted
trapz <- function(DF, x, y){
x <- DF[[x]]
y <- DF[[y]]
idx <- seq_along(x)[-1]
as.double( (x[idx] - x[idx-1]) %*% (y[idx] + y[idx-1]) ) / 2
}
library(plyr)
ddply(df, .(Class), trapz, x = "Depth", y = "organismQuantity")
# Class V1
#1 Ciliates 136640.1
#2 Dinoflagellates 994945.8
#3 Dinophyceae NA
Data
df <- read.table(text = "
Class Depth organismQuantity
1 Ciliates 10 1608.89
2 Ciliates 30 2125.09
3 Ciliates 90 1184.92
4 Dinophyceae 10 0.00
5 Dinoflagellates 30 28719.60
6 Dinoflagellates 90 4445.26
", header = TRUE)

Writing a function to summarize the results of dunn.test::dunn.test

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

For loop that references prior rows

I'm interested in filtering out data based on a set of rules.
I have a dataset that contains play data for all games in which a team had a .8 win probability at some point. What I'd like to do is find that point in which the win probability reached .8 and remove every play thereafter until the next game data begins. The dataset contains numerous games so once a game ends data from a new one begins in which the win probability goes back to around .5.
Here are the relevant columns and each row is a play in the game:
game_id = unique num for each game
team = team that will eventually get an .8 win prob
play_id = num that is increased (but not necessary in seq order for some reason) after each play
win_per = num showing what the teams win percentage chance at the start of that recorded play was
Example df
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
So in this small example, I have recorded 5 plays of two teams (a and b) who both obtained a win_prob of at least .8 at some point in the game. In both example cases, I would want to have all the plays removed AFTER they attained this .8 mark regardless of whether the win_prob kept rising or fell back below .8.
So team a would have the final two rows of data removed (win_prob == .81 and .85) and team b would have the final row removed (win_prob = .77)
I'm imagining running a for loop that checks if the team in any row is the same team as the prior row, and if so, find a win_prob >= .8 with the lowest play-id (as this would be the first time the team reached .8) and then somehow remove the rest of the rows following that match UNTIL the team != prior row's team.
Of course, you might know a better way as well. Thank you so much for helping me out!
No need to use a loop, that whole selection can be performed in 1 line using the dplyr package:
df = data.frame(game_id = c(122,122,122,122,122,144,144,144,144,144),
team = c("a","a","a","a","a", "b","b","b","b","b"),
play_id = c(1,5,22,25,34, 45,47,55,58,66),
win_per = c(.5,.6,.86,.81,.85,.54,.43,.47,.81,.77))
library(dplyr)
#group by team
#find the first row that exceeds .80 and add temp column
#save the row from 1 to the row that exceeds 0.80
#remove temp column
df %>% group_by(team, game_id) %>%
mutate(g80= min(which(win_per>=0.80))) %>%
slice(1:g80) %>%
select(-g80)
# A tibble: 7 x 4
# Groups: team [2]
game_id team play_id win_per
<dbl> <fct> <dbl> <dbl>
1 122 a 1 0.5
2 122 a 5 0.6
3 122 a 22 0.86
4 144 b 45 0.54
5 144 b 47 0.43
6 144 b 55 0.47
7 144 b 58 0.81
Here is a base R way using cumsum in ave
subset(df, ave(win_per > 0.8, game_id, FUN = function(x) c(0, cumsum(x)[-length(x)])) == 0)
# game_id team play_id win_per
#1 122 a 1 0.50
#2 122 a 5 0.60
#3 122 a 22 0.86
#6 144 b 45 0.54
#7 144 b 47 0.43
#8 144 b 55 0.47
#9 144 b 58 0.81
and using the similar concept in dplyr
library(dplyr)
df %>% group_by(game_id) %>% filter(lag(cumsum(win_per > 0.8) == 0, default = TRUE))

Resources