Compare effect of environment between two dataframes - r

I have two dataframes as follows:
a <- structure(list(Bacteria_A = c(12, 23, 45, 32, 34, 0), Bacteria_B = c(23,
12, 33, 44, 55, 3), Bacteria_C = c(25, 10, 50, 38, 3, 34), Group = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "soil")), class = "data.frame", row.names = c("Sample_1",
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))
b <- structure(list(Bacteria_A = c(14, 10, 40, 40, 37, 3), Bacteria_B = c(25,
14, 32, 23, 45, 35), Bacteria_C = c(12, 34, 45, 22, 7, 23), Group = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "water")), class = "data.frame", row.names = c("Sample_1",
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))
> a
Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 12 23 25 soil
Sample_2 23 12 10 soil
Sample_3 45 33 50 soil
Sample_4 32 44 38 soil
Sample_5 34 55 3 soil
Sample_6 0 3 34 soil
> b
Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 14 25 12 water
Sample_2 10 14 34 water
Sample_3 40 32 45 water
Sample_4 40 23 22 water
Sample_5 37 45 7 water
Sample_6 3 35 23 water
I want to compare the difference between each group across samples between soil and water.
For exemple For Bacteria_A i want to know if there is a difference between soil and water. Same for Bacteria_B and Bacteria_c (i have 900 bacteria). I though of a t-test but not sure how to do it with two dataframes.
Forgot to mention that not all bacteria are present in both dataframes so it could happen that one bacteria is not present in one of the environements. If bacteria are found in both environements they have exactly the same name.
Teh original dataframe is 160 samples per 500 Bacteria and data is not normally distributed.
Thanks for your help.

First of all, I want to mention that there are statistical methods to do the comparison which are more adequate than a t-test. They take into account the distribution the numbers are coming from (Negative-Binomial usually). You can check our DESeq2 package for instance.
As to your technical issue I would do:
for (bac in setdiff(intersect(colnames(a), colnames(b)), "Group")){
print(t.test(a[,bac], b[,bac]))
}

Your values do not seem to be in a normal or near-normal distribution, so you should stay away from the t-test. If you are unsure which distribution you are dealing with, you could use a wilcox.test.
You can stick your two data frames together quite easily then convert them to long format before running the appropriate tests:
library(tidyr)
library(dplyr)
bind_rows(a,b) %>%
pivot_longer(c(Bacteria_A, Bacteria_B, Bacteria_C)) %>%
group_by(name) %>%
summarise(mean_soil = mean(value[Group == "soil"]),
mean_water = mean(value[Group == "water"]),
pvalue = wilcox.test(value ~ Group)$p.value)
Which gives you
#> # A tibble: 3 x 4
#> name mean_soil mean_water pvalue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Bacteria_A 24.3 24 0.936
#> 2 Bacteria_B 28.3 29 0.873
#> 3 Bacteria_C 26.7 23.8 0.748

This finds the bacteria names that exist in both data frames and then does a t.test between the same names giving a list L whose names are the bacteria names. The last line uses tidy to convert L to a data frame. You can replace t.test with wilcox.test if you prefer a non-parametric test. (Of course this does not take into account the problems of performing multiple hypothesis tests but rather just does the calculations.)
Name <- intersect(names(Filter(is.numeric, a)), names(Filter(is.numeric, b)))
L <- Map(t.test, a[Name], b[Name])
library(broom)
cbind(Name, do.call("rbind", lapply(L, tidy)))
The last line gives the following data frame:
Name estimate estimate1 estimate2 statistic p.value
Bacteria_A Bacteria_A 0.3333333 24.33333 24.00000 0.03485781 0.9728799
Bacteria_B Bacteria_B -0.6666667 28.33333 29.00000 -0.07312724 0.9435532
Bacteria_C Bacteria_C 2.8333333 26.66667 23.83333 0.30754940 0.7650662
parameter conf.low conf.high method alternative
Bacteria_A 9.988603 -20.97689 21.64356 Welch Two Sample t-test two.sided
Bacteria_B 7.765869 -21.80026 20.46692 Welch Two Sample t-test two.sided
Bacteria_C 9.492873 -17.84326 23.50993 Welch Two Sample t-test two.sided
Note
LinesA <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 12 23 25 soil
Sample_2 23 12 10 soil
Sample_3 45 33 50 soil
Sample_4 32 44 38 soil
Sample_5 34 55 3 soil
Sample_6 0 3 34 soil"
LinesB <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 14 25 12 water
Sample_2 10 14 34 water
Sample_3 40 32 45 water
Sample_4 40 23 22 water
Sample_5 37 45 7 water
Sample_6 3 35 23 water"
a <- read.table(text = LinesA, as.is = TRUE)
b <- read.table(text = LinesB, as.is = TRUE)

Related

R : regression line interrupted in ggplot while a continuous line is expected

I created a multilevel regression model with nlme package and now I would like to illustrate the regression line obtained for some patients (unfortunately I cannot use geom_smooth with nlme).
So using the model I obtained the following predicted values (predicted_value) at different times (date_day) and here for two patients (ID1 and ID2).
df <- data.frame (ID = c (rep (1, 10), rep(2, 10)),
date_day = c (7:16, 7:16),
predicted_value = c (33, 33, 33, 33, 33, NA, 34, NA, NA, NA,
55, NA, NA, 53.3, NA, NA, 51.6, NA, 50.5, NA))
ID date_day predicted_value
1 1 7 33.0
2 1 8 33.0
3 1 9 33.0
4 1 10 33.0
5 1 11 33.0
6 1 12 NA
7 1 13 34.0
8 1 14 NA
9 1 15 NA
10 1 16 NA
11 2 7 55.0
12 2 8 NA
13 2 9 NA
14 2 10 53.3
15 2 11 NA
16 2 12 NA
17 2 13 51.6
18 2 14 NA
19 2 15 50.5
20 2 16 NA
Now I would like to draw the regression line for each of these patients. So I tried the following
ggplot(df%>% filter(ID %in% c("1", "2")))+
aes(x = date_day, y = predicted_value) +
geom_point(shape = "circle", size = 1.5, colour = "#112446", na.rm = T) +
geom_line(aes(y = predicted_value), na.rm = T, size = 1) +
theme_minimal() +
facet_wrap(vars(ID)) +
scale_x_continuous(name="days", limits=c(7, 16)) +
scale_y_continuous(name="predicted values", limits=c(0, 60))
But I end with the following plots: patient 1 : the line is interrupted, and patient 2 no line at all. How can I fix that ?
Thanks a lot
Thank you #BenBolker , indeed changing the first line
ggplot(df%>% filter(ID %in% c("1", "2")))
to
ggplot(na.omit(df)%>% filter(ID %in% c("1", "2")))
allowed to solve the job

Classify table based on value 'moving window' range and proportions?

I have a datasets of forest stands, each containing several tree layers of different age and volume.
I want to classify the stands as even- or uneven-aged, combining volume and age data. The forest is considered even-aged if more then 80% of the volume is allocated to age classes within 20 years apart. I wonder how to implement the 'within 20 years apart' condition? I can easily calculate the sum of volume and it's share for individual tree layers (strat). But how to check for 'how many years they are apart?' Is it some sort of moving window?
Dummy example:
# investigate volume by age classes?
library(dplyr)
df <- data.frame(stand = c("id1", "id1", "id1", "id1",
'id2', 'id2', 'id2'),
strat = c(1,2,3,4,
1,2,3),
v = c(4,10,15,20,
11,15,18),
age = c(5,10,65,80,
10,15,20))
# even age = if more of teh 80% of volume is allocated in layers in 20 years range
df %>%
group_by(stand) %>%
mutate(V_tot = sum(v)) %>%
mutate(V_share = v/V_tot*100)
Expected outcome:
stand strat v age V_tot V_share quality
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id1 1 4 5 49 8.16 uneven-aged
2 id1 2 10 10 49 20.4 uneven-aged
3 id1 3 15 65 49 30.6 uneven-aged
4 id1 4 20 80 49 40.8 uneven-aged #* because age classes 65 and 80, even less then 20 years apart have only 70% of total volume
5 id2 1 11 10 44 25 even-aged
6 id2 2 15 15 44 34.1 even-aged
7 id2 3 18 20 44 40.9 even-aged
Another tidyverse solution implementing a moving average:
library(tidyverse)
df <- structure(list(stand = c("id1", "id1", "id1", "id1", "id2", "id2", "id2"), strat = c(1, 2, 3, 4, 1, 2, 3), v = c(4, 10, 15, 20, 11, 15, 18), age = c(5, 10, 65, 80, 10, 15, 20), V_tot = c(49, 49, 49, 49, 44, 44, 44), V_share = c(8.16326530612245, 20.4081632653061, 30.6122448979592, 40.8163265306122, 25, 34.0909090909091, 40.9090909090909)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))
df %>%
group_by(stand) %>%
mutate(range20 = map_dbl(age, ~ sum(V_share[which(abs(age - .x) <= 20)])),
quality = ifelse(any(range20 > 80), "even-aged", "uneven-aged"))
#> # A tibble: 7 × 8
#> # Groups: stand [2]
#> stand strat v age V_tot V_share range20 quality
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 id1 1 4 5 49 8.16 28.6 uneven-aged
#> 2 id1 2 10 10 49 20.4 28.6 uneven-aged
#> 3 id1 3 15 65 49 30.6 71.4 uneven-aged
#> 4 id1 4 20 80 49 40.8 71.4 uneven-aged
#> 5 id2 1 11 10 44 25 100 even-aged
#> 6 id2 2 15 15 44 34.1 100 even-aged
#> 7 id2 3 18 20 44 40.9 100 even-aged
Created on 2021-09-08 by the reprex package (v2.0.1)
Interesting issue, I think I have a solution using the runner package
df %>%
group_by(stand) %>%
mutate(
V_tot = sum(v),
V_share = v/V_tot*100,
test = sum_run(
V_share,
k = 20L,
idx = age,
na_rm = TRUE,
na_pad = FALSE
),
quality = if_else(any(test >= 80), 'even-aged', 'uneven-aged')
) %>%
select(-test)

How can I pick an element from a matrix depending on a set of conditions?

I have a dataframe containing n rows and m columns. Each row is an individual and each column is information on that individual.
df
id age income
1 18 12
2 24 24
3 36 12
4 18 24
. . .
. . .
. . .
I also have a matrix rXcshowing age buckets in each row and income buckets in each column and each element of the matrix is the % of people for each income-age bucket.
matrix age\income
12 24 36 .....
18 0.15 0.12 0.11 ....
24 0.12 0.6 0.2 ...
36 0.02 0.16 0.16 ...
. ..................
. ..................
For each individual in the dataframe, I need to find the right element of the matrix given the age and income bucket of the individual.
The desired output should look like this
df2
id age income y
1 18 12 0.15
2 24 24 0.6
3 36 12 0.02
4 18 24 0.12
. . .
. . .
. . .
I tried with a series of IFs inside a loop (like in the example):
for (i in 1:length(df$x)) {
workingset <- df[i,]
if(workingset$age==18){
temp<-marix[1,]
workingset$y <- ifelse(workingset$income<12, temp[1], ifelse(workingset$income<24,temp[2],ifelse,temp[3])
}else if(workingset$age==24){
temp<-marix[2,]
workingset$y <- ifelse(workingset$income<12, temp[1], ifelse(workingset$income<24,temp[2],ifelse,temp[3])
}else if{
...
}
if(i==1){
df2 <- workingset
}else{
df2<- rbind(df2, workingset)
}
}
This code works, but it takes too long. Is there a way do this job efficiently?
Assuming your data looks exactly like shown you could use dplyr and tidyr.
First convert your matrix (I name it my_mat) into a data.frame
my_mat %>%
as.data.frame() %>%
mutate(age=rownames(.)) %>%
pivot_longer(cols=-age, names_to="income", values_to="y") %>%
mutate(across(where(is.character), as.numeric))
returns
# A tibble: 9 x 3
age income y
<dbl> <dbl> <dbl>
1 18 12 0.15
2 18 24 0.12
3 18 36 0.11
4 24 12 0.12
5 24 24 0.6
6 24 36 0.2
7 36 12 0.02
8 36 24 0.16
9 36 36 0.16
This can be left joined with your data.frame df, so in one go:
my_mat %>%
as.data.frame() %>%
mutate(age=rownames(.)) %>%
pivot_longer(cols=-age, names_to="income", values_to="y") %>%
mutate(across(where(is.character), as.numeric)) %>%
left_join(df, ., by=c("age", "income"))
gives you
# A tibble: 4 x 4
id age income y
<dbl> <dbl> <dbl> <dbl>
1 1 18 12 0.15
2 2 24 24 0.6
3 3 36 12 0.02
4 4 18 24 0.12
Data
my_mat <- structure(c(0.15, 0.12, 0.02, 0.12, 0.6, 0.16, 0.11, 0.2, 0.16
), .Dim = c(3L, 3L), .Dimnames = list(c("18", "24", "36"), c("12",
"24", "36")))
df <- structure(list(id = c(1, 2, 3, 4), age = c(18, 24, 36, 18), income = c(12,
24, 12, 24)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), spec = structure(list(cols = list(
id = structure(list(), class = c("collector_double", "collector"
)), age = structure(list(), class = c("collector_double",
"collector")), income = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

Display only rows values in which the difference in between a column is below 30

I keep trying unsuccessfully to select from an excel file a filter in which only the rows values where three consecutive row values in column'x' are below 30 units. For example, in the following table:
Name age height speed
Helen 12. 1.20 40
Alan. 14. 1.40. 75
Hector.15. 1.25. 80
Ana. 11. 1.02. 81
Sophie.16. 1.40. 50
When the difference in column speed is below 30 within consecutive rows it should give as a result:
Name age height speed
Alan. 14. 1.40. 75
Hector.15. 1.25. 80
Ana. 11. 1.02. 81
Thank you!!!
If your data is like this:
x = structure(list(Name = structure(c(4L, 1L, 3L, 2L, 5L), .Label = c("Alan",
"Ana", "Hector", "Helen", "Sophie"), class = "factor"), age = c(12,
14, 15, 11, 16), height = c(1.2, 1.4, 1.25, 1.02, 1.4), speed = c(40L,
75L, 80L, 81L, 50L)), class = "data.frame", row.names = c(NA,
-5L))
Hope I got the numbers right:
Name age height speed
1 Helen 12 1.20 40
2 Alan 14 1.40 75
3 Hector 15 1.25 80
4 Ana 11 1.02 81
5 Sophie 16 1.40 50
Then do:
x[diff(x$speed)<30,]
Name age height speed
2 Alan 14 1.40 75
3 Hector 15 1.25 80
4 Ana 11 1.02 81
next time you publish here it is useful to post some toydata information like below:
rm(list=ls())
#### Toy data ###
dfnames<-c("Name","age","height","speed")
size<-20 # number of rows
name<-LETTERS[1:size]
age<-sample(20:26,size,replace=T)
height<-sample(160:180,size,replace=T)
speed<-sample(0:60,size,replace=T)
df<-cbind.data.frame(name,age,height,speed)
Solution:
for(i in 1:nrow(df)-1){
df[i,"test"]<-(df[i+1,"speed"]-df[i,"speed"])<30
}
df[nrow(df),"test"]<-"last_row"
df<-df[df[,"test"]!=F,]

ggfortify autoplot confidence intervals level

I am trying to estimate 95% confidence intervals using autoplot from ggfortify, however I am not able to achieve it. If I use the forecast package and forecast ahead 3 weeks with 95% CI it works fine. See below:
wt <- structure(list(DOC = c(3, 10, 17, 24, 31, 38, 45, 52, 59, 66,
73, 80, 87, 94, 101), AvgWeight = c(1, 1.66666666666667, 2.06666666666667,
2.275, 3.83333333333333, 6.2, 7.4, 8.5, 10.25, 11.1, 13.625,
15.2, 16.375, 17.8, 21.5), PondName = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Pond01", class = "factor"),
SampleDate = structure(c(1182585600, 1183190400, 1183795200,
1184400000, 1185004800, 1185609600, 1186214400, 1186819200,
1187424000, 1188028800, 1188633600, 1189238400, 1189843200,
1190448000, 1191052800), class = c("POSIXct", "POSIXt"))), .Names = c("DOC",
"AvgWeight", "PondName", "SampleDate"), row.names = c(NA, 15L
), class = "data.frame")
wt$SampleDate <- as.Date(wt$SampleDate)
wt
DOC AvgWeight PondName SampleDate
1 3 1.000000 Pond01 2007-06-23
2 10 1.666667 Pond01 2007-06-30
3 17 2.066667 Pond01 2007-07-07
4 24 2.275000 Pond01 2007-07-14
5 31 3.833333 Pond01 2007-07-21
6 38 6.200000 Pond01 2007-07-28
7 45 7.400000 Pond01 2007-08-04
8 52 8.500000 Pond01 2007-08-11
9 59 10.250000 Pond01 2007-08-18
10 66 11.100000 Pond01 2007-08-25
11 73 13.625000 Pond01 2007-09-01
12 80 15.200000 Pond01 2007-09-08
13 87 16.375000 Pond01 2007-09-15
14 94 17.800000 Pond01 2007-09-22
15 101 21.500000 Pond01 2007-09-29
library(forecast)
library(ggfortify)
library(ggplot2)
library(xts)
pond <- as.xts(wt$AvgWeight,order.by=seq(as.Date("2007-06-23"), by=7, len=15))
pond
d.arima <- auto.arima(pond)
d.arima;fitted(d.arima)
d.forecast <- forecast(d.arima, level = c(95), h = 3)
d.forecast
> d.forecast
Point Forecast Lo 95 Hi 95
106 25.2 23.14483 27.25517
113 28.9 24.30450 33.49550
120 32.6 24.91026 40.28974
I get the correct 95% confidence intervals when I plot a forecast package object (d.forecast in this case)
autoplot(d.forecast,ts.colour='dodgerblue',predict.colour='green',
predict.linetype='dashed',ts.size=1.5,conf.int.fill='azure3') +
xlab('DOC') + ylab('AvgWeight-grs') + theme_bw()
But if I do:
ggfortify::autoplot(d.arima,predict=predict(d.arima,n.ahead=3),conf.int=TRUE,predict.alpha =
0.05,fitted.colour="green",
predict.colour='red',predict.linetype='solid')
It defaults to 80% confidence intervals. I tried to set the level of confidence inside predict() but it gets ignored. I also tried the level inside autoplot() and also did not work. Questions: How can I accomplish different levels of confidence using autoplot from ggfortify? Is it correct to use predict.alpha here or it is intended for the alpha color of the predicted point estimate?
Also, is it possible to connect the fitted green line to the predicted red line?
I'm surprised you're not getting an error and are seeing the plots you're showing. Unfortunately I cannot reproduce your plots
When I load ggfortify after forecast, I couldn't find a way to use forecasts autoplot. That's because ggfortify does not actually export autoplot; instead it overwrites the autoplot method from forecast. So ggfortify::autoplot(...) shouldn't work, and should throw the error
Error: 'autoplot' is not an exported object from 'namespace:ggfortify'
There is also no predict argument of autoplot.forecast or autoplot.ts, so I'm not sure where that comes from.
Is there a reason why you want to use forecast and ggfortify? Why not stick with forecasts autoplot for plotting? Here is an example based on your sample data and d.arima
autoplot(forecast(d.arima)) + theme_minimal()
The light and dark areas correspond to the 95% and 80% CIs, respectively.
Tested using forecast_8.10 and ggfortify_0.4.7.

Resources