R mean compare two dataset with the bootstrap method

R mean compare two dataset with the bootstrap method - r

I am having two sets of data values are arranged in different bins I want to compare two data sets mean accross bins of dataset1 and dataset2 visualize in line plot or any method to visualize I am new to this kind of analysis any suggestion will be very helpful
dataset1 and dataset2 actual data bin size is different bin1-bin200 on both datasets and the number of data is varing(300-200) so metioned below sample dataset I wanted use bootstrap method take random data example 100 from both datasets and take mean accross all bins dataset1 and 2 why I am doing boostrap in both data bin size is similar but datas are varing may infer in taking mean and also presence of outlier extream low and high values accross the bins may alter the result so I wanted to use bootstrap method take random dataset take mean across all bins
any suggestions how can I do this in R I am newbie to R and I am in learning phase please help me
dataset1=structure(list(genenames = c("data1", "data2", "data3", "data4", "data5", "data6"),
bin1 = c(0,20,9,0,2,0),
bin2 = c(5,20,8,30,10,0),
bin3 = c(0,0,1,1,3,0),
bin4 =c(6, 20, 10, 5, 0, 1),
bin5 =c(10,15,30,10,9, 4)),
class = "data.frame", row.names = c(NA, -6L))
dataset2=structure(list(genenames = c("data10", "data11", "data12", "data13", "data14", "data15"),
bin1 = c(0,30,0,0,20,0),
bin2 = c(0,0,8,10,20,0),
bin3 = c(0,10,19,15,3,10),
bin4 =c(30, 0, 0, 25, 0, 20),
bin5 =c(0,5,0,20,30, 29)),
class = "data.frame", row.names = c(NA, -6L))
dataset1_mean=colMeans(dataset1[,-1])
dataset2_mean=colMeans(dataset2[,-1])
any statisticl method to remove this outlier or any problem to use bootstrap method please mention
Thank you

Here is one way: After some data wrangling you could use boxplot and mark the mean with a red point:
library(dplyr)
library(ggplot2)
library(tidyr)
dataset1 <- dataset1 %>%
mutate(df = "df1")
dataset2 <- dataset2 %>%
mutate(df = "df2")
bind_rows(dataset1, dataset2) %>%
pivot_longer(
cols = starts_with("bin"),
names_to = "name",
values_to = "value"
) %>%
ggplot(aes(df, value))+
geom_boxplot() +
stat_summary(fun=mean,
geom="point",
shape=20,
size=4,
color="red",
position = position_dodge2 (width = 0.7, preserve = "single"))

Related

gtsummary tbl_merge with multiple columns of variable length

I am doing a meta-analysis and would like to use gtsummary for Table 1 (Description of the Included Studies). I would like to have each column be a detail of the study (e.g. Authors, Intervention, Number, etc). Within this MA, there are some studies that have more than 2 interventions, so the rows won't be equal among studies (i.e. first column has 1 row per study, second column variable rows per study, etc).
Here is a dataset for the problem that matches my own dataset.
library(tidyverse)
#Create dataset
MA <-
tibble(
Study = c("Study 1", "Study 2"),
Intervention1 = c("Placebo", "Control"),
Intervention2 = c("Walking", "Running"),
Intervention3 = c("Running", NA),
Number_Int1 = c(21, 19),
Number_Int2 = c(19, 20),
Number_Int3 = c(20, NA)
)
Created on 2022-06-27 by the reprex package (v2.0.1)
I've tried to use tbl_summary and tbl_merge to generate a summary table, but to no avail.
Here is what I would like the table to look like:
Any help would be appreciated.
Ben

I've managed to find a solution using the gt package. Here is the code:
MA %>% pivot_longer(
cols = !Study,
names_to = c(".value", ".value"),
names_pattern = "(.)(.)",
values_drop_na = TRUE
) %>%
rename(Intervention = In) %>%
rename(Number = Nu) %>%
gt(groupname_col = "Study") %>%
tab_stubhead(label = "Study") %>%
tab_options(row_group.as_column = TRUE)
This gives the following output table:
If anyone has any solutions using the gtsummary package, that'd be great.
Thanks,
Ben

Create a multiline plot from a dataset with time on one axis and genes on the other

I have a dataset with mean gene counts for each decade as shown below:
structure(list(decade_0 = c(92.500989948184, 2788.27384875413,
28.6937227408861, 1988.03831525414, 1476.83143096418), decade_1 = c(83.4606306426572,
537.725421951383, 10.2747132062782, 235.380422949258, 685.043600629146
), decade_2 = c(188.414375201462, 2091.84249935145, 17.080858894829,
649.55107199935, 1805.3484565514), decade_3 = c(43.3316024314987,
141.64396529835, 2.77851259926935, 94.7748265692319, 413.248354335235
), decade_4 = c(54.4891626582901, 451.076574268175, 12.4298374245007,
346.102609621018, 769.215535857077), decade_5 = c(85.5621750431284,
131.822699578988, 13.3130607062134, 151.002200923853, 387.727911723968
), decade_6 = c(112.860998806804, 4844.59668489898, 19.7317645111144,
2084.76584309876, 766.375852567831), decade_7 = c(73.2198969730458,
566.042952305845, 3.2457873699886, 311.853982701609, 768.801733767044
), decade_8 = c(91.8161648275608, 115.161700090147, 10.7289451320065,
181.747670625714, 549.21661120626), decade_9 = c(123.31045087146,
648.23694540667, 17.7690326882018, 430.301803845829, 677.187054208271
)), row.names = c("ANK1", "NTN4", "PTPRH", "JAG1", "PLAT"), class = "data.frame")
I would like to plot a line graph with the changes in counts over time for each of >30 genes as shown here in excel.
To do this with ggplot I have to convert it to col1: decade, col2: gene, col3: counts.
My question is, either how to convert my table into this ggplot friendly table, or if there is a better way to produce the plot with a different tool?
Thanks!

One possibility: transpose your data frame, convert rownames to columns, then gather ("make long"). Plotting is then easy.
library(tidyverse)
mydat <- structure(list(decade_0 = c(92.500989948184, 2788.27384875413,
28.6937227408861, 1988.03831525414, 1476.83143096418), decade_1 = c(83.4606306426572,
537.725421951383, 10.2747132062782, 235.380422949258, 685.043600629146
), decade_2 = c(188.414375201462, 2091.84249935145, 17.080858894829,
649.55107199935, 1805.3484565514), decade_3 = c(43.3316024314987,
141.64396529835, 2.77851259926935, 94.7748265692319, 413.248354335235
), decade_4 = c(54.4891626582901, 451.076574268175, 12.4298374245007,
346.102609621018, 769.215535857077), decade_5 = c(85.5621750431284,
131.822699578988, 13.3130607062134, 151.002200923853, 387.727911723968
), decade_6 = c(112.860998806804, 4844.59668489898, 19.7317645111144,
2084.76584309876, 766.375852567831), decade_7 = c(73.2198969730458,
566.042952305845, 3.2457873699886, 311.853982701609, 768.801733767044
), decade_8 = c(91.8161648275608, 115.161700090147, 10.7289451320065,
181.747670625714, 549.21661120626), decade_9 = c(123.31045087146,
648.23694540667, 17.7690326882018, 430.301803845829, 677.187054208271
)), row.names = c("ANK1", "NTN4", "PTPRH", "JAG1", "PLAT"), class = "data.frame")
newdat <- mydat %>% t() %>% as.data.frame() %>% tibble::rownames_to_column('decade') %>%
pivot_longer(-decade, names_to = 'gene', values_to = 'count')
ggplot(newdat) + geom_line(aes(decade, count, color = gene, group = gene))
Created on 2020-02-14 by the reprex package (v0.3.0)

Plotting every three rows from data frame

I would like to make some plots from my data. Unfortunately, it is hard to predict how many plots I will generate because it depends on data and may be different. It is a reason why I would like to make it easy adjustable. However, it will be most often a plot from group of 3 rows each time.
So, I would like to plot from rows 1:3, 4-6,7-9, etc.
This is data:
> dput(DF_final)
structure(list(AC = c(0.0031682160632777, 0.00228591145206846,
0.00142094444568728, 0.000661218113472149, 0.0010078157353918,
0.000400289437089513, 40.4634784175177, 40.5055070858594, 0.0183737773741582
), SD = c(0.00250647379467532, 0.0013244185401148, 0.000469332241199189,
0.000294558308707343, 0.000385553400676202, 0.000104447914881357,
11.0693842400794, 8.78768774254084, 0.00696532251341454), ln_AC = c(-5.75458660556339,
-6.08099044923792, -6.556433525855, -7.32142679754668, -6.89996992823399,
-7.8233226797995, 3.70039979980691, 3.70143794229703, -3.99683077355773
), ln_SD = c(-5.98887837626238, -6.62678175351058, -7.66419963690747,
-8.13003358225542, -7.86083085139947, -9.16682203300101, 2.40418312097106,
2.17335162163583, -4.96681136795312), Percent_AC = c(126.401324043689,
172.597361244303, 302.758754023937, 224.477834753288, 261.394591157605,
383.243109777925, 365.544076706723, 460.934756361151, 263.789326894369
), Percent_SD = c(100, 100, 100, 100, 100, 100, 100, 100, 100
), TP = c(0, 40, 80, 0, 40, 80, 0, 40, 80)), row.names = c("Tim_0",
"Tim_40", "Tim_80", "Jack_0", "Jack_40", "Jack_80", "Tom_0",
"Tom_40", "Tom_80"), class = "data.frame")
Column ln_AC should be set as an Y axis and column TP as X axis. First of all I would like to have all of them on separate graphs next to each other (remember about issue that the number of plots may be igh at some point) and if possible everything at the same graph. It should be a point plot with trend line.
Is it also possible to get a slope, SD slope, R^2 on a plot from linear regression ?
I manage to do it a for a single plot but regression line looks strange...
The code below was used to generate this plot and regression line.
fit <- lm(DF_final$ln_AC~DF_final$TP, data=DF_final)
plot(DF_final[1:3,7], DF_final[1:3,3], type = "p", ylim = c(-10,0), xlim=c(0,100), col = "red")
lines(DF_final$TP, fitted(fit), col="blue")

In base R (without so many packages), you can do:
# splits every 3 rows
DF = split(DF_final,gsub("_[^ ]*","",rownames(DF_final) ))
# you can also do
# DF = split(DF_final,(1:nrow(DF_final) - 1) %/%3 ))
To store your values:
slopes = vector("numeric",3)
names(slopes) = names(DF)
rsq = vector("numeric",3)
names(rsq) = names(DF)
To plot:
par(mfrow=c(1,3))
for(i in names(DF)){
fit <- lm(ln_AC~TP, data=DF[[i]])
plot(DF[[i]]$TP, DF[[i]]$ln_AC, type = "p", col = "red",main=i)
abline(fit, col="blue")
slopes[i]=round(fit$coefficients[2],digits=2)
rsq[i]=round(summary(fit)$r.squared,digits=2)
mtext(side=1,paste("slope=",slopes[i],"\nrsq=",rsq[i]),
padj=-2,cex=0.7)
}
And your values:
slopes
Jack Tim Tom
-0.01 -0.01 -0.10
rsq
Jack Tim Tom
0.29 0.99 0.75

If I understand correctly, the reason you want 3 observation per graph is because you have different individuals (Jack,Tim,Tom) . Is that so?
If you don't want to worry about that number, you can do this
# move rownames to column
data$person <- rownames(data)
data$person <- gsub("\\_.*","",data$person) # remove TP from names
# better to use library(data.table) for this step
data <- melt(data,id.vars=c("person","TP","ln_AC"))
ggplot(data,aes(x=TP, y=ln_AC)) + geom_point() +
geom_smooth(method = "lm") + facet_grid(~person)
This results in a plot like #giocomai, but it will work also if you have 4,5,6 or whatever persons in your data.
---- Edit
If you want to add R2 values, you can do something like this. Note, that it may not be the best and elegant solution, but it works.
data <- data.frame(...)
data$person <- rownames(data)
data$person <- gsub("\\_.*","",data$person)
# run lm for all persons and save them in a data.frame
nomi <- unique(data$person)
#lmStats <- data.frame()
lmStats <- sapply(nomi,
function(ita){
model <- lm(ln_AC~TP,data= data[which(data$person == ita),])
lmStat <- summary(model)
# I only save r2, but you can get all the statistics you need
lmRow <- data.frame("r2" = lmStat$r.squared )
#lmStats <- rbind(lmStats,lmRow)
}
)
lmStats <- do.call(rbind,lmStats)
# format the output,and create a dataframe we will use to annotate facet_grid
lmStats <- as.data.frame(lmStats)
rownames(lmStats) <- gsub("\\..*","",rownames(lmStats))
lmStats$person <- rownames(lmStats)
colnames(lmStats)[1] <- "r2"
lmStats$r2 <- round(lmStats$r2,2)
lmStats$TP <- 40
lmStats$ln_AC <- 0
lmStats$lab <- paste0("r2= ",lmStats$r2)
# melt and add r2 column to the data (not necessary, but I like to have everything I plot in teh data)
data <- melt(data,id.vars=c("person","TP","ln_AC"))
data$r2 <- lmStats[match(data$person,rownames(lmStats)),1]
ggplot(data,aes(x=TP, y=ln_AC)) + geom_point() +
geom_smooth(method = "lm") + facet_grid(~person) +
geom_text(data=lmStats,label=lmStats$lab)
An easier way (less steps) would be to use facet_grid(~r2), so that you have the R.square value in the title.

If I understand correctly what you mean, assuming you will always have three observation per graph, your main issue would be creating a categorical variable to separate them. Here's one way to accomplish it. Depending on the layout you prefer, you may want to check facet_wrap instead of facet_grid.
library("dplyr")
library("ggplot2")
DF_final <- structure(list(AC = c(0.0031682160632777, 0.00228591145206846,
0.00142094444568728, 0.000661218113472149, 0.0010078157353918,
0.000400289437089513, 40.4634784175177, 40.5055070858594, 0.0183737773741582
), SD = c(0.00250647379467532, 0.0013244185401148, 0.000469332241199189,
0.000294558308707343, 0.000385553400676202, 0.000104447914881357,
11.0693842400794, 8.78768774254084, 0.00696532251341454), ln_AC = c(-5.75458660556339,
-6.08099044923792, -6.556433525855, -7.32142679754668, -6.89996992823399,
-7.8233226797995, 3.70039979980691, 3.70143794229703, -3.99683077355773
), ln_SD = c(-5.98887837626238, -6.62678175351058, -7.66419963690747,
-8.13003358225542, -7.86083085139947, -9.16682203300101, 2.40418312097106,
2.17335162163583, -4.96681136795312), Percent_AC = c(126.401324043689,
172.597361244303, 302.758754023937, 224.477834753288, 261.394591157605,
383.243109777925, 365.544076706723, 460.934756361151, 263.789326894369
), Percent_SD = c(100, 100, 100, 100, 100, 100, 100, 100, 100
), TP = c(0, 40, 80, 0, 40, 80, 0, 40, 80)), row.names = c("Tim_0",
"Tim_40", "Tim_80", "Jack_0", "Jack_40", "Jack_80", "Tom_0",
"Tom_40", "Tom_80"), class = "data.frame")
DF_final %>%
mutate(id = as.character(sapply(1:(nrow(DF_final)/3), rep, 3))) %>%
ggplot(aes(x=TP, y=ln_AC)) +
geom_point() +
geom_smooth(method = "lm") +
facet_grid(~id)
Created on 2020-02-06 by the reprex package (v0.3.0)

How draw a loess line in ts plot

I tried hours to figure out how I can make my loess line work. The problem is I do not know much (lets say near nothing). I only have to use R for one course in university.
I created a fake table the real table is for download here
I have to make a timeline plot that worked surprisingly well. But now I have to add two loess lines with different spans. My Problem is I don't know how the command really works. I mean I know it should be something like loess(..~.., data=..). The step where I'm stuck is marked with "WHAT BELONGS HERE" in the given code below.
table <- structure(list(
Months = c("1980-06", "1980-07", "1980-08", "1980-09",
"1980-10", "1980-11", "1980-12", "1981-01"),
Total = c(75000, 70000, 60000, 73000, 72000, 71000, 76000, 71000)),
.Names = c("Monts", "Total of Killed Pigs"),
row.names = c(NA, 4L), class = "data.frame")
ts.obj <- ts(table$`Total of Killed Pigs`, start = c(1980, 1), frequency = 2)
plot(ts.obj)
trend1 <- loess(# **WHAT BELONGS HERE?**, data = table, span =1)
predict1 <- predict(trend1)
lines(predict1, col ="blue")
That is my original code:
obj <- read.csv(file="PATH/monthly-total-number-of-pigs-sla.csv", header=TRUE, sep=",")
ts.obj <- ts(obj$Monthly.total.number.of.pigs.slaughtered.in.Victoria..Jan.1980...August.1995, start = c(1980, 1), frequency = 12)
plot(ts.obj)
trend1 <- loess (WHAT BELONGS HERE?, data = obj, span =1)
predict1 <- predict (trend1)
lines(predict1, col="blue")

We can do away with the data argument as the time series is univariate (just one variable).
The formula ts.obj ~ index(ts.obj) can be read as
value as a function of time
as ts.obj will give you the values, and index(ts.obj) will give you the time index for those values, and the tilde ~ specifies that the first is a function of, or dependent on, the other.
library(zoo) # for index()
plot(ts.obj)
trend1 <- loess(ts.obj ~ index(ts.obj), span=1)
trend2 <- loess(ts.obj ~ index(ts.obj), span=2)
trend3 <- loess(ts.obj ~ index(ts.obj), span=3)
pred <- sapply(list(trend1, trend2, trend3), predict)
matlines(index(ts.obj), pred, lty=1, col=c("blue", "red", "orange"))
zoo isn't strictly required. If you replace index(ts.obj) with as.numeric(time(ts.obj)) you should be fine, I think.

In case you were wanting to go with ggplot2:
library(ggplot2)
library(dplyr)
table <- structure(list(
Months = c("1980-06", "1980-07", "1980-08", "1980-09",
"1980-10", "1980-11", "1980-12", "1981-01"),
Total = c(75000, 70000, 60000, 73000, 72000, 71000, 76000, 71000)),
.Names = c("Months", "Total"),
row.names = c(NA, 8L), class = "data.frame")
Change to proper dates:
table <- table %>% mutate(Months = as.Date(paste0(Months,"-01")))
Plot:
ggplot(table, aes(x=Months, y=Total)) +
geom_line() +
geom_smooth(span=1, se= FALSE, color ="red") +
geom_smooth(span=2, se= FALSE, color ="green") +
geom_smooth(span=3, se= FALSE) +
theme_minimal()

box plot using column of different length

I want to do some box plots, but I have data with a different number of rows for each column.
My data looks like:
OT1 OT2 OT3 OT4 OT5 OT6
22,6130653 16,6666667 20,259481 9,7431602 0,2777778 16,0678643
21,1122919 32,2946176 11,396648 10,9458023 4,7128509 10,8938547
23,5119048 19,5360195 23,9327541 39,5634921 0,6715507 12,2591613
16,9880885 39,5365943 7,7568134 22,7453205 3,6410445 11,7610063
32,768937 25,2897351 9,6288027 4,1629535 3,7251656
40,7819933 15,6320021 5,9171598
23,7961828 14,3728125 2,1887585
I'd like to have a box plot for each column (OT1, OT2…), but with the first three and the last three grouped together.
I tried:
>mydata <- read.csv('L5.txt', header = T, sep = "\t")
>mydata_t <- t(mydata)
>boxplot(mydata_t, ylab = "OTU abundance (%)",las=2, at=c(1,2,3 5,6,7))
But it didn't work…
How can I do?
Thanks!

Combining both answers and extenting Henrik's answer, you can also group the OT's together in boxplot() as well:
dat <- read.table(text='OT1 OT2 OT3 OT4 OT5 OT6
22,6130653 16,6666667 20,259481 9,7431602 0,2777778 16,0678643
21,1122919 32,2946176 11,396648 10,9458023 4,7128509 10,8938547
23,5119048 19,5360195 23,9327541 39,5634921 0,6715507 12,2591613
16,9880885 39,5365943 7,7568134 22,7453205 3,6410445 11,7610063
32,768937 25,2897351 9,6288027 4,1629535 3,7251656
40,7819933 15,6320021 5,9171598
23,7961828 14,3728125 2,1887585',header=TRUE,fill=TRUE)
dat <- sapply(dat,function(x)as.numeric(gsub(',','.',x)))
dat.m <- melt(dat)
dat.m <- transform(dat.m,group=ifelse(grepl('1|2|3','4|5|6'),
'group1','group2'))
as.factor(dat.m$X2)
boxplot(dat.m$value~dat.m$X2,data=dat.m,
axes = FALSE,
at = 1:6 + c(0.2, 0, -0.2),
col = rainbow(6))
axis(side = 1, at = c(2, 5), labels = c("Group_1", "Group_2"))
axis(side = 2, at = seq(0, 40, by = 10))
legend("topright", legend = c("OT1", "OT2", "OT3", "OT4", "OT5", "OT6"), fill = rainbow(6))
abline(v = 3.5, col = "grey")
box()

Not easy to group boxplots using R basic plots, better to use ggplot2 here. Whatever the difficulty here is how to reformat your data and reshape them in the long format.
dat <- read.table(text='OT1 OT2 OT3 OT4 OT5 OT6
22,6130653 16,6666667 20,259481 9,7431602 0,2777778 16,0678643
21,1122919 32,2946176 11,396648 10,9458023 4,7128509 10,8938547
23,5119048 19,5360195 23,9327541 39,5634921 0,6715507 12,2591613
16,9880885 39,5365943 7,7568134 22,7453205 3,6410445 11,7610063
32,768937 25,2897351 9,6288027 4,1629535 3,7251656
40,7819933 15,6320021 5,9171598
23,7961828 14,3728125 2,1887585',header=TRUE,fill=TRUE)
dat = sapply(dat,function(x)as.numeric(gsub(',','.',x)))
dat.m <- melt(dat)
dat.m <- transform(dat.m,group=ifelse(grepl('1|2|3',Var2),
'group1','group2'))
ggplot(dat.m)+
geom_boxplot(aes(x=group,y=value,fill=Var2))

Or with boxplot, using #agstudy's 'dat':
df <- melt(dat)
boxplot(value ~ Var2, data = df, at = 1:6 + c(0.2, 0, -0.2))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R mean compare two dataset with the bootstrap method - r

Related

gtsummary tbl_merge with multiple columns of variable length

Create a multiline plot from a dataset with time on one axis and genes on the other

Plotting every three rows from data frame

How draw a loess line in ts plot

box plot using column of different length

Categories

Resources