What does it mean to have something differ between the coefficients? - r

I need to create a model where the presence/absence of pox can differ between species
and between elevations while also allowing for the effect of elevation to differ between species.
This is what I have:
library(car)
library(effects)
modPox=glm(Activepox ~ Species + Elev, data = datPox, family = binomial)
summary(modPox)
Anova(modPox)
plot(allEffects(modPox))
However, what does it means to have pox differing between the two coefficients? What do I need to add in order to have pox differ between species and elevation? What does the effect of elevation differing between species mean?
Thank you.
This is what the data looks like:
Site Species Bandno Date Sex Age Oldpox Activepox Malaria Elev
1 AIN APAP 159174793 7/22/2004 U H 0 0 2 mid
2 AIN APAP 159174964 7/6/2004 M H 0 1 2 mid
3 AIN APAP 159174965 7/7/2004 F H 0 0 2 mid
Data:
datPox <- data.frame(
stringsAsFactors = FALSE,
Site = c("AIN", "AIN", "AIN"),
Species = c("APAP", "APAP", "APAP"),
Bandno = c(159174793L, 159174964L, 159174965L),
Date = c("7/22/2004", "7/6/2004", "7/7/2004"),
Sex = c("U", "M", "F"),
Age = c("H", "H", "H"),
Oldpox = c(0L, 0L, 0L),
Activepox = c(0L, 1L, 0L),
Malaria = c(2L, 2L, 2L),
Elev = c("mid", "mid", "mid")
)

Related

Making a table that contains Mean and SD of a Dataset

I am using this dataset: http://www.openintro.org/stat/data/cdc.R
to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:
Mean Standard Deviation
Age: 44.27 16.715
Height: 70.25 3.009219
Weight: 189.3 36.55036
Desired Weight: 178.6 26.25121
I created a subset for males and females with this code:
mdata <- subset(cdc, cdc$gender == ("m"))
fdata <- subset(cdc, cdc$gender == ("f"))
How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?
The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:
library(dplyr)
library(tidyr)
starwars |>
group_by(sex) |>
summarise(across(
where(is.numeric),
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}__{fn}"
)) |>
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")
# A tibble: 15 × 4
sex var Mean SD
<chr> <chr> <dbl> <dbl>
1 female height 169. 15.3
2 female mass 54.7 8.59
3 female birth_year 47.2 15.0
4 hermaphroditic height 175 NA
5 hermaphroditic mass 1358 NA
6 hermaphroditic birth_year 600 NA
7 male height 179. 36.0
8 male mass 81.0 28.2
9 male birth_year 85.5 157.
10 none height 131. 49.1
11 none mass 69.8 51.0
12 none birth_year 53.3 51.6
13 NA height 181. 2.89
14 NA mass 48 NA
15 NA birth_year 62 NA
Just make a data frame of colMeans and column sd. Note, that you may also select columns.
fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)
res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
# mean sd
# age 44.27307 16.719940
# height 70.25165 3.009219
# weight 189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------
# cdc$gender: f
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
To extract only one of the data frames in the list-like object use e.g. res$m.
Usually we use aggregate for this, which you also might consider:
aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
do.call(what=data.frame)
# gender age.mean age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1 m 44.27307 16.71994 70.251646 3.009219 189.32271 36.55036 178.61657 26.25121
# 2 f 45.79772 17.58442 64.367750 2.787304 151.66619 34.29752 133.51500 18.96301
The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.
Note: R >= 4.1 used.
Data:
source('https://www.openintro.org/stat/data/cdc.R')
or
cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent",
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0,
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1,
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L,
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L,
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L,
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995",
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")

Proportion Tables in R

I have the following data in R:
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
my_data = data.frame(gender, status, disease)
I want to make a table that shows:
What percent of male immigrants have the disease?
What percent of male non-immigrants have the disease?
What percent of female immigrants have the disease?
What percent of female non-immigrants have the disease?
I tried to do this with the following code:
t1 <- xtabs(disease ~ gender + status, data=my_data)
But I get this error:
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
‘sum’ not meaningful for factors
Can someone please show me what I am doing wrong and how to fix this?
Thank you!
As there are more columns and all of them are factors, use count from dplyr and then get the proportions
library(dplyr)
library(tidyr)
my_data %>%
dplyr::count(across(everything())) %>%
pivot_wider(names_from = disease, values_from =n, values_fill = 0) %>%
group_by(gender) %>%
mutate(100 *across(No:Yes, proportions)) %>%
ungroup
-output
# A tibble: 4 × 4
gender status No Yes
<fct> <fct> <dbl> <dbl>
1 Female Citizen 69.4 72.4
2 Female Immigrant 30.6 27.6
3 Male Citizen 70.4 68.7
4 Male Immigrant 29.6 31.3
With xtabs, if we convert the column to integer, it could work as
apply(xtabs(n ~ disease + gender + status,
transform(my_data, n = as.integer(disease))), c(1, 2), proportions) * 100
, , gender = Female
disease
status No Yes
Citizen 69.36724 72.41993
Immigrant 30.63276 27.58007
, , gender = Male
disease
status No Yes
Citizen 70.40185 68.68687
Immigrant 29.59815 31.31313

Wilcoxon Test in a loop for a number of datasets at the same time

I have a question about whether I can do a Wilcoxon test in a loop for all the table generated.
Basically, I want to do a paired Wilcoxon test between 2 variables for each dataset, and the 2 variables are in the same position(like xth and yth column) for every dataset. (For people who are familiar with Biology, in fact this is the RPKM values for like between control and treated sample for some repetitive elements) And I hope I can generate a table for the p-value from Wilcoxon test for each dataset.
I ready generated all the tables/dataset/dataframe using the below code and I think I want to do a Wilcoxon test for each dataset so I think I need to continue with the loop but i don't know how to do it:
data=sample_vs_norm
filter=unique(data$family)
for(i in 1:length(filter)){
table_name=paste('table_', filter[i], sep="")
print(table_name)
assign(table_name, data[data$Subfamily == filter[i]])
here is the structure of a single dataset:
so basically i would like to do a Wilcoxon test between the variables "R009_initial_filter_rpkm" and "normal_filter_rpkm"
Chr Start End Mappability Strand R009_initial_filter_NormalizedCounts
1: chr11 113086868 113087173 1 - 2
2: chr2 24290845 24291132 1 - 11
3: chr4 15854425 15854650 1 - 0
4: chr6 43489623 43489676 1 + 11
normal_filter_NormalizedCounts R009_initial_filter_rpkm
1: 14.569000 0.169752
2: 1.000000 0.992191
3: 14.815900 0.000000
4: 0.864262 5.372810
normal_filter_rpkm FoldChange p.value FDR FoldChangeFPKM
1: 1.236560 0.137278 0.999862671 1.000000000 0.1372776
2: 0.000000 11.000000 0.003173828 0.008149271 Inf
3: 1.704630 0.000000 1.000000000 1.000000000 0.0000000
4: 0.422137 12.727600 0.003173828 0.008149271 12.7276453
structure(list(Chr = structure(1:4, .Label = c("chr11", "chr2",
"chr4", "chr6"), class = "factor"), Start = c(113086868L, 24290845L,
15854425L, 43489623L), End = c(113087173L, 24291132L, 15854650L,
43489676L), Mappability = c(1L, 1L, 1L, 1L), Strand = structure(c(1L,
1L, 1L, 2L), .Label = c("-", "+"), class = "factor"), R009_initial_filter_NormalizedCounts = c(2L,
11L, 0L, 11L), normal_filter_NormalizedCounts = c(14.569,
1, 14.8159, 0.864262), R009_initial_filter_rpkm = c(0.169752,
0.992191, 0, 5.37281), normal_filter_rpkm = c(1.23656,
0, 1.70463, 0.422137), FoldChange = c(0.137278, 11, 0, 12.7276
), p.value = c(0.999862671, 0.003173828, 1, 0.003173828), FDR = c(1,
0.008149271, 1, 0.008149271), FoldChangeFPKM = c(0.1372776, Inf,
0, 12.7276453), class = "data.frame", row.names = c(NA,
-4L))
I'm sorry if I use incorrect terminology as I am a newbie in R, and thank you so much for the help
One approach is to use grouping with by = in data.table.
library(data.table)
setDT(data)
data[,wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = TE_Subfamily]
# TE_Subfamily statistic p.value
#1: AluYf4 7.5 1
You can group by any number of variables, for example TE_Subfamily and Chr:
data[TE_Subfamily %in% filter,
wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = .(TE_Subfamily,Chr)]
# TE_Subfamily Chr statistic p.value
#1: AluYf4 chr11 0 1
#2: AluYf4 chr2 1 1
#3: AluYf4 chr4 0 1
#4: AluYf4 chr6 1 1
If you need to only perform comparisons for certain TE_Subfamily, you could try something like this:
filter <- c("AluYf4")
data[TE_Subfamily %in% filter,
wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = TE_Subfamily]
# TE_Subfamily statistic p.value
#1: AluYf4 7.5 1
For bonus points, you can correct for multiple testing:
data[TE_Subfamily %in% filter,
wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = TE_Subfamily][,adjusted.p.value := p.adjust(p.value,method = "bonferroni")][]

Can you merge your data without creating separate dataframe in R?

My data frame is something like the follows:
sex year country value
F 2010 AU 350
F 2011 GE 258
M 2010 AU 250
F 2012 GE 928
In order to create another data frame that is merged by year and country, with sex and value being what you want to compare, you must first create separate data frames, like:
f <- subset(df, sex=="F")
m <- subset(df, sex=="M")
df_new <- merge(f, m, by=c("country", "year"), suffixes=c("_f", "_m"))
In this way, you can obtain a new data frame with year, and country being matched and just the value being different.
However, I don't like to bother to create separate data frames in order to merge. Is it possible to just write a code in one-line to achieve the data frame?
Considering dput(dft) as :
structure(list(sex = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"),
year = c(2010, 2011, 2010, 2012),
country = structure(c(1L, 2L, 1L, 2L), .Label = c("AU", "GE"), class = "factor"),
value = c(350, 258, 250, 928)), .Names = c("sex", "year", "country", "value"),row.names = c(NA, -4L), class = "data.frame")
you can use tidyverse and do:
dft %>% spread(sex,value)
which gives:
# year country F M
#1 2010 AU 350 250
#2 2011 GE 258 NA
#3 2012 GE 928 NA
We can do a split and then with Reduce/merge can get the expected output
Reduce(function(...) merge(..., by = c("country", "year"),
suffixes = c("_f", "_m")), split(df, df$sex))
# country year sex_f value_f sex_m value_m
#1 AU 2010 F 350 M 250
NOTE: This should also work when there are 'n' number of unique elements in the split by column (without the suffixes or its modification)
A reshaping option with data.table is
library(data.table)
na.omit(dcast(setDT(df), country + year ~ rowid(country, year),
value.var = c("sex", "value")))
# country year sex_1 sex_2 value_1 value_2
#1: AU 2010 F M 350 250

Plotting mean and st-dev from a dataset with multiple y values for an x value

My data is organized as such:
Distance r^2
0 1
0 0.9
0 0
0 0.8
0 1
1 0.5
1 0.45
1 0.56
1 1
2 0
2 0.9
3 0
3 0.1
3 0.2
3 0.3
...
300 1
300 0.8
I want to plot r^2 decay with distance, meaning I want to plot a mean value + st-dev for every unique distance value. So I should have 1 point at x=0, 1 point at x=1... but I have multiple x=0 values.
What is the best way to achieve this, given how the data is organized? I would like to do it in R if possible.
Thank you,
Adrian
Edit:
I have tried:
> dd <-structure(list(Distance = dist18, r.2 = a18[,13]), Names = c("Distance", "r^2"), class = "data.frame", row.names = c(NA, -15L))
> ggplot(dd, aes(x=Distance, y=r.2)) + stat_summary(fun.data="mean_sdl")
Error in data.frame(x = c(42L, 209L, 105L, 168L, 63L, 212L, 148L, 175L, : arguments imply differing number of rows: 126877, 15
> head(dist18)
[1] 42 209 105 168 63 212
> head(dd)
Distance r.2
1 42 0.89
2 209 0.92
3 105 0.91
4 168 0.81
5 63 0.88
6 212 0.88
Is this because my data is not sorted?
You can also plot your SD as an area around the mean similar to CI plotting (assuming temp is your data set)
library(data.table)
library(ggplot2)
temp <- setDT(temp)[, list(Mean = mean(r.2), SD = sd(r.2)), by = Distance]
ggplot(temp) + geom_point(aes(Distance, Mean)) + geom_ribbon(aes(x = Distance, y = Mean, ymin = (Mean - SD), ymax = (Mean + SD)), fill = "skyblue", alpha = 0.4)
Using dplyr it will be something like this:
df = data.frame(distance = rep(1:300, each = 10), r2 = runif(3000))
library(dplyr)
df_group = group_by(df, distance)
summarise(df_group, mn = mean(r2), s = sd(r2))Source: local data frame [300 x 3]
distance mn s
1 300 0.4977758 0.3565554
2 299 0.4295891 0.3281598
3 297 0.5346428 0.3424429
4 296 0.4623368 0.3163320
5 291 0.3224376 0.2103655
6 290 0.3916658 0.2115264
7 288 0.6147680 0.2953960
8 287 0.3405524 0.2032616
9 286 0.5690844 0.2458538
10 283 0.2901744 0.2835524
.. ... ... ...
Where df is the data.frame with your data, and distance and r2 the two column names.
this should work.
# Create a data frame like yours
df=data.frame(sample(50,size=300,replace=TRUE),runif(300))
colnames(df)=c('Distance','r^2')
#initialize empty data frame with columns x, mean and stdev
results=data.frame(x=numeric(0),mean=numeric(0),stdev=numeric(0))
count=1
for (i in 0:max(df$Distance)){
results[count,'x']=i
temp_mean=mean(df[which(df$Distance==i),'r^2'])
results[count,'mean']=temp_mean
temp_sd=sd(df[which(df$Distance==i),'r^2'])
results[count,'stdev']=temp_sd
count=count+1
}
# Plot your results
plot(results$x,results$mean,xlab='distance',ylab='r^2')
epsilon=0.02 #to add the little horizontal bar to the error bars
for (i in 1:nrow(results)){
up = results$mean[i] + results$stdev[i]
low = results$mean[i] - results$stdev[i]
segments(results$x[i],low , results$x[i], up)
segments(results$x[i]-epsilon, up , results$x[i]+epsilon, up)
segments(results$x[i]-epsilon, low , results$x[i]+epsilon, low)
}
Here's the result http://imgur.com/ED7PwD8
If you want to plot mean and +/- 1 sd for each point, the ggplot function makes this easy. With the test data
dd<-structure(list(Distance = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
2L, 2L, 3L, 3L, 3L, 3L), r.2 = c(1, 0.9, 0, 0.8, 1, 0.5, 0.45,
0.56, 1, 0, 0.9, 0, 0.1, 0.2, 0.3)), .Names = c("Distance", "r.2"
), class = "data.frame", row.names = c(NA, -15L))
you can just run
library(Hmisc)
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1)
which produces
I tried with your real data and got
real <- read.table("http://pelinfamily.ca/bio/GDR-18_conc.ld", header=F)
dd <- data.frame(Distance=real[,2]-real[,1], r.2=real[,13])
ggplot(dd, aes(x=Distance, y=r.2)) +
stat_summary(fun.data="mean_sdl", mult=1, geom="ribbon", alpha=.4) +
stat_summary(fun.data="mean_sdl", mult=1, geom="line")

Resources