I am working with the dataset twinData in R. I have two questions related to data wrangling.
How would I go about listing only the combinations of cohort and zygosity where the twins’ heights are significantly similar.
My prior code was to create a new variable to indicate whether the correlation coefficient between ht1 and ht2 in the particular subgroup is greater 0.5, with 95 percent confidence.
sig_twin_cor <- twinData %>%
group_by(cohort,zygosity) %>%
do(tidy( cor.test(~ ht1 + ht2, alternative = "greater" , data = . ))) %>%
arrange(desc(estimate)) %>%
mutate(Greater0.5 = ifelse(estimate>0.5,"Yes","No"))
sig_twin_cor
Second, I need to transform the dataset twinData into a narrow form using gather(). Can someone show me how to do this?
Thanks!
Related
I am trying to produce estimates of the 25th percentile of a continuous variable for a series of sub-groups, where the data is taken from a survey that uses sampling weights. I am doing this in R using the survey and srvyr packages.
This issue I face is that in a small minority of cases a sub-group only has one observation and therefore a 25th percentile is meaningless. This would be fine however it results in a error which prevents the percentiles being calculated for those subgroups with sufficient observations.
Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
need at least two non-NA values to interpolate
The code runs when the offending groups are removed, however I have had to identify them manually which is far from ideal.
Is there a way to achieve the same outcome but where for single observation groups an NA, or just the value of that observation, is outputted rather than an error? Alternatively is there a neat way of automatically excluding such groups from the calculation?
Below is a reproducible example to illustrate my issue using the apistrat dataset from the survey package.
library(dplyr)
library(survey)
library(srvyr)
data(api)
#25th percentile of api00 by school type and whether school is year round or not
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
#need at least two non-NA values to interpolate
apistrat %>% group_by(yr.rnd, stype) %>% tally() %>% filter(n==1)
#one group out of 6 has only a single api00 observation and therefore a quantile can't be interpolated
#Removing that one group means the code can now run as intended
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
filter(!(yr.rnd=="Yes"&stype=="H")) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Get the same error if you do it the 'survey' package way
dstrat <- svydesign(id=~1,strata=~stype,data=apistrat, fpc=~fpc)
svyby(~api99, ~stype+yr.rnd, dstrat, svyquantile, quantiles=0.25)
One work-around is to wrap the call to svyquantile() using tryCatch()
> svyq<-function( ...){tryCatch(svyquantile(...), error=function(e) matrix(NA,1,1))}
> svyby(~api99, ~stype+yr.rnd, dstrat, svyq, quantiles=0.25,keep.var=FALSE,na.rm=TRUE)
stype yr.rnd statistic
E.No E No 560.50
H.No H No 532.75
M.No M No 509.00
E.Yes E Yes 456.00
H.Yes H Yes NA
M.Yes M Yes 436.00
With quantiles and svyby you need to be explicit about whether you want standard errors -- the code above doesn't. If you want standard errors, you'd need the error= branch of tryCatch to return an actual svyquantile object with NAs in it.
I need to run a clustering or similarity analysis on some biological data and I am looking for an output like the one SIMPROF gives. Aka a dendrogram or hierarchical cluster.
However, I have 3200 observations/rows per group. SIMPROF, see example here,
library(clustsig)
usarrests<-USArrests[,c(1,2,4)]
rownames(usarrests)<-state.abb
# Run simprof on the data
res <- simprof(data= usarrests,
method.distance="braycurtis")
# Graph the result
pl.color <- simprof.plot(res)
seems to expect only one observation per group (US state in this example).
Now, again, my biological data (140k rows total) has about 3200 obs per group.
I am trying to cluster the groups together that have a similar representation in the variables provided.
As if in the example above, AK would be represented by more than one observation.
What's my best bet for a function/package/analysis?
Cheers,
Mo
Example from a paper:
The solution became obvious upon further reflection.
Instead of using all observations (200k) in the long format, I made longitude and depth of sampling into one variable, used like sampling units along a transect. Thus, ending up with 3800 columns of longitude - depth combinations, and 61 rows for the taxa, with the value variable being the abundance of the taxa (If you want to cluster sampling units then you have to transpose the df). This is then feasible for hclust or SIMPROF since now the quadratic complexity only applies to 61 rows (as opposed to ~200k as I tried at the beginning).
Cheers
Here is some code:
library(reshape2)
library(dplyr)
d4<-d4 %>% na.omit() %>% arrange(desc(LONGITUDE_DEC))
# make 1 variable of longitude and depth that can be used for all taxa measured, like
#community ecology sampling units
d4$sampling_units<-paste(d4$LONGITUDE_DEC,d4$BIN_MIDDEPTH_M)
d5<-d4 %>% select(PREDICTED_GROUP,CONCENTRATION_IND_M3,sampling_units)
d5<-d5%>%na.omit()
# dcast data frame so that you get the taxa as rows, sampling units as columns w
# concentration/abundance as values.
d6<-dcast(d5,PREDICTED_GROUP ~ sampling_units, value.var = "CONCENTRATION_IND_M3")
d7<-d6 %>% na.omit()
d7$PREDICTED_GROUP<-as.factor(d7$PREDICTED_GROUP)
# give the rownames the taxa names
rownames(d7)<-paste(d7$PREDICTED_GROUP)
#delete that variable that is no longer needed
d7$PREDICTED_GROUP<-NULL
library(vegan)
# calculate the dissimilarity matrix with vegdist so you can use the sorenson/bray
#method
distBray <- vegdist(d7, method = "bray")
# calculate the clusters with ward.D2
clust1 <- hclust(distBray, method = "ward.D2")
clust1
#plot the cluster dendrogram with dendextend
library(dendextend)
library(ggdendro)
library(ggplot2)
dend <- clust1 %>% as.dendrogram %>%
set("branches_k_color", k = 5) %>% set("branches_lwd", 0.5) %>% set("clear_leaves") %>% set("labels_colors", k = 5) %>% set("leaves_cex", 0.5) %>%
set("labels_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)
I frequently want to create summary tables for studies where I compare several variables between two groups, listing values for each variable along with the difference between that variable for the two groups.
For example, say I want to compare age groups (young and old) and proportion of males between two groups, A and B. I’d like to end up with a table with rows for each variable (age, proportion of males) and columns for the following variables repeated for each group (numerator, denominators, rate, difference between the two rates, 95%CI, p-value from a chi-square).
I’m looking for a general approach to this type of table.
Let’s say I have the following table:
library(dplyr)
AgeGroup <- sample(c("Young", "Old"), 10, replace = TRUE)
Gender <- sample(c("Male", "Female"), 10, replace = TRUE)
df <- data.frame(AgeGroup, Gender)
df
I can create a summary table without the comparison easily:
df1 <- df %>%
group_by(AgeGroup) %>%
summarise(num_M = sum(Gender == "Male"),
den_M = n(),
prop_M = num_M/den_M)
df1
But I can’t figure out how to create additional columns of comparisons between the different rows of grouped data. Let’s say I want to do a chi.sq test on the proportion of Males in each AgeGroup and add the p-value to the summary table above.
It would look like this (numbers, obviously, are examples), Y = Young, O = Old:
Any gentle nudges in the right direction would be greatly appreciated.
Thanks!
I like the finalfit package for summary tables. If you need to add custom summary functions, it might not be flexible enough, but its default stats cover everything you've asked for in your example, e.g. numbers in each group, proportions, and a chi-squared test. If you have continuous variables it will calculate means and SDs in each group.
library(finalfit)
finalfit::summary_factorlist(
df,
dependent = "Gender",
explanatory = "AgeGroup",
total_col = TRUE,
p = TRUE
)
Output:
label levels Female Male Total p
1 AgeGroup Old 0 (0.0) 6 (100.0) 6 0.197
2 Young 1 (25.0) 3 (75.0) 4
Update: Solved!
I'm currently trying to create a regression model for football that predicts a team's total points based on their pass yards and rush yards. I was able to get all the way to figuring out the regression equation but from here I do not know how to "plug in" the formula.
The data table is essentially all 32 NFL teams listed in rows and their offensive stats listed in columns
Code:
# 1. Import
Offense <- read.csv(file.choose(), header=TRUE)
#2 View
show (Offense)
#3 Attach so headers can be referenced
attach (Offense)
#4 Create Regression Model
mod1 <-lm(Total.Points ~ Pass.Yds + Rush.Yds)
summary(mod1)
#Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#Plug in the Regression Equation
predict(mod1)
Output: https://imgur.com/a/AbTNF
I see that at the end it applied the regression equation to all 32 rows, but how do I
get it to display in a ranked list
get it to display, say, the team name as well as the projected score (so I don't have to wonder what team "1" or "2" refer to
Since I have the equation, could I also just write a loop function that ran the equation for every row of data I have and print the results?
I'm a beginner so much appreciated!
Update: Came up with this
####Part 2. Interpretation
#1. Examining quality of model
summary(mod1)
cor(Pass.Yds, Rush.Yds)
#2. Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#3. Predicted Points (Descending Order)
proj <- sort(predict(mod1), decreasing = TRUE)
proj
#4. Corresponding Name (Descending)
name <- Team[order(predict(mod1), decreasing = TRUE)]
name
#Data Frame
Projections <- data.frame(name, proj)
Projections
While bbrot provided a much simpler version
Assuming that Teams is the vector of team names, something like cbind(Teams[order(predict(mod1), decreasing = TRUE)], sort(predict(mod1), decreasing = TRUE)) should do...
Edit: Your Teams vector seems to be a factor. In this case, the following commands are going to work:
# returns a character matrix
cbind(as.character(Teams)[order(predict(mod1), decreasing = TRUE)],
sort(predict(mod1), decreasing = TRUE))
# returns a data frame
data.frame(Teams = Teams[order(predict(mod1), decreasing = TRUE)],
Points = sort(predict(mod1), decreasing = TRUE))
How can I detect the outliers of all data set (all continuous columns) based on a categorical variable and replace them with NA. I want to use Tukey technique but focusing on each level of a categorical variable. For example replace the outliers of mtcars[, -c(8,9)] with NA based on the each level of mtcars$am
OR How can I modify this code to work for all variables in each level of am.
lapply(mtcars, function(x){sort(outlier_values<- boxplot.stats(x)$out)})
EDIT: outliers are now 1.5*IQR, as specified in comment.
This replaces the outliers in the qsec column per group in the am column with NA's. It does so by first constructin a dataframe called limits, which contains lower- and upperbounds per am group. Then, that dataframe is joined with the original dataframe, and outliers are filtered.
library(dplyr)
limits = data.frame(am = unique(mtcars$am))
limits$lower = lapply(limits$am, function(x) quantile(mtcars$qsec[mtcars$am==x],0.25) - 1.5 * (quantile(mtcars$qsec[mtcars$am==x],0.75)- quantile(mtcars$qsec[mtcars$am==x],0.25)) )
limits$upper = lapply(limits$am, function(x) quantile(mtcars$qsec[mtcars$am==x],0.75) + 1.5 * (quantile(mtcars$qsec[mtcars$am==x],0.75)- quantile(mtcars$qsec[mtcars$am==x],0.25)) )
df = mtcars %>% left_join(limits)
df$qsec = ifelse(df$qsec< df$lower | df$qsec>df$upper,NA,df$qsec)
df = df %>% select(-upper,-lower)
The a parameter can be used to determine what proportion is considered an outlier.