Below is a part of my data (pairht_protein)
I am trying to run t-test on all the variables (columns) between two groups which are:
Resistant_group <- c(PAIR-01, PAIR-12, PAIR-09)
Sensitive_group <- c(PAIR-07, PAIR-02, PAIR-05)
Before I make a function I tired to pick one of the variables and tried:
t.test(m_pHSL660 ~ Subject, data = subset(pairht_protein, Subject %in% c("Resistant_group", "Sensitive_group")))
But it gave me an error : 'grouping factor must have exactly 2 levels'
Is there a way to run t-test between these groups? and possibly make it as a function?
First, you must correct how you define the groups (you cannot use dashes on variable names):
Resistant_group <- c('PAIR-01', 'PAIR-12', 'PAIR-09')
Sensitive_group <- c('PAIR-07','PAIR-02','PAIR-05')
Then, using dplyr package create another factor variable with only two levels:
library(dplyr)
# assuming pairht_protein is your dataset name
pairht_protein <- pairht_protein %>% mutate(sub = case_when( subject %in% Resistant_group ~1,
subject %in% Sensitive_group ~2),
sub = as.factor(sub))
Because this new variable is going to make NAs values for elements outside your groups, you don't need to subsetting:
t.test(m_pHSL660 ~ sub, data =pairht_protein)
Related
I have two datasets that are the same except for one variable. For example, as demonstrated below, I have two datasets called boys_miss and boys_miss2. boys_miss2 has an extra binary variable (called type) that boys_miss doesn't. So I would like to determine the type variable in boys_miss using the observed variables in both datasets. I am not sure what would be the best way to do this. Any solutions or suggestions would greatly appreciated.
# loads relevant packages using the pacman package
pacman::p_load(
mice) # for boys dataset
# set seed
set.seed(2347723)
# generate a samall sample of the boys dataset
boys_miss <- sample(head(boys,100))
# create other dataset that has out variable of interest
boys_miss2 <- boys_miss[sample(1:nrow(boys_miss)), ]
# create the variable of interest
boys_miss2$type <- as.factor(sample(c("runner", "swimmer"),
size = nrow(boys_miss2),
replace = TRUE,
prob = c(.76, .24)))
# Goal here is to replicate type variable in `boys_miss` dataset using the values the matching
# in `boys_miss` and `boys_miss2`
Try with match on a pasteed string from both columns, use the index to return the corresponding 'type' values
boys_miss$type <- boys_miss2$type[match(do.call(paste, boys_miss),
do.call(paste, boys_miss2[-ncol(boys_miss2)]))]
Good morning everyone.
Please I do have a problem that I have not been able to solve for quite some time now.(please take a look at the image link to see a screen shot of my data set) https://i.stack.imgur.com/g2eTM.jpg
I have a column of data (status) containing two set of values (1 and 2). These are dummies representing two categories (or status) of dependent Variables (say Pp and Pt) that I need for a regression. their actual values are contained the last column Pp.Pt (Pp.Pt is just a name nothing more).
I need to run two separate regressions each using either Pp or Pt (meaning using their respective values in the Pp.Pt column (each value in the last column is either of status 1 or of status 2) . **My question is How do I separte them or group them into these two categories 1= Pp and 2 = Pt so that i could clearly identitify and group them.
https://i.stack.imgur.com/g2eTM.jpg
Thank you very much for your kind help.
Best
Ludovic
Split-Apply-Combine method :
# Using the mtcars dataset as an example:
df <- mtcars
# Allocate some memory for a list storing the split data.frame:
# df_list => empty list with the number of elements of the unique
# values of the cyl vector
df_list <- vector("list", length(unique(df$cyl)))
# Split the data.frame by the cyl vector:
df_list <- split(df, df$cyl)
# Apply the regression model, return the summary data:
lapply(df_list, function(x){
summary(lm(mpg ~ hp, data = x))
}
)
this approach can fix your issue
yourdata %>%
mutate(classofyourcolumn=ifelse(columntosplit<quantile(columntosplit,0.5),1,0))
I have a data set of plant demographics from 5 years across 10 sites with a total of 37 transects within the sites. Below is a link to a GoogleDoc with some of the data:
https://docs.google.com/spreadsheets/d/1VT-dDrTwG8wHBNx7eW4BtXH5wqesnIDwKTdK61xsD0U/edit?usp=sharing
In total, I have 101 unique combinations.
I need to subset each unique set of data, so that I can run each through some code. This code will give me one column of output that I need to add back to the original data frame so that I can run LMs on the entire data set. I had hoped to write a for-loop where I could subset each unique combination, run the code on each, and then append the output for each model back onto the original dataset. My attempts at writing a subset loop have all failed to produce even a simple output.
I created a column, "SiteTY", with unique Site, Transect, Year combinations. So "PWR 832015" is site PWR Transect 83 Year 2015. I tried to use that to loop through and fill an empty matrix, as proof of concept.
transect=unique(dat$SiteTY)
ntrans=length(transect)
tmpout=matrix(NA, nrow=ntrans, ncol=2)
for (i in 1:ntrans) {
df=subset(dat, SiteTY==i)
tmpout[i,]=(unique(df$SiteTY))
}
When I do this, I notice that df has no observations. If I replace "i" with a known value (like PWR 832015) and run each line of the for-loop individually, it populates correctly. If I use is.factor() for i or PWR 832015, both return FALSE.
This particular code also gives me the error:
Error in [,-(*tmp*, , i, value=mean(df$Year)) : subscript out of bounds
I can only assume this happens because the data frame is empty.
I've read enough SO posts to know that for-loops are tricky, but I've tried more iterations than I can remember to try to make this work in the last 3 years to no avail.
Any tips on loops or ways to avoid them while getting the output I need would be appreciated.
Per your needs, I need to subset each unique set of data, run a function, take the output and calculate a new value, consider two routes:
Using ave if your function expects and returns a single numeric column.
Using by if your function expects a data frame and returns anything.
ave
Returns a grouped inline aggregate column with repeated value for every member of group. Below, with is used as context manager to avoid repeated dat$ references.
# BY SITE GROUPING
dat$New_Column <- with(dat, ave(Numeric_Column, Site, FUN=myfunction))
# BY SITE AND TRANSECT GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, FUN=myfunction))
# BY SITE AND TRANSECT AND YEAR GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, Year, FUN=myfunction))
by
Returns a named list of objects or whatever your function returns for each possible grouping. For more than one grouping, tryCatch is used due to possibly empty data frame item from all possible combinations where your myfunction can return an error.
# BY SITE GROUPING
obj_list <- by(dat, dat$Site, function(sub) {
myfunction(sub) # RUN ANY OPERATION ON sub DATA FRAME
})
# BY SITE AND TRANSECT GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# BY SITE AND TRANSECT AND YEAR GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect", "Year")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# FILTERS OUT ALL NULLs (I.E., NO LENGTH)
obj_list <- Filter(length, obj_list)
# BUILDS SINGLE OUTPUT IF MATRIX OR DATA FRAME
final_obj <- do.call(rbind, obj_list)
Here's another approach using the dplyr library, in which I'm creating a data.frame of summary statistics for each group and then just joining it back on:
library(dplyr)
# Group by species (site, transect, etc) and summarise
species_summary <- iris %>%
group_by(Species) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Sepal.Width = mean(Sepal.Width))
# A data.frame with one row per species, one column per statistic
species_summary
# Join the summary stats back onto the original data
iris_plus <- iris %>% left_join(species_summary, by = "Species")
head(iris_plus)
I have a data frame df containing 3 numerical variables,1 outcome and 1 categorical variable
I need to carry out a procedure which involves filtering the df by different levels of category A or B and then dump them into a function such as binnedplot to check for interaction between the categorical and numerical variables.
sample df:
set.seed(10)
df=data.frame(num1=sample(100,60),
num2=sample(100,60),
num3=sample(100,60),
category=as.factor(rep(c("A","B"),30)),
outcome=sample(c(0,1),60, replace=T))
df1=df%>%filter(category=="A")
df2=df%>%filter(category=="B")
binnedplot(df1$num1, df1$outcome)
binnedplot(df2$num1, df2$outcome)
binnedplot(df1$num2, df1$outcome)
binnedplot(df2$num2, df2$outcome)
binnedplot(df1$num3, df1$outcome)
binnedplot(df2$num3, df2$outcome)
Update:
split.dfs<-split(df, df$category)
par(mar=c(1,1,1,1))
par(mfcol=c(2,1))
lapply(split.dfs, function(x) lapply(df[1:3], function(x) binnedplot(x, df$outcome, main=df$category)))
Initially I wondered how can I do this via a function in a more scalable way such as I can handle more numerical and categorical columns without too much repetition.
Now with updated code (Still have bug), my main issue is how to label the 3 2x1 Panels with the correct category header and how to label x axis with num1/num2/num3 for clarity of the plot.
You can use a combination of by and lapply:
library(arm)
by(df, df$category,
function(x) lapply(subset(x, select = -c(category, outcome)),
binnedplot, x$outcome))
I have a data frame with 251 observations and 45 variables. There are 6 observations in the middle of the data frame that i'd like to exclude from my analyses. All 6 belong to one level of a factor. It is easy to generate a new data frame that, when printed, appears to exclude the 6 observations. When I use the new data frame to plot variables by the factor in question, however, the supposedly excluded level is still included in the plot (sans observations). Using str() confirms that the level is still present in some form. Also, the index for the new data frame skips 6 values where the observations formerly resided.
How can I create a new data frame that excludes the 6 observations and does not continue to recognize the excluded factor level when plotting? Can the new data frame be made to "re-index", so that the new index does not skip values formerly assigned to the excluded factor level?
I've provided an example with made up data:
# ---------------------------------------------
# data
char <- c( rep("anc", 4), rep("nam", 3), rep("oom", 5), rep("apt", 3) )
a <- 1:15 / pi
b <- seq(1, 8, .5)
d <- rep(c(3, 8, 5), 5)
dat <- data.frame(char, a, b, d)
dat
# two ways to remove rows that contain a string
datNew1 <- dat[-which(dat$char == "nam"), ]
datNew1
datNew2 <- dat[grep("nam", dat[ ,"char"], invert=TRUE), ]
datNew2
# plots still contain the factor level that was excluded
boxplot(datNew1$a ~ datNew1$char)
boxplot(datNew2$a ~ datNew2$char)
# str confirms that it's still there
str(datNew1)
str(datNew2)
# ---------------------------------------------
You can use the drop.levels() function from the gdata package to reduce the factor levels down to the actually used ones -- apply it on your column after you created the new data.frame.
Also try a search for r and drop.levels here (but you need to make the search term [r] drop.levels which I can't here as it interferes with the formatting logic).
Starting with R version 2.12.0, there is a function droplevels, which can be applied either to factor columns or to the entire dataframe. When applied to the dataframe, it will remove zero-count levels from all factor columns. So your example will become simply:
# two ways to remove rows that contain a string
datNew1 <- droplevels( dat[-which(dat$char == "nam"), ] )
datNew2 <- droplevels( dat[grep("nam", dat[ ,"char"], invert=TRUE), ] )
I have pasted something from my code- I have an enclosure experiment in a lake- have measurements from enclosures and the lake but mostly dont want to deal with lake:
my variable is called "t.level" and the levels were control, low medium high and lake-
-this code makes it possible to use the nolk$ or data=nolk to get data without the "lake"..
nolk<-subset(mylakedata,t.level == "control" |
t.level == "low" |
t.level == "medium" |
t.level=="high")
nolk[]<-lapply(nolk, function(t.level) if(is.factor(t.level))
t.level[drop=T]
else t.level)