Identify INDICES value that 'breaks' by() [or, equivalently, tapply()] - r

When using the function by, at times I will have a data subset (as determined by the INDICES argument) that 'breaks' by (technically it breaks FUN which in turn breaks by).
Is there a way to identify the 'bad' value of the list passed to INDICES? (without writing an explicit loop over the list)

Can you make a reproducible example?
You can try something like,
customfun <- function(x)try(somefun(x))
byresult <- by(mydata, mydata$group, FUN=customfun)
whichbad <- sapply(byresult, function(x)which(inherits(x, "try-error")))

Related

T test in r: How to change the x and y arguments of a t-test using a lapply function

I have this dataframe:
I'm performing a t_test using this lapply approach:
columns = colnames(my_data)[-1]
my_t_test<-lapply(my_data[columns], function(x) t.test(x~my_data$Treatment,alternative='less'))
But it seems the t_test take x=my_data$Control and y=my_data$Stress, making the results a non-sense. As I'm testing the alternative hypothesis that the difference in the mean is less in the $Stress group, I want that the x argument will be my_data$Stress. And option is to change to alternative='greater', but is an awful approach, the other option is to change the order of the groups in my dataframe, but I´m looking for a programmatic solution.
any suggestion?
We can use the formula method after subsetting the columns of 'my_data' by looping over the column name vector
lapply(columns, function(nm) t.test(.~ Treatment, my_data[c('Treatment', nm)]))

Using apply() function to iterate over different data types doesn't work

I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)

Using a custom function (ifelse) with dcast

I'm interested in reshaping a dataframe, but instead of using standard dcast functions like the mean, I'd like to use a custom function. Specifically, I'm interested in using an ifelse statement to assign binary values.
Here's a reproducible example:
# dataframe that includes extraneous information
df <- data.frame(sale_id=c(1,1,1,2,2,2,3,3,4,5),project_id=c(501,502,503,501,502,503,501,502,504,505),
sale_year=c(1990,1991,1993,1990,1992,1990,1991,1993,1990,1992),
var1=c(5,4,3,6,5,4,4,7,2,9),var2=c(7,3,4,8,5,8,2,3,5,7))
# list of the variables I actually need (I don't need 'sale_year')
varlist <- c("var1","var2")
# selecting out id variables and variables I'm interested in manipulating
dfvars <- df[,c("sale_id","project_id",varlist)]
# melt dataframe
library(reshape2)
mdata <- melt(dfvars, id=c('sale_id','project_id'))
# create custom ifelse function, assign '1' if mean is above a critical value, and '0' if not
funx <- function(u){ifelse(mean(u)>5,1,0)}
# cast data using this function
cdata <- dcast(mdata, sale_id~variable, funx)
It works if I just use a standard function, like mean (ex):
cdata <- dcast(mdata, sale_id~variable, mean)
But with my ifelse() function, I get an error about data types (logical vs. double), which doesn't make sense to me, since the result of "mean(u) > 5" should be returning a logical result (TRUE or FALSE), to then be used by the ifelse() part.
I believe this has to do with the details of type coercion. The return of your custom function is being treated as double for some sets of observations, but logical in others. The code works when you make explicit the return type.
Example:
# Works
funx1 <- function(u){ifelse(mean(u)>5,TRUE,FALSE)}
funx2 <- function(u){as.logical(ifelse(mean(u)>5,1,0))}
funx3 <- function(u){as.numeric(ifelse(mean(u)>5,1,0))}

Looping a rep() function in r

df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!

r for loop cor.test function

I have a general problem about the R loops in general:
Could someone explain me the error of this?
for (i in seq(length=2:ncol(df))) { z <- cor.test(df$SEASON, df[,i], method="spearman");z}
easily I would like to use the cor.test(x,y) function between the col called SEASON with all the col of my data frame "df".
Moreover I want that after this calculation, R prints me the results "z".
First, you don't really need a for() loop. You can use the apply() function to get the correlation of SEASON with all of the other columns of the data frame df.
# some fake data
n <- 20
df <- data.frame(SEASON=runif(n), A=runif(n), B=runif(n), C=runif(n))
# print the correlation
apply(df[, -1], 2, cor.test, df$SEASON, method="spearman")
Second, you are not using the seq() function properly. The length.out argument of seq() is the "desired length of the sequence". You keep supplying the length.out argument with a vector, instead of the scalar (a vector of length one) it is expecting. That is why you get a warning message when you submit something like, seq(length.out=2:ncol(df)). The function just uses the first element, so the result is the same (without a warning message) as for seq(length.out=2). If you wanted to use seq() to give you the desired result, you would use seq(from=2, to=ncol(df)). This is fine, but I think it simpler and cleaner to simply use 2:ncol(df) as previous posters suggest.
If you really wanted to use a for loop, this should do the trick:
for(i in 2:ncol(df)) cor.test(df$SEASON, df[, i], method="spearman")

Resources