I clearly still don't understand plyr syntax, as illustrated below. Can someone help me see what I'm missing?
The following code works fine, as expected:
# make a data frame to use dlply on
f <- as.factor(c(rep("a", 3), rep("b", 3)))
y <- rnorm(6)
df <- data.frame(f=f, y=y)
# split the data frame by the factor and perform t-tests
l <- dlply(df, .(f), function(d) t.test(y, mu=0))
However, the following causes an error
l_bad <- dlply(df, .(f), t.test, .mu=0)
Error in if (stderr < 10 * .Machine$double.eps * abs(mx)) stop("data are essentially constant") : missing value where TRUE/FALSE needed
Which looks a bit as if R is trying to perform a t.test on the factor. Why would that be? Many thanks.
dlply splits df into several data frames. That means that whatever function you hand off to dply must expect a data frame as input. t.test expects a vector as it's first argument.
Your anonymous function in dlply declares d as its only argument. But then in your call to t.test you pass only y. R doesn't automatically know to look in the data frame d for y. So instead it's probably finding the y that you defined in the global environment.
Simply changing that to t.test(d$y,mu = 0) in your first example should make it work.
The second example will only work if the function to be applied is expecting a data frame as input. (i.e. see summarise or transform.)
Related
I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)
I'm trying to extract prop.test p-values over a set of columns in a dataframe existing in the global environment (df) and save them as a dataframe. I have a criteria column and 19 variable columns (among others)
proportiontest <- function() {
prop_df <- data.frame()
for(i in 1:19) {
x <- paste("df$var_", i, sep="")
y <- (prop.test(table(df$criteria, x), correct=FALSE))$p.value
z <- cbind (x, y)
prop_df <- rbind(prop_df, z)
}
assign("prop_df",prop_df,envir = .GlobalEnv)
}
proportiontest()
When I run this I get the error:
Error in table(df$criteria, x) : all arguments must have the same length
When I manually insert the column name into the function (instead of x) everything runs fine. e.g.
y <- (prop.test(table(df$criteria, df$var_1), correct=FALSE))$p.value
I seem to have the problem of using the variable (x) value generated via the for loop as the argument.
What am I missing or doing wrong in this case? I have tried passing x into the table() function as.String(x) as.character(x) among countless others to no avail. I cannot seem to understand in which form the argument must be. I'm probably misunderstanding something very basic in R but it's driving me insane and I cannot seem to formulate the question in a manner where google/SO can help me.
Currently in your function x is just a string. If you want to use a column from your data frame df you can do this in your for loop:
x <- df[,i]
You'll then need to change z or you'll be cbinding a column to a single p value, maybe just change to this:
z <- cbind(i,y)
so that you know which df column belongs to each p value.
You should be careful as well since the function will search for df created within itself and then move to the parent environment if it doesn't find it, so maybe you could pass the df as an argument to avoid any mistakes.
df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!
I have a general problem about the R loops in general:
Could someone explain me the error of this?
for (i in seq(length=2:ncol(df))) { z <- cor.test(df$SEASON, df[,i], method="spearman");z}
easily I would like to use the cor.test(x,y) function between the col called SEASON with all the col of my data frame "df".
Moreover I want that after this calculation, R prints me the results "z".
First, you don't really need a for() loop. You can use the apply() function to get the correlation of SEASON with all of the other columns of the data frame df.
# some fake data
n <- 20
df <- data.frame(SEASON=runif(n), A=runif(n), B=runif(n), C=runif(n))
# print the correlation
apply(df[, -1], 2, cor.test, df$SEASON, method="spearman")
Second, you are not using the seq() function properly. The length.out argument of seq() is the "desired length of the sequence". You keep supplying the length.out argument with a vector, instead of the scalar (a vector of length one) it is expecting. That is why you get a warning message when you submit something like, seq(length.out=2:ncol(df)). The function just uses the first element, so the result is the same (without a warning message) as for seq(length.out=2). If you wanted to use seq() to give you the desired result, you would use seq(from=2, to=ncol(df)). This is fine, but I think it simpler and cleaner to simply use 2:ncol(df) as previous posters suggest.
If you really wanted to use a for loop, this should do the trick:
for(i in 2:ncol(df)) cor.test(df$SEASON, df[, i], method="spearman")
First time posting so apologies if it doesn't meet standards! I'm working on a simulation in which I'm using cut() to create ordinal variables drawn from a multivariate normal distribution. I'm attempting to create a data.frame using expand.grid() containing all of my conditions and then run mdply() to 'sew it all together.' A particular column in that df is df$k, which contains vectors of cut points which will be fed into cut():
n <- c(25,50)
p <- c(0,.1)
k <- list(c(-Inf,0,Inf), c(-Inf,-1,1,Inf))
factors <- list(n=n, p=p, k=k)
df <- expand.grid(factors)
fx <- function(n, p, k){
data <- mvrnorm(n,c(0,0),matrix(c(1,p,p,1),2,2))
x <- cut(data[,1],k)
y <- cut(data[,2],k)
dat <- cbind(x,y)
}
My issue is that although my table of conditions looks correct, df$k is actually a list of 8 lists, and when fed into mdply() gives the error: Error: (list) object cannot be coerced to type 'integer' (I'm assuming that's why the error is popping up.) So essentially I'd like to unlist each cell so that each cell is simply a vector c(-Inf, 0, Inf). It will look exactly like df does now, but I believe it will solve the problem. unlist(df$k) creates one long vector, and loses the cuts I need--unlist(df$k[[x]])does the trick, but I need to do that to each element of the column.
Here's the code that produces the error:
test <- mdply(df, .fun = runSim) #runSim contains fx above and an estimation function