I have a general problem about the R loops in general:
Could someone explain me the error of this?
for (i in seq(length=2:ncol(df))) { z <- cor.test(df$SEASON, df[,i], method="spearman");z}
easily I would like to use the cor.test(x,y) function between the col called SEASON with all the col of my data frame "df".
Moreover I want that after this calculation, R prints me the results "z".
First, you don't really need a for() loop. You can use the apply() function to get the correlation of SEASON with all of the other columns of the data frame df.
# some fake data
n <- 20
df <- data.frame(SEASON=runif(n), A=runif(n), B=runif(n), C=runif(n))
# print the correlation
apply(df[, -1], 2, cor.test, df$SEASON, method="spearman")
Second, you are not using the seq() function properly. The length.out argument of seq() is the "desired length of the sequence". You keep supplying the length.out argument with a vector, instead of the scalar (a vector of length one) it is expecting. That is why you get a warning message when you submit something like, seq(length.out=2:ncol(df)). The function just uses the first element, so the result is the same (without a warning message) as for seq(length.out=2). If you wanted to use seq() to give you the desired result, you would use seq(from=2, to=ncol(df)). This is fine, but I think it simpler and cleaner to simply use 2:ncol(df) as previous posters suggest.
If you really wanted to use a for loop, this should do the trick:
for(i in 2:ncol(df)) cor.test(df$SEASON, df[, i], method="spearman")
Related
I have a dataframe as follows:
a <- c(1,45,5,23,78,NA,NA)
b <- c(1,4,5,NA,NA,NA,NA)
c <- c(4,NA,NA,NA,NA,NA,NA)
d <- c(4,6,7,3,4,23,4)
df <- data.frame(a,b,c,d)
Now I would like to get a vector with the correlation factors of each vector with its own length omitting NAs.
For example: cor(df$a[!is.na(df$a)], 1:length(df$a[!is.na(df$a)])) which returns me the linear correlation factor of (1,45,5,23,78) with (1,2,3,4,5)
When I apply the above written code on one single column, it works.
However, when I include the function in the lapply function to get it for all the columns, I get an 'incompatible dimensions' error. I understand that the incompatible dimensions error indicates that different vector sizes are correlated. However, how is this possible when I am correlating the vector with its length itself...?
result <- lapply(df, function(x){ o <-cor(x[!is.na(x)], 1:length(x[!is.na(x)]))})
I also tried, which also returned me the same error.
result <- lapply(df, function(x) {o <-cor(c(x[!is.na(x)]),c(1:length(x[!is.na(x)])))})
have you try:
apply(df, 2, cor, y=1:nrow(df),use="complete.obs")
It's a more elegant way of coding your function. It may work better for you as well.
I'm trying to extract prop.test p-values over a set of columns in a dataframe existing in the global environment (df) and save them as a dataframe. I have a criteria column and 19 variable columns (among others)
proportiontest <- function() {
prop_df <- data.frame()
for(i in 1:19) {
x <- paste("df$var_", i, sep="")
y <- (prop.test(table(df$criteria, x), correct=FALSE))$p.value
z <- cbind (x, y)
prop_df <- rbind(prop_df, z)
}
assign("prop_df",prop_df,envir = .GlobalEnv)
}
proportiontest()
When I run this I get the error:
Error in table(df$criteria, x) : all arguments must have the same length
When I manually insert the column name into the function (instead of x) everything runs fine. e.g.
y <- (prop.test(table(df$criteria, df$var_1), correct=FALSE))$p.value
I seem to have the problem of using the variable (x) value generated via the for loop as the argument.
What am I missing or doing wrong in this case? I have tried passing x into the table() function as.String(x) as.character(x) among countless others to no avail. I cannot seem to understand in which form the argument must be. I'm probably misunderstanding something very basic in R but it's driving me insane and I cannot seem to formulate the question in a manner where google/SO can help me.
Currently in your function x is just a string. If you want to use a column from your data frame df you can do this in your for loop:
x <- df[,i]
You'll then need to change z or you'll be cbinding a column to a single p value, maybe just change to this:
z <- cbind(i,y)
so that you know which df column belongs to each p value.
You should be careful as well since the function will search for df created within itself and then move to the parent environment if it doesn't find it, so maybe you could pass the df as an argument to avoid any mistakes.
df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!
When using the function by, at times I will have a data subset (as determined by the INDICES argument) that 'breaks' by (technically it breaks FUN which in turn breaks by).
Is there a way to identify the 'bad' value of the list passed to INDICES? (without writing an explicit loop over the list)
Can you make a reproducible example?
You can try something like,
customfun <- function(x)try(somefun(x))
byresult <- by(mydata, mydata$group, FUN=customfun)
whichbad <- sapply(byresult, function(x)which(inherits(x, "try-error")))
I clearly still don't understand plyr syntax, as illustrated below. Can someone help me see what I'm missing?
The following code works fine, as expected:
# make a data frame to use dlply on
f <- as.factor(c(rep("a", 3), rep("b", 3)))
y <- rnorm(6)
df <- data.frame(f=f, y=y)
# split the data frame by the factor and perform t-tests
l <- dlply(df, .(f), function(d) t.test(y, mu=0))
However, the following causes an error
l_bad <- dlply(df, .(f), t.test, .mu=0)
Error in if (stderr < 10 * .Machine$double.eps * abs(mx)) stop("data are essentially constant") : missing value where TRUE/FALSE needed
Which looks a bit as if R is trying to perform a t.test on the factor. Why would that be? Many thanks.
dlply splits df into several data frames. That means that whatever function you hand off to dply must expect a data frame as input. t.test expects a vector as it's first argument.
Your anonymous function in dlply declares d as its only argument. But then in your call to t.test you pass only y. R doesn't automatically know to look in the data frame d for y. So instead it's probably finding the y that you defined in the global environment.
Simply changing that to t.test(d$y,mu = 0) in your first example should make it work.
The second example will only work if the function to be applied is expecting a data frame as input. (i.e. see summarise or transform.)