Sapply() , combining commands - r

I was just playing around and trialling with R and I had trouble with combining my sapply() commands into one expression.
For example, my data table was called height_weight.
I want to calculate the usual summary statistics: mean, median, max, minimum and sample size from column 2 till 7.
Just as sample codes:
I used this for mean:
sapply(height_weight[2:7],mean,na.rm=TRUE)
max;
sapply(height_weight[2:7],max,na.rm=TRUE)
I'm just wondering, how would I combine the two into one expression? I have tried simply placing them next to each other, however that shows an error message.

Many ways to do so.
E.g. use summary and subset for the appropriate rows
sapply(height_weight[2:7], summary)[c("Mean", "Max."), ]
Or use an unnamed custom function that combines the two measures as a result
sapply(height_weight[2:7], function(x) c(Mean=mean(x, na.rm=TRUE), Max=max(x, na.rm=TRUE)))
Placing the two functions besides each other won't work because you can give sapply only one function. Everything that follows will be passed on to that function. (I.e. if it is no parameter of sapply.)

If you want to calculate the usual summary statistics, you can just use summary:
summary(height_weight[2:7])
sapply(height_weight[2:7],summary) # just to use sapply
Otherwise, if you want to define your own summary statistics (in this case mean and max), then you can write a function mysummary and use sapply just as before:
mysummary <- function(x, ...) {
c(mean=mean(x, ...),
max=max(x, ...))
}
sapply(height_weight[2:7], mysummary , na.rm=TRUE)

Related

use function on multiple columns (variables) in r

I am trying to run tests of homogeneity of variance using the leveneTest function from the car package. I can run the test on a single variable like so (using the iris dataset as an example)
library(car)
library(datasets)
data(iris)
leveneTest(iris$Sepal.Length, iris$Species)
However, I would like to run the test on all the dependent variables in the dataset simultaneously (so Sepal.Length, Sepal.Width, Petal.Length, Petal.Width). I am guessing it has something to do with the apply family of functions (sapply, lapply, tapply) but I just can't figure out how. The closest I came is something like this:
lapply(iris, leveneTest(group = iris$Species))
However I get the error
Error in leveneTest.default(group = iris$Species) :
argument "y" is missing, with no default
Which I understand is probably because it isn't able to specify the outcome variables. I am certain I must be missing some obvious use of the apply functions, but I just don't understand what it is. Apologies for the basic question, but I am relatively new to R and am often applying the same function to multiple variables (usually by copying the code several times), so it would be great to understand how to use these functions properly :)
Common parameters to the function need to be passed to ... within lapply. Like this:
lapply(subset(iris, select = -Species), leveneTest, group = iris$Species)
help("lapply") explains that ... is for "optional arguments to FUN" (meaning optional for lapply not for FUN) and provides lapply(x, quantile, probs = 1:3/4) as an example.
Piggybacking on #Roland's answer, you can do the following in base R as well:
lapply(iris[,-5], leveneTest, group = iris$Species
the -5 is obviously specific to the iris dataset. You could replace it with a variable like
lapply(iris[,-length(iris)]....
and that would let you remove the last element of the df, assuming your grouping variable is last.
Additionally as a data.table fanboy, I'll add an option for you to use that as well, if you're interested.
dt.iris[, lapply(.SD, leveneTest, group = Species), .SDcols = !'Species']
this code enables you to 'remove' the Species column from your lapply function in a similar manner to the above base R examples, but by naming it explicitly via the .SD and .SDcols variables. Then you run your analysis in a fairly straightforward manner. Hope this helps!

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

How to subset rows with strings

I want to use function for repetitively making up set with different names.
for example, if I have 5 random vectors.
number1<-sample(1:10, 3)
number2<-sample(1:10, 3)
number3<-sample(1:10, 3)
number4<-sample(1:10, 3)
number5<-sample(1:10, 3)
Then, I will use these vectors for selecting rows in raw data set(i.e. dataframe)
testset1<-raw[number1,]
testset2<-raw[number2,]
testset3<-raw[number3,]
tsetset4<-raw[number4,]
testset5<-raw[number5,]
It takes lot of spaces in manuscript for writing up each commands. I'm trying to shorten these commands with using 'function'
However, I found that it is hard to use variables in a function statement for writing 'text argument'. For example, it is easy to use variables like this.
mean_function<-function(x){
mean(x)
}
But, I want to use function like this.
testset "number with 1-5" <-raw[number"number 1-5",]
I would really appreciate your help.
You don't need to create a function for this task, simply use lapply to loop over the list of elements produced by mget(), then set some names and finally put all results in the global environment:
rowSelected <-lapply(mget(paste0("number", 1:5)), function(x) raw[x, ])
names(rowSelected) <- paste0("testset", 1:5)
list2env(rowSelected, envir = .GlobalEnv)

apply multiple functions in sapply

I have a list of .stat files in tmp directory.
sample:
a.stat=>
abc,10
abc,20
abc,30
b.stat=>
xyz,10
xyz,30
xyz,70
and so on
I need to find summary of all .stat files.
Currently I am using
filelist<-list.files(path="/tmp/",pattern=".stat")
data<-sapply(paste("/tmp/",filelist,sep=''), read.csv, header=FALSE)
However I need to apply summary to all files being read. Or simply in n number of .stat files I need summary from 2nd column column
using
data<-sapply(paste("/tmp/",filelist,sep=''), summary, read.csv, header=FALSE) does not work and gives me summary with class character, which is no what I intend.
sapply(filelist, function(filename){df <- read.csv(filename, header=F);print(summary(df[,2]))}) works fine. However my overall objective is to find values that are more than 2 standard deviations away on either side (outliers). So I use sd, but at the same time need to check if all values in the file currently read come under 2SD range.
To apply multiple functions at once:
f <- function(x){
list(sum(x),mean(x))
}
sapply(x, f)
In your case you want to apply them sequentially, so first read csv data then do summary:
sapply(lapply(paste("/tmp/",filelist,sep=''), read.csv), summary)
To subset your datasets to run summary on particular column you can use change outer sapply function from summary to function(x) summary(x[[2]]).
For short functions you don't want to save in the environment, it can also just be done within the sapply call. For #flxflks 's example:
sapply(df, function(x) c(min = min(x), avg = mean(x)))
Adding to #Jangorecki, I changed the function to include a vector and not a list. Only then it worked for me. I am unsure why my function worked and not the other.
f <- function(x){
c(min = min(x), avg = mean(x))
}
sapply(df, f)
I found the solution at https://www.r-bloggers.com/applying-multiple-functions-to-data-frame/

Quantiles of a data.frame

There is a data.frame() for which's columns I'd like to calculate quantiles:
tert <- c(0:3)/3
data <- dbGetQuery(dbCon, "SELECT * FROM tablename")
quans <- mapply(quantile, data, probs=tert, name=FALSE)
But the result only contains the last element of quantiles return list and not the whole result. I also get a warning longer argument not a multiple of length of shorter. How can I modify my code to make it work?
PS: The function alone works like a charme, so I could use a for loop:
quans <- quantile(a$fileName, probs=tert, name=FALSE)
PPS: What also works is not specifying probs
quans <- mapply(quantile, data, name=FALSE)
The problem is that mapply is trying to apply the given function to each of the elements of all of the specified arguments in sequence. Since you only want to do this for one argument, you should use lapply, not mapply:
lapply(data, quantile, probs=tert, name=FALSE)
Alternatively, you can still use mapply but specify the arguments that are not to be looped over in the MoreArgs argument.
mapply(quantile, data, MoreArgs=list(probs=tert, name=FALSE))
I finally found a workaround which I don't like but kinda works. Perhaps someone can tell the right way to do it:
q <- function(x) { quantile(x, probs=c(0:3)/3, names=FALSE) }
mapply(q, data)
works, no Idea where the difference is.

Resources