This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Averaging column values for specific sections of data corresponding to other column values
I would like to analyze a dataset by group. The data is set up like this:
Group Result cens
A 1.3 1
A 2.4 0
A 2.1 0
B 1.2 1
B 1.7 0
B 1.9 0
I have a function that calculates the following
sumStats = function(obs, cens) {
detects = obs[cens==0]
nondetects= obs[cens=1]
mean.detects=mean(detects)
return(mean.detects) }
This of course is a simple function for illustration purpose. Is there a function in R that will allow me to use this home-made function that needs 2 variables input to analyze the data by groups.
I looked into the by function but it seems to take in 1 column data at a time.
Import your data:
test <- read.table(header=TRUE,textConnection("Group Result cens
A 1.3 1
A 2.4 0
A 2.1 0
B 1.2 1
B 1.7 0
B 1.9 0"))
Though there are many ways to do this, using by specifically you could do something like this (assuming your dataframe is called test):
by(test,test$Group,function(x) mean(x$Result[x$cens==1]))
which will give you the mean of all the Results values within each group which have cens==1
Output looks like:
test$Group: A
[1] 1.3
----------------------------------------------------------------------
test$Group: B
[1] 1.2
To help you understand how this might work with your function, consider this:
If you just ask the by statement to return the contents of each group, you will get:
> by(test,test$Group,function(x) return(x))
test$Group: A
Group Result cens
1 A 1.3 1
2 A 2.4 0
3 A 2.1 0
-----------------------------------------------------------------------
test$Group: B
Group Result cens
4 B 1.2 1
5 B 1.7 0
6 B 1.9 0
...which is actually 2 data frames with only the rows for each group, stored as a list:
This means you can access parts of the data.frames for each group as you would before they they were split up. The x in the above functions is referring to the whole sub-dataframe for each of the groups. I.e. - you can use individual variables as part of x to pass to functions - a basic example:
> by(test,test$Group,function(x) x$Result)
test$Group: A
[1] 1.3 2.4 2.1
-------------------------------------------------------------------
test$Group: B
[1] 1.2 1.7 1.9
Now, to finally get around to answering your specific query!
If you take an example function which gets the mean of two inputs separately:
sumStats = function(var1, var2) {
res1 <- mean(var1)
res2 <- mean(var2)
output <- c(res1,res2)
return(output)
}
You could call this using by to get the mean of both Result and cens like so:
> by(test,test$Group,function(x) sumStats(x$Result,x$cens))
test$Group: A
[1] 1.9333333 0.3333333
----------------------------------------------------------------------
test$Group: B
[1] 1.6000000 0.3333333
Hope that is helpful.
The aggregate function is designed for this.
aggregate(dfrm$cens, dfrm["group"], FUN-mean)
You can get the mean value os several columns at once, each within 'group'
aggregate(dfrm[ , c("Result", "cens") ], dfrm["group"], FUN=mean)
Related
So, I have a data frame containing 100 different variables. Now, I want to create 100 new variables corresponding to each of the variable in the original data frame. Currently, I am trying loops and lapply to figure out the way out of it, but haven't had much luck so far.
Here is just a snapshot of how the data frame looks like(suppose my data frame has name er):
a b c d
1 2 3 4
5 6 7 8
9 0 1 2
and using each of these 4 variable I have to create a new variable. Hence, total of 4 new variables. My variable should be like lets suppose a1=0.5+a, b1=0.5+b and so on.
I am doing trying the following two approaches:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
and alternatively, I am trying lapply as follows:
dep <- lapply(er, function(x) {
x<-0.5+er
}
But, none of them are working. Can anyone let me know what's the problem with these codes or suggest an efficient code to do this. I have show just 4 variables here for demonstration. I have around 100 of them.
You could directly add 0.5 (or any number) to the dataframe.
er[paste0(names(er), '1')] <- er + 0.5
er
# a b c d a1 b1 c1 d1
#1 1 2 3 4 1.5 2.5 3.5 4.5
#2 5 6 7 8 5.5 6.5 7.5 8.5
#3 9 0 1 2 9.5 0.5 1.5 2.5
Ronak's answer provides the most efficient way of solving your problem. I'll focus on why your attempts didn't work.
er <- data.frame(a = c(1, 5, 9), b = c(2, 6, 0), c = c(3, 7, 1), d = c(4, 8, 2))
A. for loop:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
Thinking of how R is interpreting each element of your loop. It will go from 1 to however many columns of er, and use the i placeholder, so on the first iteration it will do:
[[1]] <- 0.5 + [[1]]
Which doesn't make sense because you're not indicating what you are indexing at all. Instead, what you would want is:
for (i in 1:ncol(er)) {
er[[i]] <- 0.5 + er[[i]]
}
Here, each iteration will mean "assign to the ith column of er, the ith column of er + 0.5". If you want to further add that you want to create new variables, you would do the following (which is somewhat similar to Ronak's answer, just less efficient):
for (i in 1:ncol(er)) {
er[[paste0(names(er)[i], "1")]] <- 0.5 + er[[i]]
}
As a side note, it is preferred to use seq_along(er) instead of 1:ncol(er).
B. lapply:
dep <- lapply(er, function(x) {
x<-0.5+er
}
When creating a function, whatever you need to specify what you want to return by calling it. Here, function(x) { x + 0.5 } is sufficient to indicate that you want to return the variable + 0.5. Since lapply() returns a list (the function's name is short for "list apply"), you'll want to use as.data.frame():
as.data.frame(lapply(er, function(x) { x + 0.5 }))
However, this doesn't change the variable names, and there's no easy efficient way to change that here:
dep <- as.data.frame(lapply(er, function(x) { x + 0.5 }))
names(dep) <- paste0(names(dep), "1")
cbind(er, dep)
a b c d a1 b1 c1 d1
1 1 2 3 4 1.5 2.5 3.5 4.5
2 5 6 7 8 5.5 6.5 7.5 8.5
3 9 0 1 2 9.5 0.5 1.5 2.5
C. Another way would be using dplyr syntax, which is more elegant and readable:
library(dplyr)
mutate(er, across(everything(), ~ . + 0.5, .names = "{.col}1"))
I did some linear regression and I want to forecast the moment of exceeding a certain value.
This means I have three columns:
a= slope
b = intercept
c = target value
On every row I want to calculate
solve(a,(c-b))
How do I do this in an efficient way, without using a loop (it is an extensive dataset)?
So you basically want to solve the equation
c = a*x + b
for x for each row? That has the pretty simple solution of
x = (c-b)/a
which is a vectorized operation in R. No loop necessary
dd <- data.frame(
a = 1:5,
b = -2:2,
c = 10:14
)
transform(dd, solution=(c-b)/a)
# a b c solution
# 1 1 -2 10 12.0
# 2 2 -1 11 6.0
# 3 3 0 12 4.0
# 4 4 1 13 3.0
# 5 5 2 14 2.4
in addition to the aforementioned responses, you could also use the mutate function from the tidyverse. like so:
library(magrittr)
library(tidyverse)
dataframe %<>% mutate(prediction=solve(a,(c-b))
in this example we are assuming the columns 'a','b', and 'c' are in a table called 'dataframe.' we then use the %<>% function from the magrittr library to say "apply the function that follows to the dataframe".
Here is a simple way using the Vectorize function:
solve_vec <- Vectorize(solve)
solve_vec(d$a, d$c - d$b)
> solve_vec(d$a, d$c - d$b)
[1] 12.0 6.0 4.0 3.0 2.4
Example Dataset:
A 2 1.5
A 2 1.5
B 3 2.0
B 3 2.5
B 3 2.6
C 4 3.2
C 4 3.5
So here I would like to create 3 frequency histograms based on the first two columns so A2, B3, and C4? I am new to R any help would be greatly appreciated should I flatten out the data so its like this:
A 2 1.5 1.5
B 3 2.0 2.5 2.6 etc...
Thank you
Here's an alternative solution, that is based on by-function, which is just a wrapper for the tapply that Jilber suggested. You might find the 'ex'-variable useful:
set.seed(1)
dat <- data.frame(First = LETTERS[1:3], Second = 1:2, Num = rnorm(60))
# Extract third column per each unique combination of columns 'First' and 'Second'
ex <- by(dat, INDICES =
# Create names like A.1, A.2, ...
apply(dat[,c("First","Second")], MARGIN=1, FUN=function(z) paste(z, collapse=".")),
# Extract third column per each unique combination
FUN=function(x) x[,3])
# Draw histograms
par(mfrow=c(3,2))
for(i in 1:length(ex)){
hist(ex[[i]], main=names(ex)[i], xlim=extendrange(unlist(ex)))
}
Assuming your dataset is called x and the columns are a,b,c respectively I think this command should do the trick
library(lattice)
histogram(~c|a+b,x)
Notice that this requires you to have the package lattice installed
I have a data frame with columns grade.equivalent and scaled.score, both numeric. I'd like to find the percent of students at or above a given scaled.score for all students at or above each grade.equivalent.
For example, given the following data frame:
df.ex <- data.frame(grade.equivalent=c(2.4,2.7,3.1,2.5,1.4,2.2,2.3,1.7,1.3,2.2),
scaled.score=c(187,277,308,268,236,305,298,246,241,138)
)
I'd like to know for each grade.equivalent, what percent of students scored above 301 out of all students scoring at or above that grade.equivalent.
To do this I did the following:
find.percent.basic <- function(cut.ge, data, cut.scaled.score){
df.sub <- subset(data, grade.equivalent >= cut.ge & !is.na(scaled.score))
denom <- nrow(df.sub)
df.sub <- subset(df.sub, scaled.score >= cut.scaled.score)
numer <- nrow(df.sub)
return(numer/denom)
}
grade.equivs <- unique(df.ex$grade.equivalent)
grade.equivs <- grade.equivs[order(grade.equivs)]
just.percs <- sapply(grade.equivs, find.percent.basic, data=df.ex, cut.scaled.score=301)
new.df <- data.frame(grade.equivalent=grade.equivs, perc=just.percs)
I plan to wrap this in a function and use it with plyr.
My question is, is there a better way to do this? It seems like this might be a base function of r or a common package that I just don't know about.
Thanks for any thoughts.
EDIT for clarification
The code above produces the following result, which is what I'm looking for:
grade.equivalent perc
1 1.3 0.2000000
2 1.4 0.2222222
3 1.7 0.2500000
4 2.2 0.2857143
5 2.3 0.2000000
6 2.4 0.2500000
7 2.5 0.3333333
8 2.7 0.5000000
9 3.1 1.0000000
Edited for clarification a second time, per observations from #DWin
The mean of a boolean is the percentage that are true, so something like this should do it:
mean(data$scaled.score >= cut.ss, na.rm=TRUE)
As in your comment, yes, that's exactly what you need to do. I'd choose to access scaled.score slightly differently, but no real difference.
gs <- sort(unique(df.ex$grade.equivalent))
ps <- sapply(gs, function(cut.ge) {
mean(df.ex$scaled.score[df.ex$grade.equivalent>=cut.ge] >= 301, na.rm=TRUE)
})
data.frame(gs, ps)
# gs ps
# 1.3 0.2000000
# 1.4 0.2222222
# 1.7 0.2500000
# 2.2 0.2857143
# 2.3 0.2000000
# 2.4 0.2500000
# 2.5 0.3333333
# 2.7 0.5000000
# 3.1 1.0000000
I don't think this is the kind of thing that will work well with plyr's split-apply-combine methodology, because you can't split the data into discrete subsets for each grade-equivalent, instead, some rows will appear in multiple subsets.
Another option would be to split the scores (or the entire data frame, if needed) yourself into the desired sections, and then to apply whatever functions you wanted; this would be the same methodology as plyr, though more by hand.
scores <- lapply(gs, function(x) df.ex$scaled.score[df.ex$grade.equivalent>=x])
sapply(scores, function(x) mean(x>301, na.rm=TRUE))
And a final option would be to put them in order to start with and then compute a cumulative mean, and remove the duplicated grade.equivalent values, like this.
df2 <- df.ex[rev(order(df.ex$grade.equivalent)),]
df2$perc <- cumsum(df2$scaled.score>301)/1:nrow(df2)
df2 <- df2[nrow(df2):1,c("grade.equivalent", "perc")]
df2[!duplicated(df2$grade.equivalent),]
with(df.ex, tapply(scaled.score, INDEX=grade.equivalent,
FUN=function(s) 100*sum(s>301)/length(s) ) )
#1.3 1.4 1.7 2.2 2.3 2.4 2.5 2.7 3.1
# 0 0 0 50 0 0 0 0 100
suppose I have a numeric vector like:
x <- c(1.0, 2.5, 3.0)
and data.frame:
df<-data.frame(key=c(0.5,1.0,1.5,2.0,2.5,3.0),
value=c(-1.187,0.095,-0.142,-0.818,-0.734,0.511))
df
key value
1 0.5 -1.187
2 1.0 0.095
3 1.5 -0.142
4 2.0 -0.818
5 2.5 -0.734
6 3.0 0.511
I want to extract all the rows in df$key that have the same values equal to x, with result like:
df.x$value
[1] 0.095 -0.734 0.511
Is there an efficient way to do this please? I've tried data.frame, hash package and data.table, all with no success. Thanks for help!
Thanks guys. I actually tried similar thing but got df$key and x reversed. Is it possible to do this with the hash() function (in the 'hash' package)? I see hash can do things like:
h <- hash( keys=letters, values=1:26 )
h$a # 1
h$foo <- "bar"
h[ "foo" ]
h[[ "foo" ]]
z <- letters[3:5]
h[z]
<hash> containing 3 key-value pair(s).
c : 3
d : 4
e : 5
But seems like it doesn't take an array in its key chain, such as:
h[[z]]
Error in h[[z]] : wrong arguments for subsetting an environment
but I need the values only as in a vector rather than a hash. Otherwise, it would be perfect so that we can get rid of data.frame by using some 'real' hash concept.
Try,
df[df$key %in% x,"value"] # resp
df[df$key %in% x,]
Using an OR | condition you may modify it in such a way that your vector may occur in either of your columns. General tip: also have a look at which.
Have you tried testing the valued of df$key that are in x and extracting the value in the value column? I only say this out loud because StackOverflow doesnt like oneline answers:
> x
[1] 1.0 2.5 3.0
> df
key value
1 0.5 -0.7398436
2 1.0 0.6324852
3 1.5 1.8699257
4 2.0 1.0038996
5 2.5 1.2432679
6 3.0 -0.6850663
> df[df$key %in% x,'value']
[1] 0.6324852 1.2432679 -0.6850663
>
BIG WARNING - comparisons with floating point numbers with == can be a bad idea - read R FAQ 7.31 for more info.