Related
I'm trying to run an aggregate SD function, but I'm getting an error message I can't resolve, or else I'm getting output that doesn't work. I'm including sample data- the goal is to run on a larger data set, but I can't even get the aggregate function to work on these three columns.
dput(droplevels(controls2[1:20, 1:3]))
structure(list(Experiment = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Ceres- Clipping",
"FGI- Defoliation"), class = "factor"), Grain = c(489.9, 698.5,
430.6, 244.9, 476.5, 545.4, 570.2, 463.1, 285.1, 407.6, 244.9,
401.9, 126.3, 179.9, 382.7, 266, 653, 653, 606.6, 606.6), Environment = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("Morris.1", "St. Paul.1"), class = "factor")), row.names = c(3565L,
3566L, 3567L, 3568L, 3569L, 3570L, 3571L, 3572L, 3573L, 3574L,
3575L, 3576L, 3577L, 3578L, 3579L, 3580L, 2379L, 2380L, 2381L,
2382L), class = "data.frame")
controlSDs <- aggregate(x = controls2, by = list(controls2$Experiment, controls2$Environment), FUN = "sd")
I get an error message:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
However, the only column that I'm trying to perform sd() on, controls2$Grain, is numeric:
names(controls2)
[1] "Experiment" "Grain" "Environment"
class(controls2$Grain)
[1] "numeric"
I understand controls2$Environment and controls2$Experiment are factors, but I have run this command before with factors in the by = list() command and it has worked. I've also tried the following:
controlSDs <- aggregate(cbind(Experiment, Environment) ~ Grain, data = controls2, sd)
Which does not return an error message, however, the values for controlSDs$Experiment and controlSDs$Environment have been replaced with 0s and NAs such that I cannot use them to combine the data set with a data frame of means calculated using a similar aggregate function.
head(controlSDs)
Grain Experiment Environment
1 0.0 0 0
2 30.0 NA NA
3 44.0 NA NA
4 44.3 NA NA
5 46.0 NA NA
6 48.0 NA NA
Any advice on how to get this aggregate SD function to work correctly would be much appreciated. I would be happy with a solution that simply allows me to calculate the SD of the Grain column, but ideally I could scale this up to a 100+ column of a data set that is entirely numeric aside from the Environment and Experiment columns. I've updated R and R Studio within the last two weeks. I'm still learning how to make reproducible questions so please let me know if there's anything I can do to improve this question.
Are you looking for this. When you specify the formula, you need to set numeric variables to left of ~:
#Code
controlSDs <- aggregate(data = controls2,Grain~.,
FUN = sd)
Output:
controlSDs
Experiment Environment Grain
1 Ceres- Clipping Morris.1 154.67734
2 FGI- Defoliation St. Paul.1 26.78905
Based on your attempts, this can also work:
#Code2
controlSDs <- aggregate(Grain~Experiment+Environment , data = controls2, sd)
Same output.
We can use dplyr
library(dplyr)
controls2 %>%
group_by(Experiment, Environment) %>%
summarise(Grain = sd(Grain))
When I try to create several columns within summarize(), I can reference a newly created column name in the same summarize statement.
Example:
Goal: Try to calculate the standard error ("se") based on the standard deviation ("sd").
Step 1 (start to assign sd for se):
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd)))
returns
style score_mean score_sd score_se
* <fct> <dbl> <dbl> <dbl>
1 S1 3.5 0.707 0.707
Step 2: calculate se based on sd
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd/sqrt(nrow(score)))))
returns
Error: Problem with `summarise()` input `..1`.
x non-numeric argument to binary operator
ℹ Input `..1` is `across(score, list(mean = mean, sd = sd, se = sd/sqrt(nrow(data))))`.
ℹ The error occured in group 1: style = "S1".
Step 3 debugging assignment term
3a) grouped data reference
I replaced the grouped data in nrow(score)) by the other column names or even nrow(data), but they all led to the same error message.
3b) assignment operation
I replaced the assignement for se sd/sqrt(nrow(score))) with different variations leading all to the same error. The simplest was sd/2, so even dividing by a constant doesn't work.
3c) assignment reference
I replaced sd by score_sd to reference the new column created, as seen in the output (Step 1). Still the same error message.
Question: Why does Step 1 work but not Step 2?
The error message just refers to the whole across() statement, so doesn't help to narrow down the root cause.
My hunch is that I have to reference the grouped data somehow, but I tried
se = sd(.)/sqrt(nrow(data) with no success.
Would be grateful for any hints...
Minimal reproducible example:
data <- structure(list(style = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("S1", "S2", "S3", "S4", "S5"), class = "factor"),
param = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), score = c(4,
1, 1, 3, 3, 3, 5, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
After many trial & error attempts, I found the solution myself. This is for everyone who is not yet familiar with the across function, as dplyr 1.0.0 is not yet released.
So the answer to my question is:
You must reference the grouped data by the . operator - BUT ONLY IF you use the purrr formula operator ~!
However, you must NOT reference the grouped data in the n() function, as the n() does NOT accept the . operator.
The second point took endless trials to find out, and is the reason why I wanted to share this solution.
You might not find it intuitive to understand either that, even though n() is defined with brackets, it is never allowed to use the . operator as it always refers to the grouped data.
This is how this double trick looks like:
data %>%
group_by(style) %>%
summarise(across(
score,
list(mean = mean, sd = sd, se = ~sd(.)/sqrt(n()))
))
If you know it, it's easy :-)
Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)
If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))
Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)
There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum
I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp
I have created a data frame from which I'm trying to obtain a subset using some criteria as outlined below. In simple terms, I'd like to write a function to select half of all male ids below the age of 4. I need this function subsequently to do other things for me hence the need for writing a function.
I've tried to write a simple code as given below but it seems an error is coming back-I would really appreciate some help with this. I think this error is coming from the filter() task but I have no idea why.
library(dplyr)
set.seed(2016)
longData <- data.frame(patient=rep(paste(letters[1:20], sep = "_", 1:20), each=5),
age=rep(runif(20, 1, 10), each = 5),var=runif(100, 50, 1000),
time=rep(1:5, 20), gender = factor(rep(c("male", "female"), c(60, 40))))
ptNo <- function(data, id = "patient")
{
ptNumber <- data %>% filter(!duplicated(data[id])) %>% nrow()
return(ptNumber)
}
randSubsetFun <- function(df, gender, seedNumb){
data <- filter(df, gender, age < 4)
set.seed(seedNumb)
n <- round(ptNo(data)/2)
fd <- sample_n(data, n)
return(fd)
}
randSubsetFun(df = longData, gender = "male", seedNumb = 1)
The error reads the following:
Warning message:
In Ops.factor(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, :
‘&’ not meaningful for factors
The annoying thing is when I tried running these codes individually outside of the function, it worked fine. It just doesn't work within the function.
Any help would be warmly appreciated. Any suggestions as to how I can improve this code would be welcome too. I guess the problem really is how do I call a factor from the filter function via this function (If that makes sense)?
This is a more focussed question based on another question I have open at Vectorize/Speed up Code with Nested For Loops
Basically, I want to speed up the execution of this code. I was thinking of using one of the apply family of functions. The apply function would have to use/perform the following:
Input: loop over regions 1 to 10; vectors sed and borewidth with preallocated dimensions filled with NAs
Process: fill data in each of sed and borewidth in the manner implemented in the inner for loop
Output: sed and borewidth vectors
Assumptions (h/t Simon Urbanek): the begin, finish points of each row are contiguous, sequential and for each region, begin at 0.
Code is as below:
for (region in 1:10) {
# subset standRef and sample by region code
standRef.region <- standRef[which(standRef$region == region),]
sample.region <- sample[which(sample$region == region),]
for (i in 1:nrow(sample.region))
{
# create a dataframe - locations - that includes:
# 1) those indices of standRef.region in which the value of the location column is greater than the value of the ith row of the begin column of sample.region
# 2) those indices of standRef.region in which the value of the location column is less than the value of the ith row of the finish column of sample.region
locations <- standRef.region[which((standRef.region$location > sample.region$begin[i]) & (standRef.region$location < sample.region$finish[i])),]
sed[end_tracker:(end_tracker + nrow(locations))] <- sample.region$sed[i]
borewidth[end_tracker:(end_tracker + nrow(locations))] <- sample.region$borewidth[i]
# update end_tracker to the number of locations rows for this iteration
end_tracker <- end_tracker + nrow(locations)
}
cat("Finished region", region,"\n")
}
Sample Data for borewidth andsed. Edit: corrected formatting error in dput
structure(list(region = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
begin = c(0L, 2253252L, 7091077L, 9120205L, 0L, 135094L,
941813L, 5901391L, 6061324L), finish = c(2253252L, 7091077L,
9120205L, 17463033L, 135094L, 941813L, 5901391L, 6061324L,
7092402L), sed = c(3.31830840984048, 1.38014704208403, 6.13049140975458,
2.10349875097134, 0.48170587509345, 0.13058713509175, 9.13509713513509,
6.13047153058701, 3.81734081501503), borewidth = c(3L, 5L,
2L, 1L, 1L, 1L, 2L, 4L, 4L)), .Names = c("region", "begin",
"finish", "sed", "borewidth"), class = "data.frame", row.names = c(NA,
-9L))
TIA.
With some extra assumptions based on the data you posted (incl. the other question), this is one way you could do it:
index <- unlist(lapply (unique(standRef$region), function(reg) {
reg.filter <- which(standRef$region == reg)
samp.filter <- which(sample$region == reg)
samp.filter[cut(standRef$location[reg.filter],c(0L,sample$finish[samp.filter]),labels=F)]
}))
sed <- sample$sed[index]
borewidth <- sample$borewidth[index]
The extra assumption is that your samples are contiguous, sequential (all your examples were) and start at 0. This allows us to use cut() on the $finish instead of treating each interval separately. One difference is that you code left gaps at the breaks, but I'm assuming that was not intentional.