Difference between indexing with $ and [[]] ??? - r

i have a question regarding indexing a dataframe in R. This is the Code:
Gewicht <- data %>%
group_by(data[[376]]) %>%
summarise(weights = mean(data[[10190]], na.rm = TRUE))
Gewicht2 <- data %>%
group_by(data[[376]]) %>%
summarise(weights = mean(Weights, na.rm = TRUE))
a <- seq(1:10)
b <- rep(c("male", "female"),5)
c <- seq(1:10)
data <- as.data.frame(cbind(a,b,c))
data$c <- as.numeric(data$c)
newdata <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(c, na.rm = TRUE))
newdata2 <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(data[[3]], na.rm = TRUE))
print(newdata)
print(newdata2)
I get different results for both dataframes. The desired result in the "newdata". Can you tell me WHY i get different values for these two calculations?
I need brackets for a more complex custom function, but it seems it writes the mean for the whole dataframe, where i would hope to get the mean for each group.
How to use [] or [[]] correctly here?

a <- c(1,2,3,4,5,6,7,8,9,10)
b <- rep(c("male", "female"),5)
c <- c(1,2,3,4,5,6,7,8,9,10)
data <- data.frame(cbind(a,b,c))
data$c <- as.numeric(as.character(data$c))
c
data$c
print(newdata)
print(newdata2)
newdata <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(c, na.rm = TRUE))
newdata2 <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(data[[3]], na.rm = TRUE))
newdata
newdata2
updated code, still different results :(

Gewicht <- aggregate(data[[varGewicht]], by=list(data[[varx]]), FUN=mean, na.rm = TRUE)
Aggregate function works :-)

Related

Can you pipe data into a pairwise.t.test?

I'm wondering if the following code can be simplified to allow the data to be piped directly from the summarise command to the pairwise.t.test, without creating the intermediary object?
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT))
pairwise.t.test(x = data_for_PTT$meanRT, g = data_for_PTT$TT, paired = TRUE)
I tried x = .$meanRT but it didn't like it, returning:
Error in match.arg(p.adjust.method) :
'arg' must be NULL or a character vector
You can use curly braces:
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT)) %>%
{pairwise.t.test(x = .$meanRT, g = .$TT, paired = TRUE)}
Reproducible:
df <- data.frame(X1 = runif(1000), X2 = runif(1000), subj = rep(c("A", "B")))
df %>%
{pairwise.t.test(.$X1, .$subj, paired = TRUE)}

Apply different data to a function in R

I have the following data frame:
library(tidyverse)
set.seed(1234)
df <- data.frame(
x = seq(1, 100, 1),
y = rnorm(100)
)
Where I apply a smooth spline using different knots:
nknots <- seq(4, 15, 1)
output <- map(nknots, ~ smooth.spline(x = df$x, y = df$y, nknots = .x))
What I need to do now is to apply the same function using 2-point and 3-point averages:
df_2 <- df %>%
group_by(., x = round(.$x/2)*2) %>%
summarise_all(funs(mean))
df_3 <- df %>%
group_by(., x = round(.$x/3)*3) %>%
summarise_all(funs(mean))
In summary, I need to apply the function I used in output with the following data frames:
df
df_2
df_3
Of course, this is a minimal example, so I am looking for a efficient way of doing it. Preferably with the purrr package.
Using lapply, and the library zoo to calculate the moving average in a more simple and elegant manner:
library(zoo)
lapply(1:3,function(roll){
dftemp <- as.data.frame(rollmean(df,roll))
map(nknots, ~ smooth.spline(x = dftemp$x, y = dftemp$y, nknots = .x))
})
Here's one possible solution:
library(tidyverse)
set.seed(1234)
df <- data.frame(x = seq(1, 100, 1),
y = rnorm(100))
# funtion to get v-point averages
GetAverages = function(v) {
df %>%
group_by(., x = round(.$x/v)*v) %>%
summarise_all(funs(mean)) }
# specify nunber of knots
nknots <- seq(4, 15, 1)
dt_res = tibble(v=1:3) %>% # specify v-point averages
mutate(d = map(v, GetAverages)) %>% # get data for each v-point
crossing(., data.frame(nknots=nknots)) %>% # combine each dataset with a knot
mutate(res = map2(d, nknots, ~smooth.spline(x = .x$x, y = .x$y, nknots = .y))) # apply smooth spline
You can use dt_res$res[dt_res$v == 1] to see all results for your original daatset, dt_res$res[dt_res$v == 2] to see results for your 2-point estimate, etc.

How to get formulas of multiple regressions by vectorizing

Suppose I have the following code that makes multiple regressions and stores the lm and lm with stepwise selection models in tibbles:
library(dplyr)
library(tibble)
library(MASS)
set.seed(1)
df <- data.frame(A = sample(3, 10, replace = T),
B = sample(100, 10, replace = T),
C = sample(100, 10, replace = T))
df <- df %>% arrange(A)
formula_df <- as.tibble(NA)
aic_df <- as.tibble(NA)
for (i in unique(df$A)){
temp <- df %>% filter(A == i)
formula_df[i, 1] <- temp %>%
do(model = lm(B ~ C, data = .))
aic_df[i, 1] <- temp %>%
do(model = stepAIC(formula_df[[1,1]], direction = "both", trace = F))
}
Is it possible to vectorize to make it faster, for example using the *pply functions? The loop becomes extremely slow when the data gets larger. Thank you in advance.
You could try something like:
model <- df %>% group_by(A) %>%
summarise(formula_model = list(lm(B ~ C))) %>%
mutate(aic_model = list(stepAIC(.[[1,2]], direction = "both", trace = F)))

Use variable names in function in dplyr for sum and cumsum

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.

For Loop Ending Early - Why?

I have a loop that should run about 300,000 times, but it ends at 55 when I bind data to a data frame and I have no clue what is happening.
The loop in question is:
TrendlineMeta <- data.frame("FutureRecord" = character(), "System" = numeric(), "Intercepts" = numeric(), "Slopes" = numeric(), stringsAsFactors = FALSE)
for (i in unique(TrendingData$FutureRecord)){
FilteredList <- TrendingData[TrendingData$FutureRecord == i,]
Regressed <- lm(FilteredList$Value ~ FilteredList$Time)#, na.action = na.omit)
newrow <- c("FutureRecord"=j, "System"=max(as.character(FilteredList$System)), "Intercepts"=summary(Regressed)$coefficients[1,1], "Slopes"=summary(Regressed)$coefficients[2,1])
TrendlineMeta <- rbind(TrendlineMeta, data.frame(as.list(newrow), stringsAsFactors = FALSE))
}
and ends after 55 itterations.
However, this loop:
TrendlineMeta <- data.frame("FutureRecord" = character(), "System" = numeric(), "Intercepts" = numeric(), "Slopes" = numeric(), stringsAsFactors = FALSE)
for (i in unique(TrendingData$FutureRecord)){
FilteredList <- TrendingData[TrendingData$FutureRecord == i,]
Regressed <- lm(FilteredList$Value ~ FilteredList$Time)#, na.action = na.omit)
#newrow <- c("FutureRecord"=j, "System"=max(as.character(FilteredList$System)), "Intercepts"=summary(Regressed)$coefficients[1,1], "Slopes"=summary(Regressed)$coefficients[2,1])
#TrendlineMeta <- rbind(TrendlineMeta, data.frame(as.list(newrow), stringsAsFactors = FALSE))
}
completes fine.
What about this am I doing wrong? I am new to R so nothing is jumping out at me.
So, this is just a stab at your issue, but it's a little difficult without seeing the underlying dataset. I'm using Hadley's purrr, tidyr, plyr and dplyr packages.
It may accomplish what you're trying to do without using the loop.
partA <- TrendingData %>%
split(.$FutureRecord) %>%
map(~ lm(Value ~ Time, data = .)) %>%
map(summary) %>%
map("coefficients") %>%
map(data.frame) %>%
map(~ select(.x, Estimate) %>%
mutate(coef = row.names(.))) %>%
ldply(rbind) %>%
rename(FutureRecord = .id) %>%
spread(coef, Estimate)
From here,
partB <- TrendingData %>%
select(FutureRecord, System) %>%
group_by(FutureRecord) %>%
filter(System == max(System)) %>%
ungroup
Then,
left_join(partA, partB)
Does that work?

Resources