I have a loop that should run about 300,000 times, but it ends at 55 when I bind data to a data frame and I have no clue what is happening.
The loop in question is:
TrendlineMeta <- data.frame("FutureRecord" = character(), "System" = numeric(), "Intercepts" = numeric(), "Slopes" = numeric(), stringsAsFactors = FALSE)
for (i in unique(TrendingData$FutureRecord)){
FilteredList <- TrendingData[TrendingData$FutureRecord == i,]
Regressed <- lm(FilteredList$Value ~ FilteredList$Time)#, na.action = na.omit)
newrow <- c("FutureRecord"=j, "System"=max(as.character(FilteredList$System)), "Intercepts"=summary(Regressed)$coefficients[1,1], "Slopes"=summary(Regressed)$coefficients[2,1])
TrendlineMeta <- rbind(TrendlineMeta, data.frame(as.list(newrow), stringsAsFactors = FALSE))
}
and ends after 55 itterations.
However, this loop:
TrendlineMeta <- data.frame("FutureRecord" = character(), "System" = numeric(), "Intercepts" = numeric(), "Slopes" = numeric(), stringsAsFactors = FALSE)
for (i in unique(TrendingData$FutureRecord)){
FilteredList <- TrendingData[TrendingData$FutureRecord == i,]
Regressed <- lm(FilteredList$Value ~ FilteredList$Time)#, na.action = na.omit)
#newrow <- c("FutureRecord"=j, "System"=max(as.character(FilteredList$System)), "Intercepts"=summary(Regressed)$coefficients[1,1], "Slopes"=summary(Regressed)$coefficients[2,1])
#TrendlineMeta <- rbind(TrendlineMeta, data.frame(as.list(newrow), stringsAsFactors = FALSE))
}
completes fine.
What about this am I doing wrong? I am new to R so nothing is jumping out at me.
So, this is just a stab at your issue, but it's a little difficult without seeing the underlying dataset. I'm using Hadley's purrr, tidyr, plyr and dplyr packages.
It may accomplish what you're trying to do without using the loop.
partA <- TrendingData %>%
split(.$FutureRecord) %>%
map(~ lm(Value ~ Time, data = .)) %>%
map(summary) %>%
map("coefficients") %>%
map(data.frame) %>%
map(~ select(.x, Estimate) %>%
mutate(coef = row.names(.))) %>%
ldply(rbind) %>%
rename(FutureRecord = .id) %>%
spread(coef, Estimate)
From here,
partB <- TrendingData %>%
select(FutureRecord, System) %>%
group_by(FutureRecord) %>%
filter(System == max(System)) %>%
ungroup
Then,
left_join(partA, partB)
Does that work?
Related
Not sure if you all will be able to help me without reproducible example data, but I have a problem with running the code below. I am attempting to use the multidplyr package, but it doesn't seem to find my columns. I am running the code below:
cl <- detectCores()
cl
models_prep <-
bookings_prep %>%
inner_join(pipeline_prep_, by = c("booking_type", "group")) %>%
crossing(biz_day) %>%
left_join(closed_pipeline, by = c("booking_type", "group")) %>%
select(-opportunity_forecast_category)
group1 <- rep(1:cl, length.out = nrow(models_prep))
models_prep1 <- bind_cols(tibble(group1), models_prep)
cluster <- new_cluster(cl)
cluster %>%
cluster_library("tidyr")
cluster %>%
cluster_library("purrr")
cluster %>%
cluster_library("plyr")
cluster %>%
cluster_library("dplyr")
cluster_copy(cluster, "rmf")
cluster_copy(cluster, "fc_xreg")
#cluster_assign(cluster, "rmf")
#cluster_copy(cluster,c("rmf","fc_xreg"))
by_group <- models_prep %>%
group_by(group) %>%
partition(cluster)
by_group1 <- models_prep1 %>%
group_by(group1) %>%
partition(cluster)
models <- by_group %>%
mutate(
xreg_arima = pmap(list(data = pipeline, h = 1,name = group, bookings = bookings, type = booking_type,
biz_day = biz_day, no_bookings = no_bookings,
sparse_pipeline = sparse_pipeline,
closed_forecast_cat = pipeline_amount, FUN = "fc_xreg"), rmf))
Everything runs up to models <- correctly, but it fails there saying it cannot find the object group. Here is what the by_group data frame looks like.
Sometimes arguments just need to be quoted, particularly in dplyr-ish situations.
models <- by_group %>%
mutate(
xreg_arima = pmap(list(data = pipeline, h = 1,name = "group", bookings = "bookings", type = "booking_type",
biz_day = "biz_day", no_bookings = "no_bookings",
sparse_pipeline = "sparse_pipeline",
closed_forecast_cat = "pipeline_amount", FUN = "fc_xreg"), rmf))
I'm wondering if the following code can be simplified to allow the data to be piped directly from the summarise command to the pairwise.t.test, without creating the intermediary object?
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT))
pairwise.t.test(x = data_for_PTT$meanRT, g = data_for_PTT$TT, paired = TRUE)
I tried x = .$meanRT but it didn't like it, returning:
Error in match.arg(p.adjust.method) :
'arg' must be NULL or a character vector
You can use curly braces:
data_for_PTT <- data %>%
group_by(subj, TT) %>%
summarise(meanRT = mean(RT)) %>%
{pairwise.t.test(x = .$meanRT, g = .$TT, paired = TRUE)}
Reproducible:
df <- data.frame(X1 = runif(1000), X2 = runif(1000), subj = rep(c("A", "B")))
df %>%
{pairwise.t.test(.$X1, .$subj, paired = TRUE)}
I have built a function which seems to work, but I don't understand why.
My initial problem was to take a data.frame which contains counts of a population and expand it to re-create the original population. This is easy enough if you know the column names in advance.
library(tidyverse)
set.seed(121)
test_counts <- tibble(Population = letters[1:4], Length = c(1,1,2,1),
Number = sample(1:100, 4))
expand_counts_v0 <- function(Length, Population, Number) {
tibble(Population = Population,
Length = rep(Length, times = Number))
}
test_counts %>% pmap_dfr(expand_counts_v0) %>% # apply it
group_by(Population, Length) %>% # test it
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
However, I wanted to generalise it to a function which didn't need to know at the column names of the data.frame, and I'm interested in NSE, so I wrote:
test_counts1 <- tibble(Population = letters[1:4],
Length = c(1,1,2,1),
Number = sample(1:100, 4),
Height = c(100, 50, 45, 90),
Width = c(700, 50, 60, 90)
)
expand_counts_v1 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% pmap_dfr(make_tbl)
}
But, when I test this function it seems to duplicate rows 4 times:
test_counts %>% expand_counts_v1(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ sum(.$Number)/sum(test_counts$Number)}
# [1] 4
This lead me to guess a solution, which was
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
make_tbl <- function(...) {
expr(tibble(!!!cols)) %>% eval(envir = df)
}
df %>% make_tbl
}
This seems to work:
test_counts %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts)}
# [1] TRUE
test_counts1 %>% expand_counts_v2(count = Number) %>%
group_by(Population, Length, Height, Width) %>%
summarise(Number = n()) %>%
ungroup %>%
{ all.equal(., test_counts1)}
# [1] TRUE
But I don't understand why. How is it evaluating for each row, even though I'm not using pmap anymore? The function needs to be applied to each row in order to work, so it must be somehow, but I can't see how it's doing that.
EDIT
After Artem's correct explanation of what was going on, I realised I could do this
expand_counts_v2 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
expr(tibble(!!!cols)) %>% eval_tidy(data = df)
}
Which gets rid of the unnecessary mk_tbl function. However, as Artem said, that is only really working because rep is vectorised. So, it's working, but not by re-writing the _v0 function and pmapping it, which is the process I was trying to replicate. Eventually, I discovered, rlang::new_function and wrote:
expand_counts_v3 <- function(df, count = NULL) {
countq <- enexpr(count)
names <- df %>% select(-!!countq) %>% names
namesq <- names %>% map(as.name)
cols <- map(namesq, ~ expr(rep(!!., times = !!countq))
) %>% set_names(namesq)
all_names <- df %>% names %>% map(as.name)
args <- rep(0, times = length(all_names)) %>% as.list %>% set_names(all_names)
correct_function <- new_function(args, # this makes the function as in _v0
expr(tibble(!!!cols)) )
pmap_dfr(df, correct_function) # applies it as in _v0
}
which is longer, and probably uglier, but works the way I originally wanted.
The issue is in eval( envir = df ), which exposes the entire data frame to make_tbl(). Notice that you never use ... argument inside make_tbl(). Instead, the function effectively computes the equivalent of
with( df, tibble(Population = rep(Population, times = Number),
Length = rep(Length, times=Number)) )
regardless of what arguments you provide to it. When you call the function via pmap_dfr(), it essentially computes the above four times (once for each row) and concatenates the results by-row, resulting in the duplication of entries you've observed. When you remove pmap_dfr(), the function is called once, but since rep is itself vectorized (try doing rep( test_counts$Population, test_counts$Number ) to see what I mean), make_tbl() computes the entire result in one go.
i have a question regarding indexing a dataframe in R. This is the Code:
Gewicht <- data %>%
group_by(data[[376]]) %>%
summarise(weights = mean(data[[10190]], na.rm = TRUE))
Gewicht2 <- data %>%
group_by(data[[376]]) %>%
summarise(weights = mean(Weights, na.rm = TRUE))
a <- seq(1:10)
b <- rep(c("male", "female"),5)
c <- seq(1:10)
data <- as.data.frame(cbind(a,b,c))
data$c <- as.numeric(data$c)
newdata <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(c, na.rm = TRUE))
newdata2 <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(data[[3]], na.rm = TRUE))
print(newdata)
print(newdata2)
I get different results for both dataframes. The desired result in the "newdata". Can you tell me WHY i get different values for these two calculations?
I need brackets for a more complex custom function, but it seems it writes the mean for the whole dataframe, where i would hope to get the mean for each group.
How to use [] or [[]] correctly here?
a <- c(1,2,3,4,5,6,7,8,9,10)
b <- rep(c("male", "female"),5)
c <- c(1,2,3,4,5,6,7,8,9,10)
data <- data.frame(cbind(a,b,c))
data$c <- as.numeric(as.character(data$c))
c
data$c
print(newdata)
print(newdata2)
newdata <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(c, na.rm = TRUE))
newdata2 <- data %>%
group_by(data[[2]]) %>%
summarise(Mean = mean(data[[3]], na.rm = TRUE))
newdata
newdata2
updated code, still different results :(
Gewicht <- aggregate(data[[varGewicht]], by=list(data[[varx]]), FUN=mean, na.rm = TRUE)
Aggregate function works :-)
I have this code that works but I would like to learn how to replace it by a function.
library(tidyverse)
l1_1617 <- read.csv("http://www.football-data.co.uk/mmz4281/1617/F1.csv", stringsAsFactors = FALSE)
l1_1516 <- read.csv("http://www.football-data.co.uk/mmz4281/1516/F1.csv", stringsAsFactors = FALSE)
l1_1415 <- read.csv("http://www.football-data.co.uk/mmz4281/1415/F1.csv", stringsAsFactors = FALSE)
l1_1314 <- read.csv("http://www.football-data.co.uk/mmz4281/1314/F1.csv", stringsAsFactors = FALSE)
l1_1617_sel <- l1_1617 %>%
select(Date:AST) %>%
mutate(season = 1617)
l1_1516_sel <- l1_1516 %>%
select(Date:AST) %>%
mutate(season = 1516)
l1_1415_sel <- l1_1415 %>%
select(Date:AST) %>%
mutate(season = 1415)
l1_1314_sel <- l1_1314 %>%
select(Date:AST) %>%
mutate(season = 1314)
l1_1317 <- bind_rows(l1_1617_sel, l1_1516_sel, l1_1415_sel, l1_1314_sel)
For the first step I have tried something like this but it obviously failed:
dl_l1 <-function(x){
df_x <- read.csv("http://www.football-data.co.uk/mmz4281/x/F1.csv", stringsAsFactors = FALSE)
}
dl_l1(1617)
You need to use paste to concatenate to build the url. Below code should work.
dl_l1 <-function(x){
read.csv(paste0("http://www.football-data.co.uk/mmz4281/",x,"/F1.csv"), stringsAsFactors = FALSE) %>%
select(Date:AST) %>%
mutate(season = x)
}
dl_l1(1617)
#final output
l1_1317 <- bind_rows(dl_l1(1617), dl_l1(1516), dl_l1(1415), dl_l1(1314))
library(tidyverse)
ids <- as.character(c(1617, 1516, 1415, 1314))
data <- lapply(ids, function(i) {
read.csv(paste0("http://www.football-data.co.uk/mmz4281/", i ,"/F1.csv"), stringsAsFactors = FALSE) %>%
select(Date:AST) %>%
mutate(season = i)
})
data <- do.call(rbind, data)
I would create a for loop in a function so you can iterate through a vector of numbers:
create function football that takes a number, or a vector of numbers, then create an empty data.frame. for each number in the vector, you want to paste it into the url, and then mutate the year into that df. Then you bind_rows into the df. At the end you return the football_df which is the bind_rows version of all of the ones combined.
library(dplyr)
football <- function(numbers){
football_df <- data.frame()
for (i in seq_along(numbers)){
df <- read.csv(paste("http://www.football-data.co.uk/mmz4281/",numbers[i],"/F1.csv", sep=""), stringsAsFactors = FALSE) %>%
mutate(year = numbers[i])
football_df <- bind_rows(football_df, df)
}
return(football_df)
}
years <- c(1617, 1415, 1314)
final_df <- football(years)