Iteratively rbind 10% of the data from data frame and plotting - r

I have three data frames, each having 1 column but having different number of rows 100,100,1000 for df1,df2,df3 respectively. I want to do an rbind iteratively and calculate measures like mean repeatedly for the small chunks of data by taking 10% of the data each time. Meaning in the first iteration I need to have 10 rows from df1, 10 from df2 and 100 from df3 and for this set i need to get a mean and the process should continue 10 times. And I need to plot the iterations chunks over time showing the mean in y-axis over iterations and get an overall mean with this procedure. Any suggestions?
df1<- data.frame(A=c(1:100))
df2<- data.frame(A=c(1:100))
df3<- data.frame(A=c(1:1000))
library(dplyr)
for i in (1:10)
{ df[i]<- rbind_list(df1,df2,df3)
mean=mean(df$A)}

You're making things complicated by trying to keep separate data frames. Add a "group" column---call it "iteration" if you prefer---and get your data in one data frame:
df1$group = rep(1:10, each = nrow(df1) / 10)
df2$group = rep(1:10, each = nrow(df2) / 10)
df3$group = rep(1:10, each = nrow(df3) / 10)
df = rbind(df1, df2, df3)
means = group_by(df, group) %>% summarize(means = mean(A))
means
# Source: local data frame [10 x 2]
#
# group means
# 1 1 43
# 2 2 128
# 3 3 213
# 4 4 298
# 5 5 383
# 6 6 468
# 7 7 553
# 8 8 638
# 9 9 723
# 10 10 808
Your overall mean is mean(df$A). You can plot with with(means, plot(group, means)).
Edits:
If the groups don't come out exactly, here's how I'd assign the group column. Make sure your dplyr is up-to-date, this uses the the .id argument of bind_rows() which was new this month in version 0.4.3.
library(dplyr)
# dplyr > 0.4.3
df = bind_rows(df1, df2, df3, .id = "id")
df = df %>% group_by(id) %>%
mutate(group = (0:(n() - 1)) %/% (n() / 10) + 1)
The id column tells you which data frame the row came from, and the group column splits it into 10 groups. The rest of the code from above should work just fine.

Related

How to partition into equal sum subsets in R?

I have a dataset with a column, X1, of various values. I would like to order this dataset by the value of X1, and then partition into K number of equal sum subsets. How can this be accomplished in R? I am able to find quartiles for X1 and append the quartile groupings as a new column to the dataset, however, quartile is not quite what I'm looking for. Thank you in advance!
df <- data.frame(replicate(10,sample(0:1000,1000,rep=TRUE)))
df <- within(df, quartile <- as.integer(cut(X1, quantile(X1, probs=0:4/4), include.lowest=TRUE)))
Here's a rough solution (using set.seed(47) if you want to reproduce exactly). I calculate the proportion of the sum for each row, and do the cumsum of that proportion, and then cut that into the desired number of buckets.
library(dplyr)
n_groups = 10
df %>% arrange(X1) %>%
mutate(
prop = X1 / sum(X1),
cprop = cumsum(prop),
bins = cut(cprop, breaks = n_groups - 1)
) %>%
group_by(bins) %>%
summarize(
group_n = n(),
group_sum = sum(X1)
)
# # A tibble: 9 × 3
# bins group_n group_sum
# <fct> <int> <int>
# 1 (-0.001,0.111] 322 54959
# 2 (0.111,0.222] 141 54867
# 3 (0.222,0.333] 111 55186
# 4 (0.333,0.444] 92 55074
# 5 (0.444,0.556] 80 54976
# 6 (0.556,0.667] 71 54574
# 7 (0.667,0.778] 66 55531
# 8 (0.778,0.889] 60 54731
# 9 (0.889,1] 57 55397
This could of course be simplified--you don't need to keep around the extra columns, just mutate(bins = cut(cumsum(X1 / sum(X1)), breaks = n_groups - 1)) will add the bins column to the original data (and no other columns), and the group_by() %>% summarize() is just to diagnose the result.

Using group_by() function on multiple data frames?

I have data that were collected from a year but are broken up by months. For my code, I labeled them df1-df12 for each corresponding month. I am trying to group these data using the group_by function to group all the dataframes similarly. When I do the following code- it works fine alone:
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
However, I would like to streamline this code so that I can use this function for all 12 dataframes without having to copy/paste 12 times, since there is a lot of data to go through. Here is what I have tried to do to that end:
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
}
yr19<-c(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)
map(yr19, func1)
However, i get the following error message: Error in UseMethod("group_by") :
no applicable method for 'group_by' applied to an object of class "character". As stated above- i don't get this error message if I go through and do it individually, but there are many months and many years to be analyzed and from a time perspective I don't think doing this code manually is feasible. Thanks for your help
Two ways you can approach this, first using the approach suggested by #ktiu:
## Create example data
library(dplyr) # for pipe and group_by()
set.seed(914)
df1 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
df2 <- tibble(
date = sample(1:30, 50, replace = T),
id = sample(1:10, 50, replace = T),
var1 = rnorm(50, mean = 10, sd = 3)
)
Modifying your function to address error
func1<-function(df)
{
df <- df %>%
group_by(date,id) %>%
slice(n()) %>%
ungroup()
df
}
## And using list rather than c to combine data frames.
yr19 <- list(df1, df2)
yr19_data <- lapply(yr19, func1)
# This will return a list of data frames you can access with `yr19_data[[1]]`
Alternative approach is to add variable for your source data frames, then collapse it all into a single data frame and manipulate from there. Which approach makes more sense will depend on what else you want to do later.
func2 <- function(df.name){
mutate(get(df.name), source = df.name)
}
# This is set up to get objects given their names, so we'll use a character vector
# of names to iterate off of.
yr19 = c("df1", "df2")
df.list <- lapply(yr19, func2)
df.long <- do.call(bind_rows, df.list)
df.long
# # A tibble: 100 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 27 9 9.31 df1
# 2 5 3 16.5 df1
# 3 28 3 2.67 df1
# 4 24 4 8.94 df1
# 5 13 3 1.68 df1
At this point you can manipulate one data frame in your original pipe:
df <- df.long %>%
group_by(source, date,id) %>%
slice(n()) %>%
ungroup()
df
# # A tibble: 93 x 4
# date id var1 source
# <int> <int> <dbl> <chr>
# 1 1 8 9.89 df1
# 2 2 4 10.9 df1
# 3 4 3 8.45 df1
# 4 5 3 16.5 df1
# 5 5 7 10.6 df1

Subset the Orginal dataframe with different combinations of 2 factor variables

I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image
You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4

Aggregating data from value and count attributes

In R, I have a large list of large dataframes consisting of two columns, value and count. The function which I am using in the previous step returns the value of the observation in value, the corresponding column count shows how many times this specific value has been observed. The following code produces one dataframe as an example - however all dataframes in the list do have different values resp. value ranges:
d <- as.data.frame(
cbind(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
)
Now I would like to aggregate the data to be able to create viewable visualizations. This aggregation should be applied to all dataframes in a list, which do each have different value ranges. I am looking for a function, cutting the data into new values and counts, a little bit like a histogram function. So for example, for all data from a value of 0 to 100, the counts should be summated (and so on, in a defined interval, with a clean interval border starting point like 0).
My first try was to create a simple value vector, where each value is repeated in a number of times that is determined by the count field. Then, the next step would have been applying the hist() function without plotting to obtain the aggregated values and counts which can be defined in the hist()'s arguments. However, this produces too large vectors (some Gb for each) that R cannot handle anymore. I appreciate any solutions or hints!
I am not entirely sure I understand your question correctly, but this might solve your problem or at least point you in a direction. I make a list of data-frames and then generate a new column containing the result of applying the binfunction to each dataframe by using mapfrom the purrr package.
library(tidyverse)
d1 <- d2 <- tibble(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
d <- tibble(name = c('d1', 'd2'), data = list(d1, d2))
binfunction <- function(data) {
data %>% mutate(bin = value - (value %% 100)) %>%
group_by(bin) %>%
mutate(sum = sum(count)) %>%
select(bin, sum)
}
d_binned <- d %>%
mutate(binned = map(data, binfunction)) %>%
select(-data) %>%
unnest() %>%
group_by(name, bin) %>%
slice(1L)
d_binned
#> Source: local data frame [66 x 3]
#> Groups: name, bin [66]
#>
#> # A tibble: 66 x 3
#> name bin sum
#> <chr> <dbl> <dbl>
#> 1 d1 900 495123.8
#> 2 d1 1000 683108.6
#> 3 d1 1100 546524.4
#> 4 d1 1200 447077.5
#> 5 d1 1300 604759.2
#> 6 d1 1400 506225.4
#> 7 d1 1500 499666.5
#> 8 d1 1600 541305.9
#> 9 d1 1700 514080.9
#> 10 d1 1800 586892.9
#> # ... with 56 more rows
d_binned %>%
ggplot(aes(x = bin, y = sum, fill = name)) +
geom_col() +
facet_wrap(~name)
See this comment for my inspiration for the binning. It bins the data in groups of 100, so e.g. bin 1100 represents 1100 to <1200 etc. I imagine you can adapt the binfunction to your needs.

How do you pivot data from a list of data frames in R?

Let's say I have a list of data frames ldf:
df1 <- data.frame(date = c(1,2), value = c(4,5))
df2 <- data.frame(date = c(1,2), value = c(4,5))
ldf <- list(df1, df2)
What is the best way to get the sum (or any other function) of values by date, i.e. some data frame like:
data.frame(date = c(1,2), value = c(8,10))
You could use:
library(data.table)
dt1 <- rbindlist(ldf)
setkey(dt1,'date')
dt1[,list(value=sum(value)), by='date']
date value
1: 1 8
2: 2 10
If these rows were all in the same data frame, you would use aggregate to do the sum. You can combine them with rbind so they are in the same data frame:
aggregate(value ~ date, data=do.call(rbind, ldf), FUN=sum)
date value
1 1 8
2 2 10
If the date columns in all the data frames are identical, you can easily use Reduce to do the sum:
Reduce(function(x, y) data.frame(date=x$date, value=x$value+y$value), ldf)
date value
1 1 8
2 2 10
This should be a lot faster than rbind-ing the data together and aggregating.
Another option is to use unnest from "tidyr" in conjunction with the typical grouping and aggregation functions via "dplyr":
library(dplyr)
library(tidyr)
unnest(ldf) %>%
group_by(date) %>%
summarise(value = sum(value))
# Source: local data frame [2 x 2]
#
# date value
# 1 1 8
# 2 2 10

Resources