I have a large dataset in which I would like to use dplyr and filter and select the data to create 12 separate dataframes.
Essentially, I am using only two columns of data from a larger dataset. The first column is "plot", where I filter by "plot" number and another condition in another 3rd column ("pos_ID"). I want to create a loop that filters by plot number (I tried plot==[i]) and the 3rd condition, and then creates a new dataframe. The loop would repeat 12 times (because plot spans from 1-12).
Here is the code that I used without a loop (based on sample data)
p1_Germ <- data %>% #p1 stands for plot 1
filter(plot==1, pos_ID<21) %>%
select(germ_bin)
Here is the code that I tried to incorporate a loop (based on sample data)
for(i in seq_along(plot)) {
data %>%
group_by(plot[[i]], pos_ID<21) %>%
select(germ_bin)
}
Here is some sample data
plot <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12)
germ_bin <- c(0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,1,0,1,0)
pos_ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
dataset <- data.frame(plot, germ_bin, pos_ID)
dataset
My guess is to use a list, but I'm not familiar with loops and list and could not find a solution online. I need to create 12 dataframes because I'm trying to convert them each into a matrix after for another function. Any helpful would be much appreciated!
We can use group_split and map to filter based on criteria to get list of dataframes.
library(dplyr)
library(purrr)
dataset %>%
group_split(plot) %>%
map(. %>% filter(pos_ID < 21) %>% select(germ_bin))
#[[1]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 0
#2 0
#[[2]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#[[3]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#....
For the shared example, if you want to drop empty groups you can filter first
dataset %>%
filter(pos_ID < 21) %>%
group_split(plot) %>%
map(. %>% select(germ_bin))
As far your attempt with for loop is concerned, you can correct that by doing
unique_plot <- unique(dataset$plot)
plot_list <- list(length = length(unique_plot))
for(i in seq_along(unique_plot)) {
plot_list[[i]] <- dataset %>%
filter(plot == unique_plot[i], pos_ID<21) %>%
select(germ_bin)
}
Or keeping it completely in base R
lapply(split(dataset, dataset$plot), function(x)
subset(x, pos_ID < 21, select = germ_bin, drop = FALSE))
Related
Related to a previous question, I want to add some random noise to every value in a column in dplyr. However, when I tried the below code I get identical values back. I understand why this is happening (dplyr generate the random number and then uses that very same number to add to every single value). Is there any way to prevent this?
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000236
# 2 1.000236
# 3 1.000236
# 4 1.000236
# 5 1.000236
Here is a solution with jitter:
library(dplyr)
set.seed(2020) # Make the results reproducible
data <- data.frame(value=c(1,1,1,1,1)) %>% mutate(value = jitter(value))
data
# value
#1 1.0058761
#2 0.9957690
#3 1.0047401
#4 0.9990756
#5 0.9854439
You could generate your random vector externally and then add it to data$value:
nrows <- nrow(data)
rands <- 1e-3 * runif(nrows)
data$value <- data$value + rands
Stepwise clarity works better for me.
Found my own answer. Adding rowwise() evaluates each row individually and thus gives a new random number.
data <- data.frame(value=c(1,1,1,1,1)) %>% rowwise() %>% mutate(value = value + 1e-3*runif(1)) %>% print
# print(data)
# value
# 1 1.000625
# 2 1.000764
# 3 1.000588
# 4 1.000536
# 5 1.000079
Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image
You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4
I have a list of dataframes containing different time series of different lengths. I want to summarize the count of a variable and then normalize it by the number of years of data that is contained in that particular dataset.
so with a sample dataframe:
data_list <- list(data.frame(temp_bin = rep(1:4, 2:5), value = runif(14)),
data.frame(temp_bin = rep(1:4, 3:6), value = runif(18)),
data.frame(temp_bin = rep(1:4, 4:7), value = runif(22)))
# this might be ~10 different data sets with ~ 100k observations each
count <- lapply(data_list, function(x) {nrow(x)/5} )
# for real data this would be divided by 8760 for the # of hours in a year.
Here is approximately what I want to do, but the n()/count doesn't work because count is a list.
data_bin <- data_list %>%
lapply(., group_by, temp_bin) %>%
lapply(., summarise, n = n()/count)
I tried doing an lapply or mapply within the definition of n, but that didn't seem to work. also tried doing it in two steps - create get a raw n value and then divide in the next step with mapply, but that didn't work either.
If you put the count step in your data_bin step I think it accomplishes what you want, though I am a little hazy on exactly what you mean but I think this works: (Note that you can remove the . assignment from the first argument of lapply, that's the default behavior of %>%)
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
# We need x so I put summarize in a manual function
lapply(function(x){summarize(x,n = 5*n()/nrow(x))}) # move the 5 to numerator
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 0.7142857
2 2 1.0714286
3 3 1.4285714
4 4 1.7857143
Is this what you wanted? You can double check the summarize is part is doing what you want by just returning the nrow(x) result.
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
lapply(function(x){summarize(x,n = nrow(x))})
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 14
2 2 14
3 3 14
4 4 14
I would try to avoid using lapply on every row of a dplyr statement. You could wrap individual data.frame transformation in a function and then lapply that function to data_list
library(dplyr)
ret_db <- function(df) {
db <- df %>%
group_by(.,temp_bin) %>%
summarise(.,n=n()/(nrow(df)/5))
return(db)
}
data_bin <- lapply(data_list,ret_db)