I have a several large data frames that I need to append data to. The data structure is simulated below:
Orders
set.seed(2)
N=1e2
tbl.orders <- tibble(
ID=1:N,
nb_products_ordered = sample(1:15, N, replace = TRUE),
type = sample(c("keyboard", "mouse", "other"), N, replace = TRUE),
grade= sample(LETTERS[1:5], N, replace=TRUE)
)
# A tibble: 100 x 4
ID nb_products_ordered type grade
<int> <int> <chr> <chr>
1 1 5 other A
2 2 15 keyboard A
3 3 6 other C
4 4 6 keyboard E
5 5 8 other C
Bulk pricing. Prices are variable depending on the number of products ordered.
The table shows the minimum number of products at which bulk pricing applies.
Other products are not counted and their price is to be considered "NA".
tbl.prices <- tibble::tribble(
~min_products_ordered, ~per_unit_cost, ~type,
1L, 39.7, "mouse",
2L, 23.1, "mouse",
3L, 18.6, "mouse",
4L, 15, "mouse",
5L, 14.3, "mouse",
6L, 11, "mouse",
9L, 10.9, "mouse",
1L, 11, "keyboard",
9L, 10.9, "keyboard"
)
My convoluted solution, which seems overly complex and is rather slow when running on my large data frames (500K+ rows each). Is there a simpler, faster way? I ultimately want it in a pipe because I have transformations before and after.
tbl.orders%>%
group_by(type)%>%
group_split()%>%
lapply(., function(x)
{
#if included in price list
if (x$type[1] %in% levels(factor(tbl.prices$type))) {
df.priceparameters <- tbl.prices %>%
filter(type == x$type[1])
x %>% mutate(
per_unit_cost =
as.numeric(
as.character(
cut(
x[["nb_products_ordered"]],
breaks = c(df.priceparameters$min_products_ordered, Inf),
#returns per unit cost
labels = df.priceparameters$per_unit_cost,
right = FALSE
)))
)
} else{
x %>% mutate(per_unit_cost = NA)
}
})%>%
do.call("rbind",.)%>%
arrange(ID)
Result
# A tibble: 100 x 5
ID nb_products_ordered type grade per_unit_cost
<int> <int> <chr> <chr> <dbl>
1 1 5 other A NA
2 2 15 keyboard A 10.9
3 3 6 other C NA
4 4 6 keyboard E 11
5 5 8 other C NA
Here is my attempt which also seems bit convoluted :
We write a function to select the correct value of prices for each ID.
library(dplyr)
select_row <- function(type, nb_products_ordered, min_products_ordered){
if(any(type == 'other')) return(TRUE)
else{
tmp <- first(nb_products_ordered) - min_products_ordered
inds <- tmp >= 0
if(any(inds)) return(tmp == min(tmp[inds], na.rm = TRUE))
else TRUE
}
}
Join the dataframe by type and select the row for each ID.
tbl.orders %>%
left_join(tbl.prices, by = 'type') %>%
group_by(ID) %>%
filter(select_row(type, nb_products_ordered, min_products_ordered))
Related
I have a problem on how to recategorize a variable according to whether it meets a certain condition or not. That is, if the category does not meet the criteria, it is assigned to another category that does.
My data has the following form:
data = data.frame(firm_size = c("Micro", "Small", "Medium","Big"),
employees = c(5,10,100,1000))
> data
firm_size employees
1 Micro 5
2 Small 10
3 Medium 100
4 Big 1000
So, if my condition is that I must group the companies that have less than 10 employees and then combine them with the other category that does meet the criteria
> new_data
firm_size employees
1 Micro-Small 15
3 Medium 100
4 Big 1000
What I'm trying to do is write a function that generalizes this procedure, for example, that also works if my data is
> data
firm_size employees
1 Micro 5
2 Small 8
3 Medium 9
4 Big 1000
> new_data
firm_size employees
1 Micro-Small-Medium 22
4 Big 1000
I think that this can be done with the tools of the tidyverse.
Thanks in advance
Here's an approach with tally:
library(dplyr)
size <- 10
data %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 3 x 2
# firm_size employees
# <chr> <dbl>
#1 Big 1000
#2 Medium 100
#3 Small-Micro 15
And for your second set of data:
data2 %>%
arrange(firm_size,desc(employees)) %>%
group_by(firm_size = c(as.character(firm_size[employees > size]),
rep(paste(firm_size[employees <= size], collapse = "-"),
sum(employees <= size)))) %>%
tally(employees, name = "employees")
## A tibble: 2 x 2
# firm_size employees
# <chr> <int>
#1 Big 1000
#2 Medium-Small-Micro 22
Data
data <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5,
10, 100, 1000)), class = "data.frame", row.names = c(NA, -4L))
data2 <- structure(list(firm_size = structure(c(3L, 4L, 2L, 1L), .Label = c("Big",
"Medium", "Micro", "Small"), class = "factor"), employees = c(5L,
8L, 9L, 1000L)), class = "data.frame", row.names = c("1", "2",
"3", "4"))
You can use the great forcats package
library(tidyverse)
data <- data.frame(
firm_size = c("Micro", "Small", "Medium", "Big", "Small"),
employees = c(5, 10, 100, 1000, 10)
)
# If you need n groups
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump(n = 2, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 3 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Medium 100
#> 2 Big 1000
#> 3 Other 25
# If you need at least x on the sum of a vector
data %>%
mutate(firm_size2 = firm_size %>% as_factor() %>% fct_lump_min(min = 10, w = employees)) %>%
group_by(firm_size2) %>%
summarise(sum_emp = sum(employees),.groups = "drop")
#> # A tibble: 4 x 2
#> firm_size2 sum_emp
#> <fct> <dbl>
#> 1 Small 20
#> 2 Medium 100
#> 3 Big 1000
#> 4 Other 5
Created on 2020-06-11 by the reprex package (v0.3.0)
Yet another solution, set into a custom function:
library(tidyverse)
mymerge <- function(dat, min) {
merged_dat <- dat %>%
filter(if_else(employees <= min, TRUE, FALSE)) %>%
summarize(firm_size = str_flatten(firm_size, collapse = " - "),
employees = sum(employees))
dat %>%
filter(if_else(employees <= min, FALSE, TRUE)) %>%
bind_rows(merged_dat)
}
mymerge(data, 30)
firm_size employees
1 Medium 100
2 Big 1000
3 Micro - Small 15
mymerge(data, 300)
firm_size employees
1 Big 1000
2 Micro - Small - Medium 115
This question already has answers here:
Remove groups that contain certain strings
(4 answers)
Closed 3 years ago.
3 doctors diagnose a patient
question 1 : how to filter the patient which all 3 doctors diagnose with disease B (no matter B.1, B.2 or B.3)
question 2: how to filter the patient which any of 3 doctors diagnose with disease A.
set.seed(20200107)
df <- data.frame(id = rep(1:5,each =3),
disease = sample(c('A','B'), 15, replace = T))
df$disease <- as.character(df$disease)
df[1,2] <- 'A'
df[4,2] <- 'B.1'
df[5,2] <- 'B.2'
df[6,2] <- 'B.3'ยท
df
I got a method but I don't know how to write the code. I think in the code any() or all() function shoule be used.
First, I want to group patients by id.
Second, check if all the disease is A or B in each group.
The code like this
df %>% group_by(id) %>% filter_all(all_vars(disease == B))
You can use all assuming every patient is checked by 3 doctors only.
library(dplyr)
df %>% group_by(id) %>% summarise(disease_B = all(grepl('B', disease)))
# id disease_B
# <int> <lgl>
#1 1 FALSE
#2 2 TRUE
#3 3 FALSE
#4 4 FALSE
#5 5 FALSE
If you want to subset the rows of the patient, we can use filter
df %>% group_by(id) %>% filter(all(grepl('B', disease)))
For question 2: similarly, we can use any
df %>% group_by(id) %>% summarise(disease_B = any(grepl('A', disease)))
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), disease = c("A", "A", "A", "B.1", "B.2",
"B.3", "B", "A", "A", "B", "A", "A", "B", "A", "B")), row.names = c(NA,
-15L), class = "data.frame")
For the question 1, you can replace B.1 B.2 ... by B, then count the number of different "Disease" per patients and filter to keep only those equal to 3 and B:
library(tidyverse)
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease)) %>%
count(Disease) %>%
filter(n == 3 & Disease == "B")
# A tibble: 2 x 3
# Groups: id [2]
id Disease n
<int> <chr> <int>
1 2 B 3
2 4 B 3
For the question 2, similarly, you can replace B.1 ... by B, then filter all rows with Disease is A, then count the number of rows per patients and your output is the patient id and the number of doctors that diagnose the disease A:
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease))%>%
filter(Disease == "A") %>%
count(id)
# A tibble: 3 x 2
# Groups: id [3]
id n
<int> <int>
1 1 1
2 3 3
3 5 2
I want to generate a list of dataframes and apply the same functions to each of them. I do not know how to do this elegantly without a very large number of lines of code.
From a dataframe df,
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),
rep("silver", 4),
rep("bronze", 4))
df <- data.frame(id, x, y, z, type)
I create a bunch of other datasets with a simple threshold rule based on one variable
df_25 <- df[df$x < 25,]
df_20 <- df[df$x < 20,]
# and so on
I then apply functions to each dataset; I can do this to each dataset individually, or to a list of datasets
# individually
df <- df_18 %>%
dplyr::group_by(id) %>%
dplyr::mutate(nb1= sum(x),
nb2 = sum(x != 25))
# to a list
ls1 <- list(df_25, df_20)
func_1 <- function(x) {
x <- x %>%
dplyr::group_by(id) %>%
dplyr::mutate(nb1= sum(x),
nb2 = sum(x != 25))
}
ls1 <- lapply(ls1, function(x) {x[c("id","x")]
<- lapply(x[c("id","x")], func_1)
x})
df_25 <- ls1[[1]]
df_20 <- ls1[[2]]
In any case this takes both a lot of lines and time as I am dealing with very large datasets. How could I simplify and fasten both the generation of datasets with proper recognisable names and the creation of the new variables through the functions defined above?
I did not find proper answer to this dual question yet and would welcome your help!
You could define a threshold vector and lapply your aggregation. In base R this could look like this:
threshold <- c(22, 24, 26)
res <- setNames(lapply(threshold, function(s) {
sst <- df[df$x < s, ]
merge(sst,
with(sst, aggregate(list(nb1=x, nb2=x != 25),
by=list(id=id), sum), by="id"))
}), threshold)
res
# $`22`
# id x y z type nb1 nb2
# 1 a 20.92786 37.61272 69976.23 gold 20.92786 1
# 2 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 3 c 18.58916 46.08353 69985.98 silver 18.58916 1
#
# $`24`
# id x y z type nb1 nb2
# 1 a 22.73948 44.29524 70002.81 gold 43.66734 2
# 2 a 20.92786 37.61272 69976.23 gold 43.66734 2
# 3 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 4 c 18.58916 46.08353 69985.98 silver 18.58916 1
#
# $`26`
# id x y z type nb1 nb2
# 1 a 22.73948 44.29524 70002.81 gold 43.66734 2
# 2 a 20.92786 37.61272 69976.23 gold 43.66734 2
# 3 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 4 c 18.58916 46.08353 69985.98 silver 44.24036 2
# 5 c 25.65120 44.85778 70008.81 bronze 44.24036 2
# 6 d 24.84056 49.22505 69993.87 bronze 24.84056 1
Data
df <- structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
x = c(22.7394803492982, 20.927856140076, 30.2395154764033,
26.6955462205898, 20.6427460111819, 18.589158456851, 25.6511987559726,
24.8405634272769, 28.8534602413068, 26.5376546472448), y = c(44.2952365501829,
37.6127198429065, 45.2842176546081, 40.3835729432985, 38.0205610647157,
46.083525703352, 44.8577760657779, 49.2250487481642, 40.2699166395278,
49.3740993403725), z = c(70002.8091832317, 69976.2314543058,
70000.9974233725, 70011.435897774, 69997.249180665, 69985.9786882474,
70008.8088326676, 69993.8665395223, 69998.7334115052, 70001.2935411788
), type = structure(c(2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L), .Label = c("bronze", "gold", "silver"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
Using purrr::map to loop over the vector of thresholds
library(dplyr)
library(purrr)
map(c(18,20,25) %>%set_names() , ~ df %>% filter(x<.x) %>%
group_by(id) %>%
mutate(nb1= sum(x),
nb2 = sum(x != 25)))
Or using map_if to apply the calculation for df subset with nrow()>1.
map_if(c(18,20,25) %>%set_names(), ~df %>% filter(x<.x) %>% nrow()>1,
~df %>% filter(x<.x) %>% group_by(id) %>%
mutate(nb1= sum(x),
nb2 = sum(x != 25)), .else = ~NA)
Using tidyverse we can combine all this operations in one chain.
library(tidyverse)
df %>%
group_split(x > 25, keep = FALSE) %>%
map(. %>% group_by(id) %>% mutate(nb1= sum(x),nb2 = sum(x != 25)))
#[[1]]
# A tibble: 6 x 7
# Groups: id [5]
# id x y z type nb1 nb2
# <fct> <dbl> <dbl> <dbl> <fct> <dbl> <int>
#1 a 21.4 42.9 70001. gold 21.4 1
#2 b 18.0 45.3 70005. silver 18.0 1
#3 c 23.3 42.7 70006. bronze 23.3 1
#4 d 23.4 40.9 69990. bronze 46.7 2
#5 d 23.3 41.2 70000. bronze 46.7 2
#6 e 22.3 55.9 69991. bronze 22.3 1
#[[2]]
# A tibble: 4 x 7
# Groups: id [3]
# id x y z type nb1 nb2
# <fct> <dbl> <dbl> <dbl> <fct> <dbl> <int>
#1 a 25.8 40.5 69995. gold 25.8 1
#2 b 28.3 41.5 69996. silver 54.5 2
#3 b 26.3 49.3 69993. silver 54.5 2
#4 c 26.5 44.5 69986. silver 26.5 1
Here, I have split the data into two groups based on value of x,first group is values below 25 and second group is above 25. You might change the logic based on your requirement.
This gives you list of dataframes as output which you can access individually.
data
set.seed(1234)
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),rep("silver", 4),rep("bronze", 4))
df <- data.frame(id, x, y, z, type)
I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3
I've been using the dplyr package to create aggregated data tables, for example using the following code:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(price1, na.rm = TRUE),2),
mean_price2 = round(mean(price2, na.rm = TRUE),2))
However, I would like to only calculate the mean over the distinct values of price1 and price2 within groups
e.g:
Price1: 1 1 2 1 2 2 1
Goes to (before aggregation):
Price1: 1 2 1 2 1
(and these in general don't have the same numbers of after removal for price1 and price2). I would also like to calculate a count for each (price1 and price2), counting only distinct values within groups. (Groups are defined as two or more identical values adjacent to each other)
I have tried:
agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(distinct(price1), na.rm = TRUE),2),
mean_price2 = round(mean(distinct(price2), na.rm = TRUE),2))
And also tried wrapping the columns within the select function with distinct(), but both these throw errors.
Is there a way to do this using dplyr or another similar package without having to write something from scratch?
To satisfy your requirement for distinct, we need to remove successive values that are the same. For numeric vectors, this can be accomplished by:
x <- x[c(1, which(diff(x) != 0)+1)]
The default use of diff computes the difference between adjoining elements in the vector. We use this to detect successive values that are different, for which diff(x) != 0. Since the output differences are lagged by 1, we add 1 to the indices of these distinct elements, and we also want the first element as distinct. For example:
x <- c(1,1,2,1,2,2,1)
x <- x[c(1, which(diff(x) != 0)+1)]
##[1] 1 2 1 2 1
We can then use this with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(price1[c(1,which(diff(price1) != 0)+1)]),
mean_price1 = round(mean(price1[c(1,which(diff(price1) != 0)+1)], na.rm=TRUE),2),
count_non_rep_2 = length(price2[c(1,which(diff(price2) != 0)+1)]),
mean_price2 = round(mean(price2[c(1,which(diff(price2) != 0)+1)], na.rm=TRUE),2))
or, better yet, define the function:
remove.repeats <- function(x) {
x[c(1,which(diff(x) != 0)+1)]
}
and use it with dplyr:
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(remove.repeats(price1)),
mean_price1 = round(mean(remove.repeats(price1), na.rm=TRUE),2),
count_non_rep_2 = length(remove.repeats(price2)),
mean_price2 = round(mean(remove.repeats(price2), na.rm=TRUE),2))
Using this on some example data that is hopefully similar to yours:
df <- structure(list(calc.method = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
price1 = c(1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 3),
price2 = c(1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1)),
.Names = c("calc.method", "price1", "price2"), row.names = c(NA, -15L), class = "data.frame")
## calc.method price1 price2
##1 A 1 1
##2 A 1 1
##3 A 2 1
##4 A 1 1
##5 A 2 1
##6 A 2 1
##7 A 1 1
##8 B 1 2
##9 B 1 1
##10 B 2 2
##11 B 2 1
##12 B 2 2
##13 B 2 1
##14 B 1 2
##15 B 3 1
We get:
print(agg_data)
### A tibble: 2 x 6
## calc.method count count_non_rep_1 mean_price1 count_non_rep_2 mean_price2
## <fctr> <int> <int> <dbl> <int> <dbl>
##1 A 7 5 1.40 1 1.0
##2 B 8 4 1.75 8 1.5