I have individuals that belong to different categories, they are located in different
zones, these populations are expected to grow from the population value below
to the demand value.
population_and_demand_by_category_and_zone <- tibble::tribble(
~category, ~zone, ~population, ~demand,
"A", 1, 115, 138,
"A", 2, 121, 145,
"A", 3, 112, 134,
"A", 4, 76, 91,
"B", 1, 70, 99,
"B", 2, 59, 83,
"B", 3, 86, 121,
"B", 4, 139, 196,
"C", 1, 142, 160,
"C", 2, 72, 81,
"C", 3, 29, 33,
"C", 4, 58, 66,
"D", 1, 22, 47,
"D", 2, 23, 49,
"D", 3, 16, 34,
"D", 4, 45, 96
)
Zones have a given capacity, current population is below this threshold, but demand
will exceed capacity in some zones.
demand_and_capacity_by_zone <- tibble::tribble(
~zone, ~demand, ~capacity, ~capacity_exceeded,
1, 444, 465, FALSE,
2, 358, 393, FALSE,
3, 322, 500, FALSE,
4, 449, 331, TRUE
)
So we will need to move those individuals to a new zone (we assume we have
enough total capacity).
Each individual that we need to move incurs a cost, which depends on its
category and destination zone. These costs are given below.
costs <- tibble::tribble(
~category, ~zone, ~cost,
"A", 1, 0.1,
"A", 2, 0.1,
"A", 3, 0.1,
"A", 4, 1.3,
"B", 1, 16.2,
"B", 2, 38.1,
"B", 3, 1.5,
"B", 4, 0.1,
"C", 1, 0.1,
"C", 2, 12.7,
"C", 3, 97.7,
"C", 4, 46.3,
"D", 1, 25.3,
"D", 2, 7.7,
"D", 3, 67.3,
"D", 4, 0.1
)
I wish to find the distribution of individuals across zones and categories so that
the total cost is minimized. So basically have a new column new_population
in the table population_and_demand_by_category_and_zone described above.
If several solutions are possible, any will do, if the result is a non integer
population, that's fine.
The real use case has about 20 categories and 30 zones, so bigger but not all that big.
It seems like a problem that would be common enough so I'm hoping that there is a convenient way to solve this in R.
This can be modeled as a small LP (Linear Programming) model. We introduce non-negative variables move(c,z,z') indicating the number of persons of category c to be moved from zone z to zone z'. The mathematical model can look like:
This can be implemented using any LP solver. A solution can look like:
---- 83 VARIABLE move.L moves needed to meet capacity
zone1 zone2 zone3
catA.zone1 6
catA.zone4 29 62
catC.zone4 27
---- 83 VARIABLE alloc.L new allocation
zone1 zone2 zone3 zone4
catA 132 180 196
catB 99 83 121 196
catC 187 81 33 39
catD 47 49 34 96
---- 83 VARIABLE totcost.L = 12.400 total cost
Notes:
Interestingly the solution shows that we move people out of zone 1 to make room for people from zone 4. So in some cases, making 2 moves to resettle one person is cheaper. Of course, that depends very much on the cost structure.
The main constraint says: allocation = demand + inflow - outflow
The constraint move(c,z,z)=0 makes sure we don't move from z to z itself. This constraint is not really needed (it is implicitly enforced by the cost). I have added it for clarity. Actually, I implemented this by setting the upper bound of move(c,z,z) to zero (i.e. without an explicit constraint). For very large models I would use another possibility: don't even generate the variables move(c,z,z). This model is small, so no need for that. You can leave it out completely if you want.
I don't use population in the model. I don't think it is needed, that is unless we look at the next bullet.
There are some subtleties to think about: can we only move new persons? (i.e. original people should be allowed to stay)
I've taken Erwin's formulation, modified it to consider that alloc should be more than the population for every zone and category, (which means already present individuals don't move), and implemented it using the {lpSolve} package, which doesn't require installing external system libraries.
Erwin's solution can be obtained by using move_new_only <- FALSE below.
SETUP
library(tidyverse)
library(lpSolve)
move_new_only <- TRUE # means population in place can't be reallocated
categories <- unique(population_and_demand_by_category_and_zone$category)
zones <- unique(population_and_demand_by_category_and_zone$zone)
n_cat <- length(categories)
n_zones <- length(zones)
# empty coefficient arrays
move_coefs_template <- array(0, c(n_zones, n_zones, n_cat),
dimnames = list(zones, zones, categories))
alloc_coefs_template <- matrix(0, n_zones, n_cat,
dimnames = list(zones, categories))
build_zone_by_category_matrix <- function(data, col) {
data %>%
pivot_wider(
id_cols = zone, names_from = category, values_from = {{col}}) %>%
as.data.frame() %>%
`row.names<-`(.$zone) %>%
select(-zone) %>%
as.matrix()
}
demand_mat <- build_zone_by_category_matrix(
population_and_demand_by_category_and_zone, demand)
cost_mat <- build_zone_by_category_matrix(costs, cost)
population_mat <- build_zone_by_category_matrix(
population_and_demand_by_category_and_zone, population)
OBJECTIVE FUNCTION : total cost
# stack the cost matrix vertically to build an array of all move coefficients
coefs_obj <- move_coefs_template
for(i in 1:n_zones) {
coefs_obj[i,,] <- cost_mat
}
# flatten it for `lp`s `objective.in` argument, adding alloc coefs
f.obj <- c(coefs_obj, alloc_coefs_template)
CONSTRAINT 1 : allocation = demand + inflow - outflow
coefs_con <- list()
for (z in zones) {
coefs_con_zone <- list()
for(categ in categories) {
coefs_arrivals <- move_coefs_template
coefs_arrivals[,z, categ] <- 1
coefs_departures <- move_coefs_template
coefs_departures[z,, categ] <- 1
coefs_moves <- coefs_arrivals - coefs_departures
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, categ] <- -1
# flatten the array
coefs_con_zone[[categ]] <- c(coefs_moves, coefs_alloc)
}
coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}
# stack the flattened arrays to build a matrix
f.con1 <- do.call(rbind, coefs_con)
f.dir1 <- rep("==", n_zones * n_cat)
f.rhs1 <- -c(t(demand_mat)) # transposing so we start with all zone 1 and so on
CONSTRAINT 2 : Allocation never exceeds capacity
coefs_con <- list()
for (z in zones) {
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, ] <- 1
coefs_con[[z]] <- c(move_coefs_template, coefs_alloc)
}
# stack the flattened arrays to build a matrix
f.con2 <- do.call(rbind, coefs_con)
f.dir2 <- rep("<=", n_zones)
f.rhs2 <- demand_and_capacity_by_zone$capacity
CONSTRAINT 3 : Allocation >= Population
i.e. we don't move people that were already there.
If we decide we can move them the rule becomes Allocation >= 0, and we get Erwin's answer.
coefs_con <- list()
for (z in zones) {
coefs_con_zone <- list()
for(categ in categories) {
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, categ] <- 1
# flatten the array
coefs_con_zone[[categ]] <- c(move_coefs_template, coefs_alloc)
}
coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}
# stack the flattened arrays to build a matrix
f.con3 <- do.call(rbind, coefs_con)
f.dir3 <- rep(">=", n_zones * n_cat)
if (move_new_only) {
f.rhs3 <- c(t(population_mat))
} else {
f.rhs3 <- rep(0, n_zones * n_cat)
}
CONCATENATE OBJECTS
f.con <- rbind(f.con1, f.con2, f.con3)
f.dir <- c(f.dir1, f.dir2, f.dir3)
f.rhs <- c(f.rhs1, f.rhs2, f.rhs3)
SOLVE
# compute the solution and visualize it in the array
results_raw <- lp("min", f.obj, f.con, f.dir, f.rhs)$solution
results_moves <- move_coefs_template
results_moves[] <-
results_raw[1:length(results_moves)]
results_allocs <- alloc_coefs_template
results_allocs[] <-
results_raw[length(results_moves)+(1:length(results_allocs))]
results_moves
#> , , A
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 3 0
#> 3 0 0 0 0
#> 4 13 0 2 0
#>
#> , , B
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 0 57 0
#>
#> , , C
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 8 0 0 0
#>
#> , , D
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 38 0 0
results_allocs
#> A B C D
#> 1 151 99 168 47
#> 2 142 83 81 87
#> 3 139 178 33 34
#> 4 76 139 58 58
TIDY RESULTS
# format as tidy data frame
results_df <-
as.data.frame.table(results_moves) %>%
setNames(c("from", "to", "category", "n")) %>%
filter(n != 0) %>%
mutate_at(c("from", "to"), as.numeric) %>%
mutate_if(is.factor, as.character)
results_df
#> from to category n
#> 1 4 1 A 13
#> 2 2 3 A 3
#> 3 4 3 A 2
#> 4 4 3 B 57
#> 5 4 1 C 8
#> 6 4 2 D 38
UPDATE TABLES
population_and_demand_by_category_and_zone <-
bind_rows(
results_df %>%
group_by(category, zone = to) %>%
summarize(correction = sum(n), .groups = "drop"),
results_df %>%
group_by(category, zone = from) %>%
summarize(correction = -sum(n), .groups = "drop"),
) %>%
left_join(population_and_demand_by_category_and_zone, ., by = c("category", "zone")) %>%
replace_na(list(correction =0)) %>%
mutate(new_population = demand + correction)
population_and_demand_by_category_and_zone
#> # A tibble: 16 × 6
#> category zone population demand correction new_population
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 115 138 13 151
#> 2 A 2 121 145 -3.00 142
#> 3 A 3 112 134 5.00 139
#> 4 A 4 76 91 -15.0 76
#> 5 B 1 70 99 0 99
#> 6 B 2 59 83 0 83
#> 7 B 3 86 121 57 178
#> 8 B 4 139 196 -57 139
#> 9 C 1 142 160 8 168
#> 10 C 2 72 81 0 81
#> 11 C 3 29 33 0 33
#> 12 C 4 58 66 -8 58
#> 13 D 1 22 47 0 47
#> 14 D 2 23 49 38 87
#> 15 D 3 16 34 0 34
#> 16 D 4 45 96 -38 58
demand_and_capacity_by_zone <-
population_and_demand_by_category_and_zone %>%
group_by(zone) %>%
summarise(population = sum(population), correction = sum(correction), new_population = sum(new_population)) %>%
left_join(demand_and_capacity_by_zone, ., by = "zone")
#> `summarise()` ungrouping output (override with `.groups` argument)
demand_and_capacity_by_zone
#> # A tibble: 4 × 7
#> zone demand capacity capacity_exceeded population correction new_population
#> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl>
#> 1 1 444 465 FALSE 349 21 465
#> 2 2 358 393 FALSE 275 35 393
#> 3 3 322 500 FALSE 243 62 384
#> 4 4 449 331 TRUE 318 -118 331
We see that the population never decreases and stays under capacity.
Related
I have data on bird individuals and their feeding locations. The feeding locations move, and so I want to create a variable that calculates the distance from yesterday's feeding location to "today's" feeding options.
Here is a reprex that exemplifies what I'm talking about. So, the 'bird' column represents the bird individual id's, feedLoc represents the the possible feeding locations for each day. Then there is the date of that observation. H (horizontal) and V (vertical) represent coordinate locations of the feeding locations on a grid. And finally, bp represents if that individual was identified at the feeding location or not.
reprex <- tibble(bird = c("A", "A", "A", "B", "B", "B", "C", "C"),
feedLoc = c("x","y", "x", "x", "y", "x", "y", "z"),
date = as.Date(c("2020-05-10", "2020-05-11", "2020-05-11",
"2020-05-24", "2020-05-25", "2020-05-25",
"2020-05-22", "2020-05-23")),
h = c(100, 123, 45, 75, 89, 64, 99, 101),
v = c(89, 23, 65, 92, 29, 90, 120, 34),
bp = c(1, 1, 0, 1, 0, 1, 1, 0))
Which produces this:
# A tibble: 8 × 6
bird feedLoc date h v bp
<chr> <chr> <date> <dbl> <dbl> <dbl>
1 A x 2020-05-10 100 89 1
2 A y 2020-05-11 123 23 1
3 A x 2020-05-11 45 65 0
4 B x 2020-05-24 75 92 1
5 B y 2020-05-25 89 29 0
6 B x 2020-05-25 64 90 1
7 C y 2020-05-22 99 120 1
8 C z 2020-05-23 101 34 0
My question is, I want to make a new variable that calculates the distance from yesterday's feeding choice (so, the rows where bp == 1 AND date == date - 1), to the current feeding location options for each bird individual using the coordinate data. How would I do this? Thanks!
I initially tried to group by bird and feedLoc id's, arrange by date, and then lag the h and v variables so that I could then use the distance formula to calculate distance from yesterday's ant swarm choice. However, the issue with that is that in the data set, the row previous when arranged is not always exactly "yesterday".
Create a dataframe filtered to bp == 1, add 1 to the date to match rows to the next day, then left_join() to your original data to compute distances:
library(dplyr)
yesterday <- reprex %>%
filter(bp == 1) %>%
transmute(bird, date = date + 1, h.yest = h, v.yest = v)
reprex %>%
left_join(yesterday) %>%
mutate(
dist = sqrt((h - h.yest)^2 + (v - v.yest)^2)
) %>%
select(!h.yest:v.yest)
# A tibble: 8 × 7
bird feedLoc date h v bp dist
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A x 2020-05-10 100 89 1 NA
2 A y 2020-05-11 123 23 1 69.9
3 A x 2020-05-11 45 65 0 60.0
4 B x 2020-05-24 75 92 1 NA
5 B y 2020-05-25 89 29 0 64.5
6 B x 2020-05-25 64 90 1 11.2
7 C y 2020-05-22 99 120 1 NA
8 C z 2020-05-23 101 34 0 86.0
Try something like this dplyr approach, which first restricts the manipulation to just bp == 1 then checks to see if the feeding location is different and the previous date is one day behind (date == date - 1) then calculates the difference for h and y. After all that it adds back in the bp == 0 data and rearranges (this approach saves a more convoluted case_when statement. If this isn't exactly what you need post an example of the desired output and I will edit. Good luck!
library(dplyr)
reprex %>%
group_by(bird) %>%
filter(bp == 1) %>%
arrange(date) %>%
mutate(h_change = case_when(
feedLoc != lag(feedLoc) & lag(date) == date - 1 ~ h - lag(h)),
v_change = case_when(
feedLoc != lag(feedLoc) & lag(date) == date - 1 ~ v - lag(v)
)) %>%
right_join(reprex) %>% arrange(bird, date)
Output:
# bird feedLoc date h v bp h_change v_change
# <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A x 2020-05-10 100 89 1 NA NA
# 2 A y 2020-05-11 123 23 1 23 -66
# 3 A x 2020-05-11 45 65 0 NA NA
# 4 B x 2020-05-24 75 92 1 NA NA
# 5 B x 2020-05-25 64 90 1 NA NA
# 6 B y 2020-05-25 89 29 0 NA NA
# 7 C y 2020-05-22 99 120 1 NA NA
# 8 C z 2020-05-23 101 34 0 NA NA
I want to calculate the 95% CI of the value that is duplicated in a column. I have the following data, where the values of total_rt (sum of rt by participant and condition) are duplicated in a same participant and condition:
data <- tribble(
~ participant, ~ condition, ~ region, ~ rt,
"p1", "a", 1, 26,
"p1", "a", 2, 56,
"p1", "a", 3, 43,
"p1", "b", 1, 15,
"p1", "b", 2, 42,
"p1", "b", 3, 36,
"p2", "a", 1, 37,
"p2", "a", 2, 67,
"p2", "a", 3, 37,
"p2", "b", 1, 7,
"p2", "b", 2, 57,
"p2", "b", 3, 27
) |>
group_by(
participant,
condition
) |>
mutate(
total_rt = sum(rt)
) |>
ungroup()
# A tibble: 12 × 5
participant condition region rt total_rt
<chr> <chr> <dbl> <dbl> <dbl>
1 p1 a 1 26 125
2 p1 a 2 56 125
3 p1 a 3 43 125
4 p1 b 1 15 93
5 p1 b 2 42 93
6 p1 b 3 36 93
7 p2 a 1 37 141
8 p2 a 2 67 141
9 p2 a 3 37 141
10 p2 b 1 7 91
11 p2 b 2 57 91
12 p2 b 3 27 91
I want to add the columns for the mean and 95% CI of total_rt as shown below. I want to do so by adding sequence of data manipulation procedures using pipes |>, so that I can avoid creating a new data frame only for the calculation of the summary statistics of total_rt:
# A tibble: 12 × 8
participant condition region rt total_rt mean_total_rt upper_total_rt lower_total_rt
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 p1 a 1 26 125 133 149. 117.
2 p1 a 2 56 125 133 149. 117.
3 p1 a 3 43 125 133 149. 117.
4 p1 b 1 15 93 92 94.0 90.
5 p1 b 2 42 93 92 94.0 90.
6 p1 b 3 36 93 92 94.0 90.
7 p2 a 1 37 141 133 149. 117.
8 p2 a 2 67 141 133 149. 117.
9 p2 a 3 37 141 133 149. 117.
10 p2 b 1 7 91 92 94.0 90.
11 p2 b 2 57 91 92 94.0 90.
12 p2 b 3 27 91 92 94.0 90.
To obtain the result above, I currently have to provide a new data frame, called newdata here, for the computation of summary statistics using distinct(), and attach it to the original data frame data using left_join(), as shown below. I want to replace these procedures with some sequence of data manipulation that directly follows the procedure to create data (i.e. follows ... |> ungroup() in the example above). Does anybody have a way to achieve this?
newdata <- data |>
group_by(condition) |>
distinct(
.keep_all = TRUE,
participant
) |>
mutate(
mean_total_rt = mean(total_rt),
sd_total_rt = sd(total_rt),
upper_total_rt = mean_total_rt + qnorm(0.975) * sd(total_rt) / sqrt(n()),
lower_total_rt = mean_total_rt - qnorm(0.975) * sd(total_rt) / sqrt(n())
) |>
ungroup()
Data <- left_join(
data,
newdata,
by = c("participant", "condition")
)
Data |> View()
Note that, the calculation of 95% CIs (and standard deviation to calculate 95% CIs) fails if I avoid using distinct(), since n() wrongly counts the duplicated values.
data |>
group_by(condition) |>
mutate(
mean_total_rt = mean(total_rt),
false_sd_total_rt = sd(total_rt),
false_upper_total_rt = mean_total_rt + qnorm(0.975) * sd(total_rt) / sqrt(n()),
false_lower_total_rt = mean_total_rt - qnorm(0.975) * sd(total_rt) / sqrt(n())
) |>
ungroup() |>
View()
I have a dataset like this.
> dataset
id
a
b
c
d
1
10
1
30
50
2
5
0
5
60
3
20
1
18
90
4
103
0
20
80
5
16
1
56
100
And so on up to 'n' number of columns in relation to the requirement.
My colleagues in the research area carry out certain analyzes where what they give me as an input is a data frame. In which the row names correspond to the variables of interest for a new dataset.
Something like this.
> rownames(Features)
a
b
d
a:d
b:d
b:c
a:c
Where the colon (:) represents the "product of". So in order to continue working I need to include those products in the original dataset.
I have manually created a function that reproduces the multiplications in the following way (where x represents my original dataset):
Products<- function(x){x<- x %>% mutate(Product1=x$a*x$d)
x<- x %>% mutate(Product2=x$b*x$d)
x<- x %>% mutate(Product3=x$b*x$c)
x<- x %>% mutate(Product4=x$a*x$c)
return(x)}
However, given that the number of products to create is variable every time, I want to find a way to automate the creation of these column multiplications with the names that they give me as input. I'm sure my approach isn't the best, so any help is very welcome.
This seems to be some modelling being done and the choice of : to stand for product is quite intriguing as that is often the case in the modeling world. In base R, you could do:
model.matrix(reformulate(c(-1,rownames(Features))), dataset)
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
I added -1 in order to remove the intercept. Otherwise you could maintain it. Also note that this is a matrix, you can then change it to a dataframe.
I am not sure if the following code works for your, where eval + gsub are used to produce the product columns
with(
dataset,
list2DF(
setNames(
lapply(
gsub(":", "*", rownames(Features)),
function(x) eval(str2lang(x))
),
rownames(Features)
)
)
)
which gives
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
Data
> dput(dataset)
structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
> dput(Features)
structure(list(Features = 1:7), class = "data.frame", row.names = c("a",
"b", "d", "a:d", "b:d", "b:c", "a:c"))
We could use strsplit to split the names that have :, select the column in the 'dataset' based on that splitted named, Reduce with * to do elementwise multiplication, and assign those 'Product' columns to the original 'dataset'
nm1 <- grep(':', rownames(Features), value = TRUE)
lst1 <- lapply(strsplit(nm1, ":", fixed = TRUE),
function(x) Reduce(`*`, dataset[x]))
dataset[paste0("Product", seq_along(lst1))] <- lst1
-output
dataset
# id a b c d Product1 Product2 Product3 Product4
#1 1 10 1 30 50 500 50 30 300
#2 2 5 0 5 60 300 0 0 25
#3 3 20 1 18 90 1800 90 18 360
#4 4 103 0 20 80 8240 0 0 2060
#5 5 16 1 56 100 1600 100 56 896
data
dataset <- structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
Features <- structure(1:7, .Dim = c(7L, 1L), .Dimnames = list(c("a", "b",
"d", "a:d", "b:d", "b:c", "a:c"), NULL))
This is similar to many solutions above. One thing about R is you can achieve the same results with so many different ways though the underline principle is still the same.
library(dplyr)
dataset <- tibble(id = c(1, 2, 3, 4, 5),
a = c(10, 5, 20, 103, 16),
b = c(1, 0, 1, 0, 1),
c = c(30, 5, 18, 20, 56),
d = c(50, 60, 90, 80, 100))
features = c("a", "b", "d", "a:d", "b:d", "b:c", "a:c")
final <- bind_cols(
map(features,
function(x) {
dataset %>%
mutate(!!x := eval(parse(text=gsub(":", "*", x)))) %>%
select(!!x)
}
))
Final dataset.
# A tibble: 5 x 9
id a b c d `a:d` `b:d` `b:c` `a:c`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 1 30 50 500 50 30 300
2 2 5 0 5 60 300 0 0 25
3 3 20 1 18 90 1800 90 18 360
4 4 103 0 20 80 8240 0 0 2060
5 5 16 1 56 100 1600 100 56 896
I'm working on a data cleaning problem where I'm stuck. I've started receiving csv-files in the format shown below and I need to clean it before I can do any analysis. There are several such columns and there can be a couple hundred variables in each cell that need to be extracted.
Original <- structure(list(CustNum = c(0, 1),
Sales = c("[1000, 345, Zero, 56]", "[987, 879, 325, 4568]"),
Amounts = c("[10, 2, 0, 98]", "[57, 25, 52, 75]"),
Number = c("['1', '2', '3', '4']", "['4', '3', '2', '1']"),
Identifier = c("A", "B")),
row.names = c(NA, -2L),
class = c("tbl_df", "tbl", "data.frame"))
What I'm trying to do is wrangle it into this format.
Desired <- tibble(CustNum = c(0, 0, 0, 0, 1, 1, 1, 1),
Sales = c(1000, 345, "Zero", 56, 987, 879, 325, 4568),
Amounts = c(10, 2, 0, 98, 57, 25, 52, 75),
Number = c(1, 2, 3, 4, 4, 3, 2, 1),
Identifier = c("A", "A", "A", "A", "B", "B", "B", "B"))
I've tried a number of different variations of the following type but can't get anywhere.
Original$Sales %>%
str_replace("\\[", "") %>%
str_replace("\\]", "") %>%
str_replace("'", "")
It's easy to do the cleaning in Power Query in Excel but would like to find a way to do it in R so I don't have to use several different tools. Can anyone show me how to do this?
Try with this:
library(dplyr) # must be version >= 1.0.0
library(stringr)
Original %>%
mutate(across(everything(), str_remove_all, pattern = "\\[|\\]|\\'")) %>%
mutate(across(everything(), str_split, pattern = ",")) %>%
tidyr::unnest(everything()) %>%
mutate(across(everything(), str_trim)) %>%
mutate(across(c(CustNum, Amounts, Number), as.numeric))
# A tibble: 8 x 5
CustNum Sales Amounts Number Identifier
<dbl> <chr> <dbl> <dbl> <chr>
1 0 1000 10 1 A
2 0 345 2 2 A
3 0 Zero 0 3 A
4 0 56 98 4 A
5 1 987 57 4 B
6 1 879 25 3 B
7 1 325 52 2 B
8 1 4568 75 1 B
Basically:
Remove [ ] '
Split by ,
Unnest the lists
Trim out unnecessary spaces
Set to numeric where necessary
You can try this approach
library(tidyverse)
library(stringr)
Original2 <- Original %>%
mutate_at(vars(Sales, Amounts, Number), ~str_replace_all(., "\\[|\\'|\\]|\\s", "")) %>%
separate_rows(c("Sales", "Amounts", "Number"), sep = ",")
# CustNum Sales Amounts Number Identifier
# <dbl> <chr> <chr> <chr> <chr>
# 1 0 1000 10 1 A
# 2 0 345 2 2 A
# 3 0 Zero 0 3 A
# 4 0 56 98 4 A
# 5 1 987 57 4 B
# 6 1 879 25 3 B
# 7 1 325 52 2 B
# 8 1 4568 75 1 B
Here we replace [ , ' and space, then we use separate_rows() from tidyr package to separate rows. It takes 2 steps to achieve our goal.
I would suggest this approach reshaping your Original data first to long and then separate the rows by sep=','. After that you will clean the variable to remove some special characters. Therefore, you can create an id variable by group in order to transform data to wide as you want in Desired:
library(tidyverse)
#Reshape
Original %>%
pivot_longer(cols = -c(CustNum,Identifier)) %>%
separate_rows(value,sep = ',') %>%
mutate(value=trimws(gsub("[[:punct:]]", " ", value))) %>%
group_by(name) %>% mutate(id=1:n()) %>%
pivot_wider(names_from = name,values_from=value) %>%
ungroup() %>%
select(-id)
Output:
# A tibble: 8 x 5
CustNum Identifier Sales Amounts Number
<dbl> <chr> <chr> <chr> <chr>
1 0 A 1000 10 1
2 0 A 345 2 2
3 0 A Zero 0 3
4 0 A 56 98 4
5 1 B 987 57 4
6 1 B 879 25 3
7 1 B 325 52 2
8 1 B 4568 75 1
In my real data, I have multiple outliers for multiple variables. My data looks something like the example below but the numbers are completely random.
I would like to pull all data points that are greater than or less than 2 SD using a winsorization.
df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
So far I have identified my outliers for the variables of test1 and test2 for each group using the following code:
outlier <- function(x, SD = 2){
mu <- mean(x)
sigma <- sd(x)
out <- x < mu - SD*sigma | x > mu + SD*sigma
out
}
# identify the outliers for each variable by each group
with(df, ave(test1, group, FUN = outlier))
with(df, ave(test2, group, FUN = outlier))
# add these new-found outliers to the data set
df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))
I am aware of the 'winsorize' function in the 'robustHD' package but am not sure:
1). how to tailor the command to a 90% winsorization (2 SD), 2). ensuring the winsorization accounts for the 2 different groups, 3). and including multiple variables in that winsorization.
Additionally, but not necessary...is there a way to see what the 'winsorize' function changed the numbers from to what the numbers were changed to?
Make first clear, how you want to winsorize your data. You have several options.
Use the mean+/-2sd limits as extreme values and replace all values outside by those
Use the observed value next to the mean+/-2sd limits
Use the 90% quantile
In option 1 and 3 you will possibly introduce values into your winsorized variable, which were not observed, in option 2 you will only have observed values. Note also, that the (5%, 95%)-quantile will not necessarily be near to 2*sd if you don't have reasonably well behaved normally distributed data.
For the winsorization process you can use DescTools::Winsorize(), which accepts both, probs and values for the limits.
Implementation 1)
x <- rnorm(100)
w1 <- Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x))
For 2) you could use something like
w2 <- Winsorize(x,
minval = max(Coalesce(x[x <= mean(x)-2*sd(x)], mean(x)-2*sd(x))),
maxval = min(Coalesce(x[x >= mean(x)+2*sd(x)], mean(x)+2*sd(x))))
Provide some escalating values for cases where there are no values outside the limits. Coalesce() returns the first non empty value, so Winsorize() will always get a valid limit.
Option 3) is the default for the function
w3 <- Winsorize(x, probs=c(0.05, 0.95))
Define a function for the groupwise apply as (here for option 1):
df$w1 <- unsplit(
tapply(df$test1, df$group,
function(x) Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x)) )
, f=df$group)
The replaced values can be found with
cbind(x, w1)[x!=w1,]
Here's a start - hopefully someone has a better solution for you.
library(tidyverse)
df <- tibble::tribble(
~id, ~group, ~test1, ~test2,
1, 0, 57, 82,
2, 0, 77, 80,
3, 0, 67, 90,
4, 0, 15, 70,
5, 0, 58, 72,
6, 1, 18, 44,
7, 1, 44, 44,
8, 1, 18, 46,
9, 1, 20, 44,
10, 1, 14, 38
)
df
#> # A tibble: 10 x 4
#> id group test1 test2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82
#> 2 2 0 77 80
#> 3 3 0 67 90
#> 4 4 0 15 70
#> 5 5 0 58 72
#> 6 6 1 18 44
#> 7 7 1 44 44
#> 8 8 1 18 46
#> 9 9 1 20 44
#> 10 10 1 14 38
library(DescTools)
df %>%
group_by(group) %>%
mutate(
test2_winsorized = DescTools::Winsorize(
test2,
maxval = quantile(df$test2, 0.90),
minval = quantile(df$test2, 0.10)
),
test1_winsorized = DescTools::Winsorize(
test1,
maxval = quantile(df$test1, 0.90),
minval = quantile(df$test1, 0.10)
)
)
#> # A tibble: 10 x 6
#> # Groups: group [2]
#> id group test1 test2 test2_winsorized test1_winsorized
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82 82 57
#> 2 2 0 77 80 80 68
#> 3 3 0 67 90 82.8 67
#> 4 4 0 15 70 70 15
#> 5 5 0 58 72 72 58
#> 6 6 1 18 44 44 18
#> 7 7 1 44 44 44 44
#> 8 8 1 18 46 46 18
#> 9 9 1 20 44 44 20
#> 10 10 1 14 38 43.4 14.9
Created on 2019-06-06 by the reprex package (v0.2.1)