90% Winsorization on multiple variables - r

In my real data, I have multiple outliers for multiple variables. My data looks something like the example below but the numbers are completely random.
I would like to pull all data points that are greater than or less than 2 SD using a winsorization.
df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
So far I have identified my outliers for the variables of test1 and test2 for each group using the following code:
outlier <- function(x, SD = 2){
mu <- mean(x)
sigma <- sd(x)
out <- x < mu - SD*sigma | x > mu + SD*sigma
out
}
# identify the outliers for each variable by each group
with(df, ave(test1, group, FUN = outlier))
with(df, ave(test2, group, FUN = outlier))
# add these new-found outliers to the data set
df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))
I am aware of the 'winsorize' function in the 'robustHD' package but am not sure:
1). how to tailor the command to a 90% winsorization (2 SD), 2). ensuring the winsorization accounts for the 2 different groups, 3). and including multiple variables in that winsorization.
Additionally, but not necessary...is there a way to see what the 'winsorize' function changed the numbers from to what the numbers were changed to?

Make first clear, how you want to winsorize your data. You have several options.
Use the mean+/-2sd limits as extreme values and replace all values outside by those
Use the observed value next to the mean+/-2sd limits
Use the 90% quantile
In option 1 and 3 you will possibly introduce values into your winsorized variable, which were not observed, in option 2 you will only have observed values. Note also, that the (5%, 95%)-quantile will not necessarily be near to 2*sd if you don't have reasonably well behaved normally distributed data.
For the winsorization process you can use DescTools::Winsorize(), which accepts both, probs and values for the limits.
Implementation 1)
x <- rnorm(100)
w1 <- Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x))
For 2) you could use something like
w2 <- Winsorize(x,
minval = max(Coalesce(x[x <= mean(x)-2*sd(x)], mean(x)-2*sd(x))),
maxval = min(Coalesce(x[x >= mean(x)+2*sd(x)], mean(x)+2*sd(x))))
Provide some escalating values for cases where there are no values outside the limits. Coalesce() returns the first non empty value, so Winsorize() will always get a valid limit.
Option 3) is the default for the function
w3 <- Winsorize(x, probs=c(0.05, 0.95))
Define a function for the groupwise apply as (here for option 1):
df$w1 <- unsplit(
tapply(df$test1, df$group,
function(x) Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x)) )
, f=df$group)
The replaced values can be found with
cbind(x, w1)[x!=w1,]

Here's a start - hopefully someone has a better solution for you.
library(tidyverse)
df <- tibble::tribble(
~id, ~group, ~test1, ~test2,
1, 0, 57, 82,
2, 0, 77, 80,
3, 0, 67, 90,
4, 0, 15, 70,
5, 0, 58, 72,
6, 1, 18, 44,
7, 1, 44, 44,
8, 1, 18, 46,
9, 1, 20, 44,
10, 1, 14, 38
)
df
#> # A tibble: 10 x 4
#> id group test1 test2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82
#> 2 2 0 77 80
#> 3 3 0 67 90
#> 4 4 0 15 70
#> 5 5 0 58 72
#> 6 6 1 18 44
#> 7 7 1 44 44
#> 8 8 1 18 46
#> 9 9 1 20 44
#> 10 10 1 14 38
library(DescTools)
df %>%
group_by(group) %>%
mutate(
test2_winsorized = DescTools::Winsorize(
test2,
maxval = quantile(df$test2, 0.90),
minval = quantile(df$test2, 0.10)
),
test1_winsorized = DescTools::Winsorize(
test1,
maxval = quantile(df$test1, 0.90),
minval = quantile(df$test1, 0.10)
)
)
#> # A tibble: 10 x 6
#> # Groups: group [2]
#> id group test1 test2 test2_winsorized test1_winsorized
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82 82 57
#> 2 2 0 77 80 80 68
#> 3 3 0 67 90 82.8 67
#> 4 4 0 15 70 70 15
#> 5 5 0 58 72 72 58
#> 6 6 1 18 44 44 18
#> 7 7 1 44 44 44 44
#> 8 8 1 18 46 46 18
#> 9 9 1 20 44 44 20
#> 10 10 1 14 38 43.4 14.9
Created on 2019-06-06 by the reprex package (v0.2.1)

Related

Minimise cost of reallocating individuals

I have individuals that belong to different categories, they are located in different
zones, these populations are expected to grow from the population value below
to the demand value.
population_and_demand_by_category_and_zone <- tibble::tribble(
~category, ~zone, ~population, ~demand,
"A", 1, 115, 138,
"A", 2, 121, 145,
"A", 3, 112, 134,
"A", 4, 76, 91,
"B", 1, 70, 99,
"B", 2, 59, 83,
"B", 3, 86, 121,
"B", 4, 139, 196,
"C", 1, 142, 160,
"C", 2, 72, 81,
"C", 3, 29, 33,
"C", 4, 58, 66,
"D", 1, 22, 47,
"D", 2, 23, 49,
"D", 3, 16, 34,
"D", 4, 45, 96
)
Zones have a given capacity, current population is below this threshold, but demand
will exceed capacity in some zones.
demand_and_capacity_by_zone <- tibble::tribble(
~zone, ~demand, ~capacity, ~capacity_exceeded,
1, 444, 465, FALSE,
2, 358, 393, FALSE,
3, 322, 500, FALSE,
4, 449, 331, TRUE
)
So we will need to move those individuals to a new zone (we assume we have
enough total capacity).
Each individual that we need to move incurs a cost, which depends on its
category and destination zone. These costs are given below.
costs <- tibble::tribble(
~category, ~zone, ~cost,
"A", 1, 0.1,
"A", 2, 0.1,
"A", 3, 0.1,
"A", 4, 1.3,
"B", 1, 16.2,
"B", 2, 38.1,
"B", 3, 1.5,
"B", 4, 0.1,
"C", 1, 0.1,
"C", 2, 12.7,
"C", 3, 97.7,
"C", 4, 46.3,
"D", 1, 25.3,
"D", 2, 7.7,
"D", 3, 67.3,
"D", 4, 0.1
)
I wish to find the distribution of individuals across zones and categories so that
the total cost is minimized. So basically have a new column new_population
in the table population_and_demand_by_category_and_zone described above.
If several solutions are possible, any will do, if the result is a non integer
population, that's fine.
The real use case has about 20 categories and 30 zones, so bigger but not all that big.
It seems like a problem that would be common enough so I'm hoping that there is a convenient way to solve this in R.
This can be modeled as a small LP (Linear Programming) model. We introduce non-negative variables move(c,z,z') indicating the number of persons of category c to be moved from zone z to zone z'. The mathematical model can look like:
This can be implemented using any LP solver. A solution can look like:
---- 83 VARIABLE move.L moves needed to meet capacity
zone1 zone2 zone3
catA.zone1 6
catA.zone4 29 62
catC.zone4 27
---- 83 VARIABLE alloc.L new allocation
zone1 zone2 zone3 zone4
catA 132 180 196
catB 99 83 121 196
catC 187 81 33 39
catD 47 49 34 96
---- 83 VARIABLE totcost.L = 12.400 total cost
Notes:
Interestingly the solution shows that we move people out of zone 1 to make room for people from zone 4. So in some cases, making 2 moves to resettle one person is cheaper. Of course, that depends very much on the cost structure.
The main constraint says: allocation = demand + inflow - outflow
The constraint move(c,z,z)=0 makes sure we don't move from z to z itself. This constraint is not really needed (it is implicitly enforced by the cost). I have added it for clarity. Actually, I implemented this by setting the upper bound of move(c,z,z) to zero (i.e. without an explicit constraint). For very large models I would use another possibility: don't even generate the variables move(c,z,z). This model is small, so no need for that. You can leave it out completely if you want.
I don't use population in the model. I don't think it is needed, that is unless we look at the next bullet.
There are some subtleties to think about: can we only move new persons? (i.e. original people should be allowed to stay)
I've taken Erwin's formulation, modified it to consider that alloc should be more than the population for every zone and category, (which means already present individuals don't move), and implemented it using the {lpSolve} package, which doesn't require installing external system libraries.
Erwin's solution can be obtained by using move_new_only <- FALSE below.
SETUP
library(tidyverse)
library(lpSolve)
move_new_only <- TRUE # means population in place can't be reallocated
categories <- unique(population_and_demand_by_category_and_zone$category)
zones <- unique(population_and_demand_by_category_and_zone$zone)
n_cat <- length(categories)
n_zones <- length(zones)
# empty coefficient arrays
move_coefs_template <- array(0, c(n_zones, n_zones, n_cat),
dimnames = list(zones, zones, categories))
alloc_coefs_template <- matrix(0, n_zones, n_cat,
dimnames = list(zones, categories))
build_zone_by_category_matrix <- function(data, col) {
data %>%
pivot_wider(
id_cols = zone, names_from = category, values_from = {{col}}) %>%
as.data.frame() %>%
`row.names<-`(.$zone) %>%
select(-zone) %>%
as.matrix()
}
demand_mat <- build_zone_by_category_matrix(
population_and_demand_by_category_and_zone, demand)
cost_mat <- build_zone_by_category_matrix(costs, cost)
population_mat <- build_zone_by_category_matrix(
population_and_demand_by_category_and_zone, population)
OBJECTIVE FUNCTION : total cost
# stack the cost matrix vertically to build an array of all move coefficients
coefs_obj <- move_coefs_template
for(i in 1:n_zones) {
coefs_obj[i,,] <- cost_mat
}
# flatten it for `lp`s `objective.in` argument, adding alloc coefs
f.obj <- c(coefs_obj, alloc_coefs_template)
CONSTRAINT 1 : allocation = demand + inflow - outflow
coefs_con <- list()
for (z in zones) {
coefs_con_zone <- list()
for(categ in categories) {
coefs_arrivals <- move_coefs_template
coefs_arrivals[,z, categ] <- 1
coefs_departures <- move_coefs_template
coefs_departures[z,, categ] <- 1
coefs_moves <- coefs_arrivals - coefs_departures
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, categ] <- -1
# flatten the array
coefs_con_zone[[categ]] <- c(coefs_moves, coefs_alloc)
}
coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}
# stack the flattened arrays to build a matrix
f.con1 <- do.call(rbind, coefs_con)
f.dir1 <- rep("==", n_zones * n_cat)
f.rhs1 <- -c(t(demand_mat)) # transposing so we start with all zone 1 and so on
CONSTRAINT 2 : Allocation never exceeds capacity
coefs_con <- list()
for (z in zones) {
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, ] <- 1
coefs_con[[z]] <- c(move_coefs_template, coefs_alloc)
}
# stack the flattened arrays to build a matrix
f.con2 <- do.call(rbind, coefs_con)
f.dir2 <- rep("<=", n_zones)
f.rhs2 <- demand_and_capacity_by_zone$capacity
CONSTRAINT 3 : Allocation >= Population
i.e. we don't move people that were already there.
If we decide we can move them the rule becomes Allocation >= 0, and we get Erwin's answer.
coefs_con <- list()
for (z in zones) {
coefs_con_zone <- list()
for(categ in categories) {
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, categ] <- 1
# flatten the array
coefs_con_zone[[categ]] <- c(move_coefs_template, coefs_alloc)
}
coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}
# stack the flattened arrays to build a matrix
f.con3 <- do.call(rbind, coefs_con)
f.dir3 <- rep(">=", n_zones * n_cat)
if (move_new_only) {
f.rhs3 <- c(t(population_mat))
} else {
f.rhs3 <- rep(0, n_zones * n_cat)
}
CONCATENATE OBJECTS
f.con <- rbind(f.con1, f.con2, f.con3)
f.dir <- c(f.dir1, f.dir2, f.dir3)
f.rhs <- c(f.rhs1, f.rhs2, f.rhs3)
SOLVE
# compute the solution and visualize it in the array
results_raw <- lp("min", f.obj, f.con, f.dir, f.rhs)$solution
results_moves <- move_coefs_template
results_moves[] <-
results_raw[1:length(results_moves)]
results_allocs <- alloc_coefs_template
results_allocs[] <-
results_raw[length(results_moves)+(1:length(results_allocs))]
results_moves
#> , , A
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 3 0
#> 3 0 0 0 0
#> 4 13 0 2 0
#>
#> , , B
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 0 57 0
#>
#> , , C
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 8 0 0 0
#>
#> , , D
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 38 0 0
results_allocs
#> A B C D
#> 1 151 99 168 47
#> 2 142 83 81 87
#> 3 139 178 33 34
#> 4 76 139 58 58
TIDY RESULTS
# format as tidy data frame
results_df <-
as.data.frame.table(results_moves) %>%
setNames(c("from", "to", "category", "n")) %>%
filter(n != 0) %>%
mutate_at(c("from", "to"), as.numeric) %>%
mutate_if(is.factor, as.character)
results_df
#> from to category n
#> 1 4 1 A 13
#> 2 2 3 A 3
#> 3 4 3 A 2
#> 4 4 3 B 57
#> 5 4 1 C 8
#> 6 4 2 D 38
UPDATE TABLES
population_and_demand_by_category_and_zone <-
bind_rows(
results_df %>%
group_by(category, zone = to) %>%
summarize(correction = sum(n), .groups = "drop"),
results_df %>%
group_by(category, zone = from) %>%
summarize(correction = -sum(n), .groups = "drop"),
) %>%
left_join(population_and_demand_by_category_and_zone, ., by = c("category", "zone")) %>%
replace_na(list(correction =0)) %>%
mutate(new_population = demand + correction)
population_and_demand_by_category_and_zone
#> # A tibble: 16 × 6
#> category zone population demand correction new_population
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 115 138 13 151
#> 2 A 2 121 145 -3.00 142
#> 3 A 3 112 134 5.00 139
#> 4 A 4 76 91 -15.0 76
#> 5 B 1 70 99 0 99
#> 6 B 2 59 83 0 83
#> 7 B 3 86 121 57 178
#> 8 B 4 139 196 -57 139
#> 9 C 1 142 160 8 168
#> 10 C 2 72 81 0 81
#> 11 C 3 29 33 0 33
#> 12 C 4 58 66 -8 58
#> 13 D 1 22 47 0 47
#> 14 D 2 23 49 38 87
#> 15 D 3 16 34 0 34
#> 16 D 4 45 96 -38 58
demand_and_capacity_by_zone <-
population_and_demand_by_category_and_zone %>%
group_by(zone) %>%
summarise(population = sum(population), correction = sum(correction), new_population = sum(new_population)) %>%
left_join(demand_and_capacity_by_zone, ., by = "zone")
#> `summarise()` ungrouping output (override with `.groups` argument)
demand_and_capacity_by_zone
#> # A tibble: 4 × 7
#> zone demand capacity capacity_exceeded population correction new_population
#> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl>
#> 1 1 444 465 FALSE 349 21 465
#> 2 2 358 393 FALSE 275 35 393
#> 3 3 322 500 FALSE 243 62 384
#> 4 4 449 331 TRUE 318 -118 331
We see that the population never decreases and stays under capacity.

Discretizing a continous variable keeping out zeros

I want to discretize a column which contains of a continous variable.
the data looks like ;
c(0,25,77,423,6,8,3,65,32,22,10,0,8,0,15,0,10,1,2,4,5,5,6)
I want turn the numbers into categorical by discretizing, but zeros represent a different category. Sometimes directly discretizing could keep different numbers with zero.
I thought if I keep zeros out then discretize my wish comes true. But in a dataframe column I can't do it because of indexes:
here is an example dput() output
structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65,
32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA,
-23L))
for example, if I'd like to use 2 breaks, categories should be; zero and the other 3 discretized ones, totally 4 categories. it should be better if I could write function that discretizes a column that can be directly created with dplyr::mutate()
thanks in advance.
If I understood it correctly, your goal is to keep "0" as a separate category when discretizing. Here's a solution using arules::discretize to make a new function that can accomplish this:
library(arules)
#> Loading required package: Matrix
#>
#> Attaching package: 'arules'
#> The following objects are masked from 'package:base':
#>
#> abbreviate, write
library(tidyverse)
df <- structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65,
32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA,
-23L))
discretize_keep <- function(vec, keep, ...) {
vec2 <- vec
vec2[vec2==keep] <- NA
dsc <- arules::discretize(vec2, ...)
fct_explicit_na(dsc, na_level = str_glue("[{keep}]"))
}
df %>%
mutate(discrete_column = discretize_keep(dummy_column, keep = 0, breaks = 3))
#> dummy_column discrete_column
#> 1 0 [0]
#> 2 25 [15,423]
#> 3 77 [15,423]
#> 4 423 [15,423]
#> 5 6 [6,15)
#> 6 8 [6,15)
#> 7 3 [1,6)
#> 8 65 [15,423]
#> 9 32 [15,423]
#> 10 22 [15,423]
#> 11 10 [6,15)
#> 12 0 [0]
#> 13 8 [6,15)
#> 14 0 [0]
#> 15 15 [15,423]
#> 16 0 [0]
#> 17 10 [6,15)
#> 18 1 [1,6)
#> 19 2 [1,6)
#> 20 4 [1,6)
#> 21 5 [1,6)
#> 22 5 [1,6)
#> 23 6 [6,15)
If you have breaks c(20,50) like below, you can try cut to discretize dummy_column, e.g.,
breaks <- c(20, 50)
df %>%
mutate(discrete = cut(dummy_column, c(-1, 0, breaks, max(dummy_column))))
which gives
dummy_column discrete
1 0 (-1,0]
2 25 (20,50]
3 77 (50,423]
4 423 (50,423]
5 6 (0,20]
6 8 (0,20]
7 3 (0,20]
8 65 (50,423]
9 32 (20,50]
10 22 (20,50]
11 10 (0,20]
12 0 (-1,0]
13 8 (0,20]
14 0 (-1,0]
15 15 (0,20]
16 0 (-1,0]
17 10 (0,20]
18 1 (0,20]
19 2 (0,20]
20 4 (0,20]
21 5 (0,20]
22 5 (0,20]
23 6 (0,20]

Create column which tells source of number between the first two columns

I have a data frame which has three columns:
df <- structure(list(lowage = c(45, 15, 9, 51, 22, 45, 4, 4, 9, 25),
highage = c(50, 21, 14, 60, 24, 50, 8, 8, 14, 30)), .Names = c("lowage",
"highage"), row.names = c(NA, 10L), class = "data.frame")
df$random_number <- apply(df, 1, function(x) sample(seq(x[1], x[2]), 1))
I want to create a fourth column that tells us the source of the where the random_number comes from. So for example, in the first row, the column lowage = 45 and highage = 46. Say, the random number generated is 46 (for example). I'd like to create a fourth column where it says as a label 'highage' since it comes from the highage column. And so on...
If the solution can be in dplyr, that would be great!
Is this what you want?
df %>%
mutate(newcol =
case_when(random_number == lowage ~ "lowage",
random_number == highage ~ "highage",
TRUE ~ "between"))
# lowage highage random_number newcol
# 1 45 50 47 between
# 2 15 21 18 between
# 3 9 14 13 between
# 4 51 60 57 between
# 5 22 24 23 between
# 6 45 50 49 between
# 7 4 8 4 lowage
# 8 4 8 6 between
# 9 9 14 9 lowage
# 10 25 30 27 between

Is there a way to automate the multiplication of columns in R in relation to their names?

I have a dataset like this.
> dataset
id
a
b
c
d
1
10
1
30
50
2
5
0
5
60
3
20
1
18
90
4
103
0
20
80
5
16
1
56
100
And so on up to 'n' number of columns in relation to the requirement.
My colleagues in the research area carry out certain analyzes where what they give me as an input is a data frame. In which the row names correspond to the variables of interest for a new dataset.
Something like this.
> rownames(Features)
a
b
d
a:d
b:d
b:c
a:c
Where the colon (:) represents the "product of". So in order to continue working I need to include those products in the original dataset.
I have manually created a function that reproduces the multiplications in the following way (where x represents my original dataset):
Products<- function(x){x<- x %>% mutate(Product1=x$a*x$d)
x<- x %>% mutate(Product2=x$b*x$d)
x<- x %>% mutate(Product3=x$b*x$c)
x<- x %>% mutate(Product4=x$a*x$c)
return(x)}
However, given that the number of products to create is variable every time, I want to find a way to automate the creation of these column multiplications with the names that they give me as input. I'm sure my approach isn't the best, so any help is very welcome.
This seems to be some modelling being done and the choice of : to stand for product is quite intriguing as that is often the case in the modeling world. In base R, you could do:
model.matrix(reformulate(c(-1,rownames(Features))), dataset)
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
I added -1 in order to remove the intercept. Otherwise you could maintain it. Also note that this is a matrix, you can then change it to a dataframe.
I am not sure if the following code works for your, where eval + gsub are used to produce the product columns
with(
dataset,
list2DF(
setNames(
lapply(
gsub(":", "*", rownames(Features)),
function(x) eval(str2lang(x))
),
rownames(Features)
)
)
)
which gives
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
Data
> dput(dataset)
structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
> dput(Features)
structure(list(Features = 1:7), class = "data.frame", row.names = c("a",
"b", "d", "a:d", "b:d", "b:c", "a:c"))
We could use strsplit to split the names that have :, select the column in the 'dataset' based on that splitted named, Reduce with * to do elementwise multiplication, and assign those 'Product' columns to the original 'dataset'
nm1 <- grep(':', rownames(Features), value = TRUE)
lst1 <- lapply(strsplit(nm1, ":", fixed = TRUE),
function(x) Reduce(`*`, dataset[x]))
dataset[paste0("Product", seq_along(lst1))] <- lst1
-output
dataset
# id a b c d Product1 Product2 Product3 Product4
#1 1 10 1 30 50 500 50 30 300
#2 2 5 0 5 60 300 0 0 25
#3 3 20 1 18 90 1800 90 18 360
#4 4 103 0 20 80 8240 0 0 2060
#5 5 16 1 56 100 1600 100 56 896
data
dataset <- structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
Features <- structure(1:7, .Dim = c(7L, 1L), .Dimnames = list(c("a", "b",
"d", "a:d", "b:d", "b:c", "a:c"), NULL))
This is similar to many solutions above. One thing about R is you can achieve the same results with so many different ways though the underline principle is still the same.
library(dplyr)
dataset <- tibble(id = c(1, 2, 3, 4, 5),
a = c(10, 5, 20, 103, 16),
b = c(1, 0, 1, 0, 1),
c = c(30, 5, 18, 20, 56),
d = c(50, 60, 90, 80, 100))
features = c("a", "b", "d", "a:d", "b:d", "b:c", "a:c")
final <- bind_cols(
map(features,
function(x) {
dataset %>%
mutate(!!x := eval(parse(text=gsub(":", "*", x)))) %>%
select(!!x)
}
))
Final dataset.
# A tibble: 5 x 9
id a b c d `a:d` `b:d` `b:c` `a:c`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 1 30 50 500 50 30 300
2 2 5 0 5 60 300 0 0 25
3 3 20 1 18 90 1800 90 18 360
4 4 103 0 20 80 8240 0 0 2060
5 5 16 1 56 100 1600 100 56 896

R ddply looping ; multiple factors

I want to use ddply to summarize data from multiple variables, by multiple factors.
I have the following test data:
site block plot rep name weight height dtf
Alberta 1 2 1 A 43 139 54
Alberta 2 5 2 A 46 139 46
Alberta 4 10 3 A 49 136 54
Nunavut 1 1 1 A 49 136 59
Nunavut 2 4 2 A 51 135 50
Nunavut 3 8 3 A 52 133 56
Alberta 5 13 1 B 55 132 50
Alberta 4 12 2 B 55 125 46
Alberta 5 15 3 B 56 120 46
Nunavut 5 14 1 B 57 119 54
Nunavut 5 13 2 B 58 119 55
Nunavut 4 11 3 B 59 118 51
...
and so on.
I want to take the variables "weight", "height", "dtf", and summarize them according to the factors "site" and "name".
I started with vectors of column names:
data.factors <- NULL
data.variables <- NULL
for(n in 1:length(data)){if(is.factor(data[[n]])){ data.factors <- c(data.factors,colnames(data[n]))} else next}
for(n in 1:length(data)){if(is.numeric(data[[n]]) || is.integer(data[[n]])){ data.variables <- c(data.variables,colnames(data[n]))} else next}
This worked for performing multiple single-factor anovas:
for(variables in data.variables){
for(factors in data.factors){
output1 <- aov(lm(data[[variables]]~data[[factors]]))
cat(variables)
cat(" by ")
cat(factors)
cat("\n")
print(summary(output1))
}}
But I cannot get it to work with ddply.
for (x in data.variables){
variable.summary <- ddply(data, .(site,name), summarise,
N = sum(!is.na(x[1])),
min = min(x[1], na.rm=TRUE),
max = max(x[1], na.rm=TRUE),
mean = mean(x[1], na.rm=TRUE),
sd = sd(x[1], na.rm=TRUE),
se = sd / sqrt(N)
)
print(variable.summary)
}
All I get is the following:
site name N min max mean sd se
1 Alberta A 1 weight weight NA NA NA
2 Alberta B 1 weight weight NA NA NA
3 Alberta C 1 weight weight NA NA NA
4 Alberta D 1 weight weight NA NA NA
5 Alberta E 1 weight weight NA NA NA
6 Nunavut A 1 weight weight NA NA NA
7 Nunavut B 1 weight weight NA NA NA
8 Nunavut C 1 weight weight NA NA NA
9 Nunavut D 1 weight weight NA NA NA
10 Nunavut E 1 weight weight NA NA NA
....
Were I to test ddply using a single variable (typed in directly rather that referenced through "x") it would work fine.
Is there a trick to getting the function to recognize the referenced column ID? I'm used to PERL, with its $Scalars that can be referenced anywhere, and was hoping a similar system was available in R.
The successor to ddply, dplyr, can do this really easily using group_by() and summarise_each(), with no need to loop anything:
df <- data.frame(site = c("Alberta", "Alberta", "Alberta", "Nunavut", "Nunavut", "Nunavut", "Alberta", "Alberta", "Alberta", "Nunavut", "Nunavut", "Nunavut"),
block = c(1, 2, 4, 1, 2, 3, 5, 4, 5, 5, 5, 4),
plot = c(2, 5, 10, 1, 4, 8, 13, 12, 15, 14, 13, 11),
rep = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
weight = c(43, 46, 49, 49, 51, 52, 55, 55, 56, 57, 58, 59),
height = c(139, 139, 136, 136, 135, 133, 132, 125, 120, 119, 119, 118),
dtf = c(54, 46, 54, 59, 50, 56, 50, 46, 46, 54, 55, 51))
library(dplyr)
df.summary <- df %>%
group_by(site, name) %>%
summarise_each(funs(sum, min, max, mean, sd), weight, height, dtf)
Which results in a data frame like this:
> df.summary
Source: local data frame [4 x 17]
Groups: site
site name weight_length height_length dtf_length weight_min height_min dtf_min
1 Alberta A 3 3 3 43 136 46
2 Alberta B 3 3 3 55 120 46
3 Nunavut A 3 3 3 49 133 50
4 Nunavut B 3 3 3 57 118 51
Variables not shown: weight_max (dbl), height_max (dbl), dtf_max (dbl), weight_mean (dbl),
height_mean (dbl), dtf_mean (dbl), weight_sd (dbl), height_sd (dbl), dtf_sd (dbl)
You can pass any function you want to the funs() inside summarise_each, so if you want a column for standard errors, just make the function first:
se <- function(x) {
N <- sum(!is.na(x[1]))
return(sd / sqrt(N))
}
And pass it through: summarise_each(funs(sum, min, max, mean, sd, se)...)
Try with data.table:
> testdt = data.table(testdf)
> testdt[,list(meanwt=mean(weight),meanht=mean(height) ),by=list(site,name)]
site name meanwt meanht
1: Alberta A 46.00000 138.0000
2: Nunavut A 50.66667 134.6667
3: Alberta B 55.33333 125.6667
4: Nunavut B 58.00000 118.6667
Max, min etc can be added to function list.

Resources