I want to use ddply to summarize data from multiple variables, by multiple factors.
I have the following test data:
site block plot rep name weight height dtf
Alberta 1 2 1 A 43 139 54
Alberta 2 5 2 A 46 139 46
Alberta 4 10 3 A 49 136 54
Nunavut 1 1 1 A 49 136 59
Nunavut 2 4 2 A 51 135 50
Nunavut 3 8 3 A 52 133 56
Alberta 5 13 1 B 55 132 50
Alberta 4 12 2 B 55 125 46
Alberta 5 15 3 B 56 120 46
Nunavut 5 14 1 B 57 119 54
Nunavut 5 13 2 B 58 119 55
Nunavut 4 11 3 B 59 118 51
...
and so on.
I want to take the variables "weight", "height", "dtf", and summarize them according to the factors "site" and "name".
I started with vectors of column names:
data.factors <- NULL
data.variables <- NULL
for(n in 1:length(data)){if(is.factor(data[[n]])){ data.factors <- c(data.factors,colnames(data[n]))} else next}
for(n in 1:length(data)){if(is.numeric(data[[n]]) || is.integer(data[[n]])){ data.variables <- c(data.variables,colnames(data[n]))} else next}
This worked for performing multiple single-factor anovas:
for(variables in data.variables){
for(factors in data.factors){
output1 <- aov(lm(data[[variables]]~data[[factors]]))
cat(variables)
cat(" by ")
cat(factors)
cat("\n")
print(summary(output1))
}}
But I cannot get it to work with ddply.
for (x in data.variables){
variable.summary <- ddply(data, .(site,name), summarise,
N = sum(!is.na(x[1])),
min = min(x[1], na.rm=TRUE),
max = max(x[1], na.rm=TRUE),
mean = mean(x[1], na.rm=TRUE),
sd = sd(x[1], na.rm=TRUE),
se = sd / sqrt(N)
)
print(variable.summary)
}
All I get is the following:
site name N min max mean sd se
1 Alberta A 1 weight weight NA NA NA
2 Alberta B 1 weight weight NA NA NA
3 Alberta C 1 weight weight NA NA NA
4 Alberta D 1 weight weight NA NA NA
5 Alberta E 1 weight weight NA NA NA
6 Nunavut A 1 weight weight NA NA NA
7 Nunavut B 1 weight weight NA NA NA
8 Nunavut C 1 weight weight NA NA NA
9 Nunavut D 1 weight weight NA NA NA
10 Nunavut E 1 weight weight NA NA NA
....
Were I to test ddply using a single variable (typed in directly rather that referenced through "x") it would work fine.
Is there a trick to getting the function to recognize the referenced column ID? I'm used to PERL, with its $Scalars that can be referenced anywhere, and was hoping a similar system was available in R.
The successor to ddply, dplyr, can do this really easily using group_by() and summarise_each(), with no need to loop anything:
df <- data.frame(site = c("Alberta", "Alberta", "Alberta", "Nunavut", "Nunavut", "Nunavut", "Alberta", "Alberta", "Alberta", "Nunavut", "Nunavut", "Nunavut"),
block = c(1, 2, 4, 1, 2, 3, 5, 4, 5, 5, 5, 4),
plot = c(2, 5, 10, 1, 4, 8, 13, 12, 15, 14, 13, 11),
rep = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
weight = c(43, 46, 49, 49, 51, 52, 55, 55, 56, 57, 58, 59),
height = c(139, 139, 136, 136, 135, 133, 132, 125, 120, 119, 119, 118),
dtf = c(54, 46, 54, 59, 50, 56, 50, 46, 46, 54, 55, 51))
library(dplyr)
df.summary <- df %>%
group_by(site, name) %>%
summarise_each(funs(sum, min, max, mean, sd), weight, height, dtf)
Which results in a data frame like this:
> df.summary
Source: local data frame [4 x 17]
Groups: site
site name weight_length height_length dtf_length weight_min height_min dtf_min
1 Alberta A 3 3 3 43 136 46
2 Alberta B 3 3 3 55 120 46
3 Nunavut A 3 3 3 49 133 50
4 Nunavut B 3 3 3 57 118 51
Variables not shown: weight_max (dbl), height_max (dbl), dtf_max (dbl), weight_mean (dbl),
height_mean (dbl), dtf_mean (dbl), weight_sd (dbl), height_sd (dbl), dtf_sd (dbl)
You can pass any function you want to the funs() inside summarise_each, so if you want a column for standard errors, just make the function first:
se <- function(x) {
N <- sum(!is.na(x[1]))
return(sd / sqrt(N))
}
And pass it through: summarise_each(funs(sum, min, max, mean, sd, se)...)
Try with data.table:
> testdt = data.table(testdf)
> testdt[,list(meanwt=mean(weight),meanht=mean(height) ),by=list(site,name)]
site name meanwt meanht
1: Alberta A 46.00000 138.0000
2: Nunavut A 50.66667 134.6667
3: Alberta B 55.33333 125.6667
4: Nunavut B 58.00000 118.6667
Max, min etc can be added to function list.
Related
I have individuals that belong to different categories, they are located in different
zones, these populations are expected to grow from the population value below
to the demand value.
population_and_demand_by_category_and_zone <- tibble::tribble(
~category, ~zone, ~population, ~demand,
"A", 1, 115, 138,
"A", 2, 121, 145,
"A", 3, 112, 134,
"A", 4, 76, 91,
"B", 1, 70, 99,
"B", 2, 59, 83,
"B", 3, 86, 121,
"B", 4, 139, 196,
"C", 1, 142, 160,
"C", 2, 72, 81,
"C", 3, 29, 33,
"C", 4, 58, 66,
"D", 1, 22, 47,
"D", 2, 23, 49,
"D", 3, 16, 34,
"D", 4, 45, 96
)
Zones have a given capacity, current population is below this threshold, but demand
will exceed capacity in some zones.
demand_and_capacity_by_zone <- tibble::tribble(
~zone, ~demand, ~capacity, ~capacity_exceeded,
1, 444, 465, FALSE,
2, 358, 393, FALSE,
3, 322, 500, FALSE,
4, 449, 331, TRUE
)
So we will need to move those individuals to a new zone (we assume we have
enough total capacity).
Each individual that we need to move incurs a cost, which depends on its
category and destination zone. These costs are given below.
costs <- tibble::tribble(
~category, ~zone, ~cost,
"A", 1, 0.1,
"A", 2, 0.1,
"A", 3, 0.1,
"A", 4, 1.3,
"B", 1, 16.2,
"B", 2, 38.1,
"B", 3, 1.5,
"B", 4, 0.1,
"C", 1, 0.1,
"C", 2, 12.7,
"C", 3, 97.7,
"C", 4, 46.3,
"D", 1, 25.3,
"D", 2, 7.7,
"D", 3, 67.3,
"D", 4, 0.1
)
I wish to find the distribution of individuals across zones and categories so that
the total cost is minimized. So basically have a new column new_population
in the table population_and_demand_by_category_and_zone described above.
If several solutions are possible, any will do, if the result is a non integer
population, that's fine.
The real use case has about 20 categories and 30 zones, so bigger but not all that big.
It seems like a problem that would be common enough so I'm hoping that there is a convenient way to solve this in R.
This can be modeled as a small LP (Linear Programming) model. We introduce non-negative variables move(c,z,z') indicating the number of persons of category c to be moved from zone z to zone z'. The mathematical model can look like:
This can be implemented using any LP solver. A solution can look like:
---- 83 VARIABLE move.L moves needed to meet capacity
zone1 zone2 zone3
catA.zone1 6
catA.zone4 29 62
catC.zone4 27
---- 83 VARIABLE alloc.L new allocation
zone1 zone2 zone3 zone4
catA 132 180 196
catB 99 83 121 196
catC 187 81 33 39
catD 47 49 34 96
---- 83 VARIABLE totcost.L = 12.400 total cost
Notes:
Interestingly the solution shows that we move people out of zone 1 to make room for people from zone 4. So in some cases, making 2 moves to resettle one person is cheaper. Of course, that depends very much on the cost structure.
The main constraint says: allocation = demand + inflow - outflow
The constraint move(c,z,z)=0 makes sure we don't move from z to z itself. This constraint is not really needed (it is implicitly enforced by the cost). I have added it for clarity. Actually, I implemented this by setting the upper bound of move(c,z,z) to zero (i.e. without an explicit constraint). For very large models I would use another possibility: don't even generate the variables move(c,z,z). This model is small, so no need for that. You can leave it out completely if you want.
I don't use population in the model. I don't think it is needed, that is unless we look at the next bullet.
There are some subtleties to think about: can we only move new persons? (i.e. original people should be allowed to stay)
I've taken Erwin's formulation, modified it to consider that alloc should be more than the population for every zone and category, (which means already present individuals don't move), and implemented it using the {lpSolve} package, which doesn't require installing external system libraries.
Erwin's solution can be obtained by using move_new_only <- FALSE below.
SETUP
library(tidyverse)
library(lpSolve)
move_new_only <- TRUE # means population in place can't be reallocated
categories <- unique(population_and_demand_by_category_and_zone$category)
zones <- unique(population_and_demand_by_category_and_zone$zone)
n_cat <- length(categories)
n_zones <- length(zones)
# empty coefficient arrays
move_coefs_template <- array(0, c(n_zones, n_zones, n_cat),
dimnames = list(zones, zones, categories))
alloc_coefs_template <- matrix(0, n_zones, n_cat,
dimnames = list(zones, categories))
build_zone_by_category_matrix <- function(data, col) {
data %>%
pivot_wider(
id_cols = zone, names_from = category, values_from = {{col}}) %>%
as.data.frame() %>%
`row.names<-`(.$zone) %>%
select(-zone) %>%
as.matrix()
}
demand_mat <- build_zone_by_category_matrix(
population_and_demand_by_category_and_zone, demand)
cost_mat <- build_zone_by_category_matrix(costs, cost)
population_mat <- build_zone_by_category_matrix(
population_and_demand_by_category_and_zone, population)
OBJECTIVE FUNCTION : total cost
# stack the cost matrix vertically to build an array of all move coefficients
coefs_obj <- move_coefs_template
for(i in 1:n_zones) {
coefs_obj[i,,] <- cost_mat
}
# flatten it for `lp`s `objective.in` argument, adding alloc coefs
f.obj <- c(coefs_obj, alloc_coefs_template)
CONSTRAINT 1 : allocation = demand + inflow - outflow
coefs_con <- list()
for (z in zones) {
coefs_con_zone <- list()
for(categ in categories) {
coefs_arrivals <- move_coefs_template
coefs_arrivals[,z, categ] <- 1
coefs_departures <- move_coefs_template
coefs_departures[z,, categ] <- 1
coefs_moves <- coefs_arrivals - coefs_departures
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, categ] <- -1
# flatten the array
coefs_con_zone[[categ]] <- c(coefs_moves, coefs_alloc)
}
coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}
# stack the flattened arrays to build a matrix
f.con1 <- do.call(rbind, coefs_con)
f.dir1 <- rep("==", n_zones * n_cat)
f.rhs1 <- -c(t(demand_mat)) # transposing so we start with all zone 1 and so on
CONSTRAINT 2 : Allocation never exceeds capacity
coefs_con <- list()
for (z in zones) {
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, ] <- 1
coefs_con[[z]] <- c(move_coefs_template, coefs_alloc)
}
# stack the flattened arrays to build a matrix
f.con2 <- do.call(rbind, coefs_con)
f.dir2 <- rep("<=", n_zones)
f.rhs2 <- demand_and_capacity_by_zone$capacity
CONSTRAINT 3 : Allocation >= Population
i.e. we don't move people that were already there.
If we decide we can move them the rule becomes Allocation >= 0, and we get Erwin's answer.
coefs_con <- list()
for (z in zones) {
coefs_con_zone <- list()
for(categ in categories) {
coefs_alloc <- alloc_coefs_template
coefs_alloc[z, categ] <- 1
# flatten the array
coefs_con_zone[[categ]] <- c(move_coefs_template, coefs_alloc)
}
coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}
# stack the flattened arrays to build a matrix
f.con3 <- do.call(rbind, coefs_con)
f.dir3 <- rep(">=", n_zones * n_cat)
if (move_new_only) {
f.rhs3 <- c(t(population_mat))
} else {
f.rhs3 <- rep(0, n_zones * n_cat)
}
CONCATENATE OBJECTS
f.con <- rbind(f.con1, f.con2, f.con3)
f.dir <- c(f.dir1, f.dir2, f.dir3)
f.rhs <- c(f.rhs1, f.rhs2, f.rhs3)
SOLVE
# compute the solution and visualize it in the array
results_raw <- lp("min", f.obj, f.con, f.dir, f.rhs)$solution
results_moves <- move_coefs_template
results_moves[] <-
results_raw[1:length(results_moves)]
results_allocs <- alloc_coefs_template
results_allocs[] <-
results_raw[length(results_moves)+(1:length(results_allocs))]
results_moves
#> , , A
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 3 0
#> 3 0 0 0 0
#> 4 13 0 2 0
#>
#> , , B
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 0 57 0
#>
#> , , C
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 8 0 0 0
#>
#> , , D
#>
#> 1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 0 38 0 0
results_allocs
#> A B C D
#> 1 151 99 168 47
#> 2 142 83 81 87
#> 3 139 178 33 34
#> 4 76 139 58 58
TIDY RESULTS
# format as tidy data frame
results_df <-
as.data.frame.table(results_moves) %>%
setNames(c("from", "to", "category", "n")) %>%
filter(n != 0) %>%
mutate_at(c("from", "to"), as.numeric) %>%
mutate_if(is.factor, as.character)
results_df
#> from to category n
#> 1 4 1 A 13
#> 2 2 3 A 3
#> 3 4 3 A 2
#> 4 4 3 B 57
#> 5 4 1 C 8
#> 6 4 2 D 38
UPDATE TABLES
population_and_demand_by_category_and_zone <-
bind_rows(
results_df %>%
group_by(category, zone = to) %>%
summarize(correction = sum(n), .groups = "drop"),
results_df %>%
group_by(category, zone = from) %>%
summarize(correction = -sum(n), .groups = "drop"),
) %>%
left_join(population_and_demand_by_category_and_zone, ., by = c("category", "zone")) %>%
replace_na(list(correction =0)) %>%
mutate(new_population = demand + correction)
population_and_demand_by_category_and_zone
#> # A tibble: 16 × 6
#> category zone population demand correction new_population
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 115 138 13 151
#> 2 A 2 121 145 -3.00 142
#> 3 A 3 112 134 5.00 139
#> 4 A 4 76 91 -15.0 76
#> 5 B 1 70 99 0 99
#> 6 B 2 59 83 0 83
#> 7 B 3 86 121 57 178
#> 8 B 4 139 196 -57 139
#> 9 C 1 142 160 8 168
#> 10 C 2 72 81 0 81
#> 11 C 3 29 33 0 33
#> 12 C 4 58 66 -8 58
#> 13 D 1 22 47 0 47
#> 14 D 2 23 49 38 87
#> 15 D 3 16 34 0 34
#> 16 D 4 45 96 -38 58
demand_and_capacity_by_zone <-
population_and_demand_by_category_and_zone %>%
group_by(zone) %>%
summarise(population = sum(population), correction = sum(correction), new_population = sum(new_population)) %>%
left_join(demand_and_capacity_by_zone, ., by = "zone")
#> `summarise()` ungrouping output (override with `.groups` argument)
demand_and_capacity_by_zone
#> # A tibble: 4 × 7
#> zone demand capacity capacity_exceeded population correction new_population
#> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl>
#> 1 1 444 465 FALSE 349 21 465
#> 2 2 358 393 FALSE 275 35 393
#> 3 3 322 500 FALSE 243 62 384
#> 4 4 449 331 TRUE 318 -118 331
We see that the population never decreases and stays under capacity.
I have a dataset like this.
> dataset
id
a
b
c
d
1
10
1
30
50
2
5
0
5
60
3
20
1
18
90
4
103
0
20
80
5
16
1
56
100
And so on up to 'n' number of columns in relation to the requirement.
My colleagues in the research area carry out certain analyzes where what they give me as an input is a data frame. In which the row names correspond to the variables of interest for a new dataset.
Something like this.
> rownames(Features)
a
b
d
a:d
b:d
b:c
a:c
Where the colon (:) represents the "product of". So in order to continue working I need to include those products in the original dataset.
I have manually created a function that reproduces the multiplications in the following way (where x represents my original dataset):
Products<- function(x){x<- x %>% mutate(Product1=x$a*x$d)
x<- x %>% mutate(Product2=x$b*x$d)
x<- x %>% mutate(Product3=x$b*x$c)
x<- x %>% mutate(Product4=x$a*x$c)
return(x)}
However, given that the number of products to create is variable every time, I want to find a way to automate the creation of these column multiplications with the names that they give me as input. I'm sure my approach isn't the best, so any help is very welcome.
This seems to be some modelling being done and the choice of : to stand for product is quite intriguing as that is often the case in the modeling world. In base R, you could do:
model.matrix(reformulate(c(-1,rownames(Features))), dataset)
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
I added -1 in order to remove the intercept. Otherwise you could maintain it. Also note that this is a matrix, you can then change it to a dataframe.
I am not sure if the following code works for your, where eval + gsub are used to produce the product columns
with(
dataset,
list2DF(
setNames(
lapply(
gsub(":", "*", rownames(Features)),
function(x) eval(str2lang(x))
),
rownames(Features)
)
)
)
which gives
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
Data
> dput(dataset)
structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
> dput(Features)
structure(list(Features = 1:7), class = "data.frame", row.names = c("a",
"b", "d", "a:d", "b:d", "b:c", "a:c"))
We could use strsplit to split the names that have :, select the column in the 'dataset' based on that splitted named, Reduce with * to do elementwise multiplication, and assign those 'Product' columns to the original 'dataset'
nm1 <- grep(':', rownames(Features), value = TRUE)
lst1 <- lapply(strsplit(nm1, ":", fixed = TRUE),
function(x) Reduce(`*`, dataset[x]))
dataset[paste0("Product", seq_along(lst1))] <- lst1
-output
dataset
# id a b c d Product1 Product2 Product3 Product4
#1 1 10 1 30 50 500 50 30 300
#2 2 5 0 5 60 300 0 0 25
#3 3 20 1 18 90 1800 90 18 360
#4 4 103 0 20 80 8240 0 0 2060
#5 5 16 1 56 100 1600 100 56 896
data
dataset <- structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
Features <- structure(1:7, .Dim = c(7L, 1L), .Dimnames = list(c("a", "b",
"d", "a:d", "b:d", "b:c", "a:c"), NULL))
This is similar to many solutions above. One thing about R is you can achieve the same results with so many different ways though the underline principle is still the same.
library(dplyr)
dataset <- tibble(id = c(1, 2, 3, 4, 5),
a = c(10, 5, 20, 103, 16),
b = c(1, 0, 1, 0, 1),
c = c(30, 5, 18, 20, 56),
d = c(50, 60, 90, 80, 100))
features = c("a", "b", "d", "a:d", "b:d", "b:c", "a:c")
final <- bind_cols(
map(features,
function(x) {
dataset %>%
mutate(!!x := eval(parse(text=gsub(":", "*", x)))) %>%
select(!!x)
}
))
Final dataset.
# A tibble: 5 x 9
id a b c d `a:d` `b:d` `b:c` `a:c`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 1 30 50 500 50 30 300
2 2 5 0 5 60 300 0 0 25
3 3 20 1 18 90 1800 90 18 360
4 4 103 0 20 80 8240 0 0 2060
5 5 16 1 56 100 1600 100 56 896
I have data from a social survey. One of the categorical variables is education. It breaks down the population into 6 groups: "1" for those who have primary education, "2" for those who have O level, "3" for those who have complee secondary education. 4 for bachelors' degree. 5 for master's, 6 for PhD.
wage age sex edu area satisf
1 NA 76 2 6 1 <NA>
2 17000 26 2 6 1 1
3 NA 74 1 6 1 <NA>
4 NA 73 2 6 1 <NA>
5 NA 49 2 5 1 1
6 25000 31 1 6 1 2
I have previously run a regression and decided that I don't need categories 1-3. SO I would like to have the following structure:
DUMMY1(edu)={█(1,&edu="1" ,2,3#0,&edu=else)┤
DUMMY2(edu)={█(1,&edu="4" #0,&edu=else)┤
DUMMY3(edu)={█(1,&edu="5" #0,&edu=else)┤
DUMMY4(edu)={█(1,&edu="6" #0,&edu=else)┤
I need to learn how to use mutate and ifelse functions. My current command is the following:
vova5 <- mutate(vova4,bedu=ifelse(vova4$edu<=3,vova4$edu2 <- 1,
vova4$edu2 <- vova4$edu-2))
but it does not seem to work.
Without your data, I would try case_when, something like:
Data sample:
vova4 <- data.frame(
edu = c(1, 2, 3, 4, 4, 5, 5, 6, 6),
age = c(70, 56, 66, 67, 34, 55, 33, 44, 32))
Try this:
library(tidyverse)
vova5 <- vova4 %>%
mutate(Bedu = case_when(edu<=3 ~ 1,
edu==4 ~ 2,
edu==5 ~ 3,
TRUE ~ 4))
vova5
Or:
vova5 <- vova4 %>%
mutate(Bedu = case_when(edu<=3 ~ 1,
edu==4 ~ 2,
edu==5 ~ 3,
edu==6 ~ 4))
You will get:
> vova5
edu age Bedu
1 1 70 1
2 2 56 1
3 3 66 1
4 4 67 2
5 4 34 2
6 5 55 3
7 5 33 3
8 6 44 4
9 6 32 4
In my real data, I have multiple outliers for multiple variables. My data looks something like the example below but the numbers are completely random.
I would like to pull all data points that are greater than or less than 2 SD using a winsorization.
df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
So far I have identified my outliers for the variables of test1 and test2 for each group using the following code:
outlier <- function(x, SD = 2){
mu <- mean(x)
sigma <- sd(x)
out <- x < mu - SD*sigma | x > mu + SD*sigma
out
}
# identify the outliers for each variable by each group
with(df, ave(test1, group, FUN = outlier))
with(df, ave(test2, group, FUN = outlier))
# add these new-found outliers to the data set
df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))
I am aware of the 'winsorize' function in the 'robustHD' package but am not sure:
1). how to tailor the command to a 90% winsorization (2 SD), 2). ensuring the winsorization accounts for the 2 different groups, 3). and including multiple variables in that winsorization.
Additionally, but not necessary...is there a way to see what the 'winsorize' function changed the numbers from to what the numbers were changed to?
Make first clear, how you want to winsorize your data. You have several options.
Use the mean+/-2sd limits as extreme values and replace all values outside by those
Use the observed value next to the mean+/-2sd limits
Use the 90% quantile
In option 1 and 3 you will possibly introduce values into your winsorized variable, which were not observed, in option 2 you will only have observed values. Note also, that the (5%, 95%)-quantile will not necessarily be near to 2*sd if you don't have reasonably well behaved normally distributed data.
For the winsorization process you can use DescTools::Winsorize(), which accepts both, probs and values for the limits.
Implementation 1)
x <- rnorm(100)
w1 <- Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x))
For 2) you could use something like
w2 <- Winsorize(x,
minval = max(Coalesce(x[x <= mean(x)-2*sd(x)], mean(x)-2*sd(x))),
maxval = min(Coalesce(x[x >= mean(x)+2*sd(x)], mean(x)+2*sd(x))))
Provide some escalating values for cases where there are no values outside the limits. Coalesce() returns the first non empty value, so Winsorize() will always get a valid limit.
Option 3) is the default for the function
w3 <- Winsorize(x, probs=c(0.05, 0.95))
Define a function for the groupwise apply as (here for option 1):
df$w1 <- unsplit(
tapply(df$test1, df$group,
function(x) Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x)) )
, f=df$group)
The replaced values can be found with
cbind(x, w1)[x!=w1,]
Here's a start - hopefully someone has a better solution for you.
library(tidyverse)
df <- tibble::tribble(
~id, ~group, ~test1, ~test2,
1, 0, 57, 82,
2, 0, 77, 80,
3, 0, 67, 90,
4, 0, 15, 70,
5, 0, 58, 72,
6, 1, 18, 44,
7, 1, 44, 44,
8, 1, 18, 46,
9, 1, 20, 44,
10, 1, 14, 38
)
df
#> # A tibble: 10 x 4
#> id group test1 test2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82
#> 2 2 0 77 80
#> 3 3 0 67 90
#> 4 4 0 15 70
#> 5 5 0 58 72
#> 6 6 1 18 44
#> 7 7 1 44 44
#> 8 8 1 18 46
#> 9 9 1 20 44
#> 10 10 1 14 38
library(DescTools)
df %>%
group_by(group) %>%
mutate(
test2_winsorized = DescTools::Winsorize(
test2,
maxval = quantile(df$test2, 0.90),
minval = quantile(df$test2, 0.10)
),
test1_winsorized = DescTools::Winsorize(
test1,
maxval = quantile(df$test1, 0.90),
minval = quantile(df$test1, 0.10)
)
)
#> # A tibble: 10 x 6
#> # Groups: group [2]
#> id group test1 test2 test2_winsorized test1_winsorized
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82 82 57
#> 2 2 0 77 80 80 68
#> 3 3 0 67 90 82.8 67
#> 4 4 0 15 70 70 15
#> 5 5 0 58 72 72 58
#> 6 6 1 18 44 44 18
#> 7 7 1 44 44 44 44
#> 8 8 1 18 46 46 18
#> 9 9 1 20 44 44 20
#> 10 10 1 14 38 43.4 14.9
Created on 2019-06-06 by the reprex package (v0.2.1)
I have a beginner R user:
This is my dataset
factor1 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8,8,9, 9, 10, 10)
factor2 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,16,17, 18, 19, 20)
factor3 <- c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d")
factor4 <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,160,170, 180, 190, NA)
dataset <- data.frame(factor1, factor2, factor3, factor4)
I created a new variable this way:
dataset$newvar <-"NA"
How to do the following:
I want newvar to take the value 1 if factor1>=5 and factor2<19 and (factor3="b" or factor3="c") and factor4 is different from missing and newvar is equal to missing
Ideally I want to specify different conditions, so some observations will be value 1, 2, 3 and 4 in the variable newvar dependent on the values of several other variables.
This is very simple and intuitive in STATA and would like to know if there is a simple and intuitive way to do the same in R.
Generate a new variable based on several conditions for several values.
This bit of the question was not explicitly addressed:
Ideally I want to specify different conditions, so some observations will be value 1, 2, 3 and 4 in the variable newvar dependent on the values of several other variables.
A simple solution would be to use case_when. Similar to Stata's recode it allows you to specify several values simultaneously.
It works the following way:
newvar = case_when(
condition1 ~ target value,
condition2 ~ target value)
e.g. var1 == 1 ~ 0
Important you need a , after each line.
library(dplyr)
dataset <- mutate(dataset,
newvar = case_when(
factor1 >= 5 & factor2<19 & (factor3 =="b" | factor3 =="c") ~ 1,
factor1 == 1 ~ 2,
factor1 == 2 ~ 3,
TRUE ~ NA_real_ # This is for all other values
)) # not covered by the above.
dataset
# factor1 factor2 factor3 factor4 newvar
# 1 1 1 a 10 2
# 2 1 2 a 20 2
# 3 2 3 a 30 3
# 4 2 4 a 40 3
# 5 3 5 a 50 NA
# 6 3 6 b 60 NA
# 7 4 7 b 70 NA
# 8 4 8 b 80 NA
# 9 5 9 b 90 1
# 10 5 10 b 100 1
# 11 6 11 c 110 1
# 12 6 12 c 120 1
# 13 7 13 c 130 1
# 14 7 14 c 140 1
# 15 8 15 c 150 1
# 16 8 16 d 160 NA
# 17 9 17 d 170 NA
# 18 9 18 d 180 NA
# 19 10 19 d 190 NA
# 20 10 20 d NA NA
Note, you can not use NA (missing) as a target value, instead use one of the following
NA_character_
NA_real_
NA_complex_
NA_double_
In base R you can just do (promoting my comment to an answer):
dataset$newvar <- NA
dataset[dataset$factor1 >= 5 & dataset$factor2 < 19 & (dataset$factor3=="b" | dataset$factor3 =="c"), "newvar"] <- 1
or:
dataset$newvar <- NA
indx <- dataset$factor1 >= 5 & dataset$factor2 < 19 & (dataset$factor3=="b" | dataset$factor3 =="c") & !is.na(dataset$factor4)
dataset[indx, "newvar"] <- 1
Using dplyr
library(dplyr)
dataset %>%
mutate(newvar = ifelse(factor1 > 5 &
factor2 < 19 &
(factor3=="b" | factor3=="c") &
!is.na(factor4), 1, NA))