Discretizing a continous variable keeping out zeros - r

I want to discretize a column which contains of a continous variable.
the data looks like ;
c(0,25,77,423,6,8,3,65,32,22,10,0,8,0,15,0,10,1,2,4,5,5,6)
I want turn the numbers into categorical by discretizing, but zeros represent a different category. Sometimes directly discretizing could keep different numbers with zero.
I thought if I keep zeros out then discretize my wish comes true. But in a dataframe column I can't do it because of indexes:
here is an example dput() output
structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65,
32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA,
-23L))
for example, if I'd like to use 2 breaks, categories should be; zero and the other 3 discretized ones, totally 4 categories. it should be better if I could write function that discretizes a column that can be directly created with dplyr::mutate()
thanks in advance.

If I understood it correctly, your goal is to keep "0" as a separate category when discretizing. Here's a solution using arules::discretize to make a new function that can accomplish this:
library(arules)
#> Loading required package: Matrix
#>
#> Attaching package: 'arules'
#> The following objects are masked from 'package:base':
#>
#> abbreviate, write
library(tidyverse)
df <- structure(list(dummy_column = c(0, 25, 77, 423, 6, 8, 3, 65,
32, 22, 10, 0, 8, 0, 15, 0, 10, 1, 2, 4, 5, 5, 6)), class = "data.frame", row.names = c(NA,
-23L))
discretize_keep <- function(vec, keep, ...) {
vec2 <- vec
vec2[vec2==keep] <- NA
dsc <- arules::discretize(vec2, ...)
fct_explicit_na(dsc, na_level = str_glue("[{keep}]"))
}
df %>%
mutate(discrete_column = discretize_keep(dummy_column, keep = 0, breaks = 3))
#> dummy_column discrete_column
#> 1 0 [0]
#> 2 25 [15,423]
#> 3 77 [15,423]
#> 4 423 [15,423]
#> 5 6 [6,15)
#> 6 8 [6,15)
#> 7 3 [1,6)
#> 8 65 [15,423]
#> 9 32 [15,423]
#> 10 22 [15,423]
#> 11 10 [6,15)
#> 12 0 [0]
#> 13 8 [6,15)
#> 14 0 [0]
#> 15 15 [15,423]
#> 16 0 [0]
#> 17 10 [6,15)
#> 18 1 [1,6)
#> 19 2 [1,6)
#> 20 4 [1,6)
#> 21 5 [1,6)
#> 22 5 [1,6)
#> 23 6 [6,15)

If you have breaks c(20,50) like below, you can try cut to discretize dummy_column, e.g.,
breaks <- c(20, 50)
df %>%
mutate(discrete = cut(dummy_column, c(-1, 0, breaks, max(dummy_column))))
which gives
dummy_column discrete
1 0 (-1,0]
2 25 (20,50]
3 77 (50,423]
4 423 (50,423]
5 6 (0,20]
6 8 (0,20]
7 3 (0,20]
8 65 (50,423]
9 32 (20,50]
10 22 (20,50]
11 10 (0,20]
12 0 (-1,0]
13 8 (0,20]
14 0 (-1,0]
15 15 (0,20]
16 0 (-1,0]
17 10 (0,20]
18 1 (0,20]
19 2 (0,20]
20 4 (0,20]
21 5 (0,20]
22 5 (0,20]
23 6 (0,20]

Related

R dataframe with special cumsum

I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!
What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)

Create column which tells source of number between the first two columns

I have a data frame which has three columns:
df <- structure(list(lowage = c(45, 15, 9, 51, 22, 45, 4, 4, 9, 25),
highage = c(50, 21, 14, 60, 24, 50, 8, 8, 14, 30)), .Names = c("lowage",
"highage"), row.names = c(NA, 10L), class = "data.frame")
df$random_number <- apply(df, 1, function(x) sample(seq(x[1], x[2]), 1))
I want to create a fourth column that tells us the source of the where the random_number comes from. So for example, in the first row, the column lowage = 45 and highage = 46. Say, the random number generated is 46 (for example). I'd like to create a fourth column where it says as a label 'highage' since it comes from the highage column. And so on...
If the solution can be in dplyr, that would be great!
Is this what you want?
df %>%
mutate(newcol =
case_when(random_number == lowage ~ "lowage",
random_number == highage ~ "highage",
TRUE ~ "between"))
# lowage highage random_number newcol
# 1 45 50 47 between
# 2 15 21 18 between
# 3 9 14 13 between
# 4 51 60 57 between
# 5 22 24 23 between
# 6 45 50 49 between
# 7 4 8 4 lowage
# 8 4 8 6 between
# 9 9 14 9 lowage
# 10 25 30 27 between

transform 3 categorical variables into 1 dummy with mutate and ifelse functions

I have data from a social survey. One of the categorical variables is education. It breaks down the population into 6 groups: "1" for those who have primary education, "2" for those who have O level, "3" for those who have complee secondary education. 4 for bachelors' degree. 5 for master's, 6 for PhD.
wage age sex edu area satisf
1 NA 76 2 6 1 <NA>
2 17000 26 2 6 1 1
3 NA 74 1 6 1 <NA>
4 NA 73 2 6 1 <NA>
5 NA 49 2 5 1 1
6 25000 31 1 6 1 2
I have previously run a regression and decided that I don't need categories 1-3. SO I would like to have the following structure:
DUMMY1(edu)={█(1,&edu="1" ,2,3#0,&edu=else)┤
DUMMY2(edu)={█(1,&edu="4" #0,&edu=else)┤
DUMMY3(edu)={█(1,&edu="5" #0,&edu=else)┤
DUMMY4(edu)={█(1,&edu="6" #0,&edu=else)┤
I need to learn how to use mutate and ifelse functions. My current command is the following:
vova5 <- mutate(vova4,bedu=ifelse(vova4$edu<=3,vova4$edu2 <- 1,
vova4$edu2 <- vova4$edu-2))
but it does not seem to work.
Without your data, I would try case_when, something like:
Data sample:
vova4 <- data.frame(
edu = c(1, 2, 3, 4, 4, 5, 5, 6, 6),
age = c(70, 56, 66, 67, 34, 55, 33, 44, 32))
Try this:
library(tidyverse)
vova5 <- vova4 %>%
mutate(Bedu = case_when(edu<=3 ~ 1,
edu==4 ~ 2,
edu==5 ~ 3,
TRUE ~ 4))
vova5
Or:
vova5 <- vova4 %>%
mutate(Bedu = case_when(edu<=3 ~ 1,
edu==4 ~ 2,
edu==5 ~ 3,
edu==6 ~ 4))
You will get:
> vova5
edu age Bedu
1 1 70 1
2 2 56 1
3 3 66 1
4 4 67 2
5 4 34 2
6 5 55 3
7 5 33 3
8 6 44 4
9 6 32 4

90% Winsorization on multiple variables

In my real data, I have multiple outliers for multiple variables. My data looks something like the example below but the numbers are completely random.
I would like to pull all data points that are greater than or less than 2 SD using a winsorization.
df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
So far I have identified my outliers for the variables of test1 and test2 for each group using the following code:
outlier <- function(x, SD = 2){
mu <- mean(x)
sigma <- sd(x)
out <- x < mu - SD*sigma | x > mu + SD*sigma
out
}
# identify the outliers for each variable by each group
with(df, ave(test1, group, FUN = outlier))
with(df, ave(test2, group, FUN = outlier))
# add these new-found outliers to the data set
df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))
I am aware of the 'winsorize' function in the 'robustHD' package but am not sure:
1). how to tailor the command to a 90% winsorization (2 SD), 2). ensuring the winsorization accounts for the 2 different groups, 3). and including multiple variables in that winsorization.
Additionally, but not necessary...is there a way to see what the 'winsorize' function changed the numbers from to what the numbers were changed to?
Make first clear, how you want to winsorize your data. You have several options.
Use the mean+/-2sd limits as extreme values and replace all values outside by those
Use the observed value next to the mean+/-2sd limits
Use the 90% quantile
In option 1 and 3 you will possibly introduce values into your winsorized variable, which were not observed, in option 2 you will only have observed values. Note also, that the (5%, 95%)-quantile will not necessarily be near to 2*sd if you don't have reasonably well behaved normally distributed data.
For the winsorization process you can use DescTools::Winsorize(), which accepts both, probs and values for the limits.
Implementation 1)
x <- rnorm(100)
w1 <- Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x))
For 2) you could use something like
w2 <- Winsorize(x,
minval = max(Coalesce(x[x <= mean(x)-2*sd(x)], mean(x)-2*sd(x))),
maxval = min(Coalesce(x[x >= mean(x)+2*sd(x)], mean(x)+2*sd(x))))
Provide some escalating values for cases where there are no values outside the limits. Coalesce() returns the first non empty value, so Winsorize() will always get a valid limit.
Option 3) is the default for the function
w3 <- Winsorize(x, probs=c(0.05, 0.95))
Define a function for the groupwise apply as (here for option 1):
df$w1 <- unsplit(
tapply(df$test1, df$group,
function(x) Winsorize(x,
minval = mean(x) - 2*sd(x),
maxval = mean(x) + 2*sd(x)) )
, f=df$group)
The replaced values can be found with
cbind(x, w1)[x!=w1,]
Here's a start - hopefully someone has a better solution for you.
library(tidyverse)
df <- tibble::tribble(
~id, ~group, ~test1, ~test2,
1, 0, 57, 82,
2, 0, 77, 80,
3, 0, 67, 90,
4, 0, 15, 70,
5, 0, 58, 72,
6, 1, 18, 44,
7, 1, 44, 44,
8, 1, 18, 46,
9, 1, 20, 44,
10, 1, 14, 38
)
df
#> # A tibble: 10 x 4
#> id group test1 test2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82
#> 2 2 0 77 80
#> 3 3 0 67 90
#> 4 4 0 15 70
#> 5 5 0 58 72
#> 6 6 1 18 44
#> 7 7 1 44 44
#> 8 8 1 18 46
#> 9 9 1 20 44
#> 10 10 1 14 38
library(DescTools)
df %>%
group_by(group) %>%
mutate(
test2_winsorized = DescTools::Winsorize(
test2,
maxval = quantile(df$test2, 0.90),
minval = quantile(df$test2, 0.10)
),
test1_winsorized = DescTools::Winsorize(
test1,
maxval = quantile(df$test1, 0.90),
minval = quantile(df$test1, 0.10)
)
)
#> # A tibble: 10 x 6
#> # Groups: group [2]
#> id group test1 test2 test2_winsorized test1_winsorized
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 57 82 82 57
#> 2 2 0 77 80 80 68
#> 3 3 0 67 90 82.8 67
#> 4 4 0 15 70 70 15
#> 5 5 0 58 72 72 58
#> 6 6 1 18 44 44 18
#> 7 7 1 44 44 44 44
#> 8 8 1 18 46 46 18
#> 9 9 1 20 44 44 20
#> 10 10 1 14 38 43.4 14.9
Created on 2019-06-06 by the reprex package (v0.2.1)

Inquiring a better way to write code in R

I am new to R, and I'd like help in finding a better way to write the following code I've written. Any help would be appreciated.
df$rank[between(df$score,0,1.2)] <- 1
df$rank[between(df$score,1.2,2.1)] <- 2
df$rank[between(df$score,2.1,2.9)] <- 3
df$rank[between(df$score,2.9,3.7)] <- 4
df$rank[between(df$score,3.7,4.5)] <- 5
df$rank[between(df$score,4.5,5.4)] <- 6
You can use cut:
df$rank <- cut(x = df$score,c(0,1.2,2.1,2.9,3.7,4.5,5.4,Inf),FALSE)
library(dplyr)
set.seed(1234)
df <- data.frame(rank = rep(0, 15),
score = runif(15, 0, 6))
df
#> rank score
#> 1 0 0.68222047
#> 2 0 3.73379643
#> 3 0 3.65564840
#> 4 0 3.74027665
#> 5 0 5.16549230
#> 6 0 3.84186363
#> 7 0 0.05697454
#> 8 0 1.39530304
#> 9 0 3.99650255
#> 10 0 3.08550685
#> 11 0 4.16154775
#> 12 0 3.26984901
#> 13 0 1.69640150
#> 14 0 5.54060091
#> 15 0 1.75389504
df %>%
mutate(rank = case_when(between(score, 0, 1.2) ~ 1,
between(score, 1.2, 2.1) ~ 2,
between(score, 2.1, 2.9) ~ 3,
between(score, 2.9, 3.7) ~ 4,
between(score, 3.7, 4.5) ~ 5,
between(score, 4.5, 5.4) ~ 6))
#> rank score
#> 1 1 0.68222047
#> 2 5 3.73379643
#> 3 4 3.65564840
#> 4 5 3.74027665
#> 5 6 5.16549230
#> 6 5 3.84186363
#> 7 1 0.05697454
#> 8 2 1.39530304
#> 9 5 3.99650255
#> 10 4 3.08550685
#> 11 5 4.16154775
#> 12 4 3.26984901
#> 13 2 1.69640150
#> 14 NA 5.54060091
#> 15 2 1.75389504
Created on 2018-04-29 by the reprex package (v0.2.0).
As you didn't add a reproducible example, I created a little one (but keep in mind you should always add an example).
Using ifelse from base you could do this way:
df = data.table(rank = c(1.2, 3.3, 2.5, 3.7, 5.8, 6, 3, 1.1, 0.5))
df$rank2 = ifelse(df$rank>0 & df$rank<=1.2, 1,
ifelse(df$rank>1.2 & df$rank<=2.1, 2,
ifelse(df$rank>2.1 & df$rank<=2.9, 3,
ifelse(df$rank>2.9 & df$rank<=3.7, 4,
ifelse(df$rank>3.7 & df$rank<=4.5, 5, 6)))))
The last ifelse should be your maximun rank value, so the "no" argument will be the last range.
If this is a reocurring problem you should create a function.
Hope it helps.

Resources