How transform a factor to numeric binary variable? - r

I have a column with different types of sites (factor) :
Localisation
A
A
B
A
B
B
I would like to create a new column with binary values (numeric) who correspond to Localization column :
A = 1 and B = 0
Localisation Binom
A 1
A 1
B 0
A 1
B 0
B 0
Thanks !

dplyr approach, handy when there are more than two if-else conditions.
df <- read.table(stringsAsFactors = T, header = T, text = "Localisation
+ A
+ A
+ B
+ A
+ B
+ B")
df %>% mutate(Binom = case_when(Localisation == "A" ~ 1, #condition1
Localisation == "B" ~ 0) #condition2
)

Here are several alternatives.
library(dplyr)
library(microbenchmark)
mb <- microbenchmark(
NelsonGon = ifelse(df$Localisation %in% "A",1,0),
Edward = ifelse(df$Localisation == "A",1,0),
Edward2 = +(df$Localisation == "A"),
Rui = 2L - as.integer(df$Localisation),
massisenergy = df %>% mutate(Binom = case_when(Localisation == "A" ~ 1, #condition1
Localisation == "B" ~ 0) #condition2
)
)
print(mb, unit = "relative", order = "median")
#Unit: relative
# expr min lq mean median uq max neval cld
# Rui 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
# NelsonGon 4.107345 3.041659 2.490878 2.679642 2.341985 0.4714148 100 ab
# Edward2 4.358608 3.339862 2.834451 3.032853 2.741840 1.0814972 100 ab
# Edward 7.631876 5.320371 4.330419 4.575165 3.967027 1.4852678 100 b
# massisenergy 247.792745 161.000287 117.762537 131.729545 96.032138 22.5566734 100 c
Data.
Localisation <- scan(what = character(), text = '
A
A
B
A
B
B')
df <- data.frame(Localisation)

Another option is
+(df$Localisation != "B")

Related

How to change a variable at the nth occurrence of a value in another variable?

There is a data.table
library(data.table)
car <- data.table(no = 1:100, turn = sample(1:5,100,replace = TRUE),
dis = sample(1:10,100,replace = TRUE))
I want to change "dis" to -1, at the nth occurrence of turn == 3, say the third time that "turn" is 3.
I can select the third row of turn == 3:
car[turn == 3, .SD[3]]
However, I don't manage to update "dis" at this row:
car[turn == 3, .SD[3]][, dis := -1]
A related Q&A: Conditionally replacing column values with data.table.
Some alternatives. Use rowid or cumsum to create a counter of rows within groups. Add the counter to your condition in i.
I use a slightly smaller toy data set, just to make it easier to track the changes:
d <- data.table(x = 1:3, y = 1:12)
d[rowid(x) == 3 & x == 3, y := -1]
# #mt1022
d[cumsum(x == 3) == 3 & (x == 3), y := -1]
# #docendo discimus
d[(ix <- x == 3) & cumsum(ix) == 3, y := -1]
Although OP didn't mention speed as an issue, I was still curious to time the different approaches on a larger vector. Unsurprisingly, #Frank's method is the fastest, especially so when the number of unique values to search among increases:
frank << docendo < henrik < mt022
microbenchmark(henrik = d[rowid(x) == 3 & x == 3, y := -1],
mt1022 = d[cumsum(x == 3) == 3 & (x == 3), y := -1],
docendo = d[(ix <- x == 3) & cumsum(ix) == 3, y := -1],
frank = d[d[x == 3, which = TRUE][3], y := -1], unit = "relative")
d <- data.table(x = sample(1:3, 1e6, replace = TRUE), y = 1:1e6)
# Unit: relative
# expr min lq mean median uq max neval cld
# henrik 4.417303 4.369407 4.133514 4.319839 4.329658 1.260394 100 b
# mt1022 5.461961 5.285562 5.174559 5.186404 5.239738 1.608712 100 c
# docendo 3.572646 3.624369 3.788678 3.589705 3.576637 1.733272 100 b
# frank 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
d <- data.table(x = sample(1:30, 1e6, replace = TRUE), y = 1:1e6)
# Unit: relative
# expr min lq mean median uq max neval cld
# henrik 22.64881 19.54375 18.81963 18.91335 19.78559 5.507692 100 bc
# mt1022 24.58258 21.17535 19.84417 20.96256 22.76020 3.625263 100 c
# docendo 19.40044 16.75912 16.23321 16.47953 18.06264 4.234100 100 b
# frank 1.00000 1.00000 1.00000 1.00000 1.00000 1.000000 100 a
d <- data.table(x = sample(1:300, 1e6, replace = TRUE), y = 1:1e6)
# Unit: relative
# expr min lq mean median uq max neval cld
# henrik 31.81237 32.51122 28.79490 30.35766 28.63560 8.236282 100 b
# mt1022 34.71984 35.45341 33.20405 33.57394 31.50914 21.556367 100 c
# docendo 27.99046 28.15855 26.56954 26.60644 25.20044 7.847163 100 b
# frank 1.00000 1.00000 1.00000 1.00000 1.00000 1.000000 100 a
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# henrik 60.655582 76.455531 83.061266 77.632036 78.57818 203.224042 100 c
# mt1022 66.701182 84.133034 87.967300 84.937201 85.72464 201.167914 100 c
# docendo 52.938545 67.214360 71.558130 68.003891 68.51897 184.178346 100 b
# frank 1.977821 2.494039 2.629852 2.663577 2.76089 3.613905 100 a
Here's another way:
car[car[turn == 3, which = TRUE][3], dis := -1 ]
Comment. For an operation like this, you might want to turn verbose on, so you can see if any change was made. For example, if we look for the 111th 3....
car[car[turn == 3, which = TRUE][111], dis := -1, verbose = TRUE ]
# Detected that j uses these columns: dis
# Assigning to 0 row subset of 100 rows
It shows that 0 rows were edited.
See also Using .I to return row numbers with data.table package on the use of which = TRUE.

Fill all entries between two specified values

I have a long vector, thousands of entries, which has elements 0, 1, 2 in it sporadically. 0 means "no signal", 1 means "signal on", and 2 means "signal off". I am trying to find the runs from 1 to the next occurrence of 2 and fill the space with 1s. I also need to do the same thing between a 2 and the next occurrence of 1 but fill the space with 0s.
I currently have a solution for this issue using loops but it's slow and incredibly inefficient:
example vector:
exp = c(1,1,1,0,0,1,2,0,2,0,1,0,2)
desired result:
1,1,1,1,1,1,2,0,0,0,1,1,2
Thank you
You could use rle & shift from the data.table-package in the following way:
library(data.table)
# create the run-length object
rl <- rle(x)
# create indexes of the spots in the run-length object that need to be replaced
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
# replace these values
rl$values[idx1] <- 1
rl$values[idx0] <- 0
Now you will get the desired result by using inverse.rle:
> inverse.rle(rl)
[1] 1 1 1 1 1 1 2 0 0 0 1 1 2
As an alternative for the shift-function, you could also use the lag and lead functions from dplyr.
If you want to assess the speed of both approaches, the microbenchmark-package is a useful tool. Below you'll find 3 benchmarks, each for a different vector size:
# create functions for both approaches
jaap <- function(x) {
rl <- rle(x)
idx1 <- rl$values == 0 & shift(rl$values, fill = 0) == 1 & shift(rl$values, fill = 0, type = 'lead') %in% 1:2
idx0 <- rl$values == 2 & shift(rl$values, fill = 0) == 0 & shift(rl$values, fill = 2, type = 'lead') %in% 0:1
rl$values[idx1] <- 1
rl$values[idx0] <- 0
inverse.rle(rl)
}
john <- function(x) {
Reduce(f, x, 0, accumulate = TRUE)[-1]
}
Execute the benchmarks:
# benchmark on the original data
> microbenchmark(jaap(x), john(x), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x) 58.766 61.2355 67.99861 63.8755 72.147 143.841 100 b
john(x) 13.684 14.3175 18.71585 15.7580 23.902 50.705 100 a
# benchmark on a somewhat larger vector
> x2 <- rep(x, 10)
> microbenchmark(jaap(x2), john(x2), times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
jaap(x2) 69.778 72.802 84.46945 76.9675 87.3015 184.666 100 a
john(x2) 116.858 121.058 127.64275 126.1615 130.4515 223.303 100 b
# benchmark on a very larger vector
> x3 <- rep(x, 1e6)
> microbenchmark(jaap(x3), john(x3), times = 20)
Unit: seconds
expr min lq mean median uq max neval cld
jaap(x3) 1.30326 1.337878 1.389187 1.391279 1.425186 1.556887 20 a
john(x3) 10.51349 10.616632 10.689535 10.670808 10.761191 10.918953 20 b
From this you can conclude that the rle-approach has an advantage when applied to vectors that are larger than 100 elements (which is probably nearly always).
You could also use Reduce with the following function:
f <- function(x,y){
if(x == 1){
if(y == 2) 2 else 1
}else{
if(y == 1) 1 else 0
}
}
Then:
> x <- c(1,1,1,0,0,1,2,0,2,0,1,0,2)
> Reduce(f, x, 0, accumulate = TRUE)[-1]
[1] 1 1 1 1 1 1 2 0 0 0 1 1 2

mutate() with an if/else function

I have an example dataframe
df <- data.frame(cust = sample(1:100, 1000, TRUE),
channel = sample(c("WEB", "POS"), 1000, TRUE))
that I'm trying to mutate
get_channels <- function(data) {
d <- data
if(unique(d) %>% length() == 2){
d <- "Both"
} else {
if(unique(d) %>% length() < 2 && unique(d) == "WEB") {
d <- "Web"
} else {
d <- "POS"
}
}
return(d)
}
This works without issue and on small dataframes, it takes no time at all.
start.time <- Sys.time()
df %>%
group_by(cust) %>%
mutate(chan = get_channels(channel)) %>%
group_by(cust) %>%
slice(1) %>%
group_by(chan) %>%
summarize(count = n()) %>%
mutate(perc = count/sum(count))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.34602 secs
However, when the data frame gets rather large, say, on the order of >1000000 or more cust, my basic if/else fx takes much, much longer.
How can I streamline this function to make it run more quickly?
You should use a data.table for this.
setDT(df)
t1 = Sys.time()
df = df[ , .(channels = ifelse(uniqueN(channel) == 2, "both", as.character(channel[1]))), by = .(cust)]
> Sys.time() - t1
Time difference of 0.00500083 secs
> head(df)
cust channels
1: 37 both
2: 45 both
3: 74 both
4: 20 both
5: 1 both
6: 68 both
You can do it in base R using something like that:
web_cust <- unique(df$cust[df$channel=="WEB"])
pos_cust <- unique(df$cust[df$channel=="POS"])
both <- length(intersect(web_cust, pos_cust))
web_only <- length(setdiff(web_cust, pos_cust))
pos_only <- length(setdiff(pos_cust, web_cust))
Data:
set.seed(1)
df <- data.frame(cust = sample(2e6, 1e7, TRUE),
channel = sample(c("WEB", "POS"), 1e7, TRUE),
stringsAsFactors = F)
A faster dplyr version that takes about 1/3 the time but is probably still slower than the data table version. uniqueN borrowed from #Kristoferson answer.
df %>%
group_by(cust) %>%
summarize(chan = if_else(uniqueN(channel) == 2, "Both", as.character(channel[1]))) %>%
group_by(chan) %>%
summarize(n = n() ) %>%
mutate(perc = n /sum(n))
Also, your orginal can be improved significantly by optimizing your function like this:
get_channels <- function(data) {
ud <- unique(data)
udl <- length(ud)
if(udl == 2) {
r <- "Both"
} else {
if(udl < 2 && ud == "WEB") {
r <- "Web"
} else {
r <- "POS"
}
}
return(r)
}
And some timings...
I tried three different alternatives in both dplyr and data.table: (1) ifelse (see #Kristofersen's answer), (2) if / else (because the test is of length 1), and (3) vector indexing. Unsurprisingly, the main difference is between dplyr and data.table and not among alternative 1-3.
For 1000 customers, data.table is about 7 times faster. For 10000 customers it's about 30 times faster. For 1e6 customers, I only tested data.table, not a very large difference between alternatives.
# 1000 customers, 2*1000 registrations
df <- data.frame(cust = sample(1e3, 2e3, replace = TRUE),
channel = sample(c("WEB", "POS"), 2e3, TRUE))
library(microbenchmark)
library(dplyr)
library(data.table)
microbenchmark(dp1 = df %>%
group_by(cust) %>%
summarise(res = ifelse(n_distinct(channel) == 1, channel[1], "both")),
dp2 = df %>%
group_by(cust) %>%
summarise(res = if(n_distinct(channel) == 1) channel[1] else "both"),
dp3 = df %>%
group_by(cust) %>%
summarise(res = c("both", channel[1])[(n_distinct(channel) == 1) + 1]),
dt1 = setDT(df)[ , .(channels = ifelse(uniqueN(channel) == 2, "both", channel[1])), by = cust],
dt2 = setDT(df)[ , .(channels = if(uniqueN(channel) == 2) "both" else channel[1]), by = cust],
dt3 = setDT(df)[ , .(res = c("both", channel[1])[(uniqueN(channel) == 1) + 1]), by = cust],
times = 5, unit = "relative")
# 1e3 customers
# Unit: relative
# expr min lq mean median uq max neval
# dp1 7.8985477 8.176139 7.9355234 7.676534 8.0359975 7.9166933 5
# dp2 7.8882707 8.018000 7.8965098 8.731935 7.8414478 7.3560530 5
# dp3 8.0851402 8.934831 7.7540060 7.653026 6.8305012 7.6887950 5
# dt1 1.1713088 1.180870 1.0350482 1.209861 1.0523597 0.7650059 5
# dt2 0.8272681 1.223387 0.9311628 1.047773 0.9028017 0.7795579 5
# dt3 1.0000000 1.000000 1.0000000 1.000000 1.0000000 1.0000000 5
# 1e4 customers
# Unit: relative
# expr min lq mean median uq max neval
# dp1 40.8725204 39.5297108 29.5755838 38.996075 38.246103 17.2784642 5
# dp2 40.7396141 39.4299918 27.4476811 38.819577 37.886320 12.7265756 5
# dp3 41.0940358 39.7819673 27.5532964 39.260488 38.317899 12.4685386 5
# dt1 1.0905470 1.0661613 0.7422082 1.053786 1.034642 0.3428945 5
# dt2 0.9052739 0.9008761 1.2813458 2.111642 2.356008 0.9005391 5
# dt3 1.0000000 1.0000000 1.0000000 1.000000 1.000000 1.0000000 5
# 1e6 customers, data.table only
# Unit: relative
# expr min lq mean median uq max neval
# dt1 1.146757 1.147152 1.155497 1.164471 1.156244 1.161660 5
# dt2 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 5
# dt3 1.084442 1.079734 1.253568 1.106833 1.098766 1.799935 5

Sum of subset based on second vector

I got two vectors:
a <- c(1,1,2,3,4,4,4,4,5,6)
b <- c(T,F,T,F,T,T,F,F,F,T)
I would like to have a vector that tells me how many TRUEs there are in b for each unique value in a. (the second column)
[,1] [,2]
[1,] 1 1
[2,] 2 1
[3,] 3 0
[4,] 4 2
[5,] 5 0
[6,] 6 1
The best I can come up here with is using sapply:
sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
This is fine, but for larger vectors it is rather slow. (I tried some subset variants.)
a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)
microbenchmark::microbenchmark(
subset = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b)
, iN = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a %in% uniqueA & b), a = a, b = b)
, equal = sapply(unique(a), FUN = function(uniqueA, a, b) sum(a == uniqueA & b), a = a, b = b)
, times = 5
)
Unit: milliseconds
expr min lq mean median uq max neval
subset 389.1995 390.6002 413.6969 393.0396 445.6553 449.9897 5
iN 2746.8407 2798.0462 2797.3155 2806.9477 2814.6317 2820.1110 5
equal 1080.3430 1089.2507 1111.0267 1096.8082 1135.1957 1153.5358 5
Does anyone have an idea how to do this faster?
You could use aggregate:
aggregate(b, list(a), sum)
For the fastest performance, I'd suggest a data.table. It will take longer to set up, but the performance should be quite good for larger amounts of data.
library(data.table)
dt <- data.table(a = a, b = b)
dt[,sum(b), by = a]
Speed test comparing (1) aggregate, (2) sapply, (3) data.table, (4) tapply:
a <- sample(1:1000, 1e5, replace = TRUE)
b <- sample(c(T,F), 1e5, replace = TRUE)
summarize_dt <- function(x) {
dt <- data.table(a = a, b = b)
dt[,sum(b), by = a]
}
microbenchmark::microbenchmark(
aggregate = aggregate(b, list(a), sum),
sapply = sapply(unique(a), FUN = function(uniqueA, a, b) sum(b[a == uniqueA]), a = a, b = b),
datatable = summarize_dt(),
tapply = tapply(b, a, sum)
)
#expr min lq mean median uq max neval
#aggregate 130.995347 133.672041 141.404597 135.301762 137.199151 213.730345 100
#sapply 335.344866 357.387474 394.432339 411.994214 425.604144 486.548520 100
#datatable 1.540011 1.914712 2.430220 2.027578 2.239999 5.297593 100
#tapply 3.075646 3.627395 4.719595 4.089434 5.934675 8.758332 100
Looks like data.table is the fastest by a lot
This one maybe using table in base R:
t <- table(a[b])
z <- as.numeric(names(t))
rbind(unname(cbind(z, t)), cbind(setdiff(unique(a),z),0))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 1
# [3,] 4 2
# [4,] 6 1
# [5,] 3 0
# [6,] 5 0
If you want those with non-zero number of TRUE's, just table(a[b]) would suffice.
Or we can use tidyverse
library(tidyverse)
tibble(a, b) %>%
group_by(a) %>%
summarise(b = sum(b))
A base R option would be
rowsum(+b, a)

Repeating calculation based on conditions

What I am trying to do is pretty simple. However, I am new to R and have not learned much about loops and functions and am not sure what is the most efficient way to get the results. Basically, I want to count the number of rows that meet my conditions and do a division. Here is an example:
df1 <- data.frame(
Main = c(0.0089, -0.050667, -0.030379, 0.066484, 0.006439, -0.026076),
B = c(NA, 0.0345, -0.0683, -0.052774, 0.014661, -0.040537),
C = c(0.0181, 0, -0.056197, 0.040794, 0.03516, -0.022662),
D = c(-0.0127, -0.025995, -0.04293, 0.057816, 0.033458, -0.058382)
)
df1
# Main B C D
# 1 0.008900 NA 0.018100 -0.012700
# 2 -0.050667 0.034500 0.000000 -0.025995
# 3 -0.030379 -0.068300 -0.056197 -0.042930
# 4 0.066484 -0.052774 0.040794 0.057816
# 5 0.006439 0.014661 0.035160 0.033458
# 6 -0.026076 -0.040537 -0.022662 -0.058382
My criteria for the numerator is to count the number of B/C/D that is >0 when Main is >0; For denominator, count the number of B/C/D that is != 0 when Main is != 0. I can use length(which(df1$Main >0 & df1$B>0)) / length(which(df1$Main !=0 & df1$B !=0)) to get the ratios for each of the column individually. But my data set has many more columns, and I am wondering if there is a way to get those ratio all at once so that my result will be like:
# B C D
# 1 0.2 0.6 0.3
Use apply:
apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0)) / length(which(df1$Main !=0 & x !=0)))
You could do this vectorized (No apply or for is needed):
tail(colSums(df[df$Main>0,]>0, na.rm = T) / colSums(df[df$Main!=0,]!=0, na.rm = T), -1)
# B C D
#0.2000000 0.6000000 0.3333333
One way to do this would be with a for loop that loops over the columns and applies the function that you wrote. Something like this:
ratio1<-vector()
for(i in 2:ncol(df1)){
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0)) / length(which(df1$Main !=0 & df1[,i] !=0))
}
Maybe there is a better way to do this with apply or data.table, but this is a simple solution that I can come up with. Works on any number of columns. Use round() if you want the answer in one decimal.
criteria1 <- df1[which(df1$Main > 0), -1] > 0
criteria2 <- df1[which(df1$Main != 0), -1] != 0
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T)
## B C D
## 0.2000000 0.6000000 0.3333333
Edit: It appears Niek's method is quickest for this specific data
# Unit: microseconds
# expr min lq mean median uq max neval
# Jim(df1) 216.468 230.0585 255.3755 239.8920 263.6870 802.341 300
# emilliman5(df1) 120.109 135.5510 155.9018 142.4615 156.0135 1961.931 300
# Niek(df1) 97.118 107.6045 123.5204 111.1720 119.6155 1966.830 300
# nine89(df1) 211.683 222.6660 257.6510 232.2545 252.6570 2246.225 300
#[[1]]
# [,1] [,2] [,3] [,4]
#median 239.892 142.462 111.172 232.255
#ratio 1.000 0.594 0.463 0.968
#diff 0.000 -97.430 -128.720 -7.637
However, when there are many columns the vectorized approach is quicker.
Nrow <- 1000
Ncol <- 1000
mat <- matrix(runif(Nrow*Ncol),Nrow)
df1 <- data.frame(Main = sample(-2:2,Nrow,T), mat) #1001 columns
#Unit: milliseconds
# expr min lq mean median uq max
# Jim(df1) 46.75627 53.88500 66.93513 56.58143 62.04375 185.0460
#emilliman5(df1) 73.35257 91.87283 151.38991 178.53188 185.06860 292.5571
# Niek(df1) 68.17073 76.68351 89.51625 80.14190 86.45726 200.7119
# nine89(df1) 51.36117 56.79047 74.53088 60.07220 66.34270 191.8294
#[[1]]
# [,1] [,2] [,3] [,4]
#median 56.581 178.532 80.142 60.072
#ratio 1.000 3.155 1.416 1.062
#diff 0.000 121.950 23.560 3.491
functions
Jim <- function(df1){
criteria1 <- df1[which(df1$Main > 0), -1] > 0
criteria2 <- df1[which(df1$Main != 0), -1] != 0
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T)
}
emilliman5 <- function(df1){
apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0)) / length(which(df1$Main !=0 & x !=0)))
}
Niek <- function(df1){
ratio1<-vector()
for(i in 2:ncol(df1)){
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0)) / length(which(df1$Main !=0 & df1[,i] !=0))
}
ratio1
}
nine89 <- function(df){
tail(colSums(df[df$Main>0,]>0, na.rm = T) / colSums(df[df$Main!=0,]!=0, na.rm = T), -1)
}

Resources