adding a new column based upon values in another column using dplyr

adding a new column based upon values in another column using dplyr - r

I have a column of a data frame df$c_touch:
c_touch
0
1
3
2
3
4
5
Where each number refers to a duration of time, such that 0 = 2 mins, 1 = 5 mins, 2 = 10 mins, 3=15 mins, 4=20 mins, 5=30 mins.
I'd like to add another column df$c_duration to be like
c_touch c_duration
0 2
1 5
3 15
2 10
3 15
4 20
5 30
So far I've been using a loop, which is a bit ugly/messy, and I'd rather not use it. Is there a loop-free way of adding the extra column in, particularly using dplyr mutate function (as I'm trying to rewrite all my code using dplyr)?

library(dplyr)
df %>%
mutate(c_duration = case_when(
c_touch == 0 ~ 2,
c_touch == 5 ~ 30,
TRUE ~ c_touch * 5)
)

Here is a dplyr solution:
# data.frame containing the mapping
map <- data.frame(
idx = 0:5,
val = c(2, 5, 10, 15, 20, 30))
# Sample data
df <- read.table(text =
"c_touch
0
1
3
2
3
4
5", header = T)
dplyr::left_join(df, map, by = c("c_touch" = "idx"))
# c_touch val
#1 0 2
#2 1 5
#3 3 15
#4 2 10
#5 3 15
#6 4 20
#7 5 30

You could use dplyr::case_when inside mutate:
df <- df %>%
mutate(c_duration = case_when(c_touch == 0 ~ 2,
c_touch == 1 ~ 5,
c_touch == 2 ~ 10,
c_touch == 3 ~ 15,
c_touch == 4 ~ 20,
c_touch == 5 ~ 30))

Related

How to exclude a value when comparing two columns for the max value in R

I have two columns: A and B, with values from 1 to 7 each. I need to get the maximum value between both columns EXCLUDING the value 7, how can I do that? OR In the case that A has 7 I keep the value of B, this would serve me more, for instance:
A <- c(1,1,1,3,2,4,2,5,6,7)
B <- c(7,3,6,7,4,1,6,7,3,4)
df <- data.frame(A, B)
Expected results: 1,3,6,3,4,4,6,5,6,4

One option could be:
with(df, pmax(A * (A != 7), B * (B != 7)))
[1] 1 3 6 3 4 4 6 5 6 4
To deal with missing values:
with(df, pmax(A * (A != 7), B * (B != 7), na.rm = TRUE))
Considering also negative values:
with(df, pmax(A * (A != 7)^NA, B * (B != 7)^NA, na.rm = TRUE))

Updated special thanks to dear #tmfmnk.
Maybe there is a better way of handling this problem but it instantly popped up in my mind to replace every 7 with the lowest possible value like 0 so that it does not affect the comparison. It also works with NA values.
library(dplyr)
library(purrr)
A <- c(1,1,1,3,2,4,2,5,6,7)
B <- c(7,3,6,7,4,1,6,7,3,4)
df <- tibble(A, B)
df %>%
mutate(across(everything(), ~ replace(., . == 7, 0))) %>%
mutate(ResultColumn = pmap_dbl(., ~ max(c(...), na.rm = TRUE)))
# A tibble: 10 x 3
A B ResultColumn
<dbl> <dbl> <dbl>
1 1 0 1
2 1 3 3
3 1 6 6
4 3 0 3
5 2 4 4
6 4 1 4
7 2 6 6
8 5 0 5
9 6 3 6
10 0 4 4

Here is the tidyverse example:
library(tidyverse)
A <- c(1,1,1,3,2,4,2,5,6,7)
B <- c(7,3,6,7,4,1,6,7,3,4)
df <- data.frame(A,B)
df %>%
filter(A != 7, B != 7) %>%
transform(ResultColumn = pmax(A,B))
Output:

Another base R option using pmax
> do.call(pmax, c(replace(df, df == 7, NA), na.rm = TRUE))
[1] 1 3 6 3 4 4 6 5 6 4
or
> do.call(pmax, replace(df, df == 7, -Inf))
[1] 1 3 6 3 4 4 6 5 6 4

df$max <- ifelse(pmax(df$A,df$B)==7, pmin(df$A,df$B),pmax(df$A,df$B))

How to replace values in a specific column

I have a big data frame from a survey. There is some statements where I need to use revere coding, hence I need to change values in few columns. I have tried below code (where x represents the column where I want to make the changes)
df$x <- replace( df$x, 1=7, 2=6, 3=5, 5=3, 6=2, 7=1)
But this did not work. Every help is much appreciated.

If your column has only 1-7 values you can subtract those values from 8 to reverse the values.
set.seed(123)
df <- data.frame(x = sample(7, 10, replace = TRUE))
df$y <- 8 - df$x
#Or maybe more general
#df$y <- max(df$x) + 1 - df$x
df
# x y
#1 7 1
#2 7 1
#3 3 5
#4 6 2
#5 3 5
#6 2 6
#7 2 6
#8 6 2
#9 3 5
#10 5 3

You could try case_when from package dplyr. The syntax is very clean.
library(dplyr)
df %>%
mutate(x=case_when(
x == 1 ~ 7,
x == 2 ~ 6,
x == 3 ~ 5,
x == 6 ~ 2,
x == 7 ~ 1,
TRUE ~ as.numeric(x)
))
DATA
set.seed(1)
df <- data.frame(x = sample(7, 10, replace = TRUE))
df
The solution above overwrites the varaible x. To compare result, I created a new_x variable with the replaced data:
df %>%
mutate(new_x=case_when(
x == 1 ~ 7,
x == 2 ~ 6,
x == 3 ~ 5,
x == 6 ~ 2,
x == 7 ~ 1,
TRUE ~ as.numeric(x)
))
x new_x
1 1 7
2 4 4
3 7 1
4 1 7
5 2 6
6 5 5
7 7 1
8 3 5
9 6 2
10 2 6

One way you can replace values is using which:
df$x[which(df$x=1)] <- 7 # this replaces 1 with 7
Another way is to use ifelse:
df$x <- ifelse(df$x == 1,7,ifelse(df$x == 2,6,ifelse....)) # replaces 1 with 7, 2 with 6 and so on..

An option with which.max
library(dplyr)
df %>%
mutate(y = x[which.max(x)] - x + 1)

R: Multiply values values based on a logical conditionin in a data frame with NA values

If you have a full data frame, it easy to multiply values based on a logical condition:
df = data.frame(
var1 = c(1, 2, 3, 4, 5),
var2 = c(1, 2, 3, 2, 1),
var3 = c(5, 4, 3, 4, 5)
)
> df
var1 var2 var3
1 1 1 5
2 2 2 4
3 3 3 3
4 4 2 4
5 5 1 5
> df[df > 2] <- df[df > 2] * 10
> df
var1 var2 var3
1 1 1 50
2 2 2 40
3 30 30 30
4 40 2 40
5 50 1 50
However, if you have NA values in the data frame, the operation fails:
> df_na = data.frame(
var1 = c(NA, 2, 3, 4, 5),
var2 = c(1, 2, 3, 1, NA),
var3 = c(5, NA, 3, 4, 5)
)
> df_na
var1 var2 var3
1 NA 1 5
2 2 2 NA
3 3 3 3
4 4 1 4
5 5 NA 5
> df_na[df_na > 2] <- df_na[df_na > 2] * 10
Error in `[<-.data.frame`(`*tmp*`, df_na > 2, value = c(NA, 30, 40, 50, :
'value' is the wrong length
I tried, for example, some na.omit() tactics but could not make it work. I also could not find an appropriate question here in Stack Overflow.
So, how should I do it?

You can add !is.na() as an additional logical argument to subset by:
df_na[df_na > 2 & !is.na(df_na)] <- df_na[df_na > 2 & !is.na(df_na)] * 10
# > df_na
# var1 var2 var3
# 1 NA 1 50
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 50 NA 50
Alternatively, a dplyr / tidyverse solution would be:
library(dplyr)
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x > 2, .x * 10, .x))
Added based on OP comment:
If you want to subset by values based on the %in% operator, opt for the dplyr solution (the %in% operator won't work the same way here as explained in this post):
df_na %>%
mutate_all(.funs = ~ ifelse(!is.na(.x) & .x %in% c(3, 4), .x * 10, .x))
# var1 var2 var3
# 1 NA 1 5
# 2 2 2 NA
# 3 30 30 30
# 4 40 1 40
# 5 5 NA 5
This approach generally lends itself to more complex manipulation tasks. You may, for instance, also define additional conditions with the help of dplyr::case_when() instead of the one-alternative ifelse.

Does this work, Using base R:
df_na[] <- lapply(df_na, function(x) ifelse(!is.na(x) & x > 2, x * 10, x))
df_na
var1 var2 var3
1 NA 1 50
2 2 2 NA
3 30 30 30
4 40 1 40
5 50 NA 50

The problem is not with the multiplication, it is with the array indexing.
(df_na > 2 returns NAs).
You can convert the line below into one line if you like,
inds <- which(df_na > 2, arr.ind = TRUE)
df_na[inds] <- df_na[inds] * 10

Return summary table that sums data with iteration and control statement

none of these functions are particularly hard to do, but I'm wondering how to combine them.
df <- tibble::tibble(index = seq(1:8),
amps = c(7, 6, 7, 0, 7, 6, 0, 6))
As long as there is a positive value for amps, I'd like to sum them up. If amps = 0, then that's a break in the sequence and I'd like to return the 0, then start over. I'd also like to return the corresponding index value. The result would look like this:
index amps
<dbl> <dbl>
1 1 20
2 4 0
3 5 13
4 7 0
5 8 6
I can do this in VBA but I'd like to beef up my R skills in functional programming. I would prefer to use functions rather than loops just because they're cleaner. Any help is appreciated.

Another base R solution using rle + tapply
u <- with(rle(df$amps == 0), rep(seq_along(lengths), lengths))
dfout <- data.frame(
index = which(!duplicated(u)),
amps = tapply(df$amps, u, sum)
)
which gives
> dfout
index amps
1 1 20
2 4 0
3 5 13
4 7 0
5 8 6

One dplyr option could be:
df %>%
group_by(grp = with(rle(amps == 0), rep(seq_along(lengths), lengths))) %>%
summarise(index = first(index),
amps = sum(amps))
grp index amps
<int> <int> <dbl>
1 1 1 20
2 2 4 0
3 3 5 13
4 4 7 0
5 5 8 6

We can create a new group where amps = 0 or where previous value of amps is 0, get the first value of index and sum of amps for each group.
library(dplyr)
df %>%
group_by(gr = cumsum(amps == 0 | lag(amps, default = first(amps)) == 0)) %>%
summarise(index = first(index), amps = sum(amps)) %>%
select(-gr)
# A tibble: 5 x 2
# index amps
# <int> <dbl>
#1 1 20
#2 4 0
#3 5 13
#4 7 0
#5 8 6
Using the same logic in data.table :
library(data.table)
setDT(df)[, .(index = first(index), amps = sum(amps)),
cumsum(amps == 0 | shift(amps, fill = first(amps)) == 0)]

In base R we could use aggregate based on the rle.
ll <- rle(df$amps != 0)$lengths
rr <- aggregate(amps ~ cbind(index=rep(index[!!c(amps[1]>0, diff(amps!=0))], ll)), df, sum)
rr
# index amps
# 1 1 20
# 2 4 0
# 3 5 13
# 4 7 0
# 5 8 6

Conditional update similar to SQL

I have the following dataframe
library(tidyverse)
x <- c(1,2,3,NA,NA,4,5)
y <- c(1,2,3,5,5,4,5)
z <- c(1,1,1,6,7,7,8)
df <- data.frame(x,y,z)
df
x y z
1 1 1 1
2 2 2 1
3 3 3 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
I would like to update the dataframe according to the following conditions
If z==1, update to x=1, else leave the current value for x
If z==1, update to y=2, else leave the current value for y
The following code does the job fine
df %>% mutate(x=if_else(z==1,1,x),y=if_else(z==1,2,y))
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
However, I have to add if_else statement for x and y mutate functions. This has the potential to make my code complicated and hard to read. To give you a SQL analogy, consider the following code
UPDATE df
SET x= 1, y= 2
WHERE z = 1;
I would like to achieve the following:
Specify the update condition ahead of time, so I don't have to repeat it for every mutate function
I would like to avoid using data.table or base R. I am using dplyr so I would like to stick to it for consistency

Using mutate_cond posted at dplyr mutate/replace several columns on a subset of rows we can do this:
df %>% mutate_cond(z == 1, x = 1, y = 2)
giving:
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
sqldf
Of course you can directly implement it in SQL with sqldf -- ignore the warning message that the backend RSQLite issues.
library(sqldf)
sqldf(c("update df set x = 1, y = 2 where z = 1", "select * from df"))
base R
It straight-forward in base R:
df[df$z == 1, c("x", "y")] <- list(1, 2)

library(dplyr)
df %>%
mutate(x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
In base R
transform(df,
x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
If you store the condition in a variable, you don't have to type it multiple times
condn = (df$z == 1)
transform(df,
x = replace(x, condn, 1),
y = replace(y, condn, 2))

Here is one option with map2. Loop through the 'x', 'y' columns of the dataset, along with the values to change, apply case_when based on the values of 'z' if it is TRUE, then return the new value, or else return the same column and bind the columns with the original dataset
library(dplyr)
library(purrr)
map2_df(df %>%
select(x, y), c(1, 2), ~ case_when(df$z == 1 ~ .y, TRUE ~ .x)) %>%
bind_cols(df %>%
select(z), .) %>%
select(names(df))
Or using base R, create a logical vector, use that to subset the rows of columns 'x', 'y' and update by assigning to a list of values
i1 <- df$z == 1
df[i1, c('x', 'y')] <- list(1, 2)
df
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
The advantage of both the solutions are that we can pass n number of columns with corresponding values to pass and not repeating the code

If you have an SQL background, you should really check out data.table:
library(data.table)
dt <- as.data.table(df)
set(dt, which(z == 1), c('x', 'y'), list(1, 2))
dt
# or perhaps more classic syntax
dt <- as.data.table(df)
dt
# x y z
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
dt[z == 1, `:=`(x = 1, y = 2)]
dt
# x y z
#1: 1 2 1
#2: 1 2 1
#3: 1 2 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
The last option is an update join. This is great if you have the lookup data already done upfront:
# update join:
dt <- as.data.table(df)
dt_lookup <- data.table(x = 1, y = 2, z = 1)
dt[dt_lookup, on = .(z), `:=`(x = i.x, y = i.y)]
dt

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

adding a new column based upon values in another column using dplyr - r

library(dplyr) df %>% mutate(c_duration = case_when( c_touch == 0 ~ 2, c_touch == 5 ~ 30, TRUE ~ c_touch * 5) )

You could use dplyr::case_when inside mutate: df <- df %>% mutate(c_duration = case_when(c_touch == 0 ~ 2, c_touch == 1 ~ 5, c_touch == 2 ~ 10, c_touch == 3 ~ 15, c_touch == 4 ~ 20, c_touch == 5 ~ 30))

Related

How to exclude a value when comparing two columns for the max value in R

How to replace values in a specific column

R: Multiply values values based on a logical conditionin in a data frame with NA values

Return summary table that sums data with iteration and control statement

Conditional update similar to SQL

Categories

Resources