Condition in rows, modify all columns without a loop - r

what I want to do is to modify all selected columns of an R data table according to the rows conditions i.e
for all 4 columns selected in cols variable, if the value is greater (or equal) than 1.5, i would like to put them to 1 else 0
I tried something like that : iris[(cols) > 1.5 , (cols) := 1, .SDcols = cols]
Thx

One data.table approach:
iris <- as.data.table(iris)
cols <- names(iris)[1:4]
cols
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
iris[, (cols) := lapply(.SD, function(z) fifelse(z > 1.5, 1, z)), .SDcols = cols]
iris
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <num> <num> <num> <num> <fctr>
# 1: 1 1 1.4 0.2 setosa
# 2: 1 1 1.4 0.2 setosa
# 3: 1 1 1.3 0.2 setosa
# 4: 1 1 1.5 0.2 setosa
# 5: 1 1 1.4 0.2 setosa
# 6: 1 1 1.0 0.4 setosa
# 7: 1 1 1.4 0.3 setosa
# 8: 1 1 1.5 0.2 setosa
# 9: 1 1 1.4 0.2 setosa
# 10: 1 1 1.5 0.1 setosa
# ---
# 141: 1 1 1.0 1.0 virginica
# 142: 1 1 1.0 1.0 virginica
# 143: 1 1 1.0 1.0 virginica
# 144: 1 1 1.0 1.0 virginica
# 145: 1 1 1.0 1.0 virginica
# 146: 1 1 1.0 1.0 virginica
# 147: 1 1 1.0 1.0 virginica
# 148: 1 1 1.0 1.0 virginica
# 149: 1 1 1.0 1.0 virginica
# 150: 1 1 1.0 1.0 virginica
An alternative using set:
for (nm in cols) set(iris, which(iris[[nm]] > 1.5), nm, 1)

Another solution:
library(dplyr)
library(data.table)
iris[,1:4] %>% data.table() %>% mutate_all(~ ifelse(.x>=1.5,1,0))

If you just need to check for numeric columns across can be a good fit, it also works with more specific choices like positions and names
library(tidyverse)
iris |>
as_tibble() |>
mutate(across(.cols = where(is.numeric),.fns = ~ if_else(.x > 1.5,1,.x)))
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 1.4 0.2 setosa
#> 2 1 1 1.4 0.2 setosa
#> 3 1 1 1.3 0.2 setosa
#> 4 1 1 1.5 0.2 setosa
#> 5 1 1 1.4 0.2 setosa
#> 6 1 1 1 0.4 setosa
#> 7 1 1 1.4 0.3 setosa
#> 8 1 1 1.5 0.2 setosa
#> 9 1 1 1.4 0.2 setosa
#> 10 1 1 1.5 0.1 setosa
#> # ... with 140 more rows
Created on 2021-10-18 by the reprex package (v2.0.1)

Base R option -
data <- iris
cols <- 1:4
data[cols] <- +(data[cols] > 1.5)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 1 1 0 0 setosa
#2 1 1 0 0 setosa
#3 1 1 0 0 setosa
#4 1 1 0 0 setosa
#5 1 1 0 0 setosa
#6 1 1 1 0 setosa
#...
#...
The + at the beginning is used to change the logical values (TRUE/FALSE) to integers (1/0).

We may do
library(dplyr)
iris %>%
mutate(across(where(is.numeric), ~ +(. > 1.5)))

Related

Create a column in the original dataset to indicate whether the row was drawn in a random stratified sample

I would like to draw a stratified random sample (n = 375) from a dataset. Based on the stratified random sample, I would like to add a column to the original dataset indicating whether the row is in the stratified random sample (1) or not (0).
iris <- iris
# Get a random stratified sample
library(tidyverse)
stratified <- iris %>%
group_by(Species) %>%
sample_n(size=1)
# The final result I would like to get:
iris$sample3 <- 0
iris[21,6] <- 1
iris[65,6] <- 1
iris[106,6] <- 1
After doing that, I would like to repeat the procedure by drawing a second stratified random sample (n = 125) from my first stratified random sample (n = 375) and repeat the creation of a column.
You can add a column to your data frame that has the required number of 1s per group (and 0 otherwise).
set.seed(1)
samples <- 1
sample1 <- iris %>%
group_by(Species) %>%
mutate(sampled = as.numeric(row_number() %in% sample(n(), samples)))
sample1
sample1
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0
#> 2 4.9 3 1.4 0.2 setosa 0
#> 3 4.7 3.2 1.3 0.2 setosa 0
#> 4 4.6 3.1 1.5 0.2 setosa 1
#> 5 5 3.6 1.4 0.2 setosa 0
#> 6 5.4 3.9 1.7 0.4 setosa 0
#> 7 4.6 3.4 1.4 0.3 setosa 0
#> 8 5 3.4 1.5 0.2 setosa 0
#> 9 4.4 2.9 1.4 0.2 setosa 0
#> 10 4.9 3.1 1.5 0.1 setosa 0
#> # ... with 140 more rows
To get the sampled values, simply filter to find the 1s:
sample1 %>% filter(sampled == 1)
#> # A tibble: 3 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 4.6 3.1 1.5 0.2 setosa 1
#> 2 5.6 3 4.1 1.3 versicolor 1
#> 3 6.3 3.3 6 2.5 virginica 1
Created on 2022-05-16 by the reprex package (v2.0.1)

How to use dplyr::if_else to mutate across a tibble dataframe?

I wonder how to combine mutate and if_else to transform a data frame into TRUE and FALSE?
For example, mutate a table into TRUE (value >= 2) and FALSE(value <2):
> iris %>% as_tibble() %>% select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
# A tibble: 150 × 4
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
# … with 140 more rows
into
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 T T F F
2 T T F F
3 T T F F
4 T T F F
5 T T F F
6 T T F F
7 T T F F
Thanks a lot!
iris %>%
mutate(across(where(is.numeric), ~ . >= 2))
You don't need if_else when the result you want is TRUE or FALSE. Generally, ifelse(test, TRUE, FALSE) is a long way of writing test.
Or in base R
iris[1:4] >= 2

dummy code collapsed column using data.table in R

It is quite easy to dummy code a collapsed column using the tidyverse. Here is a quick example of how I've done it in the past. First, I'll load the iris data and create a custom collapsed column of randomly sampled letters:
library(tidyverse)
# load practice data
data(iris)
iris <- as_tibble(iris)
# create column of collapsed values
lst <- list()
for(i in 1:150) {
value <- as.list(paste0(sample(letters[1:2], 1), ", ", sample(letters[3:4], 1)))
lst[i] <- value
}
# append custom columns to the iris dataset
iris$Samples <- unlist(lst)
iris$Subject <- c(1:150)
iris <- iris %>% select(Subject, everything())
# preview custom dataset
iris
# A tibble: 150 x 7
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species Samples
<int> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 1 5.1 3.5 1.4 0.2 setosa a, d
2 2 4.9 3 1.4 0.2 setosa a, c
3 3 4.7 3.2 1.3 0.2 setosa a, c
4 4 4.6 3.1 1.5 0.2 setosa b, c
5 5 5 3.6 1.4 0.2 setosa a, c
6 6 5.4 3.9 1.7 0.4 setosa a, d
7 7 4.6 3.4 1.4 0.3 setosa b, c
8 8 5 3.4 1.5 0.2 setosa b, c
9 9 4.4 2.9 1.4 0.2 setosa b, d
10 10 4.9 3.1 1.5 0.1 setosa a, c
# ... with 140 more rows
So, let's say that each letter represented a unique value of interest and I wanted to wrangle this data into a series of dummy coded variables for each letter. Here is how I would do this using tidyverse functions:
iris %>%
separate_rows(Samples, sep = ', ') %>%
mutate(Values = 1) %>%
pivot_wider(names_from = "Samples", values_from = "Values") %>%
mutate_if(is.double, ~replace_na(., 0))
# A tibble: 150 x 10
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species a d c b
<int> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 5.1 3.5 1.4 0.2 setosa 1 1 0 0
2 2 4.9 3 1.4 0.2 setosa 1 0 1 0
3 3 4.7 3.2 1.3 0.2 setosa 1 0 1 0
4 4 4.6 3.1 1.5 0.2 setosa 0 0 1 1
5 5 5 3.6 1.4 0.2 setosa 1 0 1 0
6 6 5.4 3.9 1.7 0.4 setosa 1 1 0 0
7 7 4.6 3.4 1.4 0.3 setosa 0 0 1 1
8 8 5 3.4 1.5 0.2 setosa 0 0 1 1
9 9 4.4 2.9 1.4 0.2 setosa 0 1 0 1
10 10 4.9 3.1 1.5 0.1 setosa 1 0 1 0
# ... with 140 more rows
This is fast and efficient for small datasets. But, I am quickly moving into datasets that have millions of rows. Enter data.table.
How would I accomplish the same process using data.table? Here is my attempt:
library(data.table)
# convert my tibble into a data.table
iris.dt <- as.data.table(iris)
# perform the separate_rows functionality on my data
result <- iris.dt[, list(Samples = unlist(strsplit(Samples, ", "))), by = Subject
][, Values := 1]
print(result)
Subject Samples Values
1: 1 a 1
2: 1 d 1
3: 2 a 1
4: 2 c 1
5: 3 a 1
---
296: 148 d 1
297: 149 a 1
298: 149 d 1
299: 150 b 1
300: 150 c 1
The problem is that I don't know how to (1) keep all other columns and (2) spread out this info in a way similar to dplyr::pivot_wider.
Any help would be much appreciated!
One way is to tstrsplit and then melt+dcast. Seems kind of inefficient but not sure of another way
Example Data:
library(magrittr)
library(data.table)
set.seed(2020)
iris.dt <- as.data.table(iris)
iris.dt[, samples := paste0(sample(letters[1:2], .N, T), ', ', sample(letters[3:4], .N, T))]
Create dummy cols
new_cols <-
iris.dt[, tstrsplit(samples, ', ')][, I := .I] %>%
melt('I') %>%
dcast(I ~ value, fun.agg = length) %>%
.[, I := NULL]
iris.dt[, names(new_cols) := new_cols][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species samples a b c d
# 1: 5.1 3.5 1.4 0.2 setosa b, c 0 1 1 0
# 2: 4.9 3.0 1.4 0.2 setosa a, d 1 0 0 1
# 3: 4.7 3.2 1.3 0.2 setosa b, c 0 1 1 0
# 4: 4.6 3.1 1.5 0.2 setosa a, d 1 0 0 1
# 5: 5.0 3.6 1.4 0.2 setosa a, c 1 0 1 0
# ---
# 146: 6.7 3.0 5.2 2.3 virginica b, d 0 1 0 1
# 147: 6.3 2.5 5.0 1.9 virginica a, d 1 0 0 1
# 148: 6.5 3.0 5.2 2.0 virginica b, c 0 1 1 0
# 149: 6.2 3.4 5.4 2.3 virginica a, c 1 0 1 0
# 150: 5.9 3.0 5.1 1.8 virginica a, d 1 0 0 1
Here is another option using matrix numeric index:
l <- strsplit(DT[["Samples"]], ",")
nl <- lengths(l)
ul <- unlist(l)
cols <- sort(unique(ul))
DT[, (cols) := {
m <- matrix(0L, nrow=.N, ncol=length(cols))
m[cbind(rep(1L:.N, nl), match(ul, cols))] <- 1L
as.data.table(m)
}]
output:
Subject Samples a b c d
1: 1 a,d 1 0 0 1
2: 2 a,c 1 0 1 0
3: 3 a,c 1 0 1 0
4: 4 b,c 0 1 1 0
5: 5 a,c 1 0 1 0
6: 6 a,d 1 0 0 1
7: 7 b,c 0 1 1 0
8: 8 b,c 0 1 1 0
9: 9 b,d 0 1 0 1
10: 10 a,c 1 0 1 0
data:
DT <- fread("Subject Samples
1 a,d
2 a,c
3 a,c
4 b,c
5 a,c
6 a,d
7 b,c
8 b,c
9 b,d
10 a,c", sep=" ")

R - replace variable with a specific values in dplyr

Let's say I want to replace several variables by 1 in a dataset:
data(iris)
put_1 <- function(x){ x = 1}
iris %>%
mutate_at(vars(Petal.Length, Petal.Width), funs(put_1)) %>%
head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1 1 setosa
# 2 4.9 3.0 1 1 setosa
# 3 4.7 3.2 1 1 setosa
# 4 4.6 3.1 1 1 setosa
# 5 5.0 3.6 1 1 setosa
# 6 5.4 3.9 1 1 setosa
Question : Is there a way to do the same without declaring a function before ?
I tried things like :
mutate_at(vars(...), funs(function(x){ x <- 1 }))
mutate_at(vars(...), funs(~ 1 }))
mutate_at(vars(...), funs(~ . = 1 }))
without success.
Thank you in advance.
This is one of the times when = and <- don't work the same
> iris%>%mutate_at(vars(Petal.Length,Petal.Width),funs(.<-1))%>%head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1 1 setosa
2 4.9 3.0 1 1 setosa
3 4.7 3.2 1 1 setosa
4 4.6 3.1 1 1 setosa
5 5.0 3.6 1 1 setosa
6 5.4 3.9 1 1 setosa
And
> iris%>%mutate_at(vars(Petal.Length,Petal.Width),funs(.=1))%>%head()
Error: Can't create call to non-callable object
Call `rlang::last_error()` to see a backtrace
The best answer is from #josemz
iris %>%
mutate_at(vars(Petal.Length, Petal.Width), ~ 1)

dplyr summarize: how to include all table columns in the output table

I have the follow dataset
# Dataset
x<-tbl_df(data.frame(locus=c(1,2,2,3,4,4,5,5,5,6),v=c(1,1,2,1,1,2,1,2,3,1),rpkm=rnorm(10,10)))
If I use the follow command
# Subset
x%>%group_by(locus)%>%summarize(max(rpkm))
I obtained
locus max(rpkm)
1 9.316949
2 10.273270
3 9.879886
4 10.944641
5 10.837681
6 13.450680
While I'd like to obtain
locus v max(rpkm)
1 1 9.316949
2 1 10.273270
3 1 9.879886
4 2 10.944641
5 1 10.837681
6 1 13.450680
So, I'd like to have in the output table the "v" correspondent row.
Is it possible?
Try:
x %>% group_by(locus) %>%
summarize(max(rpkm), v = v[which(rpkm==max(rpkm))])
You can use the top_n function instead
# with set.seed(15)
x %>% group_by(locus) %>% top_n(1, rpkm)
# locus v rpkm
# 1 1 1 10.258823
# 2 2 1 11.831121
# 3 3 1 10.897198
# 4 4 1 10.488016
# 5 5 2 11.090773
# 6 6 1 8.924999
Try this:
x %>% group_by(locus) %>% filter(rpkm==max(rpkm))
I assume you're looking for a way to not type all of the column names by hand, and you achieve that by using across within summarize, like so:
iris %>%
group_by(Species) %>%
dplyr::summarize(
across(everything()),
mean_l = mean(Sepal.Length)
) %>%
head()
# A tibble: 6 × 6
# Groups: Species [1]
Species Sepal.Length Sepal.Width Petal.Length Petal.Width mean_l
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2 5.01
2 setosa 4.9 3 1.4 0.2 5.01
3 setosa 4.7 3.2 1.3 0.2 5.01
4 setosa 4.6 3.1 1.5 0.2 5.01
5 setosa 5 3.6 1.4 0.2 5.01
6 setosa 5.4 3.9 1.7 0.4 5.01

Resources