I'd like to apply case_when to all columns in the data frame.
set.seed(1)
data <- tibble(x = runif(10), y = x * 2)
data
For all columns above 0.5, I'd like to replace with a string ">0.5", for those above 1, I'd like to replace with ">1".
I've tried to case_when, but it appears that I have to specify the column like x and y. I'd like to use case_when without specifying columns and use it on the entire data frame instead.
a purrr solution;
library(purrr)
data %>%
map_df(~case_when(.x > 0.5 & .x < 1 ~ ">0.5",
.x >= 1 ~ ">1"))
output;
x y
<chr> <chr>
1 NA >0.5
2 NA >0.5
3 >0.5 >1
4 >0.5 >1
5 NA NA
6 >0.5 >1
7 >0.5 >1
8 >0.5 >1
9 >0.5 >1
10 NA NA
Here is a potential solution:
library(tidyverse)
set.seed(1)
data <- tibble(x = runif(10), y = x * 2)
data
#> # A tibble: 10 × 2
#> x y
#> <dbl> <dbl>
#> 1 0.266 0.531
#> 2 0.372 0.744
#> 3 0.573 1.15
#> 4 0.908 1.82
#> 5 0.202 0.403
#> 6 0.898 1.80
#> 7 0.945 1.89
#> 8 0.661 1.32
#> 9 0.629 1.26
#> 10 0.0618 0.124
data %>%
mutate(across(everything(),
~case_when(.x > 0.5 & .x < 1.0 ~ ">0.5",
.x >= 1.0 ~ ">1")))
#> # A tibble: 10 × 2
#> x y
#> <chr> <chr>
#> 1 <NA> >0.5
#> 2 <NA> >0.5
#> 3 >0.5 >1
#> 4 >0.5 >1
#> 5 <NA> <NA>
#> 6 >0.5 >1
#> 7 >0.5 >1
#> 8 >0.5 >1
#> 9 >0.5 >1
#> 10 <NA> <NA>
Created on 2021-10-24 by the reprex package (v2.0.1)
You can use cut -
library(dplyr)
data %>%
mutate(across(.fns = ~cut(., c(0.5, 1, Inf), c(">0.5", ">1"))))
# x y
# <fct> <fct>
# 1 NA >0.5
# 2 NA >0.5
# 3 >0.5 >1
# 4 >0.5 >1
# 5 NA NA
# 6 >0.5 >1
# 7 >0.5 >1
# 8 >0.5 >1
# 9 >0.5 >1
#10 NA NA
In base R, with lapply -
data[] <- lapply(data, function(x) cut(x, c(0.5, 1, Inf), c(">0.5", ">1")))
We can use if_all with everything() (to select all the columns) to create the logical vector
library(dplyr)
data %>%
mutate(new = case_when(if_all(everything(), `>`, 1) ~ ">1", if_all(everything(), `>`, 0.5) ~ ">0.5")
)
-output
# A tibble: 10 × 3
x y new
<dbl> <dbl> <chr>
1 0.266 0.531 <NA>
2 0.372 0.744 <NA>
3 0.573 1.15 >0.5
4 0.908 1.82 >0.5
5 0.202 0.403 <NA>
6 0.898 1.80 >0.5
7 0.945 1.89 >0.5
8 0.661 1.32 >0.5
9 0.629 1.26 >0.5
10 0.0618 0.124 <NA>
NOTE: As the OP specified on the entire dataset, this does create the column based on evaluation on the entire dataset
If the OP meant separate columns, use between
data %>%
mutate(across(everything(),
~ case_when(between(.x, 0.5, 1) ~"> 0.5", TRUE ~ "> 1")))
# A tibble: 10 × 2
x y
<chr> <chr>
1 > 1 > 0.5
2 > 1 > 0.5
3 > 0.5 > 1
4 > 0.5 > 1
5 > 1 > 1
6 > 0.5 > 1
7 > 0.5 > 1
8 > 0.5 > 1
9 > 0.5 > 1
10 > 1 > 1
If we want to do this separately
out <- as.data.frame(data)
out[] <- case_when(data > 0.5 ~ "> 0.5", data > 1 ~ "> 1")
Another base R solution:
ff = function(z){x = rep(NA, length(z)); x[z > .5] = ">.5"; x[z > 1] = ">1";z = x }
sapply(data, ff)
# x y
# [1,] NA ">.5"
# [2,] NA ">.5"
# [3,] ">.5" ">1"
# [4,] ">.5" ">1"
# [5,] NA NA
# [6,] ">.5" ">1"
# [7,] ">.5" ">1"
# [8,] ">.5" ">1"
# [9,] ">.5" ">1"
#[10,] NA NA
Related
See code below.
the mutate(across(everything(), scale, .names = "{.col}_z")) part of the syntax is generating columns with [,1]appended at the end.
Two questions:
Why is this happening?
How can I avoid or remove it?
library(dplyr)
# Input
df_test <- tibble(x = c(1, 2, 3, 4), y = c(5, 6, 7, 8))
# My code generating x_z and y_z
df_scaled <- df_test %>%
mutate(across(everything(), scale, .names = "{.col}_z"))
# Output
df_scaled
#> # A tibble: 4 × 4
#> x y x_z[,1] y_z[,1]
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5 -1.16 -1.16
#> 2 2 6 -0.387 -0.387
#> 3 3 7 0.387 0.387
#> 4 4 8 1.16 1.16
Expected output
#> # A tibble: 4 × 4
#> x y x_z y_z
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5 -1.16 -1.16
#> 2 2 6 -0.387 -0.387
#> 3 3 7 0.387 0.387
#> 4 4 8 1.16 1.16
Created on 2022-12-30 with reprex v2.0.2
scale returns a matrix. We may either use c or extract the column with [ or use as.numeric to remove the dim attributes
library(dplyr)
df_test %>%
mutate(across(everything(),
~ as.numeric(scale(.x)), .names = "{.col}_z"))
-output
# A tibble: 4 × 4
x y x_z y_z
<dbl> <dbl> <dbl> <dbl>
1 1 5 -1.16 -1.16
2 2 6 -0.387 -0.387
3 3 7 0.387 0.387
4 4 8 1.16 1.16
i.e. check the output on a single column
> scale(df_test[[1]])
[,1]
[1,] -1.1618950
[2,] -0.3872983
[3,] 0.3872983
[4,] 1.1618950
attr(,"scaled:center")
[1] 2.5
attr(,"scaled:scale")
[1] 1.290994
If we check the source code
> scale.default
function (x, center = TRUE, scale = TRUE)
{
x <- as.matrix(x) # it is converting to matrix
...
and is required in applying apply/colMeans/sweep, thus when we pass a vector to the scale, it does convert it to a single column matrix
> as.matrix(df_test$x)
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
We could rewrite the scale function -> my_scale:
my_scale <- function(x){
require(caret)
var <- preProcess(data.frame(x),c("center", "scale"))
predict(var, data.frame(x))$x
}
df_test %>%
mutate(across(everything(), ~my_scale(.), .names = "{col}_z"))
x y x_z y_z
<dbl> <dbl> <dbl> <dbl>
1 1 5 -1.16 -1.16
2 2 6 -0.387 -0.387
3 3 7 0.387 0.387
4 4 8 1.16 1.16
This question already has answers here:
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 5 months ago.
i have a dataframe that looks like this :
Date = seq(as.Date("2022/1/1"), by = "day", length.out = 10)
x = c(rnorm(9),NA)
y = c(NA,0,rnorm(4),6,0,0,10)
z = c(rnorm(4),0,0,NA,NA,3,2)
d = tibble(Date,x,y,z);d
# A tibble: 10 x 4
Date x y z
<date> <dbl> <dbl> <dbl>
1 2022-01-01 2.456174 NA 0.2963012
2 2022-01-02 0.3648335 0 0.3981664
3 2022-01-03 0.8283570 -0.1843364 1.194378
4 2022-01-04 1.061199 1.507231 -0.2337116
5 2022-01-05 -0.07824196 -0.6708553 0
6 2022-01-06 -0.2654019 0.3008499 0
7 2022-01-07 1.426953 6 NA
8 2022-01-08 -0.5776817 0 NA
9 2022-01-09 0.8706953 0 3
10 2022-01-10 NA 10 2
how i can replace all the zeros across all columns with NA using Dplyr package ?
With dplyr, you could use na_if():
library(dplyr)
d %>%
mutate(across(everything(), na_if, 0))
or simply
d[d == 0] <- NA
# Pipeline
d %>%
`[<-`(d == 0, value = NA)
I would avoid using dplyr here (3 base R example already in comments on OP) but you could
library(dplyr)
d |> mutate_all( \(x) replace(x, x == 0, NA))
# x y z
# <dbl> <dbl> <dbl>
# 1 -0.626 NA -2.21
# 2 0.184 NA 1.12
# 3 -0.836 -0.305 -0.0449
# 4 1.60 1.51 -0.0162
# 5 0.330 0.390 NA
# 6 -0.820 -0.621 NA
# 7 0.487 6 NA
# 8 0.738 NA NA
# 9 0.576 NA 3
# 10 NA 10 2
Reproducible data:
set.seed(1)
x = c(rnorm(9),NA)
y = c(NA,0,rnorm(4),6,0,0,10)
z = c(rnorm(4),0,0,NA,NA,3,2)
d = tibble(x,y,z)
Here is my comment to the question as a dplyr pipe.
suppressPackageStartupMessages(library(dplyr))
x = c(rnorm(9),NA)
y = c(NA,0,rnorm(4),6,0,0,10)
z = c(rnorm(4),0,0,NA,NA,3,2)
Date <- seq(as.Date("2022-01-01"), by = "day", length = length(x))
d = tibble(Date,x,y,z)
d %>%
mutate(across(everything(), ~`is.na<-`(., . == 0)))
#> # A tibble: 10 × 4
#> Date x y z
#> <date> <dbl> <dbl> <dbl>
#> 1 2022-01-01 -0.311 NA -0.891
#> 2 2022-01-02 -0.192 NA 0.278
#> 3 2022-01-03 1.24 0.742 -0.331
#> 4 2022-01-04 0.0130 -1.18 0.384
#> 5 2022-01-05 -1.11 -1.17 NA
#> 6 2022-01-06 -0.330 -0.629 NA
#> 7 2022-01-07 -1.25 6 NA
#> 8 2022-01-08 0.0937 NA NA
#> 9 2022-01-09 0.986 NA 3
#> 10 2022-01-10 NA 10 2
Created on 2022-09-14 with reprex v2.0.2
I am trying to assign quantile groups for a stacked data such that for each category (r1 and r2 in my example) of data, I can classify the values into 5 groups. I can manage to do this using ntile() as follows.
r1<-rnorm(10,0,1)
r2<-rnorm(10,2,4)
df<-cbind(r1,r2)
df<-melt(df)
df<-df%>%group_by(Var2) %>% mutate(group=ntile(value,5))
However, what should I do if I hope to exclude the top and bottom 10% when sorting the groups. Ideally, I hope to keep those top and bottom values in the output table with their group code showing as "NA".
Thanks to anyone who can help!
Your question is a little ambiguous. It is not clear whether you wish to exclude the top and bottom 10% from the quintile calculation (so that you are getting equal quintiles of the 10-90th centiles of the original data), or whether you want to do the quintiles first on all the data, then exclude the first and last 10%. Doing it the second way will give you smaller 1st and 5th quintiles, so I assume you mean the first method:
df %>%
group_by(Var2) %>%
mutate(group = ntile(value, 10)) %>%
mutate(group = ntile(ifelse(group %% 9 == 1, NA, value), 5))
#> # A tibble: 20 x 4
#> # Groups: Var2 [2]
#> Var1 Var2 value group
#> <int> <fct> <dbl> <int>
#> 1 1 r1 -0.626 1
#> 2 2 r1 0.184 2
#> 3 3 r1 -0.836 NA
#> 4 4 r1 1.60 NA
#> 5 5 r1 0.330 3
#> 6 6 r1 -0.820 1
#> 7 7 r1 0.487 3
#> 8 8 r1 0.738 5
#> 9 9 r1 0.576 4
#> 10 10 r1 -0.305 2
#> 11 1 r2 8.05 NA
#> 12 2 r2 3.56 2
#> 13 3 r2 -0.485 1
#> 14 4 r2 -6.86 NA
#> 15 5 r2 6.50 5
#> 16 6 r2 1.82 1
#> 17 7 r2 1.94 2
#> 18 8 r2 5.78 4
#> 19 9 r2 5.28 3
#> 20 10 r2 4.38 3
Just in case, the second method you would achieve like this:
df %>%
group_by(Var2) %>%
mutate(group = ntile(value, 5)) %>%
mutate(group = ifelse(ntile(value, 10) %% 9 == 1, NA, group))
#> # A tibble: 20 x 4
#> # Groups: Var2 [2]
#> Var1 Var2 value group
#> <int> <fct> <dbl> <int>
#> 1 1 r1 -0.626 2
#> 2 2 r1 0.184 3
#> 3 3 r1 -0.836 NA
#> 4 4 r1 1.60 NA
#> 5 5 r1 0.330 3
#> 6 6 r1 -0.820 1
#> 7 7 r1 0.487 4
#> 8 8 r1 0.738 5
#> 9 9 r1 0.576 4
#> 10 10 r1 -0.305 2
#> 11 1 r2 8.05 NA
#> 12 2 r2 3.56 3
#> 13 3 r2 -0.485 1
#> 14 4 r2 -6.86 NA
#> 15 5 r2 6.50 5
#> 16 6 r2 1.82 2
#> 17 7 r2 1.94 2
#> 18 8 r2 5.78 4
#> 19 9 r2 5.28 4
#> 20 10 r2 4.38 3
Created on 2022-02-19 by the reprex package (v2.0.1)
Setup and data used
library(dplyr)
library(reshape2)
set.seed(1)
r1 <- rnorm(10,0,1)
r2 <- rnorm(10,2,4)
df <- cbind(r1,r2)
df <- melt(df)
I am trying to get a model results table into wide format. Since names are not the same on outcomes (dv variables), NA's show up in the table and I can't find a way to have one row per variable.
I need one row per variable/dv. Model 1 and 3 share all variables other than one.
Data:
table <- data.frame(variable=c("intercept", "a", "b", "intercept", "c", "intercept", "a", "e", "intercept", "c"),
b=c(1.2, 0.1, 0.4, 0.3, 0.9, 1.3, 2, .23, .4, .7),
p=(abs(rnorm(10, 0, .3))),
model=c(1,1,1,2,2,3,3,3,4,4),
dv=c(rep("dv1", 5), rep("dv2", 5)))
> table
variable b p model dv
1 intercept 1.20 0.03320481 1 dv1
2 a 0.10 0.16675234 1 dv1
3 b 0.40 0.53607394 1 dv1
4 intercept 0.30 0.14935514 2 dv1
5 c 0.90 0.58998515 2 dv1
6 intercept 1.30 0.21040677 3 dv2
7 a 2.00 0.14183742 3 dv2
8 e 0.23 0.32034711 3 dv2
9 intercept 0.40 0.06539247 4 dv2
10 c 0.70 0.30780133 4 dv2
Code:
table %>%
gather(key, value, b, p) %>% unite("stat_var", dv, key, sep=".") %>%
spread(stat_var, value) %>%
arrange(model, desc(variable))
Output:
variable model dv1.b dv1.p dv2.b dv2.p
1 intercept 1 1.2 0.21866737 NA NA
2 b 1 0.4 0.50600799 NA NA
3 a 1 0.1 0.18751178 NA NA
4 intercept 2 0.3 0.25133611 NA NA
5 c 2 0.9 0.04601194 NA NA
6 intercept 3 NA NA 1.30 0.34144108
7 e 3 NA NA 0.23 0.12793927
8 a 3 NA NA 2.00 0.37614448
9 intercept 4 NA NA 0.40 0.08852144
10 c 4 NA NA 0.70 0.26853770
Looking for:
I will ignore the reason (I see there's no valid reason) to consider some type of "equivalence" between models (handled it with mutate()). But related only with the table manipulation, I have this basic option to get your desired output:
You can use pivot_wider_spec() to set the names b and p like suffixs.
require(tidyverse)
table %>%
mutate(model = case_when(model == 3 ~ 1,
model == 4 ~ 2,
TRUE ~ model)) %>%
pivot_wider(names_from = dv, values_from = c("b", "p")) %>%
select(variable,
model,
ends_with("dv1"),
ends_with("dv2"))
# A tibble: 6 x 6
# variable model b_dv1 p_dv1 b_dv2 p_dv2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 intercept 1 1.2 0.318 1.3 0.200
# 2 a 1 0.1 0.120 2 0.419
# 3 b 1 0.4 0.309 NA NA
# 4 intercept 2 0.3 0.350 0.4 0.0148
# 5 c 2 0.9 0.185 0.7 0.530
# 6 e 1 NA NA 0.23 0.174
As I stated in my comment, it seems that your expected output is wrong.
However, you can reproduce it by tweaking the model variable:
table %>%
select(model, everything()) %>%
mutate(model=ifelse(model>2, model-2, model)) %>%
pivot_longer(c(b, p)) %>%
unite("name", c("dv", "name")) %>%
pivot_wider()
# # A tibble: 6 x 6
# model variable dv1_b dv1_p dv2_b dv2_p
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 intercept 1.2 0.193 1.3 0.160
# 2 1 a 0.1 0.650 2 0.476
# 3 1 b 0.4 0.190 NA NA
# 4 2 intercept 0.3 0.0435 0.4 0.145
# 5 2 c 0.9 0.372 0.7 0.243
# 6 1 e NA NA 0.23 0.297
Of note, gather() and spread() are deprecated in favor of pivoting functions which offer very nice improvements (although not used here).
how can I create a new column which starting value is 1 and the following values are a multiplication of the previous value of a column (b) and the previous value of itself (d)?
these data are only made up, but have the structure of my data:
> a <- rep(1:10, 3)
> b <- runif(30)
> c <- tibble(a,b)
> c
# A tibble: 30 x 2
a b
<int> <dbl>
1 1 0.945
2 2 0.280
3 3 0.464
4 4 0.245
5 5 0.917
6 6 0.913
7 7 0.144
8 8 0.481
9 9 0.873
10 10 0.754
# ... with 20 more rows
Then I try to calculate column d:
> c <- c %>%
+ group_by(a) %>%
+ mutate(d = accumulate(lag(b, k = 1), `*`, .init = 1))
and it should look like this
# A tibble: 30 x 3
# Groups: a [10]
a b d
<int> <dbl> <dbl>
1 1 0.945 1 <--- b[1] * d[1] = d[2]
2 2 0.280 0.945
3 3 0.464 0.265
4 4 0.245 0.123
5 5 0.917 0.03
#...
But instead I am getting this error message.
Fehler: Column `d` must be length 3 (the group size) or one, not 4
The problem is that when you initialize accumulate with .init = that adds an extra first element of the vector.
You could try this:
library(dplyr)
library(purrr)
c %>%
group_by(a) %>%
mutate(d = accumulate(b[(2:length(b))-1], `*`,.init=1)) %>%
arrange(a)
# a b d
# <int> <dbl> <dbl>
# 1 1 0.266 1
# 2 1 0.206 0.266
# 3 1 0.935 0.0547
# 4 2 0.372 1
# 5 2 0.177 0.372
# … with 25 more rows
Data
library(tibble)
set.seed(1)
a <- rep(1:10, 3)
b <- runif(30)
c <- tibble(a,b)
Using dplyr, I would do this:
c %>%
mutate(d = 1*accumulate(.x = b[-length(b)],
.init = 1,
.f = `*`))
# # A tibble: 30 x 3
# a b d
# <int> <dbl> <dbl>
# 1 1 0.562 1
# 2 2 0.668 0.562
# 3 3 0.100 0.375
# 4 4 0.242 0.0376
# 5 5 0.0646 0.00907
# 6 6 0.373 0.000586
# 7 7 0.664 0.000219
# 8 8 0.915 0.000145
# 9 9 0.848 0.000133
# 10 10 0.952 0.000113
# # ... with 20 more rows