Dplyr across + mutate + condition to select the columns - r

I am sure the solution is a one liner, but I am banging my head against the wall.
See the very short reprex at the end of the post; how do I tell dplyr that I want to double only the columns without NA?
Many thanks
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=1:10, y=101:110,
w=c(6,NA,4,NA, 5,0,NA,4,8,17 ),
z=c(2,3,4,NA, 5,10,22,34,58,7 ),
k=rep("A",10))
df
#> # A tibble: 10 x 5
#> x y w z k
#> <int> <int> <dbl> <dbl> <chr>
#> 1 1 101 6 2 A
#> 2 2 102 NA 3 A
#> 3 3 103 4 4 A
#> 4 4 104 NA NA A
#> 5 5 105 5 5 A
#> 6 6 106 0 10 A
#> 7 7 107 NA 22 A
#> 8 8 108 4 34 A
#> 9 9 109 8 58 A
#> 10 10 110 17 7 A
df %>% mutate(across(where(is.numeric), ~.x*2))
#> # A tibble: 10 x 5
#> x y w z k
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2 202 12 4 A
#> 2 4 204 NA 6 A
#> 3 6 206 8 8 A
#> 4 8 208 NA NA A
#> 5 10 210 10 10 A
#> 6 12 212 0 20 A
#> 7 14 214 NA 44 A
#> 8 16 216 8 68 A
#> 9 18 218 16 116 A
#> 10 20 220 34 14 A
##now double the value of all the columns without NA. How to fix this...
df %>% mutate(across(where(sum(is.na(.x))==0), ~.x*2))
#> Error: Problem with `mutate()` input `..1`.
#> ✖ object '.x' not found
#> ℹ Input `..1` is `across(where(sum(is.na(.x)) == 0), ~.x * 2)`.
Created on 2020-10-27 by the reprex package (v0.3.0.9001)

Here is the one-liner you are looking for
df %>% mutate(across(where(~is.numeric(.) && all(!is.na(.))), ~.x*2))
Output
# A tibble: 10 x 5
x y w z k
<dbl> <dbl> <dbl> <dbl> <chr>
1 2 202 6 2 A
2 4 204 NA 3 A
3 6 206 4 4 A
4 8 208 NA NA A
5 10 210 5 5 A
6 12 212 0 10 A
7 14 214 NA 22 A
8 16 216 4 34 A
9 18 218 8 58 A
10 20 220 17 7 A

Note that the aim is to select columns that dont have NA any that are numeric. Recall that the input to where must be a function. in your case just do:
df %>% mutate(across(where(~is.numeric(.) & sum(is.na(.x))==0), ~.x*2))
Well to give you other ways:
df %>% mutate(across(where(~!anyNA(.) & is.numeric(.)), ~.*2))
# A tibble: 10 x 5
x y w z k
<dbl> <dbl> <dbl> <dbl> <chr>
1 2 202 6 2 A
2 4 204 NA 3 A
3 6 206 4 4 A
4 8 208 NA NA A
5 10 210 5 5 A
6 12 212 0 10 A
7 14 214 NA 22 A
8 16 216 4 34 A
9 18 218 8 58 A
10 20 220 17 7 A
If you know how to use the negate function:
df %>% mutate(across(where(~Negate(anyNA)(.) & is.numeric(.)), ~.*2))

Related

How to find average time points difference in longitudinal data

0
I have longitudinal data of body weights of over 100K participants. Time points of weight measurements between participants are not the same. What I want to know is the average time difference between 1st and 2nd measurement as well as 2nd and 3rd measurement etc. Another one is how many people or % of people who have 3 body weight measurements, as well as for 4,5, 6, 7, and 8 etc. How can I do to find these answers on R.
Perhaps something like this:
library(dplyr, warn.conflicts = F)
set.seed(1)
# generate some sample data
dates <- seq(as.Date("2000-01-01"), by = "day", length.out = 500)
sample_data <- tibble(
participant_id = sample(1:1000, size = 5000, replace = T),
meas_date = sample(dates, size = 5000, replace = T)) %>%
arrange(participant_id, meas_date)
sample_data
#> # A tibble: 5,000 × 2
#> participant_id meas_date
#> <int> <date>
#> 1 1 2000-01-18
#> 2 1 2000-02-28
#> 3 1 2000-05-15
#> 4 1 2001-02-01
#> 5 2 2000-05-11
#> 6 3 2000-01-22
#> 7 3 2000-03-27
#> 8 3 2000-04-17
#> 9 3 2000-09-23
#> 10 3 2000-12-13
#> # … with 4,990 more rows
# periods between each measurement for each participant
meas_periods <- sample_data %>%
group_by(participant_id) %>%
mutate(meas_n = row_number(),
date_diff = meas_date - lag(meas_date)) %>%
ungroup()
meas_periods
#> # A tibble: 5,000 × 4
#> participant_id meas_date meas_n date_diff
#> <int> <date> <int> <drtn>
#> 1 1 2000-01-18 1 NA days
#> 2 1 2000-02-28 2 41 days
#> 3 1 2000-05-15 3 77 days
#> 4 1 2001-02-01 4 262 days
#> 5 2 2000-05-11 1 NA days
#> 6 3 2000-01-22 1 NA days
#> 7 3 2000-03-27 2 65 days
#> 8 3 2000-04-17 3 21 days
#> 9 3 2000-09-23 4 159 days
#> 10 3 2000-12-13 5 81 days
#> # … with 4,990 more rows
# average period between meas_n-1 and meas_n
meas_periods %>%
group_by(meas_n) %>%
summarise(mean_duration = mean(date_diff))
#> # A tibble: 13 × 2
#> meas_n mean_duration
#> <int> <drtn>
#> 1 1 NA days
#> 2 2 88.54102 days
#> 3 3 86.16762 days
#> 4 4 76.21154 days
#> 5 5 69.11392 days
#> 6 6 67.16798 days
#> 7 7 50.67089 days
#> 8 8 50.91111 days
#> 9 9 49.89873 days
#> 10 10 48.70588 days
#> 11 11 51.00000 days
#> 12 12 26.25000 days
#> 13 13 66.00000 days
# number and percentage of participants gone through meas_n measurements
meas_periods %>%
count(meas_n, name = "participant_n") %>%
mutate(percent = participant_n/max(participant_n))
#> # A tibble: 13 × 3
#> meas_n participant_n percent
#> <int> <int> <dbl>
#> 1 1 996 1
#> 2 2 963 0.967
#> 3 3 877 0.881
#> 4 4 728 0.731
#> 5 5 553 0.555
#> 6 6 381 0.383
#> 7 7 237 0.238
#> 8 8 135 0.136
#> 9 9 79 0.0793
#> 10 10 34 0.0341
#> 11 11 12 0.0120
#> 12 12 4 0.00402
#> 13 13 1 0.00100
Created on 2022-11-02 with reprex v2.0.2

R, find average length of consecutive time-steps in data.frame

I have the following data.frame with time column sorted in ascending order:
colA=c(1,2,5,6,7,10,13,16,19,20,25,40,43,44,50,51,52,53,68,69,77,79,81,82)
colB=rnorm(24)
df=data.frame(time=colA, x=colB)
How can I count and take the average of the consecutive time-steps observed in the time column?
In detail, I need to group the rows in the time column by consecutive observations, e.g. 1,2 and 5,6,7 and 19,20 and 43,44, etc... and then take the average of the length of each group.
You can group clusters of consecutive observations like this:
df$group <- c(0, cumsum(diff(df$time) != 1)) + 1
Which gives:
df
#> time x group
#> 1 1 0.7443742 1
#> 2 2 0.1289818 1
#> 3 5 1.4882743 2
#> 4 6 -0.6626820 2
#> 5 7 -1.1606550 2
#> 6 10 0.3587742 3
#> 7 13 -0.1948464 4
#> 8 16 -0.2952820 5
#> 9 19 0.4966404 6
#> 10 20 0.4849128 6
#> 11 25 0.0187845 7
#> 12 40 0.6347746 8
#> 13 43 0.7544441 9
#> 14 44 0.8335890 9
#> 15 50 0.9657613 10
#> 16 51 1.2938800 10
#> 17 52 -0.1365510 10
#> 18 53 -0.4401387 10
#> 19 68 -1.2272839 11
#> 20 69 -0.2376531 11
#> 21 77 -0.9268582 12
#> 22 79 0.4112354 13
#> 23 81 -0.1988646 14
#> 24 82 -0.5574496 14
You can get the length of these groups by doing:
rle(df$group)$lengths
#> [1] 2 3 1 1 1 2 1 1 2 4 2 1 1 2
And the average length of the consecutive groups is:
mean(rle(df$group)$lengths)
#> [1] 1.714286
And the average of x within each group using
tapply(df$x, df$group, mean)
#> 1 2 3 4 5 6 7
#> 0.4366780 -0.1116876 0.3587742 -0.1948464 -0.2952820 0.4907766 0.0187845
#> 8 9 10 11 12 13 14
#> 0.6347746 0.7940166 0.4207379 -0.7324685 -0.9268582 0.4112354 -0.3781571
Another way, doing some kind of consecutive length encoding:
l = diff(c(0, which(diff(df$time) != 1), nrow(df)))
l
# [1] 2 3 1 1 1 2 1 1 2 4 2 1 1 2
mean(l)
#[1] 1.714286

create a "mean rank" for a rank-frequency data.frame by R [duplicate]

This question already has an answer here:
Rank a vector based on order and replace ties with their average
(1 answer)
Closed 1 year ago.
we use tidytext to generate a rank column for a data.frame.
as shown:
what we want to get is another "mean rank" column for the data.frame.
as shown:
are there any easy way to generate this column?
library(tidyverse)
library(tidytext)
dt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))
dt %>% arrange(desc(frequency))%>%
mutate(rank = row_number())
sure, just group by frequency
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dt <-data.frame(frequency=c(64,58,54,32,29,29,25,17,17,15,12,12,10))
dt %>% arrange(desc(frequency))%>%
mutate(rank = row_number()) %>%
group_by(frequency) %>%
mutate(mean_rank = mean(rank)) %>%
ungroup()
#> # A tibble: 13 × 3
#> frequency rank mean_rank
#> <dbl> <int> <dbl>
#> 1 64 1 1
#> 2 58 2 2
#> 3 54 3 3
#> 4 32 4 4
#> 5 29 5 5.5
#> 6 29 6 5.5
#> 7 25 7 7
#> 8 17 8 8.5
#> 9 17 9 8.5
#> 10 15 10 10
#> 11 12 11 11.5
#> 12 12 12 11.5
#> 13 10 13 13
Use the built-in functions for this.
dt <- within(dt, {
rank=order(-frequency)
mean_rank=rank(-frequency)
})
dt
# frequency mean_rank rank
# 1 64 1.0 1
# 2 58 2.0 2
# 3 54 3.0 3
# 4 32 4.0 4
# 5 29 5.5 5
# 6 29 5.5 6
# 7 25 7.0 7
# 8 17 8.5 8
# 9 17 8.5 9
# 10 15 10.0 10
# 11 12 11.5 11
# 12 12 11.5 12
# 13 10 13.0 13

How to replace NAs in a data-frame with the average of the nearest two available values? [duplicate]

This question already has answers here:
Interpolate NA values in a data frame with na.approx
(3 answers)
Closed 3 years ago.
the link of the data with NAsthe link of the data with NAs I have a data-frame that contains NAs. I want to replace each NA with the average of the nearest two available values at each column. My problem is when I have more than one NA in row.
This is my data frame (data):
Seq Speed Volume
1 50 8
2 70 NA
3 65 10
4 55 15
5 NA 12
6 40 9
7 NA NA
8 NA NA
9 NA NA
10 30 18
11 25 NA
12 NA 22
13 NA 7
14 20 9
for(i in data$Speed){
data$Speed[which(is.na(data$Speed))] <- ((i+1)+(i-1))/2
}
Here is what I expect to get:
Seq Speed Volume
1 50 8
2 70 9
3 65 10
4 55 15
5 47.5 12
6 40 9
7 37.5 11.25
8 35 13.5
9 32.5 15.75
10 30 18
11 25 20
12 22.5 22
13 21.25 7
14 20 9
Try this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(
a=c(1,3,5,NA,9),
b=c(1,NA,NA,NA,13),
)
df
#> # A tibble: 5 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 1
#> 2 3 NA
#> 3 5 NA
#> 4 NA NA
#> 5 9 13
df %>%
mutate_all(function(x){
approx(x,n=length(x))$y
})
#> # A tibble: 5 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 1
#> 2 3 4
#> 3 5 7
#> 4 7 10
#> 5 9 13
Created on 2019-08-08 by the reprex package (v0.3.0)

r - use dplyr::group_by in combination with purrr::pmap

I have the following dataframe:
df <- data.frame(a = c(1:20),
b = c(2:21),
c = as.factor(c(rep(1,5), rep(2,10), rep(3,5))))
and I want to do the following:
df1 <- df %>% group_by(c) %>% mutate(a = lead(b))
but originally I have many variables to which I need to apply the lead() function in combination with group_by() on multiple variables. I'm trying the purrr::pmap() to achieve this:
df2 <- pmap(list(df[,1],df[,2],df[,3]), function(x,y,z) group_by(z) %>% lead(y))
Unfortunately this results in error:
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('integer', 'numeric')"
You can do this with mutate_at and named arguments to funs(), which creates new columns instead of overwriting them. Note that this does nothing to a but you can rename the columns after this as desired.
df <- data.frame(
a = c(1:20),
b = c(2:21),
b2 = 3:22,
b3 = 4:23,
c = as.factor(c(rep(1, 5), rep(2, 10), rep(3, 5)))
)
library(tidyverse)
df %>%
group_by(c) %>%
mutate_at(vars(starts_with("b")), funs(lead = lead(.)))
#> # A tibble: 20 x 8
#> # Groups: c [3]
#> a b b2 b3 c b_lead b2_lead b3_lead
#> <int> <int> <int> <int> <fct> <int> <int> <int>
#> 1 1 2 3 4 1 3 4 5
#> 2 2 3 4 5 1 4 5 6
#> 3 3 4 5 6 1 5 6 7
#> 4 4 5 6 7 1 6 7 8
#> 5 5 6 7 8 1 NA NA NA
#> 6 6 7 8 9 2 8 9 10
#> 7 7 8 9 10 2 9 10 11
#> 8 8 9 10 11 2 10 11 12
#> 9 9 10 11 12 2 11 12 13
#> 10 10 11 12 13 2 12 13 14
#> 11 11 12 13 14 2 13 14 15
#> 12 12 13 14 15 2 14 15 16
#> 13 13 14 15 16 2 15 16 17
#> 14 14 15 16 17 2 16 17 18
#> 15 15 16 17 18 2 NA NA NA
#> 16 16 17 18 19 3 18 19 20
#> 17 17 18 19 20 3 19 20 21
#> 18 18 19 20 21 3 20 21 22
#> 19 19 20 21 22 3 21 22 23
#> 20 20 21 22 23 3 NA NA NA
Created on 2018-09-07 by the reprex package (v0.2.0).

Resources