Mutate a value across several columns using dplyr selectors only - r

I want to calculate the sd for several columns inside a data frame without leaving my dplyr pipe. In the past, I have done this by defaulting to base r. I haven't been able to find a solution here that works.
It may help to provide some context. This is a process I do to validate survey data. We measure the sd of matrix questions to identify straight-liners. An sd of zero across the columns flags a straight line. In the past, I calculated this in base R as follows:
apply(x, 1, sd)
I know there has to be a way to do this within a dplyr pipe. I've tried several options including pmap and various approaches at mutate_at. Here's my latest attempt:
library(tidyverse)
set.seed(858465)
scale_points <- c(1:5)
q1 <- sample(scale_points, replace = TRUE, size = 100)
q2 <- sample(scale_points, replace = TRUE, size = 100)
q3 <- sample(scale_points, replace = TRUE, size = 100)
digits = 0:9
createRandString<- function() {
v = c(sample(LETTERS, 5, replace = TRUE),
sample(digits, 4, replace = TRUE),
sample(LETTERS, 1, replace = TRUE))
return(paste0(v,collapse = ""))
}
s_data <- tibble::tibble(resp_id = 100)
for(i in c(1:100)) {
s_data[i,1] <- createRandString()
}
s_data <- bind_cols(s_data, q1 = q1, q2 = q2, q3 = q3)
s_data %>% mutate(vars(starts_with("q"), ~sd(.)))
In a perfect world, I would keep the resp_id variable in the output so that I could generate a report using filter to identify the respondent IDs with sd == 0.
Any help is greatly appreciated!

If we need a rowwise sd,
library(tidyverse)
s_data %>%
mutate(sdQs = select(., starts_with("q")) %>%
pmap_dbl(~ sd(c(...)))) %>%
filter(sdQs == 0)
# A tibble: 9 x 5
# resp_id q1 q2 q3 sdQs
# <chr> <int> <int> <int> <dbl>
#1 JORTY8990R 3 3 3 0
#2 TFYAF4729I 5 5 5 0
#3 VPUYC0789H 4 4 4 0
#4 LHAPM6293X 1 1 1 0
#5 FZQRQ8530P 3 3 3 0
#6 TKTJU3757T 5 5 5 0
#7 AYVHO1309H 4 4 4 0
#8 BBPTZ4822E 5 5 5 0
#9 NGLXT1705B 3 3 3 0
Or another option is rowSds from matrixStats
library(matrixStats)
s_data %>%
mutate(sdQs = rowSds(as.matrix(.[startsWith(names(.), "q")])))

Related

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

Conditional rolling sum loop in R

I'm looking for some kind kind of conditional rolling sum I thought a while loop would do what I need, but I'm having trouble implementing it. So this should look like PCAR[1]*time[1]+PCAR[2]*time[2]+PCAR[3]*time[3] etc where [] references the row of the column, and this would loop until the cumulative time value reachs <= 100 years, then the loop should add this value to a column and then start again until cumulative time is between 100 and <= 200, and so on until the bottom of the data set. It's going to be applied to datasets of varying sizes with tens of thousands of years in.
I hope that makes sense. In the example data below the PCAR_BIN column is what I'm aiming for as the outcome.
df <- tibble(cumulative.time = c(20,40,60,80,100, 120,140,160,180,200),
PCAR =1:10,
time = 1:10,
depth.along.core = 1:10,
Age.cal.BP = 1:10,
AFBD = 1:10,
assumed.C = rep(0.5, 10),
PCAR_BIN = c(55,330,NA,NA,NA,NA,NA,NA,NA,NA))
The function looks like
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up,
PCAR_BIN = ifelse(cumulative.time <= 100, sum(PCAR*time+lead(PCAR)*lead(time),NA)
)}
Obviously I had no luck with the ifelse satement, as it would only work for one iteration of time and the sum is wrong. I've tried similar with while and for loops but with no luck. Part of the problem is I'm not sure how to express the sum that I need. I've also tried binning the data with case_when, and working off that, but with no luck again.
Thanks people :)
EDIT
Following Martins method I now have the function working up to creating the ROLLSUM Column, I now need to create a column that will give the maximum value for each century group. Running the code from slicemax onward gives me the error:
Error in eval(lhs, parent, parent) : object 'tmp' not found
I've added the real data too.
dput(head(EMC))
structure(list(depth.along.core = c(0.5, 1.5, 2.5, 3.5, 4.5,
5.5), Age.cal.BP = c(-56.016347625, -55.075825875, -54.201453125,
-53.365755375, -52.541258625, -51.700488875), time = c(0.94052175,
0.87437275, 0.83569775, 0.82449675, 0.84076975, 0.88451675),
cumulative.time = c(0.94052175, 1.8148945, 2.65059225, 3.475089,
4.31585875, 5.2003755), AFBD = c(0.0711, 0.057, 0.0568, 0.0512,
0.0559, 0.0353), assumed.C = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5
)), row.names = c(NA, 6L), class = "data.frame")
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)%>%
slice(1:(n()-1))%>%
group_by(Century = cut(cumulative.time, breaks = seq(0, max(cumulative.time), 100)))%>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))%>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)%>%
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))}
You could try this:
# Get cumulative sums by group (assuming per century groups)
df <- df %>%
group_by(Century = cut(cumulative.time,
breaks = seq(0, max(cumulative.time), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR * time)))
# Get maximum of each group
groupMaxima <- df %>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
# Fill column as desired
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))
We simply create a factor column to group the cumulative time column by centuries and use that factor to sum up the values. Lastly we edit the rolling sum column to contain only the max values and fill the other rows with NA.
# A tibble: 10 x 10
# Groups: Group [2]
cumulative.time PCAR time depth.along.core Age.cal.BP AFBD assumed.C PCAR_BIN Group ROLLSUM
<dbl> <int> <int> <int> <int> <int> <dbl> <dbl> <fct> <int>
1 20 1 1 1 1 1 0.5 55 (0,100] 55
2 40 2 2 2 2 2 0.5 330 (0,100] 330
3 60 3 3 3 3 3 0.5 NA (0,100] NA
4 80 4 4 4 4 4 0.5 NA (0,100] NA
5 100 5 5 5 5 5 0.5 NA (0,100] NA
6 120 6 6 6 6 6 0.5 NA (100,200] NA
7 140 7 7 7 7 7 0.5 NA (100,200] NA
8 160 8 8 8 8 8 0.5 NA (100,200] NA
9 180 9 9 9 9 9 0.5 NA (100,200] NA
10 200 10 10 10 10 10 0.5 NA (100,200] NA
Edit:
For this special case:
MBA <- function(data) {
require(dplyr)
data <- data %>% mutate(PCAR = ((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)
data <- data %>%
group_by(CTIME = cut(cumsum(cumulative.time),
breaks = seq(0, max(cumsum(cumulative.time), na.rm = T), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))
groupMaxima <- data %>% slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
data$ROLLSUM <- c(groupMaxima, rep(NA, nrow(data) - length(groupMaxima)))
data
}
There are a number of ways, if your steps are really steps of 100 years, and the values go 0,20,40 in constant intervals- you can do this natively:
steps = 100
intervals = 20
ratio = steps / intervals
columns = df[,c("PCAR","time")]
indices = rep(ratio,nrow(df)) %>% cumsum
PCAR_BIN = lapply(indices,function(x){
localRange = (x-ratio):x
sum(columns[localRange,1] * columns[localRange,2])
})%>% unlist
we can now bind PICAR_BIN:
df = cbind(df,PICAR_BIN)

function will not work with dplyr's select wrappers (contains, ends_with) [duplicate]

This question already has answers here:
Performing dplyr mutate on subset of columns
(5 answers)
Closed 3 years ago.
I'm trying to calculate row means on a dataset. I found a helpful function someone made here (dplyr - using mutate() like rowmeans()), and it works when I try out every column but not when I try to use a dplyr helper function.
Why does this work:
#The rowmeans function that works
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
#The data
library(tidyverse)
data <- tibble(id = c(1:4),
turn_intent_1 = c(5, 1, 1, 4),
turn_intent_2 = c(5, 1, 1, 3),
turn_intent_3R = c(5, 5, 1, 3))
#The code that is cumbersome but works
data %>%
mutate(turn_intent_agg = my_rowmeans(turn_intent_1, turn_intent_2, turn_intent_3R))
#The output
# A tibble: 4 x 5
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
But this does not work:
#The code
data %>%
mutate(turn_intent_agg = select(., contains("turn")) %>%
my_rowmeans())
#The output
Error in class1Def#contains[[class2]] : no such index at level 1
Of course, I can type each column, but this dataset has many columns. It'd be much easier to use these wrappers.
I need the output to look like the correct one shown that contains all columns (such as id).
Thank you!
I think that you can simplify it to:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn"))))
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
And you can indeed add also the na.rm = TRUE parameter:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn")), na.rm = TRUE))

sum by group including intermediate groups

I have:
df <- data.frame(group=c(1,1,2,4,4,5), value=c(3,1,5,2,3,6))
aggregate(value ~ group, data = df, FUN = 'sum')
group value
1 1 4
2 2 5
3 4 5
4 5 6
is there a way to include intermediate groups to return the below? I realise this could be done by creating a dataframe with all the desired groups and matching in the results from aggregate() but I am hoping there is a cleaner way to do this. it would need to be as fast as using aggregate and only use base r packages - this is due to restrictions in my workplace.
group value
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
You can try this .
library(tidyr)
library(dplyr)
df %>%
mutate(group=factor(group, 1:5)) %>%
complete(group) %>%group_by(group)%>%
dplyr::summarise(value=sum(value,na.rm = T))
group value
<fctr> <dbl>
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
You can do this easily with the tidyverse:
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
summarise(valuesum = sum(value)) %>%
full_join(., expand(df, group = 1:5)) %>%
complete(group, fill = list(valuesum = 0))
The result:
# A tibble: 5 x 2
group valuesum
<dbl> <dbl>
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
Or a bit more difficult to understand with data.table:
library(data.table)
setDT(df)[.(group = 1:5), on = 'group', sum(value, na.rm = TRUE), by = .EACHI]
You can use mergefrom base R. I've changed the name of your data.frame to dat, since df is the name of an R function.
dat <- read.table(text = "
group value
1 4
2 5
4 5
5 6
", header = TRUE)
str(dat)
res <- aggregate(value ~ group, data = dat, FUN = 'sum')
merge(res, data.frame(group = seq(from = min(res$group), to = max(res$group))), all = TRUE)
Note that there will be a NA, not a zero. I believe that you should solve that by leaving it as a missing value.

Remove duplicated rows using dplyr

I have a data.frame like this -
set.seed(123)
df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)
> df
x y z
1 0 1 1
2 1 0 2
3 0 1 3
4 1 1 4
5 1 0 5
6 0 1 6
7 1 0 7
8 1 0 8
9 1 0 9
10 0 1 10
I would like to remove duplicate rows based on first two columns. Expected output -
df[!duplicated(df[,1:2]),]
x y z
1 0 1 1
2 1 0 2
4 1 1 4
I am specifically looking for a solution using dplyr package.
Here is a solution using dplyr >= 0.5.
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
> df %>% distinct(x, y, .keep_all = TRUE)
x y z
1 0 1 1
2 1 0 2
3 1 1 4
Note: dplyr now contains the distinct function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)
I've also been thinking about adding a slice() function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique() that would let you select which
variables to use:
df %>% unique(x, y)
For completeness’ sake, the following also works:
df %>% group_by(x) %>% filter (! duplicated(y))
However, I prefer the solution using distinct, and I suspect it’s faster, too.
Most of the time, the best solution is using distinct() from dplyr, as has already been suggested.
However, here's another approach that uses the slice() function from dplyr.
# Generate fake data for the example
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
# In each group of rows formed by combinations of x and y
# retain only the first row
df %>%
group_by(x, y) %>%
slice(1)
Difference from using the distinct() function
The advantage of this solution is that it makes it explicit which rows are retained from the original dataframe, and it can pair nicely with the arrange() function.
Let's say you had customer sales data and you wanted to retain one record per customer, and you want that record to be the one from their latest purchase. Then you could write:
customer_purchase_data %>%
arrange(desc(Purchase_Date)) %>%
group_by(Customer_ID) %>%
slice(1)
When selecting columns in R for a reduced data-set you can often end up with duplicates.
These two lines give the same result. Each outputs a unique data-set with two selected columns only:
distinct(mtcars, cyl, hp);
summarise(group_by(mtcars, cyl, hp));
If you want to find the rows that are duplicated you can use find_duplicates from hablar:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 2, 4),
b = c(5, 2, 2, 8))
df %>% find_duplicates()

Resources