Mutate to Create Minimum in Each Row - r

I have a question relating to creating a minimum value in a new column in dplyr using the mutate function based off two other columns.
The following code repeats the same value for each row in the new column. Is there a way to create an independent minimum for each row in the new column? I wish to avoid using loops or the apply family due to speed and would like to stick with dplyr if possible. Here's the code:
a = data.frame(runif(5,0,5))
b = data.frame(runif(5,0,5))
c = data.frame(runif(5,0,5))
y = cbind(a,b,c)
colnames(y) = c("a","b","c")
y = mutate(y, d = min(y$b, y$c))
y
The new column "d" is simply a repeat of the same number. Any suggestions on how to fix it so that it's the minimum of "b" and "c" in each row?
Thank you for your help.

We can use pmin
y$d <- with(y, pmin(b, c))
Or
transform(y, d = pmin(b,c))
Or with dplyr
library(dplyr)
y %>%
mutate(d = pmin(b,c))
min works columnwise, suppose if we want to use min, an option would be
y %>%
rowwise %>%
mutate(d = min(unlist(c(b,c))))

You could make the min function apply by rows rather than columns by using the apply function and setting the margin argument to MARGIN = 1. Your rowwise min function would look like this:
apply(y, MARGIN = 1, FUN = function(x) min(x)))
Then, in order to make the rowwise min function only apply to columns b and c, you could use the select function within mutate, like this:
y %>% mutate(b.c.min =
y %>%
select(one_of("b", "c")) %>%
apply(MARGIN = 1, FUN = function(x) min(x)))

Related

How to use the R pipe operator (%>%) in the following cases

1) I have a data frame named df, how can I include an if statement within the mutate function used within the pipe operator? The following does not work:
df %>%
mutate_if(myvar == "A", newColumn = oldColumn*3, newColumn = oldColumn)
The variable myvar is not included in the data frame and is a "flag" variable with values either "A" or "B". When "A", would like to create a new column named "newColumn" in the data frame that is three times the old column (named "oldColumn"), otherwise it is identical to the old column.
2) Would like to divide the column named "numbers" with the entry of numbers which has the minimum value in another column named "seconds", as follows:
df$newCol <- df$numbers / df[df$seconds== min(df$seconds),]$numbers
How can I do that with mutate command and "%>%", so that it looks more handy? Nothing that I tried works unfortunately.
Thanks for any answers,
J.
If myvar is just a variable floating around in the environmnet, you can use an if else statement within mutate (similar question here)
library(dplyr)
# Generate dataset
df <- tibble(oldColumn = rnorm(100))
# Mutate with if-else conditions
df <- df %>% mutate(newColumn = if(myvar == "A") oldColumn else if(myvar=="B") oldColumn * 3)
If myvar is included as a column in the dataframe then you could can use case_when.
# Generate dataset
df <- tibble(myvar = sample(c("A", "B"), 100, replace = TRUE),
oldColumn = rnorm(100))
# Create a new column which depends on the value of myvar
df <- df %>%
mutate(newColumn = case_when(myvar == "A" ~ oldColumn*3,
myvar == "B" ~ oldColumn))
As for question 2, you can use mutate with "." operater which calls the left hand side (i.e. "df") in the right hand side of the function. Then you can filter down to the row with the minimum value of seconds (top_n statement using -1 as argument), and pull out the value for the numbers variable
# Generate data
df <- tibble(numbers = sample(1:60),
seconds = sample(1:60))
# Do computation
df <- df %>% mutate(newCol = numbers / top_n(.,-1,seconds) %>% pull(numbers))

Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C).
I also have a continuous variable with some missing values on it.
I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A.
I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops.
A <- subset(data, group == "A")
mean(A$variable, rm.na = TRUE)
A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE)
Now, I understand I could do the same for group B and C, but perhaps a for loop (with if and else) might do the trick?
require(dplyr)
data %>% group_by(group) %>%
mutate(variable=ifelse(is.na(variable),mean(variable,na.rm=TRUE),variable))
For a faster, base-R version, you can use ave:
data$variable<-ave(data$variable,data$group,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x))
You could use data.table package to achieve this-
tomean <- c("var1", "var2")
library(data.table)
setDT(dat)
dat[, (tomean) := lapply(tomean, function(x) {
x <- get(x)
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
})]

R: Convert frequency to percentage with only a selected number of columns

I would like to convert a dataframe filled with frequencies into a dataframe filled with percentage by row using dplyr.
My data set has the particularity to get filled with others variables and I just want to calculate the percentage for a set of columns defined by a vector of names. Plus, I want to use the dplyr library.
sim_dat <- function() abs(floor(rnorm(26)*10))
df <- data.frame(a = letters, b = sim_dat(), c = sim_dat(), d = sim_dat()
, z = LETTERS)
names_to_transform <- names(df)[2:4]
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_each(function(x) x / sum_freq_codpos, names_to_transform)
# does not work
Any idea on how to do it? I have tried with mutate_at and mutate_each but I can't get it to work.
you're almost there!:
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_at(names_to_transform, funs(./sum_freq_codpos))
the dot . roughly translates to "the object i am manipulating here", which in this call is "the focal variable in names_to_transform".

Removing duplicate rows from a data frame in R, keeping those with a smaller/larger value

I am trying to remove duplicate rows in an R data frame, but I want the condition that the row with a smaller or larger value (not bothered for the purpose of this question) in a certain column should be kept.
I can remove duplicate rows normally (from either side) like this:
df = data.frame( x = c(1,1,2,3,4,5,5,6,1,2,3,3,4,5,6),
y = c(rnorm(4),NA,rnorm(10)),
id = c(rep(1,8), rep(2,7)))
splitID <- split(df , df$id)
lapply(splitID, function(x) x[!duplicated(x$x),] )
How can I condition the removal of duplicate rows?
Thanks!
Use ave() to return a logical index to subset your data.frame
idx = as.logical(ave(df$y, df$x, df$id, FUN=fun))
df[idx,, drop=FALSE]
Some possible fun include
fun1 = function(x)
!is.na(x) & !duplicated(x) & (x == min(x, na.rm=TRUE))
fun2 = function(x) {
res = logical(length(x))
res[which.min(x)] = TRUE
res
}
The dplyr version of this might be
df %>% group_by(x, id) %>% filter(fun2(y))
We may need to order before applying the duplicated
lapply(splitID, function(x) x[!duplicated(x[order(x$x, x$y),]$x),] )
and for the reverse, i.e. keeping the larger values, order with decreasing = TRUE

Apply a rolling sum by group in R

I need to calculate a rolling sum by group.
y<- 1:10
tmp<-data.frame(y)
tmp$roll<-NA
tmp$roll[2:10]<-rollapply (y, 2, sum)
tmp$g<-(c("a","a","a","a","a","b","b","b","b","b"))
tmp$roll calculates the rolling sum for tmp$y; I need to do this by tmp$g. I think I may need to split the data frame into a list of data frames by group and then bind back together but this seems like a long route. The result would be an additional column of the rolling sum by group a and b (this a simplified example of actual data frame):
roll_group
NA
3
5
7
9
NA
13
15
17
19
Here is the data.table way:
library(data.table)
tmp.dt <- data.table(tmp)
tmp.dt <- tmp.dt[, .(y =y, roll = cumsum(y)), by = g]
You can do it with dplyr package as well.
Thanks but the answers provided in this post use the cumsum whereas I need the rolled sum with NA's if there aren't enough lagged values. I solved it this way:
#function to calculate rolled sum, returns a column vector
roll<-function(x,lags){
if (length(x)<lags) {
tmp=c(rep(NA,length(x)))
}
else {
tmp=rollsum(x, lags, align = "right", fill = NA)
}
tmp=as.numeric(tmp)
return(tmp)
}
tmp1 <- tmp %>%
group_by(g) %>%
mutate(roll_group = ave(y, g, FUN = function(x) roll(x, 2)))%>%
ungroup
How about wrapping it in tapply (or lapply split):
tapply(y, tmp$g, cumsum)
Consider this base solution with sapply() combining running count and running sum:
tmp$roll <- sapply(1:nrow(tmp),
function(i)
sum((tmp[1:i, c("g")] == tmp$g[i]) * tmp[1:i,]$y)
)

Resources