Mutate with dplyr using multiple conditions - r

I have a data frame (df) below and I want to add an additional column, result, using dplyr that will take on the value 1 if z == "gone" and where x is the maximum value for group y.
y x z
1 a 3 gone
2 a 5 gone
3 a 8 gone
4 a 9 gone
5 a 10 gone
6 b 1
7 b 2
8 b 4
9 b 6
10 b 7
If I were to simply select the maximum for each group it would be:
df %>%
group_by(y) %>%
slice(which.max(x))
which will return:
y x z
1 a 10 gone
2 b 7
This is not what I want. I need to take advantage of the max value of x for each group in y while checking to see if z == "gone", and if TRUE 1 otherwise 0. This would look like:
y x z result
1 a 3 gone 0
2 a 5 gone 0
3 a 8 gone 0
4 a 9 gone 0
5 a 10 gone 1
6 b 1 0
7 b 2 0
8 b 4 0
9 b 6 0
10 b 7 0
I'm assuming I would use a conditional statement within mutate() but I cannot seem to find an example. Please advise.

With dplyr you can use:
df %>% group_by(y) %>% mutate(result = +(x == max(x) & z == 'gone'))
The +(..) notation is shorthand for as.integer to coerce the logical output to 1's and 0's. Some don't like it so it's a matter of shorter code versus readability. Efficiency gains can be debated on the circumstance.
Also to appreciate what data.table and dplyr have done for data manipulation with R, let's do the same thing in the old-fashioned "split-apply-combine" way:
#split data.frame by group
split.df <- split(df, df$y)
#apply required function to each group
lst <- lapply(split.df, function(dfx) {
dfx$result <- +(dfx$x == max(dfx$x) & dfx$z == "gone")
dfx})
#combine result in new data.frame
newdf <- do.call(rbind, lst)

We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'y', we create the logical condition for maximum value of 'x' and the 'gone' element in 'z', coerce it to 'integer' (as.integer) and assign (:=) the output to the new column ('result').
library(data.table)
setDT(df)[, result := as.integer(x==max(x) & z=='gone') , by = y]
df
# y x z result
# 1: a 3 gone 0
# 2: a 5 gone 0
# 3: a 8 gone 0
# 4: a 9 gone 0
# 5: a 10 gone 1
# 6: b 1 0
# 7: b 2 0
# 8: b 4 0
# 9: b 6 0
#10: b 7 0
Or we can use ave from base R
df$result <- with(df, +(ave(x, y, FUN=max)==x & z=='gone' ))

Related

How to select unique point

I am a novice R programmer. I have a following series of points.
df <- data.frame(x = c(1 , 2, 3, 4), y = c(6 , 3, 7, 5))
df <- df %>% mutate(k = 1)
df <- df %>% full_join(df, by = 'k')
df <- subset(df, select = c('x.x', 'y.x', 'x.y', 'y.y'))
df
Is there way to select for "unique" points? (the order of the points do not matter)
EDIT:
x.x y.x x.y y.y
1 6 2 3
2 3 3 7
.
.
.
(I changed the 2 to 7 to clarify the problem)
With data.table (and working from the OP's initial df):
library(data.table)
setDT(df)
df[, r := .I ]
df[df, on=.(r > r), nomatch=0]
x y r i.x i.y
1: 2 3 1 1 6
2: 3 2 1 1 6
3: 4 5 1 1 6
4: 3 2 2 2 3
5: 4 5 2 2 3
6: 4 5 3 3 2
This is a "non-equi join" on row numbers. In x[i, on=.(r > r)] the left-hand r refers to the row in x and the right-hand one to a row of i. The columns named like i.* are taken from i.
Data.table joins, which are of the form x[i], use i to look up rows of x. The nomatch=0 option drops rows of i that find no matches.
In the tidyverse, you can save a bit of work by doing the self-join with tidyr::crossing. If you add row indices pre-join, reducing is a simple filter call:
library(tidyverse)
df %>% mutate(i = row_number()) %>% # add row index column
crossing(., .) %>% # Cartesian self-join
filter(i < i1) %>% # reduce to lower indices
select(-i, -i1) # remove extraneous columns
## x y x1 y1
## 1 1 6 2 3
## 2 1 6 3 7
## 3 1 6 4 5
## 4 2 3 3 7
## 5 2 3 4 5
## 6 3 7 4 5
or in all base R,
df$m <- 1
df$i <- seq(nrow(df))
df <- merge(df, df, by = 'm')
df[df$i.x < df$i.y, c(-1, -4, -7)]
## x.x y.x x.y y.y
## 2 1 6 2 3
## 3 1 6 3 7
## 4 1 6 4 5
## 7 2 3 3 7
## 8 2 3 4 5
## 12 3 7 4 5
You can use the duplicated.matrix() function from base, to find the rows which are no duplicator - which means in fact that there are unique. When you call the duplicated() function you have to clarify that you only want to use the to first colons. With this call you check which line is unique. In a second step you call in your dataframe for this rows, with all columns.
unique_lines = !duplicated.matrix(df[,c(1,2)])
df[unique_lines,]

How do I delete rows with NAs and those that follow the NAs?

I have some data where I want to remove the NAs and the data that follows the NAs by the level of a factor.
Removing the NAs is easy:
df <- data.frame(a=c("A","A","A","B","B","B","C","C","C","D","D","D"), b=c(0,1,0,0,0,0,0,1,0,0,0,1) ,c=c(4,5,3,2,1,5,NA,5,1,6,NA,2))
df
newdf<-df[complete.cases(df),];newdf
The final result should remove all of the rows for C and the final two rows of D.
Hope you can help.
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'a', get the cumulative sum of logical vector of NA elements in 'c' and check whether it is less than 1 to subset
library(data.table)
setDT(df)[, .SD[cumsum(is.na(c))<1], by= a]
Or a faster option with .I to return the row index of the logical vector and subset the rows.
setDT(df)[df[, .I[cumsum(is.na(c)) < 1], by = a]$V1]
# a b c
#1: A 0 4
#2: A 1 5
#3: A 0 3
#4: B 0 2
#5: B 0 1
#6: B 0 5
#7: D 0 6
A classic split-apply-combine in base R:
do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
Here it is again, but in several lines:
split_df <- split(df, df$a)
apply_df <- lapply(split_df, function(x)x[cumsum(is.na(x$c))<1,])
combine_df <- do.call(rbind, apply_df)
The result:
> do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
# a b c
#A.1 A 0 4
#A.2 A 1 5
#A.3 A 0 3
#B.4 B 0 2
#B.5 B 0 1
#B.6 B 0 5
#D D 0 6
A similar solution in dplyr would be
library(dplyr)
df %>% group_by(a) %>% filter(!is.na(cumsum(c)))
Output:
Source: local data frame [7 x 3]
Groups: a [3]
a b c
<fctr> <dbl> <dbl>
1 A 0 4
2 A 1 5
3 A 0 3
4 B 0 2
5 B 0 1
6 B 0 5
7 D 0 6
If we take the cumulative sum of variable C, any values after the first NA will be converted to NA. Performing this at the group level allows us to remove NA rows and get the desired output.

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
library(dplyr)
df %>%
group_by(x) %>%
slice(max(y))
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
ungroup()
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
ungroup()
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.
do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

R Split data.frame using a column that represents and on/off switch

I have data that looks like the following:
a <- data.frame(cbind(x=seq(50),
y=rnorm(50),
z=c(rep(0,5),
rep(1,8),
rep(0,3),
rep(1,2),
rep(0,12),
rep(1,12),
rep(0,8))))
I would like to split the data.frame a on the column z but have each group as a separate data.frame as a member of a list i.e. in my example the first 5 rows would be the first item in the list the next 8 rows would be the next item in the list, the next 3 rows would be item after that etc. etc.
Simple factors combine all the 1s together and all the 0s together...
I'm sure that there is a simple way to do this, but it has eluded for at the moment.
Thanks
Try the rleid function in data.table v > 1.9.5
library(data.table)
split(a, rleid(a$z))
# $`1`
# x y z
# 1 1 -0.03737561 0
# 2 2 -0.48663043 0
# 3 3 -0.98518106 0
# 4 4 0.09014355 0
# 5 5 -0.07703517 0
#
# $`2`
# x y z
# 6 6 0.3884339 1
# 7 7 1.5962833 1
# 8 8 -1.3750668 1
# 9 9 0.7987056 1
# 10 10 0.3483114 1
# 11 11 -0.1777759 1
# 12 12 1.1239553 1
# 13 13 0.4841117 1
....
Or, also with cumsum:
split(a, c(0, cumsum(diff(a$z) != 0)))
Here are some base R options.
Using rle. A variant of rleid function in the comments by #Spacedman
split(a,inverse.rle(within.list(rle(a$z), values <- seq_along(values))))
By using cumsum after creating a logical index based on whether the adjacent elements are equal or not
split(a, cumsum(c(TRUE, a$z[-1]!=a$z[-nrow(a)])))

Removing rows after a certain value in R

I have a data frame in R,
df <- data.frame(a=c(1,1,1,2,2,5,5,5,5,5,6,6), b=c(0,1,0,0,0,0,0,1,0,0,0,1))
I want to remove the rows which has values for the variable b equal to 0 which occurs after the value equals to 1 for the duplicated variable a values.
So the output I am looking for is,
df.out <- data.frame(a=c(1,1,2,2,5,5,5,6,6), b=c(0,1,0,0,0,0,1,0,1))
Is there a way to do this in R?
This should do the trick?
ind = intersect(which(df$b==0), which(df$b==1)+1)
df.out = df[-ind,]
The which(df$b==1) returns the index of the df where b==1. add one to this and intersect with the indexes where b==0.
How about
df[ ave(df$b, df$a, FUN=function(x) x>=cummax(x))==1, ]
# a b
# 1 1 0
# 2 1 1
# 4 2 0
# 5 2 0
# 6 5 0
# 7 5 0
# 8 5 1
# 11 6 0
# 12 6 1
Here we use ave to look within each level of a and we test to see if we've seen a 1 yet with cummax.

Resources