Remove Duplicates from Col X based on condition in Col Y - r

I have a data frame in R, that has duplicates, in one of the columns, however I only want to remove the duplicate based on a specification in another column.
For Example:
DF:
X J Y
1 2 3
2 3 1
1 3 2
I want to remove rows, where X is a duplicate and = 3.
DF:
X J Y
2 3 1
1 3 2
I have tried reading on dplyr, but have so far only been unable to get the desired result.

We can create the condition to condition with duplicated and the equality operator
subset(df1, !((duplicated(X)|duplicated(X, fromLast = TRUE)) & Y == 3))
# X J Y
#2 2 3 1
#3 1 3 2
If we need to remove the whole group of rows of 'X' if there is any value of 'Y' is 3, then
library(dplyr)
df1t %>%
group_by(X) %>%
filter(! 3 %in% Y) #or
# filter(all(Y != 3))

Related

Subsetting whole clusters froma dataframe

In my data.frame below, I wonder how to subset a whole cluster of study that has any outcome larger than 1 in it?
My desired output is shown below. I tried subset(h, outcome > 1) but that doesn't give my desired output.
h = "
study outcome
a 1
a 2
a 1
b 1
b 1
c 3
c 3"
h = read.table(text = h,h=T)
DESIRED OUTPUT:
"
study outcome
a 1
a 2
a 1
c 3
c 3"
Modify the subset -
subset the 'study' based on the first logical expression outcome > 1
Use %in% on the 'study' to create the final logical expression in subset
subset(h, study %in% study[outcome > 1])
-output
study outcome
1 a 1
2 a 2
3 a 1
6 c 3
7 c 3
If we want to limit the number of 'study' elements having 'outcome' value 1, i.e. the first 'n' 'study', then get the unique 'study' from the first expression of subset, use head to get the first 'n' 'study' values and use %in% to create logical expression
n <- 3
subset(h, study %in% head(unique(study[outcome > 1]), n))
Or can be done with a group by approach with any
library(dplyr)
h %>%
group_by(study) %>%
filter(any(outcome > 1)) %>%
ungroup

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Create a column with the column name of the max value of the row in R

I got this data:
df = data.frame(x = c(1,2,3), y = c(5,1,4))
> x y
> 1 1 5
> 2 2 1
> 3 3 4
But i want a new column with the column name of the max value in the row
like this:
> x y max.col
> 1 1 5 y
> 2 2 1 x
> 3 3 4 y
I've tried a lot of codes, but without sucess. Extra points with i can use the solution with %>%
Edit1: i got a lot of NA's and i want skip it
Edit2: i got 30 different columns in the real df
We can use max.col to return the index of the max value and use that to subset the column name. If there are NAs replace the NA with a negative value
If a row is all NA, then we can identify it with rowSums on logical matrix
i1 <- !rowSums(!is.na(df))
df$max.col <- names(df)[max.col(replace(df, is.na(df), -999), 'first')]
df$max.col[i1] <- NA
Here is the solution for your question
df2 <- df %>%
mutate(max.col = ifelse(x>y, "x", "y"))
# x y max.col
# 1 1 5 y
# 2 2 1 x
# 3 3 4 y

Vectorize for loop over two rows with condition

I want to perform some operations on my dataframe, but I have some problems with performance, so I was wondering how I could speed up the performance of my code.
My data has several columns and if the column X is 0, I want to do some operations on other columns (adding and max). If X is 1, do nothing (X can only be 1 or 0)
df <- data.frame(X = c(0,0,1,0,1),Y = c(10,0,0,3,7),Z = c(2,2,0,4,5))
df
X Y Z
1 0 10 2
2 0 0 2
3 1 0 0
4 0 3 4
5 1 7 5
Right now my code looks like:
for(i in 1:(nrow(df)-1)){
if(df$X[i] == 0){
df$Y[i+1] <- df$Y[i]+df$Y[i+1]
df$Z[i+1] <- max(df$Z[i],df$Z[i+1])
}
}
The result should look like:
df
X Y Z
1 0 10 2
2 0 10 2
3 1 10 2
4 0 3 4
5 1 10 5
Is there a way to write this more efficiently?
Additionally, a lot of the rows contain only 0's, so I was wondering if there is an efficient way to skip the operations for these rows, as the value won't change.
Edit:
As I was a bit unspecific about the rules, here they are in greater detail:
Y should get summed up until there is 1 again (the sum (including the value for the row, where the 1 is) should replace the value of the row with the 1). The same principle should be applied to the X variable, but this time with the max() function.
Many thanks!
How about something like this? This reproduces your expected output:
df <- data.frame(X = c(0,0,1,0,1),Y = c(10,0,0,3,7),Z = c(2,2,0,4,5))
df %>%
mutate(
group = cumsum(c(0, diff(X) == -1))) %>%
group_by(group) %>%
mutate(
n = 1:n(),
Y = cumsum(Y),
Z = ifelse(n > 1, max(Z, lead(Z, default = 0)), Z)) %>%
ungroup() %>%
select(X, Y, Z)
# # A tibble: 5 x 3
# X Y Z
# <dbl> <dbl> <dbl>
#1 0. 10. 2.
#2 0. 10. 2.
#3 1. 10. 2.
#4 0. 3. 4.
#5 1. 10. 5.
Explanation: Group entries based on 0-series terminated by 1; replace Y with the cumsum of Y; replace Z with the maximum of entries in that row and from the next row, starting from the second row (n > 1).

What is the most effective way to sort dataframe and add special id? [duplicate]

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Resources