How can I remove the duplicate rows in R [duplicate] - r

This question already has answers here:
Deleting reversed duplicates with R
(3 answers)
Removing duplicate combinations (irrespective of order)
(1 answer)
data.table with two string columns of set elements, extract unique rows with each row unsorted
(4 answers)
Closed 2 years ago.
In my df, I define c('apple', 'banana') and c('banana', 'apple') are the same, casue the fruit type is the same just the arrangement is different.
Then, How can I remove row No.1 and row No.2 and only keep the last row(wanted_df).
df = data.frame(fruit1 = c('apple', 'banana', 'fig'),
fruit2 = c('banana', 'apple', 'cherry'))
df
wanted_df = df[3,]
Any help will be high appreciated!
============================
Something wrong with my real data.
The frames2 loses rows which lag = 2.
I wanted data frame shold like wanted_frames.
pollution1 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co')
pollution2 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co')
dis = 'n'
lag = 1:2
frames = expand.grid(pollution1 = pollution1,
pollution2 = pollution2,
dis = dis,
lag = lag) %>%
mutate(pollution1 = as.character(pollution1),
pollution2 = as.character(pollution2),
dis = as.character(dis)) %>%
as_tibble() %>%
filter(pollution1 != pollution2)
vec<- with(frames, paste(pmin(pollution1, pollution2), pmax(pollution1, pollution2)))
frames2 = frames[!duplicated(vec), ]
wanted_frames = frames2 %>% mutate(lag = 2) %>% bind_rows(frames2)

Try this.
library(dplyr)
d <- filter(df, !(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))
Which gives
> d
fruit1 fruit2
1 fig cherry
Update
As commented by #JonSpring and #Phil, the updated code should be
df %>% rowwise() %>% filter(!(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))%>% ungroup()

A base R way :
vec<- with(df, paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2)))
df[!(duplicated(vec) | duplicated(vec, fromLast = TRUE)), ]
# fruit1 fruit2
#3 fig cherry

Here's a low-tech dplyr approach. Make a sorted key, then keep rows with unique keys.
library(dplyr)
df %>%
mutate(key = paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2))) %>%
add_count(key) %>%
filter(n == 1)

Related

How to find students who is the third place? [duplicate]

This question already has an answer here:
Subset data frame to only include the nth highest value of a column
(1 answer)
Closed 1 year ago.
I have data that shows every students' score and I have to find out who are in the third place. I have to make a list of test scores and a list of students' names.
If there are two or more people who get the same score and occupy third place, the output must show all of the names. I still have no idea how to solve this problem.
example :
names = c('Alex', 'Joy', 'Cindy', 'Lily')
score = c(80, 80,100,90)
Output:
'Students in the third place: Alex, Joy'.
We can use slice_max, by default with_ties = TRUE and then filter the min value
library(dplyr)
df1 %>%
slice_max(n = 3, order_by= score) %>%
filter(score == min(score))
-output
names score
1 Alex 80
2 Joy 80
If we need to output in a format
df1 %>%
slice_max(n = 3, order_by= score) %>%
filter(score == min(score)) %>%
pull(names) %>%
{glue::glue("Students in the third place: {toString(.)}")}
Students in the third place: Alex, Joy
data
df1 <- data.frame(names, score)
One solution with rank:
df$names[rank(-df$score) >= 3]
[1] "Alex" "Joy"
If you have ranks greater than 3:
df$names[rank(-df$score) >= 3 & rank(-df$score) <= 4]
Data:
df <- data.frame(
names = c('Alex', 'Joy', 'Cindy', 'Lily'),
score = c(80, 80,100,90)
)

arrange data based on user defined variables order? [duplicate]

This question already has answers here:
Order data frame rows according to vector with specific order
(6 answers)
Closed 1 year ago.
I have the following data.frame and would like to change the order of the rows in such a way that rows with variable == "C" come at the top followed by rows with "A" and then those with "B".
library(tidyverse)
set.seed(123)
D1 <- data.frame(Serial = 1:10, A= runif(10,1,5),
B = runif(10,3,6),
C = runif(10,2,5)) %>%
pivot_longer(-Serial, names_to = "variables", values_to = "Value" ) %>%
arrange(-desc(variables))
D1 %>%
mutate(variables = ordered(variables, c('C', 'A', 'B'))) %>%
arrange(variables)
Perhaps I did not get the question. If you want C then A then B, you could do:
D1 %>%
arrange(Serial, variables)
#Onyambu's answer is probably the most "tidyverse-ish" way to do it, but another is:
D1[order(match(D1$variables,c("C","A","B"))),]
or
D1 %>% slice(order(match(variables,c("C","A","B"))))
or
D1 %>% slice(variables %>% match(c("C","A","B")) %>% order())

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.
split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

How to group by two column in R but with if statment for second?

I can't found any help lf internet.
I have 3 cols in .sav file loaded to R studio.
Is M with values 1,2,3,4,5,6,7 and label: weight, and N with values 1,2,3 and label diet.
I want group by it by these columns, but for N col I want only pick those where value is 1. Also I have last column with age data A.
I wrote this:
library(dplyr)
df%>%
group_by(M, N) %>%
summarize(values = mean(A, na.rm = TRUE))
And I got group by but for all N.
I tried something like this:
library(dplyr)
df%>%
group_by(M, N == 1) %>%
summarize(values = mean(A, na.rm = TRUE))
but I got again group for all categories from N with NA etc.
Expcted: I want only group_by by M - all values, and N where value =1.
How should that group by looks?
We can do a group by 'M' and summarise the filtered 'A'
library(dplyr)
df %>%
group_by(M) %>%
summarise(values = mean(A[N == 1], na.rm = TRUE))
Or another option is to have a filter in between, but this would also remove the groups where there are no 'N' as 1
df %>%
filter(N == 1) %>%
group_by(M) %>%
summarise(values = mean(A, na.rm = TRUE))

add x lagged value to a tbl [duplicate]

This question already has answers here:
Adding multiple lag variables using dplyr and for loops
(2 answers)
Closed 4 years ago.
I have a tibble like this:
df <- tibble(value = rnorm(500))
how can I add x (e.g. x = 10) lagged values to this tbl (ideally in a dplyr pipe)? I want to add these lagged variables as new columns.
I can do it for a single lag:
lag_df <- df %>%
mutate(value_lag = lag(value, n = 1)) %>% # first lag
filter(!is.na(value_lag)) # remove NA
doing it manually for 3 lags would look like this:
lag_df <- df %>%
mutate(value_lag1 = lag(value, n = 1)) %>% # first lag
mutate(value_lag2 = lag(value, n = 2)) %>% # second lag
mutate(value_lag3 = lag(value, n = 3)) %>% # third lag
filter(!is.na(value_lag1)) # remove NA
filter(!is.na(value_lag2)) # remove NA
filter(!is.na(value_lag3)) # remove NA
Not a complete dplyr solution but one way is to create a column for each lagged value and cbind it to the original daatframe and remove the rows with NA values with na.omit()
library(dplyr)
cbind(df, sapply(1:10, function(x) lag(df$value, n = x))) %>%
na.omit()
An ugly attempt to keep it completely in tidyverse with my broken skills
library(tidyverse)
tibble(n=1:10) %>% mutate(output = map2(list(df),n ,function(x,y){
x %>% mutate(value = lag(value,y))
})) %>% spread(n,output) %>% unnest() %>% na.omit()
The base R method is much cleaner than this but there should definitely be a better way to do it than this.
And a bit shorter version
map2(list(df), 1:10, function(x, y) {
x %>% mutate(value = lag(value,y))
}) %>%
bind_cols() %>% na.omit()

Resources