Remove all duplicates by multiple variables with dplyr [duplicate]

Remove all duplicates by multiple variables with dplyr [duplicate] - r

This question already has answers here:
Remove all copies of rows with duplicate values in R [duplicate]
(2 answers)
Closed 3 years ago.
I'm trying to remove all duplicate values based on multiple variable using dplyr. Here's how I do it without dplyr:
dat = data.frame(id=c(1,1,2),date=c(1,1,1))
dat = dat[!(duplicated(dat[c('id','date')]) | duplicated(dat[c('id','date')],fromLast=TRUE)),]
It should only return id number 2.

This can be done with a group_by/filter operation in tidyverse. Grouped by the columns of interest (here used group_by_all as all the columns in the dataset are grouped. Instead can also make use of group_by_at if a selected number of columns are needed)
library(dplyr)
dat %>%
group_by_all() %>%
filter(n()==1)
Or simply group_by
dat %>%
group_by(id, date) %>%
filter(n() == 1)
If the OP intended to use the duplicated function
dat %>%
filter_at(vars(id, date),
any_vars(!(duplicated(.)|duplicated(., fromLast = TRUE))))
# id date
#1 2 1

Related

Tidyverse filter by width of variable [duplicate]

This question already has answers here:
Remove all rows where length of string is more than n
(4 answers)
Closed 1 year ago.
I'm working with an untidy dataset and want to filter out any object with an ID shorter than 6 digits (these rows contain errors).
I created a new column that calculates the number of characters for each ID, and then I filter for all objects with 6 or more digits, like so:
clean_df <- df %>%
mutate(chars = nchar(id)) %>%
filter(chars >= 6)
This is working just fine, but I'm wondering if there's an easier way.

Using str_length() from the stringr package (part of the tidyverse):
library(tidyverse)
clean_df <- df %>%
filter(str_length(id) >= 6)

If id's are numeric, just use log10
df %>%
filter(log10(id)>=5)

You can skip mutate
df %>%
filter(nchar(id) >= 6)

(R) Alternate way for row sum, multiple columns with similar name [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
So, i don't know if the title makes it easy to understand, but basically i want to change this to the minimum of code possible:
data %>%
group_by(name) %>%
mutate(
plataforma.3DS = sum(plataforma.3DS),
plataforma.PS3 = sum(plataforma.PS3),
plataforma.PS4 = sum(plataforma.PS4),
plataforma.PSP = sum(plataforma.PSP),
plataforma.PSV = sum(plataforma.PSV),
plataforma.Wii = sum(plataforma.Wii),
plataforma.WiiU = sum(plataforma.WiiU),
plataforma.X360 = sum(plataforma.X360),
plataforma.XOne = sum(plataforma.XOne)
)
I have some other columns that i need to do this, so how can i reduce my code? thanks in advance.

We can specify it with across. Note that mutate replaces the column value with the sum of that column.
library(dplyr)
data %>%
group_by(name) %>%
mutate(across(starts_with('plataforma'), sum))
It the intention is to return a single sum per each column, change the mutate to summarise
data %>%
group_by(name) %>%
summarise(across(starts_with('plataforma'), sum), .groups = 'drop')
NOTE: The title specified row sum, while the code showed in OP's post is doing column sum.

How to duplicate a specific number of row per group level in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
Here is my data:
For each x1 level, I am trying to duplicate a number of rows equal to number.class and I would like for each row the length class to goes from the Lmin..cm. to Lmax..cm. increasing by 1 for each row.I came up with this code:
test<-A.M %>% filter(x1=="Crenimugil crenilabis")
for (i in 1:test$number.class){test<-test %>% add_row()}
for (i in 1:nrow(test)){test[i,]=test[1,]}
for (i in 1:nrow(test)){test$length.class[i]<-print(i+test$Lmin..cm.)}
test$length.class<-test$length.class-1
which basically works and gives me the expected results: 2
However, this script does not allow me to run this for every species.
Thank you.

Here, we could use uncount from tidyr to replicate the rows, do a group by 'x1' and mutate the 'Lmin..cm' by adding the row_number()
library(dplyr)
library(tidyr)
A.M %>%
uncount(number.class) %>%
group_by(x1) %>%
mutate(`Lmin..cm.` = `Lmin..cm.` + row_number())
If we need to create a sequence from Lmin..cm to Lmax..cm, then instead of uncount, we could use map2 to create the sequence and then unnest
library(purrr)
A.M %>%
mutate(new = map2(`Lmin..cm.`, `Lmax..cm`, ~ seq(.x, .y, by = 1)) %>%
unnest(c(new))

R grouped counter that copes with NAs or conditions [duplicate]

This question already has answers here:
cumsum by group [duplicate]
(2 answers)
Closed 3 years ago.
I have an R dataframe where I need a counter which gives me a fresh new number for a new set of circumstances while also continuing this number (respecting the order of the data).
There are quite a few previous posts on this but none seems to work for my problem. I've tried using combinations of row_counter, ave and rleid and none seems to hit the spot.
id <- c("A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","D","D")
marker_new <- c(1,0,0,0,0,1,0,1,0,0,0,0,1,0,1,1,0,1,0,1,0)
counter_result <- c(1,1,1,1,1,1,1,2,2,2,2,2,3,3,4,1,1,2,2,1,1)
df <- data.frame(id,marker_new, counter_result)
df <- df %>%
group_by(id, marker_new) %>%
mutate(counter =
ifelse(marker_new != 0,
row_number(),
lag(marker_new,lag(marker_new))) %>%
ungroup()
I can get to the point using the code above which will give me a fresh number but won't continue this set of numbers down (as in the counter_result i've included).
Any help much appreciated!

Since, we have marker_new column as 1/0, we can use cumsum by group (id) to get counter.
Base R:
df$result <- with(df, ave(marker_new, id, FUN = cumsum))
dplyr:
df %>% group_by(id) %>% mutate(result = cumsum(marker_new))
data.table
setDT(df)[, result := cumsum(marker_new), by = id]

dplyr select column when column name is number [duplicate]

This question already has answers here:
Select multiple columns with dplyr::select() with numbers as names
(2 answers)
Closed 6 years ago.
I want to reshape the data and then select a specific column.
data(ChickWeight)
chick <- ChickWeight %>% spread(Time,weight) %>% filter(Diet=="1")
It creates the column names for me, which are numbers. So how could I select the column that named "0"? I know that %>% select(3) may work, but I need the solution to select columns with their names being number.

Use backticks to select columns with their names being number
data(ChickWeight)
library(dplyr)
library(tidyr)
chick <- ChickWeight %>% spread(Time,weight) %>% filter(Diet==2) %>% select(`0`)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove all duplicates by multiple variables with dplyr [duplicate] - r

Related

Tidyverse filter by width of variable [duplicate]

(R) Alternate way for row sum, multiple columns with similar name [duplicate]

How to duplicate a specific number of row per group level in R [duplicate]

R grouped counter that copes with NAs or conditions [duplicate]

dplyr select column when column name is number [duplicate]

Categories

Resources