How to identify number of duplicate rows in R (and remove) - r

I have a large dataframe in R (1.3 mil row, 51 columns). I am not sure if there are any duplicate rows but I want to find out. I tried using the duplicate() function but it took too long and ended up freezing my Rstudio. I dont need to know which entries are duplicate, I just want to delete the ones that are.
Does anyone know how to do this with out it taking 20+ minutes and eventually not loading?
Thanks

I don't know how you used the duplicated function. It seems like this way should be relatively quick even if the dataframe is large (I've tested it on a dataframe with 1.4m rows and 32 columns: it took less than 2min):
df[-which(duplicated(df)), ]

The first one is to extract complete duplicates or over 1(maybe triples)
The second is to removes duplicates or over one.
duplication <- df %>% group_by(col) %>% filter(any(row_number() > 1))
unique_df <- df %>% group_by(col) %>% filter(!any(row_number() > 1))
you can use these too.
dup <- df[duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
uni_df <- df[!duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
*** If you want to get the whole df then you can use this***
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)

Related

Summarise multiple but not all column

I have a dataset with 51 columns and I want to add summary rows for most of these variables. Currently columns 5:48 are various metrics with each row being 1 area from 1 quarter, I am summing the metric for all quarters for each area and ultimately creating a rate. The below code works fine for doing this to one individual metric but I need to run this for 44 different columns.
example <- test %>%
group_by(Area) %>%
summarise(`Metric 1`= (sum(`Metric 1`))/(mean(Population))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
I have tried creating a for loop and using the column index values, however, that hasn't worked and just returns various errors. I've been unable to get the above script working with index values as well, the below gives an error ('Error: unexpected '=' in: " group_by_at(Local_Authority) %>% summarise(u17_12mo[5]=")
example <- test %>%
group_by_at(Area) %>%
summarise(test[5]= (sum(test[5]))/(mean(test[4]))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
Any help on setting up a for loop for this, or another way entirely would be great
Without data, its tough to help, but maybe this would work for you:
library(tidyverse)
example <- test %>%
group_by(Area) %>%
summarise(across(5:48, ~(sum(.))/(mean(Population))*10000))

Looping a pipe through columns of a tibble

I have a tibble with 20 variables. So far I've been using this pipe to find out which values appear more than once in a single column
as_tibble(iris) %>% group_by(Petal.Length) %>% summarise(n=sum(n())) %>% filter(n>1)
I was wonering if I could write a line that could loop this through all the columns and return 20 different tibbles (or as many as I need in the future) in the same way the pipe above would return one tibble. I have tried writing my own loops but I've had no success, I am quite new.
The iris example dataset has 5 columns so feel free to give an answer with 5 columns.
Thank you!
library(dplyr)
col_names <- colnames(iris)
lapply(
col_names,
function(col) {
iris %>%
group_by_at(col) %>%
summarise(n = n()) %>%
filter(n > 1)
}
)
In base R 4.1+ we have this one-liner. For each column it applies table and then filters out those elements whose value exceeds 1. Finally it converts what remains of the table to a data frame. Omit stack if it is ok to return a list of table objects instead of a list of data frames.
lapply(iris, \(x) stack(Filter(function(x) x > 1, table(x))))
A variation of that is to keep only duplicated items and then add 1 giving slightly fewer keystrokes. Again we can omit stack if returning a list of table objects is ok.
lapply(iris, \(x) stack(table(x[duplicated(x)]) + 1))

R for loop to filter and print columns of a data frame

similarly asked questions to mine don’t seem to quite apply to what I am trying to accomplish, and at least one of the provided answers in one of the most similar questions didn’t properly provide a solution that actually works.
So I have a data frame that lets say is similar to the following.
sn <- 1:6
pn <- letters[1:6]
issue1_note <- c(“issue”,”# - #”,NA,”sue”,”# - #”,”ISSUE”)
issue2_note <- c(“# - #”,”ISS”,”# - #”,NA,”Issue”,”Tissue”)
df <- data.frame(sn,pn,issue1_note,issue2_note)
Here is what I want to do. I want to be able to visually inspect each _note column quickly and easily. I know I can do this on each column by using select() and filter() as in
df %>% select(issue1_note) %>%
filter(!is.na(issue1_note) & issue1_note != “# - #”)
However, I have around 30 columns and 300 rows in my real data and don’t want to do this each time.
I’d like to write a for loop that will do this across all of the columns. I also want each of the columns printed individually. I tried the below to remove just the NAs, but it merely selects and prints the columns. It’s as if it skips over the filtering completely.
col_notes <- df %>% select(ends_with(“note”)) %>% colnames()
for(col in col_notes){
df %>% select(col) %>% filter(!is.na(col)) %>% print()
}
Any ideas on how I can get this to also filter?
I was able to figure out a solution through more research, though it doesn’t involve a for loop. I created a custom function and then used lapply. In case anybody is wondering, here is my solution.
my_fn <- function(column){
tmp <- df %>% select(column)
tmp %>% filter(!is.na(.data[[column]]) & .data[[column]] != “# - #”)
}
lapply(col_notes, my_fn)
Thanks for the consideration.
This can be done all at once with filter/across or filter/if_any/filter_if_all` depending on the outcome desired
library(dplyr)
df %>%
filter(across(ends_with('note'), ~ !is.na(.) & . != "# - #"))
This will return rows with no NA or "# - #" in all of the columns with "note" as suffix in its column names. If we want to return at least one column have non-NA, use if_any
df %>%
filter(if_any(ends_with("note"), ~ !is.na(.) & . != "# - #"))

Fastest way to row bind dataframe within for loop in R?

I am trying to find the quickest and most effective way to produce a table using a for loop (or map in purrrr) in R.
I have 15,881 values which I am trying to loop over, for this example assume the values are the numbers 1 to 15,881 incremented by 1, which is this variable:
values <- c(1:15881)
I am then trying to filter an existing dataframe where a column matches a value and then perform some data cleaning process - the output of this a single dataframe, for clarity my process is the following:
Assume in this situation that I have chosen a single value from the values object e.g. value = values[1]
So then for a single value I have the following:
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
The above code works perfectly fine when I run it for a single value. The output is a desired dataframe. This process takes around 0.7 seconds for a single value.
However, I am trying to append the results of this output to an empty dataframe for each and every single value found in the variable values
So far I have tried the following:
For Loop approach
# empty dataframe to append values to
empty_df <- tibble()
for (value in values){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
empty_df <- bind_rows(empty_df, df)
}
However the above is extremely slow - I did a quick calculation and it would take around 186 minutes ((0.7 seconds per table x 15,881)/60 - seconds in a minute = around 185.7 minutes) - which is a huge amount of time to process just a dataframe.
Is there a quicker way to speed up the above process instead of a for loop? I can't think of any way to improve the fundamentals of the above code as it does the job well and 0.7 seconds to produce a single table seems fast to me but 15,881 tables is obviously going to take a long time.
I tried using the purrr package along with data.table but the furthest I got was this:
combine_dfs <- function(value){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
df <- data.table(df)
rbindlist(list(df, empty_df))
}
Then running with map_df is this:
map_df(values, ~combine_dfs(.))
However, even the above is extremely slow and seems to take around the same time!
Any help is appreciated!
Row binding dataframe in a loop is inefficient irrespective of which library you use.
You have not provided any example data but I think for your case this should work the same.
library(dplyr)
df_to_filter %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country)) -> result
result
You really need to provide an reproducible example firstly. Otherwise we can't provide a complete solution and have nothing to compare the result.
library(data.table)
setDT(df_to_filter)[code %in% values, by = .(code, country)] %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))

Splitting a long row into multiple shorter rows in R

I have this really wide data frame (1 obs. of 696 variables) and i want to split this only row into others every 10 columns
I think it'd be too confusing to post just the final data because it is too wide, so I'm giving the code for how I created it
library(tidyverse)
Vol_cil <- function(r, h) {
vol <- (pi*(r^2))*h
return(vol)}
vec <- Vol_cil(625, 0:695)/1000000
df <- data.frame(vec)
stckovrflw <- df %>%
mutate("mm" = 0:695) %>%
pivot_wider(names_from = mm, values_from = vec)
I want the columns to go from 0 to 9 and the rows from 0 to 69, with the data in this data frame (stckovrflw), I tried to find anyway to do this in the internet but couldn't do anything, ended up exporting it to excel and doing it by hand.
I'd appreciate any help
If I wasn't able to make myself understood please feel free to ask me anything
Here is one way to do it. It starts by putting stckovrflw back to long format so if that is actually what you have you can take out that step. It works by creating columns for the row and column number, then spreading by column number.
stckovrflw %>% pivot_longer(everything(), names_to='mm') %>%
mutate(row=rep(1:70, each=10)[1:696], col=rep(1:10, 70)[1:696]) %>%
select(-mm) %>%
pivot_wider(names_from='col', values_from='value')

Resources