R: Consolidating duplicate observations?

R: Consolidating duplicate observations? - r

I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.
I've tried:
df.new <- group_by(df,"ID")
and also:
library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]
and unfortunately neither have worked. Any help is appreciated!

Using basic R:
df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
day = c("1","2","3","4","5","6","7","8"),
year = c(2016,2017,2017,2016,2017,2016,2017,2016),
stringsAsFactors = F)
> df
ID day year
1 a 1 2016
2 a 2 2017
3 b 3 2017
4 b 4 2016
5 b 5 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
Do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x, collapse = "/") }
)
Result:
> z
id day year
1 a 1/2 2016/2017
2 b 3/4/5 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
EDIT
If you want to avoid "collapsing" NA do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x[!is.na(x)],collapse = "/") })
For a data frame like:
> df
ID day year
1 a 1 2016
2 a 2 NA
3 b 3 2017
4 b 4 2016
5 b <NA> 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
The result is:
> z
id day year
1 a 1/2 2016
2 b 3/4 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016

I have had a similar problem in the past, but I wasn't dealing with several copies of the same data. It was in many cases just 2 instances and in some cases 3 instances. Below was my approach. Hopefully, it will help.
idx <- duplicated(df$key) | duplicated(df$key, fromLast=TRUE) # get the index of the duplicate entries. Or will help get the original value too.
dupes <- df[idx,] # get duplicated values
non_dupes <- df[!idx,] # get all non duplicated values
temp <- dupes %>% group_by(key) %>% # roll up the duplicated ones.
fill_(colnames(dupes), .direction = "down") %>%
fill_(colnames(dupes), .direction = "up") %>%
slice(1)
Then it is easy to merge back the temp and the non_dupes.
EDIT
I would highly recommend to filter the df to the only the population as much as possible and relevant for your end goal as this process could take some time.

What about?
df %>%
group_by(ID) %>%
summarise_each(funs(paste0(., collapse = "/")))
Or reproducible...
iris %>%
group_by(Species) %>%
summarise_each(funs(paste0(., collapse = "/")))

Related

Delete duplicates with multiple grouping conditions

I want to delete duplicates with multiple grouping conditions but always get way less results than expected.
The dataframe compares two companies per year. Like this:
year
c1
c2
2000
a
b
2000
a
c
2000
a
d
2001
a
b
2001
b
d
2001
a
c
For every c1 I want to look at c2 and delete rows which are in the previous year.
I found a similar problem but with just one c. Here are some of my tries so far:
df<- df%>%
group_by(c1,c2) %>%
mutate(dup = n() > 1) %>%
group_split() %>%
map_dfr(~ if(unique(.x$dup) & (.x$year[2] - .x$year[1]) == 1) {
.x %>% slice_head(n = 1)
} else {
.x
}) %>%
select(-dup) %>%
arrange(year)
df<- sqldf("select a.*
from df a
left join df b on b.c1=a.c1 and b.c2 = a.c2 and b.year = a.year - 1
where b.year is null")
The desired output for the example would be:
year
c1
c2
2000
a
b
2000
a
c
2000
a
d
2001
b
d

Assuming you want to check duplicate in the previous year only. So showing it to you on a modified sample
library(tidyverse)
df <- read.table(header = T, text = 'year c1 c2
2000 a b
2000 a c
2000 a d
2001 a b
2001 b d
2001 a c
2002 a d')
df %>%
filter(map2_lgl(df$year, paste(df$c1, df$c2), ~ !paste(.x -1, .y) %in% paste(df$year, df$c1, df$c2)))
#> year c1 c2
#> 1 2000 a b
#> 2 2000 a c
#> 3 2000 a d
#> 4 2001 b d
#> 5 2002 a d
Created on 2021-07-08 by the reprex package (v2.0.0)

Some of the other solutions won't work because I think they ignore the fact that you will probably have many years and want to eliminate duplicates from only the prior.
Here is something fairly simple. You could do this in some map function or whatnot, but sometimes a simple loop does just fine. For each year of data, use anti_join() to return only those values from the current year which are not in the prior year. Then just restack the data.
df_split <- df %>%
group_split(year)
for (this_year in 2:length(df_split)) {
df_split[[this_year]] <- df_split[[this_year]] %>%
anti_join(df_split[[this_year - 1]], by = c("c1", "c2"))
}
bind_rows(df_split)
# # A tibble: 4 x 3
# year c1 c2
# <int> <chr> <chr>
# 1 2000 a b
# 2 2000 a c
# 3 2000 a d
# 4 2001 b d
Edit
Another approach is to add a dummy column for the prior year and just use an anti_join() with that. This is probably what I would do.
df %>%
mutate(prior_year = year - 1) %>%
anti_join(df, by = c(prior_year = "year", "c1", "c2")) %>%
select(-prior_year)

You can also use the following solution.
library(dplyr)
library(purrr)
df %>%
filter(pmap_int(list(df$c1, df$c2, df$year), ~ df %>%
filter(year %in% c(..3, ..3 - 1)) %>%
rowwise() %>%
mutate(output = all(c(..1, ..2) %in% c_across(c1:c2))) %>%
pull(output) %>% sum) < 2)
# AnilGoyal's modified data set
year c1 c2
1 2000 a b
2 2000 a c
3 2000 a d
4 2001 b d
5 2002 a d

this will only keep the data u want.
The datais your data frame.
data[!duplicated(data[,2:3]),]

I think this is pretty simple with base duplicated using the fromLast option to get the last rather than the first entry. (It does assum the ordering by year.
dat[!duplicated(dat[2:3], fromLast=TRUE), ] # negate logical vector in i-position
year c1 c2
3 2000 a d
4 2001 a b
5 2001 b d
6 2001 a c
I do get a different result than you said was expected so maybe I misunderstood the specifications?

Assuming, that you indeed wanted to keep your last year, as stated in the question, but contrary to your example table, you could simply use slice:
library(dplyr)
df = data.frame(year=c("2000","2000","2000","2001","2001","2001"),
c1 = c("a","a","a","a","b","a"),c2=c("b","c","d","b","d","c"))
df %>% group_by(c1,c2) %>%
slice_tail() %>%arrange(year,c1,c2)
Use slice_head(), if you wanted the first year.
Here is the documentation: slice

How to do calculations on a column of a data frame using values contained in another data frame in R?

I have 2 data frames: one with experimental data and one with values of some constants. Experimental data and constants are separated by categories (a and b). I would like to include a new column in the experimental data frame that is the result of the following calculation:
z = k*y
To do this, I'm using the dplyr package and the mutate() function, but I'm not getting the expected result. Does anyone have any tips or suggestions, even if it is necessary to use another package?
library(dplyr)
Category <- c("a", "b")
k <- c(1, 2)
# Data frame with the constants for each category
Constant <- data.frame(Category, k)
x <- seq(0,5,1)
df <- expand.grid(x = x,
Category = Category)
# Data frame with the experimental resultas
df$y <- seq(1,12,1)
# Failed attempt to calculate z separated by categories
df %>%
group_by(Category) %>%
mutate(z = Constant*y)

With dplyr you can do the following:
library(dplyr)
left_join(df, Constant, by = c("Category")) %>%
mutate(z = k * y) %>%
select(-k)

I did this:
a = c()
for(i in unique(df$Category)){
a = c(a,df[df$Category==i,"y"]*Constant[Constant$Category==i,'k'])
}
df$z=a
result:
x Category y z
1 0 a 1 1
2 1 a 2 2
3 2 a 3 3
4 3 a 4 4
5 4 a 5 5
6 5 a 6 6
7 0 b 7 14
8 1 b 8 16
9 2 b 9 18
10 3 b 10 20
11 4 b 11 22
12 5 b 12 24
I don't know if it was what you're looking for. Juste keep in mind that this works if your df is sorted by the category column
if you don't like for loop, here is a lapply version:
df$z =unlist( lapply(unique(df$Category), function(i){return(df[df$Category==i,"y"]*Constant[Constant$Category==i,'k'])}))
if the data isn't sorted by category:
df$z=unlist(lapply(1:nrow(df),function(i){ return(df[i,"y"]*Constant[Constant$Category==df[i,"Category"],'k'])}))

Replace missing data by using another data table for multiple columns

I have many columns in a table where there is missing data. I want to be able to pull in the information from another table if the data is missing for a particular record based on ID. I thought about possibly joining the two tables and writing a for loop where if column X is NA then pull in information from column Y, however, I have many columns and would require writing many of these conditions.
I want to create a function or a loop where I can pass in the data column names with the missing data and be able to pass in the column name from another table to get the information from.
Reproducible Example:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,NA,NA,1968,1992)
Month <- c(1,NA,8,12,NA,5)
Day <- c(3,NA,NA,NA,NA,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
ID <- c(2,3,4,5)
Year <- c(NA,1994,1967,NA)
Month <- c(4,NA,NA,10)
Day <- c(23,12,16,9)
Old_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Expected Output:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,1994,1967,1968,1992)
Month <- c(1,4,8,12,10,5)
Day <- c(3,23,12,16,9,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)

Using rbind combine two dataframe , then we using group_by with summarise_all
library(dplyr)
rbind(New_Data,Old_Data)%>%group_by(ID)%>%dplyr::summarise_all(function(x) x[!is.na(x)][1])
# A tibble: 6 x 4
ID Year Month Day
<dbl> <dbl> <dbl> <dbl>
1 1 1990 1 3
2 2 1987 4 23
3 3 1994 8 12
4 4 1967 12 16
5 5 1968 10 9
6 6 1992 5 30

An option using dplyr::left_join and dplyr::coalesce can be as:
library(dplyr)
New_Data %>% left_join(Old_Data, by="ID") %>%
mutate(Year = coalesce(Year.x, Year.y),
Month = coalesce(Month.x, Month.y),
Day = coalesce(Day.x, Day.y)) %>%
select(ID, Year, Month, Day)
# ID Year Month Day
# 1 1 1990 1 3
# 2 2 1987 4 23
# 3 3 1994 8 12
# 4 4 1967 12 16
# 5 5 1968 10 9
# 6 6 1992 5 30

Here's a solution using only base functions from another SO question
I modified it to your needs (created a function, and made an argument for the key column name):
fill_missing_data = function(df1, df2, keyColumn) {
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != keyColumn]
dfmerge<- merge(df1,df2,by="ID",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
return(dfmerge)
}
result = fill_missing_data(New_Data, Old_Data, "ID")

Find the data between two datasets which is within two weeks R

I am having two dataframes as follows,
data1
Type date
1 A 2011-10-21
2 A 2011-11-18
3 A 2011-12-16
4 B 2011-10-20
5 B 2011-11-17
6 B 2011-12-15
and
data2
Date Type value
1 2011-10-25 A 1
2 2011-10-15 A 3
3 2011-11-10 A 4
4 2011-10-23 B 12
5 2011-10-27 B 1
6 2011-11-18 B 1
I want to loop through the type(A,B) of data1 and check for each date and check all the entries for type(A,B) in data2 and check for the dates in data2 which is within two weeks gap, and then sum the values and bring it as an output.
My ideal output would be
Type date Value
1 A 2011-10-21 4 (3+1)
2 A 2011-11-18 4
3 A 2011-12-16 NA ( No values for A within two weeks)
4 B 2011-10-20 13 ( 12+1)
5 B 2011-11-17 1
6 B 2011-12-15 NA ( No values for A within two weeks)
I can think of writing a loop in R and running through. But it is running for a long time. I guess there should be a better way in dplyr to do this. I am trying and not able to complete it. Can anybody help me in doing this?
Thanks

How does this look? Assuming data1 as df1 and data2 as df2
library(dplyr)
library(lubridate)
df3 <- full_join(df1, df2, by = "Type")
df3 <- df3 %>% mutate(date1 = week(date), Date1 = week(Date))
df4 <- df3 %>% mutate(Key = ifelse(((date1 - Date1) %in% c(-2:2)), T, F))
df5 <- df4 %>% filter(Key == T) %>% group_by(Type, date) %>%
summarise(Value = sum(value))
full_join(df1, df5, by = c("Type", "date"))

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.

You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464

Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622

Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Consolidating duplicate observations? - r

What about? df %>% group_by(ID) %>% summarise_each(funs(paste0(., collapse = "/"))) Or reproducible... iris %>% group_by(Species) %>% summarise_each(funs(paste0(., collapse = "/")))

Related

Delete duplicates with multiple grouping conditions

How to do calculations on a column of a data frame using values contained in another data frame in R?

Replace missing data by using another data table for multiple columns

Find the data between two datasets which is within two weeks R

Getting a summary data frame for all the combinations of categories represented in two columns

Categories

Resources