Related
I have a large data frame that shows the distance between strings and their counts.
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).
I want my data to look somehow like this
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. I am really struggling to break down this problem and cluster the groups.
Any help or comment are really appreciated!
You can try using random walk clustering from igraph:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem. Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster). Then you would need to replace walktrap.community with components.
Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id, and identifier the "word" that has the largest value. I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename. I push all the variants into a list column.
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id column)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152
A financial statement is a good illustration of this issue. Here is an example dataframe:
df <- data.frame( date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 10),
category = sample(c('a','b', 'c'), 10, replace=TRUE),
direction = sample(c('credit', 'debit'), 10, replace=TRUE),
value = sample(0:25, 10, replace = TRUE) )
I want to produce a summary table with incoming, outgoing and total columns for each category.
df %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE), outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
In most cases this works perfectly with the example dataframe above.
But there are cases where df$direction could contain a single value e.g., credit, resulting in an error.
Error: Problem with `summarise()` column `outgoing`.
object 'debit' not found
Given that I have no control over the dataframe, what is the best way to handle this?
I've been playing around with a conditional statement in the summarize method to check that the column exists, but have not managed to get this working.
...
summarize( outgoing = case_when(
"debit" %in% colnames(.) ~ sum(debit,na.rm=TRUE),
TRUE ~ 0 ) )
...
Have I made a syntax error, or am I going in completely the wrong direction with this?
The issue happens only when one of the elements is presents i.e. 'credit' and no 'debit' or viceversa. Then, the pivot_wider doesn't create the column missing. Instead of pivoting and then summarising, do this directly with summarise and == i.e. if the 'debit' is absent, sum will take care of it by returning 0
library(dplyr)
df %>%
slice(-c(9:10)) %>% # just removed the 'debit' rows completely
group_by(category) %>%
summarise(total = sum(value[direction == 'credit']) -
sum(value[direction == "debit"]))
-output
# A tibble: 3 × 2
category total
<chr> <int>
1 a 15
2 b 30
3 c 63
With pivot_wider, it is not the case
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value)
# A tibble: 8 × 3
date category credit
<date> <chr> <int>
1 2020-07-25 c 19
2 2020-05-09 b 15
3 2020-08-27 a 15
4 2020-03-27 b 15
5 2020-04-06 c 6
6 2020-07-06 c 11
7 2020-09-22 c 25
8 2020-10-06 c 2
it creates only the 'credit' column, thus when we call a column 'debit' that is not created, it throws error
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) )
Error: Problem with summarise() column outgoing.
ℹ outgoing = sum(debit, na.rm = TRUE).
✖ object 'debit' not found
ℹ The error occurred in group 1: category = "a".
Run rlang::last_error() to see where the error occurred.
In this case, we can do a complete to create some rows with debit as well which will have NA for other columns
library(tidyr)
df %>%
slice(-c(9:10)) %>%
complete(category, direction = c("credit", "debit")) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
# A tibble: 3 × 4
category incoming outgoing total
<chr> <int> <int> <int>
1 a 15 0 15
2 b 30 0 30
3 c 63 0 63
I have the following dataset:
id<-c("1a","1a","1a","1a","1a",
"2a","2a","2a","2a","2a",
"3a","3a","3a","3a","3a")
fch<-c("22/05/2020","12/01/2020","01/01/2019","10/11/2020","01/01/2019",
"10/10/2015","01/01/2015","20/10/2015","08/04/2020","12/12/2019",
"01/05/2020","01/01/2013","10/08/2019","12/01/2020","20/10/2019")
dat<-c(25,35,48,97,112,
65,85,77,89,555,
58,98,25,45,336)
data<-as.data.frame(cbind(id,fch,dat))
My intention is to extract the row corresponding to the earliest date by the factor "id".
So my resulting data frame would look like this:
id<-c("1a","2a","3a")
fch<-c("01/01/2019","01/01/2015","01/01/2013")
dat<-c(48,85,98)
data_result<-as.data.frame(cbind(id,fch,dat))
This was my unsuccessful attempt:
DF1 <- data %>%
mutate(fch = as.Date(as.character(data$fch),format="%d/%m/%Y")) %>%
group_by(id) %>%
mutate(fch = min(fch)) %>%
ungroup
Slightly different method from #akrun. Note that one of the earliest dates in your data has two entries. Without a time there is no way to know which occurred first (or maybe you want both?).
library(tidyverse)
library(lubridate)
data.frame(id = c(rep("1a",5), rep("2a",5), rep("3a",5)),
fch = c("22/05/2020","12/01/2020","01/01/2019","10/11/2020","01/01/2019",
"10/10/2015","01/01/2015","20/10/2015","08/04/2020","12/12/2019",
"01/05/2020","01/01/2013","10/08/2019","12/01/2020","20/10/2019"),
dat = c(25,35,48,97,112,65,85,77,89,555,58,98,25,45,336)) %>%
group_by(id) %>%
mutate(fch = dmy(fch)) %>%
filter(fch == min(fch))
ungroup()
# A tibble: 4 x 3
# Groups: id [3]
id fch dat
<chr> <chr> <dbl>
1 1a 01/01/2019 48
2 1a 01/01/2019 112
3 2a 01/01/2015 85
4 3a 01/01/2013 98
We arrange the data by 'id', and the Date converted 'fch', grouped by 'id', use slice_head to get the first row of each group
library(dplyr)
library(lubridate)
data %>%
arrange(id, dmy(fch)) %>%
group_by(id) %>%
slice_head(n = 1) %>%
ungroup
-output
# A tibble: 3 x 3
# id fch dat
# <chr> <chr> <dbl>
#1 1a 01/01/2019 48
#2 2a 01/01/2015 85
#3 3a 01/01/2013 98
NOTE: cbind returns a matrix by default and matrix can have only a single type. Instead, we can directly create the data.frame
data
data <- data.frame(id, fch, dat)
I am working on a data set which is similar to
data <-tribble(
~id, ~ dates, ~days_prior,
1,20190101, NA,
1,NA, 15,
1,NA, 20,
2, 20190103, NA,
2,NA, 3,
2,NA, 4)
I have the first date for each ID and I am trying to calculate the next date by adding days_prior to the previous date. I am using the lag function to refer to the previous date.
df<- df%>% mutate(dates = as.Date(ymd(dates)), days_prior =as.integer(days_prior))
df<-df %>% mutate(dates =
as.Date(ifelse(is.na(days_prior),dates,days_prior+lag(dates)),
origin="1970-01-01"))
This works but only for the next row as you can see attached data.
What am I doing wrong? I would like all the dates to be calculated by mutate(). What different approach should I take to calculate this.
I don't really see how lag would help here; unless I misunderstood here is an option using tidyr::fill
data %>%
group_by(id) %>%
mutate(dates = as.Date(ymd(dates))) %>%
fill(dates) %>%
mutate(dates = dates + if_else(is.na(days_prior), 0L, as.integer(days_prior))) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 NA
#2 1 2019-01-16 15
#3 1 2019-01-21 20
#4 2 2019-01-03 NA
#5 2 2019-01-06 3
#6 2 2019-01-07 4
Or a slight variation, replacing the NA entries in days_prior with 0
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + as.integer(days_prior)) %>%
ungroup()
Update
In response to your clarifications in the comments, here is what you can do
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + cumsum(days_prior)) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 0
#2 1 2019-01-16 15
#3 1 2019-02-05 20
#4 2 2019-01-03 0
#5 2 2019-01-06 3
#6 2 2019-01-10 4
You can use the na.locf from the zoo package to fill in the last observed date before adding the prior days.
library("tidyverse")
library("zoo")
data %>%
# Fill in NA dates with the previous non-NA date
# The `locf` stands for "last observation carried forward"
# Fill in NA days_prior with 0
mutate(dates = zoo::na.locf(dates),
days_prior = replace_na(days_prior, 0)) %>%
mutate(dates = lubridate::ymd(dates) + days_prior)
This solution makes two assumptions:
The rows are sorted by id. You can get around this assumption with a group_by(id) followed by an ungroup() statement as shows in the solution by Maurits Evers.
For each id, the row with the observed date is first in the group. This needs to be true in any case with either na.locf and fill because both functions fill in NAs using the previous non-NA entry.
If you don't want to make any assumptions about the ordering, you can sort the rows at the start with data %>% arrange(id, dates).
I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.