group two columns and count in R - r

I am trying to do a simple counts with dataframe
For example, if the df is as below:
> df
ID lable col1 col2
1 Buckinghamshire 1 A A
2 Cornwall and Isles of Scilly 2 B B
3 Devon 1 A A
4 Dunfermline 2 C C
5 Humberside 2 C C
6 Inner London X A A
7 Kent X A A
8 Kirkcaldy 1 C C
9 Lancashire 1 B B
10 Not known/missing 2 C C
Desire output
> df2
name group 1 2 X
1 col1 A 4647 4858 108
1 col1 B 120456 146864 3502
1 col1 C 258 53 111
2 col2 A 12247 1202 66
2 col2 B 4585 258 1
2 col2 C 32158 15426 477
How can I solve this so I can get desire output

The question is not clear.
lines="
ID, label, col1, col2
Buckinghamshire , 1 ,A,A
Cornwall and Isles of Scilly , 2 ,B,B
Devon , 1 ,A,A
Dunfermline , 2 ,C,C
Humberside , 2 ,C,C"
con <- textConnection(lines)
data <- read.csv(con)
data$col2=trimws(data$col2)
library(tidyr)
library(dplyr)
temp=gather(data, key="name", value="group",3:4)
temp2=temp[,c(2,3,4)] %>%
group_by(name, group, label) %>% summarise(count=n())
spread(temp2, key="label", value="count")
Result:
name group `1` `2`
<chr> <chr> <int> <int>
1 col1 A 2 NA
2 col1 B NA 1
3 col1 C NA 2
4 col2 A 2 NA
5 col2 B NA 1
6 col2 C NA 2

Related

How can I stack my dataset so each observation relates to all other observations but itself?

I would like to stack my dataset so all observations relate to all other observations but itself.
Suppose I have the following dataset:
df <- data.frame(id = c("a", "b", "c", "d" ),
x1 = c(1,2,3,4))
df
id x1
1 a 1
2 b 2
3 c 3
4 d 4
I would like observation a to be related to b, c, and d. And the same for every other observation. The result should look like something like this:
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
So observation a is related to b,c,d. Observation b is related to a, c,d. And so on. Any ideas?
Another option:
library(dplyr)
left_join(df, df, by = character()) %>%
filter(id.x != id.y)
Or
output <- merge(df, df, by = NULL)
output = output[output$id.x != output$id.y,]
Thanks #ritchie-sacramento, I didn't know the by = NULL option for merge before, and thanks #zephryl for the by = character() option for dplyr joins.
tidyr::expand_grid() accepts data frames, which can then be filtered to remove rows that share the id:
library(tidyr)
library(dplyr)
expand_grid(df, df, .name_repair = make.unique) %>%
filter(id != id.1)
# A tibble: 12 × 4
id x1 id.1 x1.1
<chr> <dbl> <chr> <dbl>
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
You can use combn() to get all combinations of row indices, then assemble your dataframe from those:
rws <- cbind(combn(nrow(df), 2), combn(nrow(df), 2, rev))
df2 <- cbind(df[rws[1, ], ], df[rws[2, ], ])
# clean up row and column names
rownames(df2) <- 1:nrow(df2)
colnames(df2) <- c("id", "x1", "id2", "x2")
df2
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 c 3
5 b 2 d 4
6 c 3 d 4
7 b 2 a 1
8 c 3 a 1
9 d 4 a 1
10 c 3 b 2
11 d 4 b 2
12 d 4 c 3

How can I compare two columns of different lengths from two dataframes to check for matching values in R?

I have two dataframes that look something like this:
df1
df2 <- data.frame(col1 = c(1,4,5,7), col2 = c("a","c","f", "g"))
df2
df1
col1 col2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
df2
col1 col2
1 1 a
2 2 c
3 3 f
4 4 g
5 10 z
I want to compare the values in col2 of each df and line up the columns of each df by the matches to get this:
col1 col2 col1.1 col2.1
1 1 a 1 a
2 2 b NA <NA>
3 3 c 2 c
4 4 d NA <NA>
5 5 e NA <NA>
6 6 f 3 f
7 7 g 4 g
Where ideally, the missing values from df1 are dropped and the missing values from df2 are filled in with NAs. Ultimately, I want to calculate what percent of the values in col2 of df1 have a match in col2 of df2.
Use left_join from dplyr
library(dplyr)
df1 %>%
left_join(df2, by="col2", keep = TRUE)
output:
col1.x col2.x col1.y col2.y
1 1 a 1 a
2 2 b NA <NA>
3 3 c 2 c
4 4 d NA <NA>
5 5 e NA <NA>
6 6 f 3 f
7 7 g 4 g
To get the percentage of match
out <- df1 %>%
left_join(df2, by="col2", keep = TRUE)
out %>%
filter(col2.x == col2.y) %>%
nrow()/nrow(out)*100
Result:
[1] 57.14286
If you want the % of match as a new column:
df1 %>%
left_join(df2, by="col2", keep = TRUE) %>%
mutate(percentage_match = sum(col2.x == col2.y, na.rm = TRUE)/nrow(.)*100)
output:
col1.x col2.x col1.y col2.y percentage_match
1 1 a 1 a 57.14286
2 2 b NA <NA> 57.14286
3 3 c 2 c 57.14286
4 4 d NA <NA> 57.14286
5 5 e NA <NA> 57.14286
6 6 f 3 f 57.14286
7 7 g 4 g 57.14286

Keeping all NAs in dplyr distinct function

I have a data.frame (the eBird basic dataset) where many observers may upload a record from a same sighting to a database, in this case, the event is given a "group identifier"; when not from a group session, a NA will appear in the database; so I'm trying to filter out all those duplicates from group events and keep all NAs, I'm trying to do this without splitting the dataframe in two:
library(dplyr)
set.seed(1)
df <- tibble(
x = sample(c(1:6, NA), 30, replace = T),
y = sample(c(letters[1:4]), 30, replace = T)
)
df %>% count(x,y)
gives:
> df %>% count(x,y)
# A tibble: 20 x 3
x y n
<int> <chr> <int>
1 1 a 1
2 1 b 2
3 2 a 1
4 2 b 1
5 2 c 1
6 2 d 3
7 3 a 1
8 3 b 1
9 3 c 4
10 4 d 1
11 5 a 1
12 5 b 2
13 5 c 1
14 5 d 1
15 6 a 1
16 6 c 2
17 NA a 1
18 NA b 2
19 NA c 2
20 NA d 1
I want no NA at x to be grouped together, as here happened with "NA b" and "NA c" combinations; distinct function has no information on not taking NAs into the computation; is splitting the dataframe the only solution?
With distinct an option is to create a new column based on the NA elements in 'x'
library(dplyr)
df %>%
mutate(x1 = row_number() * is.na(x)) %>%
distinct %>%
select(-x1)
Or we can use duplicated with an OR (|) condition to return all NA elements in 'x' with filter
df %>%
filter(is.na(x)|!duplicated(cur_data()))
# A tibble: 20 x 2
# x y
# <int> <chr>
# 1 1 b
# 2 4 b
# 3 NA a
# 4 1 d
# 5 2 c
# 6 5 a
# 7 NA d
# 8 3 c
# 9 6 b
#10 2 b
#11 3 b
#12 1 c
#13 5 d
#14 2 d
#15 6 d
#16 2 a
#17 NA c
#18 NA a
#19 1 a
#20 5 b

Expand dataframe by ID to generate a special column

I have the following dataframe
df<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5),
'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency
1 A 1 1
2 A 2 2
3 A 3 NA
4 A 4 4
5 A 5 6
6 B 1 1
7 B 2 2
8 B 3 5
9 B 4 6
10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 1
8 B 2
9 B 3
10 B 4
11 B 5
12 B 6
13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<-df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>%
mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq
<fct> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 NA 3
4 A 4 4 4
5 A 5 6 5
6 B 1 1 1
7 B 2 2 2
8 B 3 5 3
9 B 4 6 4
10 B 5 7 5
Request someone to help me here
gather function from tidyr package will be helpful here:
library(tidyverse)
df %>%
gather(x, CFreq, -ID) %>%
select(-x) %>%
na.omit() %>%
unique() %>%
arrange(ID, CFreq)
A different tidyverse possibility could be:
df %>%
nest(A_Frequency, B_Frequency, .key = C_Frequency) %>%
mutate(C_Frequency = map(C_Frequency, function(x) unique(x[!is.na(x)]))) %>%
unnest()
ID C_Frequency
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
9 A 6
10 B 1
11 B 2
12 B 3
13 B 4
14 B 5
18 B 6
19 B 7
Base R approach would be to split the dataframe based on ID and for every list we count the number of unique enteries and create a sequence based on that.
do.call(rbind, lapply(split(df, df$ID), function(x) data.frame(ID = x$ID[1] ,
CFreq = seq_len(length(unique(na.omit(unlist(x[-1]))))))))
# ID CFreq
#A.1 A 1
#A.2 A 2
#A.3 A 3
#A.4 A 4
#A.5 A 5
#A.6 A 6
#B.1 B 1
#B.2 B 2
#B.3 B 3
#B.4 B 4
#B.5 B 5
#B.6 B 6
#B.7 B 7
This will also work when A_Frequency B_Frequency has characters in them or some other random numbers instead of sequential numbers.
In tidyverse we can do
library(tidyverse)
df %>%
group_split(ID) %>%
map_dfr(~ data.frame(ID = .$ID[1],
CFreq= seq_len(length(unique(na.omit(flatten_chr(.[-1])))))))
A data.table option
library(data.table)
cols <- c('A_Frequency', 'B_Frequency')
out <- setDT(df)[, .(CFreq = sort(unique(unlist(.SD)))),
.SDcols = cols,
by = ID]
out
# ID CFreq
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: A 5
# 6: A 6
# 7: B 1
# 8: B 2
# 9: B 3
#10: B 4
#11: B 5
#12: B 6
#13: B 7

How to filter values that appear after another value by group in R?

I am trying to filter products that customers bought after buying a product "A".
My sample data set:
fk_ConsumerID ProductName Date
1 B 2015.10.12
1 A 2015.10.14
1 C 2015.10.18
1 D 2015.10.19
2 A 2015.10.10
2 B 2015.10.12
2 C 2015.10.14
2 D 2015.10.18
2 E 2015.10.19
3 C 2015.10.14
3 D 2015.10.18
3 A 2015.10.19
4 B 2015.10.10
Result I want to get:
fk_ConsumerID ProductName Date
1 C 2015.10.18
1 D 2015.10.19
2 B 2015.10.12
2 C 2015.10.14
2 D 2015.10.18
2 E 2015.10.19
Code I tried writing:
library(dplyr)
#Grouping customers
customers <- group_by(df, fk_ConsumerId)
#Filtering the ones that appear after A (Doesn`t work)
f<-filter(customers, ProductName > "A")
I will try to find a neater solution, but this is a temporary one that does the job.
library(dplyr)
library(purrr)
df <- data.frame(fk_ConsumerID=c(1,1,1,1,2,2,2,2,2,3,3,3,4),
ProductName=c("B","A","C","D","A","B","C","D","E","C","D","A","B"),
Date=c(1:13)
)
df <- df %>% group_by(fk_ConsumerID) %>%
mutate(cc=ProductName=="A",
ss=seq_along(ProductName)
)
fk_ConsumerID ProductName Date cc ss
<dbl> <fctr> <int> <lgl> <int>
1 1 B 1 FALSE 1
2 1 A 2 TRUE 2
3 1 C 3 FALSE 3
4 1 D 4 FALSE 4
5 2 A 5 TRUE 1
6 2 B 6 FALSE 2
7 2 C 7 FALSE 3
8 2 D 8 FALSE 4
9 2 E 9 FALSE 5
10 3 C 10 FALSE 1
11 3 D 11 FALSE 2
12 3 A 12 TRUE 3
13 4 B 13 FALSE 1
a temp dataframe to list each fk_ConsumerID and the index of the entry with A:
kk <- df[which(df$cc==TRUE),c(1,5)]
names(kk)[2] <- "idx"
> kk
Source: local data frame [3 x 2]
Groups: fk_ConsumerID [3]
fk_ConsumerID idx
<dbl> <int>
1 1 2
2 2 1
3 3 3
add the index of the entry with A in a new column:
getIndex <- function(x){
kk$idx[kk$fk_ConsumerID==x] %>%
as.integer
}
filter based on the index value:
df <- df %>%
mutate(idx=map(fk_ConsumerID,getIndex )) %>%
filter(ss>idx) %>%
select(1:3)
Source: local data frame [6 x 3]
Groups: fk_ConsumerID [2]
fk_ConsumerID ProductName Date
<dbl> <fctr> <int>
1 1 C 3
2 1 D 4
3 2 B 6
4 2 C 7
5 2 D 8
6 2 E 9
Produce temporary variable first, then filter group with productname='A', further filter rank is greater than rank where productname='A' is located.
df%>%group_by(fk_ConsumerID)%>%mutate(rank=1:n())%>%
filter(sum(ProductName=='A')>0)%>%filter(rank>rank[ProductName=='A'])%>%
select(-rank)
# fk_ConsumerID ProductName Date
<int> <chr> <chr>
1 1 C 2015.10.18
2 1 D 2015.10.19
3 2 B 2015.10.12
4 2 C 2015.10.14
5 2 D 2015.10.18
6 2 E 2015.10.19
The following data.table (version 1.9.7) solution uses non-equi joins:
library(data.table)
# date of first purchase of product A by each customer
# (thereby removing edge case where purchase of A was the last purchase)
fp <- dt[ProductName == "A" & Date < max(Date), .(minDate = min(Date)), by = fk_ConsumerID]
# non-equi join
dt[fp, on = c("fk_ConsumerID", "Date>minDate")]
# fk_ConsumerID ProductName Date
#1: 1 C 2015-10-14
#2: 1 D 2015-10-14
#3: 2 B 2015-10-10
#4: 2 C 2015-10-10
#5: 2 D 2015-10-10
#6: 2 E 2015-10-10
Data
to make it reproducible
dt <- fread("fk_ConsumerID ProductName Date
1 B 2015.10.12
1 A 2015.10.14
1 C 2015.10.18
1 D 2015.10.19
2 A 2015.10.10
2 B 2015.10.12
2 C 2015.10.14
2 D 2015.10.18
2 E 2015.10.19
3 C 2015.10.14
3 D 2015.10.18
3 A 2015.10.19
4 B 2015.10.10")
dt[, Date := anytime::anydate(Date)]
Here is a solution in dplyr that solves your problem.
first we find the time a customer bought item a. This time is stored in a new column called timeA.
Now it is just to select all rows that have a time that comes after this time.
df %>%
group_by(fk_ConsumerID) %>%
filter(ProductName=="A") %>%
summarise(timeA = min(Date)) %>%
right_join(df) %>%
filter(!is.na(timeA),Date > timeA)

Resources