Create group variable based on common dates - r

I have a large data set containing animal ID's and dates. There are two groups within this dataset but there is no grouping variable, so I have to extrapolate who belongs to which group based on the dates they appear to have in common.
Dummy data.
mydf<-data.frame(
Date=sort(rep(seq(as.Date("2012/1/1"),as.Date("2012/1/4"), length.out = 4),5)),
ID = c(1,2,3,4,5,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10))
The other issue I have is that every now and then an ID belonging to group 1 might appear with a date associated with group 2, which is what has thrown off every attempt I've made so far at grouping.
What I need is a output with ID's and a new Group ID like this
ID Group
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
1:5 all appear together on the 1st and the 3rd so they are likely to be one group.
6:10 appear on the 2nd and 4th and are likely to be the 2nd group.
ID 5 belongs to group 1, because even though it was observed once on the 2nd with ID's 6:9, it was observed twice on the 1st and 2nd 1:4, so it's most likely to belong to group 1.
All my attempts have fallen flat. Can anyone offer a solution to this?
Thanks in advance.
EDIT:
I thought we had nailed a solution using Jon's kmeans solution (in the comments below):
mydf_wide <- mydf %>%
select(ID, date) %>%
distinct(ID,date)%>% #
mutate(x = 1) %>%
spread(date, x, fill = 0)
mydf_wide$clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
but I'm actually finding the kmeans method not quite getting it right every time. See below:
The groups where certain tags (ID) appear on the same day as each other are fairly easy to spot by eye. There are two groups, one is in the center, and the other group appears on either side. The clustering should be vertical by common dates as in Jon's answer below, but it is clustering across the entire date range. (Apologies for the messy axis labels)
The k-means method has worked on other groups, but it's not consistently able to group by common dates. I think the clustering approach is sensible, but I was wondering if there may be other clustering methods that may cope better than kmeans?
Alternatively, could a filtering method help reduce any background noise and help the kmeans approach more reliable?
Again, very grateful for any and all advice.
Cheers.

My thinking here is that you just assign each Date to a group, then take the average of group for each ID. You could then round to the nearest whole number from there. In this case, average group of ID == 5 would be 1.33
library(dplyr)
mydf %>%
mutate(group = case_when(
Date %in% as.Date(c("2012-01-01", "2012-01-03")) ~ 1,
Date %in% as.Date(c("2012-01-02", "2012-01-04")) ~ 2,
TRUE ~ NA_real_
)) %>%
group_by(ID) %>%
summarise(likely_group = mean(group) %>% round)
Which gives you the following:
# A tibble: 10 x 2
ID likely_group
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
This works as long as there isn't an even split between groups for a single ID. But there isn't currently a way to address this situation with the information provided.

As a general solution, you might consider using k-means as an automatic way to split the data into groups based on similarity to other IDs.
First, I converted the data into wide format so that each ID gets one row. Then fed that into the base kmeans function to get the clustering output as a list, and purrr::pluck to extract just the assignment part of that list.
library(tidyverse)
mydf_wide <- mydf %>%
mutate(x = 1) %>%
spread(Date, x, fill = 0)
mydf_wide
# ID 2012-01-01 2012-01-02 2012-01-03 2012-01-04
#1 1 1 0 1 0
#2 2 1 0 1 0
#3 3 1 0 1 0
#4 4 1 0 1 0
#5 5 1 1 1 0
#6 6 0 1 0 1
#7 7 0 1 0 1
#8 8 0 1 0 1
#9 9 0 1 0 1
#10 10 0 0 0 1
clusters <- mydf_wide %>%
kmeans(centers = 2) %>%
pluck("cluster")
clusters
# [1] 2 2 2 2 2 1 1 1 1 1
Here's what that looks if you add those to the original data and plot.
mydf_wide %>%
mutate(cluster = clusters) %>%
# ggplot works better with long (tidy) data...
gather(date, val, -ID, -cluster) %>%
filter(val != 0) %>%
arrange(cluster) %>%
ggplot(aes(date, ID, color = as.factor(cluster))) +
geom_point(size = 5) +
scale_y_continuous(breaks = 1:10, minor_breaks = NULL) +
scale_color_discrete(name = "cluster")

Related

How to use the lag function correctly in r dplyr?

I get the below incorrect output for the last cell in column reSeq when running the R/dplyr code immediately beneath. The code produces a value of 8 in that last cell of column reSeq, when via the lag() function in the code it should instead produce a 7. What is wrong with my use of the lag() function? Also see image at the bottom that better explains what I am trying to do.
Element Group eleCnt reSeq
<chr> <dbl> <int> <int>
1 R 0 1 1
2 R 0 2 2
3 X 0 1 1
4 X 1 2 2
5 X 1 3 2
6 X 0 4 4
7 X 0 5 5
8 X 0 6 6
9 B 0 1 1
10 R 0 3 3
11 R 2 4 4
12 R 2 5 4
13 X 3 7 7
14 X 3 8 7
15 X 3 9 8
library(dplyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = eleCnt) %>%
mutate(reSeq = ifelse(
Element == lag(Element)& Group == lag(Group) & Group > 0,
lag(reSeq),
eleCnt)
)
The above is an attempted translation from Excel as show in this image below. I am new to R, migrating over from Excel. I am trying to replicate the column D "Target", highlighted in yellow with the formula to the right. The below shows the correct output, including the desired 7 in cell D17 which I can't replicate with the above R code.
Breaking the derivation of "Target" down into 2 columns, Step1 and Step2, highlighted in yellow and blue in the below image (Step2 below is same as Target in above image)(2 steps is how I got the R code working as shown in one of the solutions):
The below code works. I broke the Excel "Target" calculation into 2 steps in the 2nd image in the OP in order to reflect the step-wise R solution.
library(dplyr)
library(tidyr)
myDF <- data.frame(
Element = c("R","R","X","X","X","X","X","X","B","R","R","R","X","X","X"),
Group = c(0,0,0,1,1,0,0,0,0,0,2,2,3,3,3)
)
myDF %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup()%>%
mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0)) %>%
mutate(reSeq = na_if(reSeq, 0)) %>%
group_by(Element) %>%
fill(reSeq) %>%
ungroup

How to visualize the data if one participant has multiple entries in different rows?

I am currently working on a dataset which consists of multiple participants. Some participants have participated all followups, whereas others have skipped some followups.
For example, in the dataset below, participant 2 only participated the 3rd followup, and participant 3 only participated the 2nd and the 3rd followup. You can also see that some participants have more than 1 rows of entry because they have several followups.
The original dataset only has the 1st and the 2nd column. Since I am aiming to create a progress chart like this
I have tried to create extra columns for each visit by using the code below:
participant <- c(1,1,1,2,3,3,4,5,5,5 )
visit <- c(1,2,3,3,2,3,1,1,2,3)
df <- data.frame(participant, visit)
df[,3] <- as.integer(df$visit=="1")
df[,4] <- as.integer(df$visit=="2")
df[,5] <- as.integer(df$visit=="3")
colnames(df)[colnames(df) %in% c("V3","V4","V5")] <- c(
"Visit1","Visit2","Visit3")
However, I still experience a hard time combining rows of the same participant, and hence I could not proceed to making the chart (which I also have no clue about). I have tried the 'reshape' function but it did not work out. group_by function also did not work out and still showed the original dataset
df1 <- df[,-2]
df1 %>%
group_by(participant)
What function should I use this case for:
combining rows of the same participant?
how to produce the progress chart?
Thank you in advance for your help!
Based on your df you could produce the chart with
library(ggplot2)
library(dplyr)
df %>%
ggplot(aes(x = as.factor(visit),
y = as.factor(participant),
fill = as.factor(visit))) +
geom_tile(aes(width = 0.7, height = 0.7), color = "black") +
scale_fill_grey() +
xlab("Visit") +
ylab("Participants") +
guides(fill = "none")
If you need your data.frame in a wide format (similar to the image shown but with only one row per participant), use
library(tidyr)
library(dplyr)
df %>%
mutate(value = 1) %>%
pivot_wider(
names_from = visit,
values_from = value,
names_glue = "Visit{visit}",
values_fill = 0)
to get
# A tibble: 5 x 4
participant Visit1 Visit2 Visit3
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
I think you are looking for a way to dummify a variable.
There are several ways to do that.
I like the fastDummies package. You can use dummy_cols, with remove_selected_columns=TRUE.
df %>% fastDummies::dummy_cols(select_columns = 'visit',
remove_selected_columns = TRUE)
participant visit_1 visit_2 visit_3
1 1 1 0 0
2 1 0 1 0
3 1 0 0 1
4 2 0 0 1
5 3 0 1 0
6 3 0 0 1
7 4 1 0 0
8 5 1 0 0
9 5 0 1 0
10 5 0 0 1
You may want to pipe in some summariseoperation to make the table even cleaner, as in:
df %>% fastDummies::dummy_cols(select_columns = 'visit', remove_selected_columns = TRUE)%>%
group_by(participant)%>%
summarise(across(starts_with('visit'), max))
# A tibble: 5 x 4
participant visit_1 visit_2 visit_3
<dbl> <int> <int> <int>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
In a certain way, this looks a bit like a pivoting operation too.
You may be interested in using dplyr::pivot_wider here too
EDIT: #MartinGal had just given a similar answer, I removed a very similar version of his pivot_wider

How to count observations between two rows based on condition in R?

I am trying to create a variable for a data frame in which I count the number of observations between two observations which meet a criteria. Here it is counting the number of times since last win in a game.
Say I have a dataframe like this:
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),win = c(1,0,0,0,1,1,0,1,0,0,1))
I want to create a new variable that counts the number of games it has been since the player has won.
Summarized in a vector, the result should be (setting a Not Applicable for the first observation):
c(NA,0,1,2,3,0,0,1,0,1,2)
I want to be able to do this easily and create it as a variable in the data.frame using dplyr (or any other suitable approach)
I am not quite sure why the first value should be NA. Because the elapsed time is 0 since the last "win" and not NA.
For purely logical reasons, I would take the following approach:
seq = with(df, ave(win, cumsum(win == 1), FUN = seq_along)-1)
So you get the past cummulated sum games since the last win as follows:
c(0,1,2,3,0,0,1,0,1,2,0)
But if you still aim for your described result with a little data handling you can achieve it with this:
append(NA, seq[1:length(seq)-1])
It is not nice, but it works ;)
With {tidyverse}, try:
library(tidyverse)
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),
win = c(1,0,0,0,1,1,0,1,0,0,1))
df %>%
group_by(player, group = cumsum(win != lag(win, default = first(win)))) %>%
mutate(counter = row_number(),
counter = if_else(win == 1, true = 0L, false = counter)) %>%
ungroup() %>%
group_by(player) %>%
mutate(counter = if_else(row_number() == 1, true = NA_integer_, false = counter)) %>%
ungroup() %>%
select(-group)
player win counter
<dbl> <dbl> <int>
1 10 1 NA
2 10 0 1
3 10 0 2
4 10 0 3
5 10 1 0
6 10 1 0
7 10 0 1
8 10 1 0
9 10 0 1
10 10 0 2
11 10 1 0

How to find the first occurrence of a negative value for each factor

I am working with weather data and trying to find the first time a temperature is negative for each winter season. I have a data frame with a column for the winter season (1,2,3,etc.), the temperature, and the ID.
I can get the first time the temperature is negative with this code:
FirstNegative <- min(which(df$temp<=0))
but it only returns the first value, and not one for each season.
I know I somehow need to group_by season, but how do I incorporate this?
For example,
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df <- cbind(season,temp,ID)
Ideally I want a table that looks like this from the above dummy code:
table
season id_firstnegative
[1,] 1 2
[2,] 2 4
[3,] 3 8
[4,] 4 10
[5,] 5 13
A base R option using subset and aggregate
aggregate(ID ~ season, subset(df, temp < 0), head, 1)
# season ID
#1 1 2
#2 2 4
#3 3 8
#4 4 10
#5 5 13
library(dplyr)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<-as.data.frame(cbind(season,temp,ID))
df %>%
dplyr::filter(temp < 0) %>%
group_by(season) %>%
dplyr::filter(row_number() == 1) %>%
ungroup()
As you said, I believe you could solve this by simply grouping season and examining the first index of IDs below zero within that grouping. However, the ordering of your data will be important, so ensure that each season has the correct ordering before using this possible solution.
library(dplyr)
library(tibble)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<- tibble(season,temp,ID)
df <- df %>%
group_by(season) %>%
mutate(firstNeg = ID[which(temp<0)][1]) %>%
distinct(season, firstNeg) # Combine only unique values of these columns for reduced output
This will provide output like:
# A tibble: 5 x 2
# Groups: season [5]
season firstNeg
<dbl> <dbl>
1 1 2
2 2 4
3 3 8
4 4 10
5 5 13

tidyr::gather() %>% mutate() %>% spread() returns NA's unexpectedly

My ultimate goal is to do a series of chisq.test's on this data, comparing the values of 'dealer','store' and 'transport' by 'gender'. I'm using spread and gather to create a column of 'female' and one for 'males' then planned to use group_by & map to run the chisq.test by group of 'key', which is created in my gather argument. I'm doing something wrong because I'm getting grouped NA's back.
The code below produces my dilemma.
set.seed(123)
df_ <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
df_ %>%
gather(key,value,-gender) %>%
mutate(id = 1:nrow(.)) %>%
spread(gender,value)
Here is a data_frame of my desired outcome.
data_frame(key = sample(c('dealer','store','transport'),50,T),
male = sample(1:5,50,T),
female = sample(1:5,50,T))
You need to group_by(gender) before adding your id and spreading, i.e.
library(tidyverse)
df_ %>%
gather(key, value, -gender) %>%
group_by(gender) %>%
mutate(id = row_number()) %>%
spread(gender, value)
NOTE Substituting row_number() with 1:nrow(.) will fail because of the grouping. This is because it takes the sequence of the whole data frame (rather than a sequence for each group) and tries to assign it to each group. Hence the error you get with the length
Error in mutate_impl(.data, dots) :
Column id must be length 156 (the group size) or one, not 300
If you do say ... %>%mutate(id = 1:length(key)) It will be fine
The result in both (row_number and 1:length(key)) is,
# A tibble: 168 x 4
key id female male
* <chr> <int> <int> <int>
1 dealer 1 3 4
2 dealer 2 3 2
3 dealer 3 1 4
4 dealer 4 5 3
5 dealer 5 4 4
6 dealer 6 5 2
7 dealer 7 3 3
8 dealer 8 1 2
9 dealer 9 2 5
10 dealer 10 2 2
# ... with 158 more rows
#elliot while #Sotos has given a great answer to the challenge you were having with the tidyverse, I'm a bit confused by why you're going through all that extra effort. Your ultimate goal as stated was to run chisq.test for gender against each of the others (dealer, store & transport). Your original dataset doesn't need any modification to do that!
require(tidyverse)
set.seed(123)
yourdata <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
yourdata
# A tibble: 100 x 4
gender dealer store transport
<chr> <int> <int> <int>
1 female 2 2 5
2 male 2 4 2
3 female 2 2 1
Can be used exactly as it stands! You may have other reasons to want to change the data but it is tidy as it is representing one case or person per row.
Edited (January 16th) To achieve your stated ultimate goal you just have to:
require(dplyr)
require(broom)
allofthem <- lapply(yourdata[-1], function(y) tidy(chisq.test(x = yourdata$gender, y = y )))
allofthem <- bind_rows(allofthem, .id = "dependentv")
allofthem
You may also want to look at the lsr package which will do Chi-square independence (association tests) and provide a much more informative output. Also note that from a statistical perspective you are running very many tests and should correct your confidence appropriately... see for example http://rpubs.com/ibecav/290361

Resources