Since it is hard for me to solve another puzzle posted on this site due to computation limits, I am trying something new as a substitute, which should work. I have two large datasets, one with monthly firm data, one with monthly bank data :
Data_firm <- data.frame(
Firm = c("A", "A", "B"),
time = c("1", "3", "2"),
postal= c("20", "20", "67")
)
Data_bank <- data.frame(
Bank = c("AB", "AC","BD"),
Postal_bank = c("20", "67","20"),
entry = c("1","1","2"),
exit = c("6","5","7"))
I would need to add a column with the number of banks present in the same department as the firm at each time, accounting for the fact that some banks enter and some do exit (entry, exit variables in "Databank"). In theory, I should have the fllowing column in the example above :
nbbankindepartment = c(1,2,1)
I have tried the following
for (row in 1:nrow(Data_firm){
dep <- Data_firm[row, "postal"]
time <- Data_firm[row, "time"]
count <- sum(Data_bank$Postal_bank == dep & Data_bank$entry <= time & Data_bank$exit > time, na.rm = TRUE)
Data_firm[row,"nbbankindepartment"]<-count
}
But I only get 0s as a result. Does anybody have a solution ? What I am not doing right ?
Thanks in advance,
Here is my best guess for your problem:
library(dplyr)
Data_firm %>%
left_join(Data_bank, by = c("postal" = "Postal_bank")) %>%
filter(entry <= time, exit > time) %>%
group_by(Firm, time, postal) %>%
summarise(nb_bank_in_department = n(), .groups = "drop")
This returns
# A tibble: 3 x 4
Firm time postal nb_bank_in_department
<chr> <chr> <chr> <int>
1 A 1 20 1
2 A 3 20 2
3 B 2 67 1
Related
I have the following example of dataset for a biology project.
I want to compute a rate of growth of number between 4th January and 2nd of January.
That is rate = (number_at_0104 - number_at_0102)/(number_at_0102)
(In tidyverse if possible)
a <- c("Date", "Specie", "Number")
b <- c("2020-01-01", "Dog", "3")
c <- c("2020-01-02", "Dog", "4")
d <- c("2020-01-03", "Dog", "5")
e <- c("2020-01-04", "Dog", "6")
f <- c("2020-01-01", "Cat", "3")
g <- c("2020-01-02", "Cat", "7")
h <- c("2020-01-03", "Cat", "8")
i <- c("2020-01-04", "Cat", "10")
df <- as.data.frame(rbind(b, c, d, e, f, g, h, i))
names(df) <- a
df$Date <- as.Date(df$Date)
df$Number <- as.integer(df$Number)
I want to calculate a rate of growth. I know this has been treated already but I'm not sure whether I can apply it there.
Usually, we use the lag() function but I have some questions.
Can we tell the lag function what lags to use (eg not the previous periods but 4 periods before)
My dataset is much bigger and for some specie (say cat) I want to compute the rate of growth between 20 Feb and 3 March. And for others (say dog) I want to compute between 5 May and 4th April. How can I do it?
Thank you in advance,
using dplyr
start <- as.Date("2020-01-02")
end <- as.Date("2020-01-04")
df %>%
filter(Date %between% c(start, end)) %>%
arrange(Date, Species) %>%
group_by(Species) %>%
summarise(Growth = (last(Number) - first(Number)) / first(Number))
output
Species Growth
<fct> <dbl>
1 Cat 0.25
2 Dog 0.5
data
note that my test data is already as dates and as numeric values
df <- data.frame(
Date = rep(seq.Date(as.Date("2020-01-01"), as.Date("2020-01-04"), "days"), 2),
Species = c(rep("Dog", 4), rep("Cat", 4)),
Number = 3:10
)
if you want different lookups for every species you can do something like this. Define your lookups and the output will be a tibble with Species, Growth and the period it was taken from.
lookups <- list(
c("Species" = "Dog", "start" = "2020-01-01", "end" = "2020-01-04"),
c("Species" = "Cat", "start" = "2020-01-02", "end" = "2020-01-04")
)
bind_rows(lapply(lookups, function(species) {
df %>%
filter(Species == species["Species"] & Date %between% as.Date(c(species["start"], species["end"]))) %>%
arrange(Date, Species) %>%
group_by(Species) %>%
summarise(
Growth = (last(Number) - first(Number)) / first(Number),
Start = species["start"],
End = species["end"]
)
}))
# # A tibble: 2 x 4
# Species Growth Start End
# <chr> <dbl> <chr> <chr>
# 1 Dog 1 2020-01-01 2020-01-04
# 2 Cat 0.25 2020-01-02 2020-01-04
Since you want to have different input for different species and different start and end dates, I have created a function to do the job.
Function compute_rate_in_species()
x is a dataframe
species is Species in x, which can be supplied in a vector (e.g. c("Dog", "Cat"))
start_date and end_date is self explanatory
DF is a logical value indicating whether you want dataframe as output. If FALSE, a named vector with be the output
If you would like to have different start_date and end_date for different species, you'll have to run the function separately using your desired input.
library(tidyverse)
compute_rate_in_species <- function(x, species, start_date, end_date, DF = T) {
x <- x %>% filter(Species %in% species & Date %in% as.Date(c(start_date, end_date))) %>%
group_by(Species) %>%
summarize(Rate = (last(Number) - first(Number))/first(Number))
if (DF == T) {
return(x)
} else {
x_vec <- setNames(x$Rate, x$Species)
return(x_vec)
}
}
Output
# DF = F
compute_rate_in_species(df, c("Dog", "Cat"), "2020-01-01", "2020-01-03", DF = F)
Cat Dog
1.6666667 0.6666667
# DF = T
compute_rate_in_species(df, c("Dog", "Cat"), "2020-01-01", "2020-01-03", DF = T)
# A tibble: 2 x 2
Species Rate
<chr> <dbl>
1 Cat 1.67
2 Dog 0.667
You can use -
library(dplyr)
start_date <- as.Date("2020-01-02")
end_date <- as.Date("2020-01-04")
df %>%
group_by(Specie) %>%
summarise(growth_rate = (Number[match(end_date, Date)] -
Number[match(start_date, Date)])/
Number[match(start_date, Date)])
# Specie growth_rate
# <chr> <dbl>
#1 Cat 0.429
#2 Dog 0.5
You can replace the start_date and end_date with the dates of your choice.
Or maybe a bit verbose but clearer answer would be -
df %>%
group_by(Specie) %>%
summarise(num_end = Number[match(end_date, Date)],
num_start = Number[match(start_date, Date)],
growth_rate = (num_end - num_start)/num_start)
# Specie num_end num_start growth_rate
# <chr> <int> <int> <dbl>
#1 Cat 10 7 0.429
#2 Dog 6 4 0.5
I have a question related to filtering on dates in R. I found e.g. this link
dplyr filter on Date, which answers the question how to filter with help of dplyr in a specific date range. I would like to select a dynamic range, e.g. calculate the number of critical Jobs in a specific window e.g. the last seven days starting from the current date in the dataset. The code I have in mind would look something like this:
my.data %>%
group_by(category) %>%
filter(date > date - days(7) & date <= date) %>%
mutate(ncrit = sum(critical == 'yes'))
This is not working properly. Is there a way to get this running with dplyr?
Edit:
Apologies for the unclear post. To complete the post first the idea: imagine computers running jobs. If a computer fails to compute jobs the past x days it is more likely that it also fails in calculating the current job. A dummy dataset includes the computer categories (e.g. A/B), the date, and failure (yes/no)
Using the dataset from Rui Barradas, I would like to add with dplyr the following column 'number of critical Jobs in past 3 days" (in this case x = 3):
head(my.data, 7)
category date critical number of critical jobs in past 3 days
1 A 2018-08-14 yes NA
2 A 2018-08-15 no NA
3 A 2018-08-16 yes NA
4 A 2018-08-17 no 2
5 A 2018-08-18 yes 1
6 A 2018-08-19 no 2
7 A 2018-08-20 yes 1
Data (Rui Barradas):
set.seed(3635)
my.data <- data.frame(category = rep(c('A', 'B'), each = 10), #
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = 'days')),
critical = sample(c('no', 'yes'), 20, TRUE))
Without an example dataset it's not very easy to say, but given your description of the problem I believe the following is on the right track.
The code uses function rollapplyr from package zoo, inspired not by the accepted but by the second answer to this question.
library(zoo)
library(dplyr)
sumCrit <- function(DF, crit = "yes", window = 3){
DF %>%
group_by(category) %>%
mutate(ncrit = rollapplyr(critical == crit, list(-seq(3)), sum, fill = NA))
}
result <- sumCrit(my.data)
head(result, 7)
## A tibble: 7 x 4
## Groups: category [1]
# category date critical ncrit
# <fct> <date> <fct> <int>
#1 A 2018-08-14 yes NA
#2 A 2018-08-15 no NA
#3 A 2018-08-16 yes NA
#4 A 2018-08-17 no 2
#5 A 2018-08-18 yes 1
#6 A 2018-08-19 no 2
#7 A 2018-08-20 yes 1
Data.
This is a made up dataset meant to test the code above.
set.seed(3635) # Make the results reproducible
my.data <- data.frame(category = rep(c("A", "B"), each = 10),
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = "days"), 2),
critical = sample(c("no", "yes"), 20, TRUE))
Data generation
DATE1 <- as.Date("2018-08-23")
DATE2 <- as.Date("2018-07-23")
# creating a data range with the start and end date:
dates <- seq(DATE2, DATE1, by="days")
dt<-data.frame(category=sample(1:6,32,replace = T),deadline=dates)
Filter the dates
library("tidyverse")
dt %>%
group_by(category) %>%
filter(deadline %in% seq(Sys.Date()-7,Sys.Date() , by="days") )
Using the dataset that Rui Barradas created, providing a lubridate formulation, using intervals
set.seed(3635) # Make the results reproducible
my.data <- data.frame(category = rep(c("A", "B"), each = 10),
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = "days"), 2),
critical = sample(c("no", "yes"), 20, TRUE))
library(lubridate) #use lubridate to create intervals
INT_check<-interval(Sys.Date()-7,Sys.Date()) # 7 days from today
my.data %>%
filter(date %within% INT_check ) %>%
group_by(category)%>%
summarise(ncrit = sum(critical == 'yes'))
you can also specify INT_Check as
INT_check<-interval("2018-08-16","2018-08-18") # if you want to use absolute dates
INT_check<-interval("2018-08-16",Sys.Date()) # if you want to specify just absolute start date
To give some context, I have a dataframe of eyetracking data from a psychology experiment and I want to count the switches between two Areas Of Interest (AOI), for each participant.
Here's a simplified dataframe of the problem (we assume that AOI2 == !AOI1 so we don't need it):
library(tidyverse)
df <- tibble(Participant = rep(1:7, times = 1, each = 10),
Time = rep(1:10, 7),
AOI1 = rbinom(70, 1, .5))
What I want is to count how many times the value of AOI1 changes during time for each participant. I could do it using for loops like bellow, but I was wondering if there was a simpler and more R way of doing it?
df.switches <- tibble(Participant = 1:7,
Switches = NA)
for(p in 1:7){
s <- 0
for(i in 2:10){
if(subset(df, Participant == p & Time == i, select = AOI1) !=
subset(df, Participant == p & Time == i-1, select = AOI1)){
s <- s + 1
}
}
df.switches <- df.switches %>%
mutate(Switches = ifelse(Participant == p, s, Switches))
}
One option is to use dplyr::lag to compare the value with current row in order to count number of switches for each participants.
library(tidyverse)
df %>% group_by(Participant) %>%
summarise(count = sum(AOI1 != lag(AOI1, default = -Inf)))
# # A tibble: 7 x 2
# Participant count
# <int> <int>
# 1 1 5
# 2 2 4
# 3 3 5
# 4 4 4
# 5 5 6
# 6 6 6
# 7 7 4
Since you are already using the tidyverse, you can use lag available as part of dplyr. This checks whether the value of AOI1 is the same as the previous value, and if not, sets a flag to 1. For the first record of each participant, the value is automatically set to NA. Note that the group_by is required, otherwise the flag won't get "reset" every time a new participant is encountered. Also it is assumed that the data is sorted by Participant and Time; if not, pipe arrange(Participant, Time) before the group_by.
df <- tibble(Participant = rep(1:7, times = 1, each = 10),
Time = rep(1:10, 7),
AOI1 = rbinom(70, 1, .5))
df2 <- df %>%
group_by(Participant) %>%
mutate(switch = ifelse(AOI1 != lag(AOI1), 1, 0)) %>%
summarise(num_switches = sum(switch, na.rm = TRUE))
I have a dataset of unique matches like this. Each row is a match with result.
date <- c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01')
team1 <- c('A','B','B','C','D','A','B')
team1_score <- c(1,0,4,3,5,6,7)
team2 <- c('B','A','A','B','C','C','A')
team2_score <- c(0,1,5,4,6,9,10)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
I want to create 2 new columns, forms for team 1 and team 2. The result of the match can be determined by which team have a larger score or a draw. The result would look something like below. So the form would be the result of team1 in the last 2 matches. For example, for the first 3 rows, form of team 1 and 2 respectively are. There will be times where there are not enough 2 matches of a particular team, so a result of NULL is sufficient. I want to know the form of team1 and team2 going into a match.
Form1: W-W, L-W, W-L
Form2: L-L, W-L, L-W
In the actual data set, there are a lot more than just 4 unique teams. I have been thinking but can't think of a good way to create these 2 variables.
Here is my solution:
library(tidyverse)
date <- as.Date(c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01', '2017/05/30'))
team1 <- c('A','B','B','C','D','A','B','A')
team1_score <- c(1,0,4,3,5,6,7,0)
team2 <- c('B','A','A','B','C','C','A','D')
team2_score <- c(0,1,5,4,6,9,10,0)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
## 1. Create a unique identifier for each match. It assumes that teams can only play each other once a day.
matches$UID <- paste(matches$date, matches$team1, matches$team2, sep = "-")
## 2. Create a Score Difference Varaible reflecting team1's score
matches <- matches %>% mutate(score_dif_team1 = team1_score - team2_score)
## 3. Create a Result (WDL) reflecting team1's results
matches <- matches %>% mutate(results_team1 = if_else(score_dif_team1 < 0, true = "L", false = if_else(score_dif_team1 > 0, true = "W", false = "D")))
## 4. Cosmetic step: Reorder variables for easier comparison across variables
matches <- matches %>% select(UID, date:results_team1)
## 5. Reshape the table into a long format based on the teams. Each observation will now reflect the results of 1 team within a match. Each game will have two observations.
matches <- matches %>% gather(key = old_team_var, value = team, team1, team2)
## 6. Stablishes a common results variable for each observation. It essentially inverts the results_team1 varaible for teams2, and keeps results_team1 identical for teams1
matches <- matches %>%
mutate(results = if_else(old_team_var == "team2",
true = if_else(results_team1 == "W",
true = "L",
false = if_else(results_team1 == "L",
true = "W",
false = "D")),
false = results_team1))
## Final step: Filter the matches table by the dates you are interested into, and then reshapes the table to show a data frame of DLW in long format.
Results_table <- matches %>% filter(date <= as.Date("2017-12-01")) %>% group_by(team, results) %>% summarise(cases = n()) %>% spread(key = results, value = cases, fill = 0)
## Results:
# A tibble: 4 x 4
# Groups: team [4]
team D L W
* <chr> <dbl> <dbl> <dbl>
1 A 1 1 4
2 B 0 4 1
3 C 0 1 2
4 D 1 1 0
Each day I have a new csv file with ids and some variables. The ids can be differents over the days. I would like to take the IDs of one day and follow how a variable evolves over the time.
My goal is to create area plot like this :
For example I take all the ids the 31 march, each day I make a join with thoses ids, and I make a count group by the var "Code". If there is missing ids (Ids here the 31 march but not day D) their code become "NA" to show how many IDs I "lose" over time. I hope i'm clear enough.
Here is how I calculate this king of plot : (my real datas are like li and not datas)
library(plyr)
library(dplyr)
datas <- data.frame(id1 = c("x", "y", "x", "y", "z", "x", "z"),
id2 = c("x2", "y2", "x2", "y2", "z2", "x2", "z2"),
code = c("code1", "code2", "code1", "code2", "code2", "code1", "code2"),
var = runif(7),
date = do.call(c, mapply(rep, seq(Sys.Date() - 2, Sys.Date(), by = 1), c(2, 3, 2))))
li <- split(datas, datas$date)
dateStart <- Sys.Date() - 2
dateEnd <- Sys.Date()
# A "filter" if I want to start with another date than the date min or end with another date than the max date
li <- li[as.Date(names(li)) >= dateStart & as.Date(names(li)) <= dateEnd]
dfCounts <- ldply(li, function(x)
left_join(li[[1]], x, by = c("id1", "id2")) %>%
group_by(code.y) %>%
count(code = code.y) %>%
mutate(freq = n / sum(n),
code = ifelse(is.na(code), "NA", code))),
.id = "date")
> dfCounts
date code n freq
1 2015-07-04 1 1 0.5
2 2015-07-04 2 1 0.5
3 2015-07-05 1 1 0.5
4 2015-07-05 2 1 0.5
5 2015-07-06 1 1 0.5
6 2015-07-06 NA 1 0.5
dfCounts %>%
ggplot(aes(date, freq)) +
geom_area(aes(fill = code), position = "stack")
# I have no idea why in this example, nothing is shown in the plot, but it works on my real datas
So it works, but if I want to observe a longer period, I have to join over many days (files) and it can be slow. Do you have any ideas to do the same things without joins, using the binded datas (the object datas and not li) with dplyr or data.table ?
In your opinion, which approach is better ?
Thanks !
(Sorry for the title I couldn't find better...)