I have a dataframe that contains information for various countries, days and variables. I have observations for one of those variables only. A simple working example would look like this:
df <- data.frame(country=c("NL","NL","NL","NL","BE","BE","BE","BE"),
day=c("Monday","Monday","Tuesday","Tuesday","Monday","Monday","Tuesday","Tuesday"),
variable=c("A","B","A","B","A","B","A","B"),
value=c(8,NA,13,NA,12,NA,9,NA))
> df
country day variable value
1 NL Monday A 8
2 NL Monday B NA
3 NL Tuesday A 13
4 NL Tuesday B NA
5 BE Monday A 12
6 BE Monday B NA
7 BE Tuesday A 9
8 BE Tuesday B NA
I want to copy those observations over to the other variable, as long as country and day are identical. The end result would look like this:
> df
country day variable value
1 NL Monday A 8
2 NL Monday B 8
3 NL Tuesday A 13
4 NL Tuesday B 13
5 BE Monday A 12
6 BE Monday B 12
7 BE Tuesday A 9
8 BE Tuesday B 9
The actual dataframe is quite large and I would like to avoid having to build loops. A solution using pipes would be preferable.
Perhaps you could just do:
library(dplyr)
df %>%
group_by(country, day) %>%
mutate(value = value[!is.na(value)])
Output:
# A tibble: 8 x 4
# Groups: country, day [4]
country day variable value
<fct> <fct> <fct> <dbl>
1 NL Monday A 8
2 NL Monday B 8
3 NL Tuesday A 13
4 NL Tuesday B 13
5 BE Monday A 12
6 BE Monday B 12
7 BE Tuesday A 9
8 BE Tuesday B 9
Another way would be via fill, though this is probably unnecessary (if needed, rather use mutate(value = zoo::na.locf(value)) as last line since fill itself is quite slow):
library(tidyverse)
df %>%
group_by(country, day) %>%
arrange(country, day, value) %>%
fill(value)
With data.table, we can do
library(data.table)
setDT(df)[, value := na.omit(value), .(country, day)]
Or using na.locf
library(zoo)
setDT(df)[, value := na.locf0(value), .(country, day)]
Related
I have a dataframe df as follows:
Date Value
1-Jun-12 5
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20
2-Aug-12 10
My output should be:
Date Value mon_diff
1-Jun-12 5 7
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12 8
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20 ...
2-Aug-12 10
Actually I have to take the next months first value and subtract it from the first value that is 12-5 = 7 then again next months first value to be subtracted from current month value that is 20-12 = 8. Please understand there is no fixed number of rows for date as different months have different number of days. Please help.
Making the approach more robust so that it can be implemented even when there will be entries for multiple years.
library(tidyverse)
library(lubridate)
df %>%
mutate(
Date = dmy(Date),
month_year=paste0(month(Date),'_',year(Date))) %>%
group_by(month_year) %>%
filter(Date==min(Date)) %>%
ungroup() %>%
mutate(mon_diff=lead(Value)-Value) %>%
select(-month_year) %>%
right_join(df %>% mutate(Date=dmy(Date)), by=c("Date", "Value")) %>%
arrange(Date)-> output_df
Output:
Date Value mon_diff
<date> <int> <int>
1 2012-06-01 5 7
2 2012-06-02 10 NA
3 2012-06-03 8 NA
4 2012-06-04 15 NA
5 2012-07-02 12 8
6 2012-07-03 6 NA
7 2012-07-04 14 NA
8 2012-08-01 20 NA
9 2012-08-02 10 NA
Data:
read.table(text='Date Value
1-Jun-12 5
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20
2-Aug-12 10',header=T)-> df
Using the data shown reproducibly in the Note at the end, convert the Date to yearmon (year and month with no day) giving ym and then for each row find the first element with the same year month and the first row with the next year month. Note that yearmon class represents year and month internally as year + fraction where fraction = 0, 1/12, ..., 11/12 so the next month is found by adding 1/12. Evaluate Value for those rows taking the difference. Finally NA out diff for those rows with duplicated year and month values. If you are using tidyverse then use the same code but with mutate replacing transform.
library(zoo)
ym <- as.yearmon(DF$Date, format = "%d-%b-%y")
DF |>
transform(diff = Value[match(ym + 1/12, ym)] - Value[match(ym, ym)]) |>
transform(diff = ifelse(duplicated(diff), NA, diff))
giving:
Date Value diff
1 1-Jun-12 5 7
2 2-Jun-12 10 NA
3 3-Jun-12 8 NA
4 4-Jun-12 15 NA
5 2-Jul-12 12 8
6 3-Jul-12 6 NA
7 4-Jul-12 14 NA
8 1-Aug-12 20 NA
9 2-Aug-12 10 NA
Note
Lines <- "Date Value
1-Jun-12 5
2-Jun-12 10
3-Jun-12 8
4-Jun-12 15
2-Jul-12 12
3-Jul-12 6
4-Jul-12 14
1-Aug-12 20
2-Aug-12 10"
DF <- read.table(text = Lines, header = TRUE)
I'm trying to count the number of occurences in a single column. Here is a snippet of the df I'm working with:
Here is the code I have so far:
my_df$day <- weekdays(as.Date(my_df$deadline))
most_common_day <- my_df %>%
arrange(day) %>%
filter(day == "Friday") %>%
select(day)
So the main goal is to get which weekday is the most common. Any suggestions?
There are various ways to count the number of occurrences in R. The basic R method is table():
table(my_df$day)
# Friday Monday Saturday Sunday Thursday Tuesday Wednesday
# 4 6 8 11 6 5 10
The dplyr approach can be with count():
count(my_df, day)
# day n
#1 Friday 4
#2 Monday 6
#3 Saturday 8
#4 Sunday 11
#5 Thursday 6
#6 Tuesday 5
#7 Wednesday 10
You can also use tally() from dplyr but you will also need group_by():
my_df %>% group_by(day) %>% tally
# day n
#1 Friday 4
#2 Monday 6
#3 Saturday 8
#4 Sunday 11
#5 Thursday 6
#6 Tuesday 5
#7 Wednesday 10
To get the most common day(s), you can do:
# when using table()
names(table(my_df$day))[table(my_df$day) == max(table(my_df$day))]
#[1] "Sunday"
# when using count()
count(my_df, day) %>% slice_max(n)
# day n
#1 Sunday 11
# when using tally()
my_df %>% group_by(day) %>% tally %>% slice_max(n)
## A tibble: 1 x 2
# day n
# <fct> <int>
#1 Sunday 11
So here's the data:
DF1
ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday
I would like to join the following dictionary.
DF2
ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54
My goal is I want the total count of entries on each day as well as the hours worked on that day. But if a value on the list exists twice, it is not counted twice. (Thats the hard part)
Here's my attempt following R Code:
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
df3 %>%
group_by(ID) %>%
summarize(count = n())
sum = sum(Employee_Hrs)) %>%
mutate(injRate = count/sum)
This does not work because though it does successfully count total number of entries for each ID, it sums employee_Hrs every time, even when it is entered multiple times...
End product should be:
ID count sum
1 3 41
2 2 55
3 3 120
Again, take count, but sum hours , dont double count.
Here is a base R option using merge + aggregate
u <- merge(df1, df2, by = c("ID", "DOW"))
res <- setNames(
merge(aggregate(DOW ~ ID, u, length),
aggregate(Hours ~ ID, unique(u), sum),
by = "ID"
),
c("ID", "Count", "Sum")
)
which gives
> res
ID Count Sum
1 1 3 41
2 2 2 55
3 3 3 120
An option with data.table
library(data.table)
setDT(df1)[df2, .(Count = .N, Hours), on = .(ID), by = .EACHI][,
.(Sum = sum(Hours)), .(ID, Count)]
# ID Count Sum
#1: 1 3 41
#2: 2 2 55
#3: 3 3 120
Another approach is to summarize the tables prior to joining them.
textFile1 <- "ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday"
textFile2 <- "ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54"
df1 <- read.table(text =textFile1,header=TRUE )
df2 <- read.table(text =textFile2,header=TRUE )
df1 %>% group_by(ID) %>%
summarise(count = n()) -> counts
df2 %>%
group_by(ID) %>%
summarize(sum = sum(Hours)) %>%
left_join(counts) %>%
mutate(injRate = count/sum)
...and the output:
# A tibble: 3 x 4
ID sum count injRate
<int> <int> <int> <dbl>
1 1 41 3 0.0732
2 2 55 2 0.0364
3 3 120 3 0.025
Try this solution where you compute the number of counts and then you filter to obtain final summary:
library(tidyverse)
#Data
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
#Code
df3 %>%
group_by(ID) %>%
mutate(count=n()) %>%
filter(!duplicated(DOW)) %>%
summarise(count=unique(count),Sum=sum(Hours))
Output:
# A tibble: 3 x 3
ID count Sum
<int> <int> <int>
1 1 3 41
2 2 2 55
3 3 3 120
I would like know how many animals will show up on a specific day. This chart describes people register their animals in advance.
For instance, at 7 days ahead, someone registered for their 4 cats to show up on 5/3/2019; at 6 days ahead, another 9 cats are registered for 5/3/2019. So there will be 7+6=13 cats showing up on 5/3/2019.
When days_ahead = 0, it simply means someone registered on the event day. For instance, 4 wolves registered for 5/1/2019 on 5/1/2019 (0 days ahead), and there will be 4 wolves that day.
library(dplyr)
set.seed(0)
animal = c(rep('cat', 5), rep('dog', 6), rep('wolf', 3))
date = sample(seq(as.Date("2019/5/1"), as.Date('2019/5/10'), by='day'), 14, replace=TRUE)
days_ahead = sample(seq(0,14), 14, replace=FALSE)
number = sample.int(10, 14, replace=TRUE)
dt = data.frame(animal, date, days_ahead, number) %>% arrange(animal, date)
The expected outcome should have the same 1-3 columns as the example, but the fourth column should be an accumulated number by each date, accumulating on days_ahead.
I added an expected outcome here. The comments are used to explain the accumulated_number column.
I've considered loop function but not entirely sure how to loop over three variables (cat, date, and days_ahead). Any advice is appreciated!!
The accumulated_number is somewhat easy with cumsum(). See this link for your comments field:
Cumulatively paste (concatenate) values grouped by another variable
dt%>%
group_by(animal,date)%>%
mutate(accumulated_number = cumsum(number)
,comments = Reduce(function(x1, x2) paste(x1, x2, sep = '+'), as.character(number), accumulate = T)
)%>%
ungroup()
Also, my dataset is slightly different than yours with the same seed. Still, it seems to work.
# A tibble: 14 x 6
animal date days_ahead number accumulated_number comments
<fct> <date> <int> <int> <int> <chr>
1 cat 2019-05-03 10 9 9 9
2 cat 2019-05-04 6 4 4 4
3 cat 2019-05-06 8 5 5 5
4 cat 2019-05-09 5 4 4 4
5 cat 2019-05-10 13 6 6 6
6 dog 2019-05-01 0 2 2 2
7 dog 2019-05-03 3 5 5 5
8 dog 2019-05-07 1 7 7 7
9 dog 2019-05-07 9 8 15 7+8
10 dog 2019-05-09 12 2 2 2
11 dog 2019-05-10 7 9 9 9
12 wolf 2019-05-02 14 5 5 5
13 wolf 2019-05-03 11 8 8 8
14 wolf 2019-05-07 4 9 9 9
I'm not sure I understand your question, is this what you want?
I'm adding an "animals_arriving" column and kepping the rest of dt
library(dplyr)
library(lubridate)
dt %>%
mutate(date_arrival = date + days(days_ahead)) %>%
group_by(date = date_arrival) %>%
summarise(animals_arriving = n()) %>%
full_join(dt,by="date")
I have a dataset that contains weekly data. The week starts on a Monday and ends on a Sunday. This dataset is also broken out by group.
I want to detect if there are any missing consecutive dates between the start and finish for each group. Here is an example dataset:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-05-04', '2015-06-29', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A', 'A','B','B','B','B')
Value<- c(2,3,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
df
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-27 B 9
2015-08-03 B 8
For group B, there is missing data between 2015-07-06 and 2015-07-27. There is also missing data in group A between 2015-04-20 and 2015-05-04. I want to add a row for that group and have the value be NA. I have many groups and I want my expected output to be below:
Week Group Value
2015-04-13 A 2
2015-04-20 A 3
2015-04-27 A NA
2015-05-04 A 10
2015-06-29 B 4
2015-07-06 B 11
2015-07-13 B NA
2015-07-20 B NA
2015-07-27 B 9
2015-08-03 B 8
Any help would be great, thanks!
You can use complete from tidyr package, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
complete(Week = seq(min(Week), max(Week), by = 'week'))
which gives,
# A tibble: 10 x 3
# Groups: Group [2]
Group Week Value
<fct> <date> <dbl>
1 A 2015-04-13 2
2 A 2015-04-20 3
3 A 2015-04-27 NA
4 A 2015-05-04 10
5 B 2015-06-29 4
6 B 2015-07-06 NA
7 B 2015-07-13 NA
8 B 2015-07-20 NA
9 B 2015-07-27 11
10 B 2015-08-03 9
The only way I've found to do this is using an inequality join in SQL.
library(tidyverse)
library(sqldf)
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04',
'2015-06-29', '2015-06-07', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#what are the start and end weeks for each group?
GroupWeeks <- df %>%
group_by(Group) %>%
summarise(start = min(Week),
end = max(Week))
#What are all the possible weeks?
AllWeeks <- data.frame(Week = seq.Date(min(df$Week), max(df$Week), by = "week"))
#use an inequality join to add rows for every week within the group's range
sqldf("Select AllWeeks.Week, GroupWeeks.[Group], Value
From AllWeeks inner join GroupWeeks on AllWeeks.Week >= start AND AllWeeks.Week <= end
left join df on AllWeeks.Week = df.Week and GroupWeeks.[Group] = df.[Group]")
This can be achieved using seq function. Here is the code snippet.
Code:
Week<- as.Date(c('2015-04-13', '2015-04-20', '2015-04-27', '2015-05-04', '2015-06-29','2015-07-06', '2015-07-27', '2015-08-03'))
Group <- c('A', 'A','A', 'A','B','B','B','B')
Value<- c(2,3,2,10,4,11,9,8)
df<-data.frame(Week, Group, Value)
#generate all the missing dates
alldates = seq(min(df$Week[df$Group == 'B']), max(df$Week[df$Group == 'B']), 7)
#filter out the dates that are not present in your dataset
dates = alldates[!(alldates %in% df$Week)]
#add these new dates to a new dataframe and rbind with the old dataframe
new_df = data.frame(Week = dates,Group = 'B', Value = NA)
df = rbind(df, new_df)
df = df[order(df$Week),]
Output:
Week Group Value
1 2015-04-13 A 2
2 2015-04-20 A 3
3 2015-04-27 A 2
4 2015-05-04 A 10
5 2015-06-29 B 4
6 2015-07-06 B 11
9 2015-07-13 B NA
10 2015-07-20 B NA
7 2015-07-27 B 9
8 2015-08-03 B 8