conditionally duplicating rows in a data frame - r

This is a sample of my data set:
day city count
1 1 A 50
2 2 A 100
3 2 B 110
4 2 C 90
Here is the code for reproducing it:
df <- data.frame(
day = c(1,2,2,2),
city = c("A","A","B","C"),
count = c(50,100,110,90)
)
As you could see, the count data is missing for city B and C on the day 1. What I want to do is to use city A's count as an estimate for the other two cities. So the desired output would be:
day city count
1 1 A 50
2 1 B 50
3 1 C 50
4 2 A 100
5 2 B 110
6 2 C 90
I could come up with a for loop to do it, but I feel like there should be an easier way of doing it. My idea is to count the number of observations for each day, and then for the days that the number of observations is less than the number of cities in the data set, I would replicate the row to complete the data for that day. Any better ideas? or a more efficient for-loop? Thanks.

With dplyr and tidyr, we can do:
library(dplyr)
library(tidyr)
df %>%
expand(day, city) %>%
left_join(df) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Alternatively, we can avoid the left_join using thelatemail's solution:
df %>%
complete(day, city) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Both return:
# A tibble: 6 x 3
day city count
<dbl> <fct> <dbl>
1 1. A 50.
2 1. B 50.
3 1. C 50.
4 2. A 100.
5 2. B 110.
6 2. C 90.
Data (slightly modified to show .direction filling both directions):
df <- data.frame(
day = c(1,2,2,2),
city = c("B","A","B","C"),
count = c(50,100,110,90)
)

Related

Create column with a certain week value by group

I would like to create a column, by group, with a certain week's value from another column.
In this example New_column is created with the Number from the 2nd week for each group.
Group Week Number New_column
A 1 19 8
A 2 8 8
A 3 21 8
A 4 5 8
B 1 4 12
B 2 12 12
B 3 18 12
B 4 15 12
C 1 9 4
C 2 4 4
C 3 10 4
C 4 2 4
I've used this method, which works, but I feel is a really messy way to do it:
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(New_column = ifelse(Week == 2, Number, NA))
df <- df %>%
group_by(Group) %>%
mutate(New_column = sum(New_column, na.rm = T))
There are several solution possible, depending on what you need specifically. With your specific sample data, however, all of them give the same result
1) It identifies the week number from column Week, even if the dataframe is not sorted
df %>%
group_by(Group) %>%
mutate(New_column = Number[Week == 2])
However, if the weeks do not start from 1, this solution will still try to find the case only where Week == 2
2) If df is already sorted by Week inside each group, you could use
df %>%
group_by(Group) %>%
mutate(New_column = Number[2])
This solution does not take the week Number in which Week == 2, but rather the second week within each group, regardless of its actual Week value.
3) If df is not sorted by week, you could do it with
df %>%
group_by(Group) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(New_column = Number[2])
and uses the same rationale as solution 2)

Counting highest number occurrences of character and return in separate data table/frame in R

I am looking for a string of code that will count the number of occurrences of a certain variable, sort it in order, and then limit it to the first X results. Example of what I am looking for:
Dataframe:
ID Group
1000 A
1001 A
100a A
100g D
1004 C
100f B
100z B
1293 B
2412 B
3040 B
3452 C
Result: Table or Dataframe showing Top 3 results (of 4), in order of highest to low
Group Count
B 5
A 3
C 2
Thanks in advance!
In dplyr, we can count Group values, select top 3 values and arrange them in decreasing order.
library(dplyr)
df %>% count(Group) %>% top_n(3, n) %>% arrange(desc(n))
# Group n
# <fct> <int>
#1 B 5
#2 A 3
#3 C 2
We can also use
df %>% count(Group) %>% arrange(desc(n)) %>% head(3)
Or in base R
stack(head(sort(table(df$Group), decreasing = TRUE), 3))

How to find observations within a certain time range of each other in R

I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1

R dplyr count observations within groups

I have a data frame with yes/no values for different days and hours. For each day, I want to get a total number of hours where I have data, as well as the total number of hours where there is a value of Y.
df <- data.frame(day = c(1,1,1,2,2,3,3,3,3,4),
hour = c(1,2,3,1,2,1,2,3,4,1),
YN = c("Y","Y","Y","Y","Y","Y","N","N","N","N"))
df %>%
group_by(day) %>%
summarise(tot.hour = n(),
totY = WHAT DO I PUT HERE?)
Using boolean then add it up
df %>%
group_by(day) %>%
dplyr::summarise(tot.hour = n(),
totY = sum(YN=='Y'))
# A tibble: 4 x 3
day tot.hour totY
<dbl> <int> <int>
1 1 3 3
2 2 2 2
3 3 4 1
4 4 1 0

How to Count the Number of Values that meet certain conditions in a df in R

Let's say that I have data like the following
date value location
1/1 10 A
1/2 15 A
1/3 20 A
2/1 15 A
2/2 10 A
2/3 5 A
2/4 12 B
2/5 15 B
2/6 5 B
2/7 20 A
I would like the count of all values over 10 after 1/31 aggregated by location. So my output would give me 3 for location A and 2 for location B.
Any ideas how this could be implemented in R?
Once you standardized your date field (assuming the year is 2018) you could use the dplyr package to filter your dataset to the conditions you need and group by the location and tally.
library(dplyr)
df <- df %>%
mutate(date = as.Date(paste0(df$date, '/', format(Sys.Date(), '%Y')),
format = '%m/%d/%Y')) %>%
filter(date > as.Date('2018-01-31')) %>%
filter(value >= 10) %>%
group_by(location) %>%
tally()
using base R you can do:
newdat=subset(transform(dat,date=strptime(date,"%m/%d")),date>as.Date("2018-01-31")&value>=10)
table(newdat$location)
A B
3 2
or
aggregate(value~location,newdat,length)
location value
1 A 3
2 B 2
Taking into consideration the comment by thelaemailyou can do:
aggregate(value~location,dat,length,subset = strptime(date,"%m/%d")>as.Date("2018-01-31")&value>=10)
location value
1 A 3
2 B 2
Adding a bit of lubridate functionality to D.sen's answer:
library(tidyverse)
library(lubridate)
thresh <- 10
date_thresh <- "2018-01-31"
df %>%
mutate(date = mdy(paste0(date, "/2018"))) %>%
filter(date > date_thresh, value > thresh) %>%
group_by(location) %>%
tally()
# A tibble: 2 x 2
location n
<fct> <int>
1 A 2
2 B 2

Resources