I have 900 files named like 20120412_bwDD2yYa.txt. The first part up to the _ is in the year-month-day format. Some days have multiple files associated with them.
I'd like to use the dates extracted from the file names as data to compile a timeseries where the dates are the x axis and the number of files are the y axis.
How can I do this?
Here is a solution with Base R. Since the question does not include a reproducible example, we'll simulate the file names, parse out the dates, and create the counts by date.
# use list.files() to extract files from directory
files <- list.files(path="./data",pattern="*.txt",full.names = FALSE)
# simulate result from list.files()
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
# extract dates from file names
date <- as.Date(substr(files,1,8),"%Y%m%d")
df <- data.frame(date,count = rep(1,length(date)))
aggregate(count ~ date,data = df, sum)
...and the output:
date count
1 2012-01-01 2
2 2012-01-02 1
dplyr solution
A solution with dplyr::summarise() looks like this:
files <- list.files(path="./data",pattern="*.txt",full.names = FALSE)
# simulate result from list.files()
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
library(dplyr)
data.frame(date=as.Date(substr(files,1,8),"%Y%m%d")) %>%
group_by(date) %>% summarise(count = n())
# A tibble: 2 x 2
date count
<date> <int>
1 2012-01-01 2
2 2012-01-02 1
Accounting for dates with no files
In response to a comment on my answer, here is a solution that fills in gaps in the file list where there are days with 0 files. We take the minimum and maximum dates from the file list and create a data frame containing the sequence of dates. Then we left_join() this with the previously aggregated data, and recode NA values for count to 0.
# create a gap in dates with files
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt",
"20120104_aaa.txt","20120104_aab.txt","20120104_aac.txt")
library(dplyr)
data.frame(date=as.Date(substr(files,1,8),"%Y%m%d")) %>%
group_by(date) %>% summarise(count = n()) -> fileCounts
# create df with all dates, left_join() and recode NA to 0
data.frame(date = as.Date(min(fileCounts$date):max(fileCounts$date),
origin = "1970-01-01")) %>%
left_join(.,fileCounts) %>%
mutate(count = if_else(is.na(count),0,as.numeric(count)))
...and the output:
Joining, by = "date"
date count
1 2012-01-01 2
2 2012-01-02 1
3 2012-01-03 0
4 2012-01-04 3
You can use table to count frequencies and then stack it to get a dataframe.
Using #Len Greski's files.
files <- c("20120101_aaa.txt","20120101_bbb.txt","20120102_ccc.txt")
stack(table(as.Date(sub('_.*', '', files),"%Y%m%d")))[2:1]
# ind values
#1 2012-01-01 2
#2 2012-01-02 1
Related
I have a dataframe containing timestamps of visits to a location by RFID tagged individuals. Here is a simplified example
Time<- c("07:00:48", "11:45:34", "11:46:28","11:46:29", "11:47:17","11:47:18")
ID<- c("00003F9776","01103F9702","01103FA8DD","01103FA8DD","01103F9702","01103F9702")
df<- data.frame(Time, ID)
df
Time ID
1 07:00:48 00003F9776
2 11:45:34 01103F9702
3 11:46:28 01103FA8DD
4 11:46:29 01103FA8DD
5 11:47:17 01103F9702
6 11:47:18 01103F9702
All I want to do is remove the visits which are close to each other in time, say for example, if they within 10 seconds of each other. In the above example I'd remove rows 3 and 5.
My actual data sets contain thousands of entries like this. Is there a simple way to achieve this?
Here is one dplyr answer -
library(dplyr)
df %>%
mutate(Timestamp = as.POSIXct(Time, format = '%T')) %>%
filter(difftime(Timestamp, lag(Timestamp, default = first(Timestamp) - 11), units = 'sec') > 10) %>%
select(-Timestamp)
# Time ID
#1 07:00:48 00003F9776
#2 11:45:34 01103F9702
#3 11:46:28 01103FA8DD
#4 11:47:17 01103F9702
To keep the first row in the output I used default value of lag as first(Timestamp) - 11 so that it satisfies the condition (difftime > 10) to select the row.
you could easy do this with data.table package
library(data.table)
df<- data.frame(Time, ID) %>%
as.data.table() %>% ## convert to table
mutate(Time = as.ITime(Time)) ## Time as time
create diff col to filter
df[ , diff := difftime(Time, shift(Time), units="secs") ] %>%
filter(diff > 10 | is.na(diff))
thats all.
...or calculate & filter in one line
df[difftime(Time, shift(Time), units="secs") > 10 | is.na(difftime(Time, shift(Time), units="secs"))]
is.na() added to show also first row
This question already has answers here:
R grouping based on time difference
(3 answers)
Earliest Date for each id in R
(4 answers)
Closed 2 years ago.
I expect to find for thousand of ids the days when they start to be recorded, and the days when they stop, in a simple way.
I currently use a loop which works well but take ages, as below.
an example of my dataset :
id date
1 2017-11-30
1 2017-12-01
1 2017-12-02
1 2017-12-03
1 2017-12-05
1 2017-12-06
1 2017-12-07
1 2017-12-08
1 2017-12-09
1 2017-12-10
and then I use this loop to find each date when the individual start to be recorded, without a stop between days. In my example in give the '2017-11-30' and the '2017-12-05' for the starts, and the '2017-12-03' and the '2017-12-10' for the ends.
nani <- unique(dat$id)
n <- length(dat$id)
#SET THE NEW OBJECT WHERE TO SAVE RESULTS
NEWDAT <- NULL
for(i in 1 : n)
{
#SELECT ANIMALS I WITHIN THE DATA.FRAME
x <- which(dat$id == nani[i])
#FIND THE POSITION IN THE DATA FRAME OF THE DAYS WHEN THE RECORD IS NOT CONTINUE
diffx <- diff(diff(dat$date[x]))
#FIND THE POSITION OF STARTS FOR EACH SESSIONS OF RECORDS
starti <- which(diffx < 0) +1
#FIND THE POSITION OF ENDS FOR EACH SESSIONS OF RECORDS
endi <- which(diffx > 0) +1
#FIND THE DATES OF STARTS FOR EACH SESSIONS OF RECORDS
starts_records <- c(dat$date[x][1], dat$date[x][starti])
#FIND THE DATES OF ENDS FOR EACH SESSIONS OF RECORDS
ends_records <- c(dat$date[x][endi], dat$date[x][length(x)])
#CREATE LABELS
name_start <- rep("START_RECORDS_BY_SENSORS", length(starts_records))
name_end <- rep("END_RECORDS_BY_SENSORS", length(ends_records))
#CREATE THE NEW DATA.FRAME EXPECTED
dat2 <- data.frame( "event_start" = c(starts_records, ends_records),
"name" = c(name_start, name_end))
dat2 <- dat2[order(dat2$event_start),]
#SAVE RESULTS
NEWDAT <- bind_rows(NEWDAT, dat2)
}
So far, I tried things as below but did not found the right solution to avoid the loop.
NEWDAT <- dat %>% group_by(id) %>% summarize(diff_days = diff(diff(date)))
I still struggle to understand well the syntaxe of dplyr.
You can try to create a new group at every break and get first and last date in each group.
library(dplyr)
df %>%
group_by(id, grp = cumsum(c(TRUE, diff(date) > 1))) %>%
summarise(start = first(date), stop = last(date))
# id grp start stop
# <int> <int> <date> <date>
#1 1 1 2017-11-30 2017-12-03
#2 1 2 2017-12-05 2017-12-10
I have a data frame with dates and numbers called 'df'. I have another data frame with start and end dates called 'date_ranges'.
My goal is to filter/subset df so that it only shows for the start/end dates in each row of the date_ranges column. Here is my code so far:
df_date <- as.Date((as.Date('2010-01-01'):as.Date('2010-04-30')))
df_numbers <- c(1:120)
df <- data.frame(df_date, df_numbers)
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
date_ranges <- data.frame(start_dates, end_dates)
# Attempting to filter df by start and end dates
for (i in range(date_ranges$start_dates)){
for (j in range(date_ranges$end_dates)){
print (
df %>%
filter(between(df_date, i, j)))
}
}
The first and third result of the nested for loop is what I want, but not the second result. The first and third give me the dates and values for df between their respective rows, but the second result is the range from the earliest date to the latest date. How can I fix this loop to exclude the second result?
A tidyverse approach could be to create a sequence between start and end_dates and join with df to keep only the dates which lie in the range.
library(dplyr)
date_ranges %>%
mutate(df_date = purrr::map2(start_dates, end_dates, seq, "day")) %>%
tidyr::unnest(df_date) %>%
select(-start_dates, -end_dates) %>%
left_join(df, by = 'df_date')
# A tibble: 39 x 2
# df_date df_numbers
# <date> <int>
# 1 2010-01-06 6
# 2 2010-01-07 7
# 3 2010-01-08 8
# 4 2010-01-09 9
# 5 2010-01-10 10
# 6 2010-01-11 11
# 7 2010-01-12 12
# 8 2010-01-13 13
# 9 2010-01-14 14
#10 2010-01-15 15
# … with 29 more rows
You can try looping through index
for (i in seq_along(date_ranges$start_dates)){
print (
df %>%
filter(between(df_date, date_ranges$start_dates[i], date_ranges$end_dates[i])))
}
Base R solution:
# Your data creation can be simplified:
df <- data.frame(df_date = seq.Date(as.Date('2010-01-01', "%Y-%m-%d"), as.Date('2010-04-30', "%Y-%m-%d"),
by = 1), df_numbers = c(1:120))
# Store start and end date vectors to filter the data.frame:
start_dates <- as.Date(c("2010-01-06", "2010-02-01", '2010-04-15'))
end_dates <- as.Date(c("2010-01-23", "2010-02-06", '2010-04-29'))
# Subset the data to extract records with matching dates: df => stdout (Console
df[df$df_date %in% c(start_dates, end_dates),]
I am struggling a little with the logic for recoding nested data into a long "continuous" format based on dates in R
Below is a dummy example of my data. I have three sets of dates The start and stop time for a participant that is stored in long format, and then the start of another incident that is stored as wide data.
GC_ID HMIS_Start HMIS_Stop CPS Start CPS Start 2 CPS Start 3
------- ------------ ----------- ----------- ------------- -------------
1 1/10/14 1/20/14 1/15/14 6/2/14 NA
1 4/10/14 5/30/14 1/15/14 6/2/14 NA
1 12/1/14 12/2/14 1/15/14 6/2/14 NA
1 1/1/15 2/28/15 1/15/14 6/2/14 NA
2 8/13/13 8/17/14 NA NA NA
3 5/1/15 5/2/15 1/16/13 6/26/14 7/27/15
3 6/4/16 7/10/16 1/16/13 6/26/14 7/27/15
4 10/15/13 10/25/13 2/18/15 NA NA
4 12/25/13 1/18/14 2/18/15 NA NA
4 2/8/15 7/20/15 2/18/15 NA NA
My goal is to create two long continuous variables that go along with each months from August 2013 to December 2015. For one of the two variables, I would like to code a 1 for each month that target month is within an HMIS_start and HMIS_stop time for a participant AND has at least one CPS Start date within that month. The second variable would do a similar thing, but it would be if the CPS Start date happened in the month after the HMIS Stop date.
So participant 1's data could look like this:
I assume I need to create a blank data set with the ID variable and then the month/year variable. Then I would use a for loop for each ID to run an "if_then" statement comparing IF the month is greater then the HMIS start and less then the HMIS stop AND if the CPS start is within that month too.
I am mostly just struggling with how to create that process and use the for loop logically given that there are long data already in the file and multiple lines of long data per participant that need to be compared to all possible CPS start dates
Any thoughts or code tips on how to tackle this?
I am not sure how you came to your answers, and I will update this code once that is provided. But I used library(tidyverse) and library(lubridate) for this:
dat <- data.frame(GC_ID = c(1,1,1,1,2,3,3,4,4,4),
HMIS_Start = c("1/10/14", "4/10/14", "12/1/14", "1/1/15", "8/13/13", "5/1/15", "6/4/16", "10/15/13", "12/25/13","2/8/15"), HMIS_Stop = c("1/20/14", "5/30/14", "12/2/14", "2/28/15", "8/17/14", "5/2/15", "7/10/16", "10/25/13", "1/18/14", "7/20/15"), CPS_Start = c("1/15/14","1/15/14","1/15/14","1/15/14",NA, "1/16/13", "1/16/13", "2/18/15", "2/18/15", "2/18/15"), CPS_Start_2 = c("6/2/15", "6/2/15", "6/2/15", "6/2/15", NA, "6/26/14", "6/26/14", NA, NA, NA), CPS_Start_3 = c(NA,NA,NA,NA,NA,"7/27/15", "7/27/15", NA,NA,NA))
dats <- dat %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, ~as.Date(., format = "%m/%d/%y")) %>%
gather(Var, Dates, -GC_ID, -HMIS_Start, -HMIS_Stop) %>%
filter(!is.na(Dates)) %>%
mutate(HMIS_CPS_SAME = if_else(month(HMIS_Start) == month(HMIS_Stop) &
year(HMIS_Start) == year(HMIS_Stop) &
month(HMIS_Start) == month(Dates) &
year(HMIS_Start) == year(Dates), 1, 0 ),
CPS_After = if_else(month(HMIS_Stop) + 1 == month(Dates) &
year(HMIS_Stop) == year(Dates), 1,0 ),
Months = month(HMIS_Start),
Years = year(HMIS_Start)) %>%
arrange(GC_ID, HMIS_Start, Dates) %>%
group_by(GC_ID, Months, Years) %>%
summarise(HMIS_CPS_SAME = max(HMIS_CPS_SAME),
CPS_After = max(CPS_After)) %>%
ungroup()
full_dat <- merge(data.frame(GC_ID = unique(dat$GC_ID)), data.frame(Dates = seq.Date(as.Date("2013-08-01"), as.Date("2015-12-01"), by = "month"))) %>%
mutate(Months = month(Dates), Years = year(Dates)) %>%
left_join(dats, by = c("GC_ID", "Months", "Years")) %>%
mutate_if(is.numeric , replace_na, replace = 0)
First I created the data in R and R format. Then I converted the data to date format for the 5 columns you mentioned. I made the data long to do the comparisons specified, then found the max for each GC_ID, Months, Years. Then I used a cartesian join for each date and GC_ID and got the months and years from those and joined our dats to full_dat by GC_ID, Months, Years. The last mutate_if is to convert all NA values to 0. NO Looping Needed! :-)
I have several data frames and they were named like this
plant1_wd_hrly, plant2_wd_hrly,plant3_wd_hrly......,
Each of them have data like this :
time temp
1 2012-01-01 00:00:00 20
2 2012-01-01 01:00:00 21
3 2012-01-01 02:00:00 22
4 2012-01-01 03:00:00 23
5 2012-01-01 04:00:00 24
I need to do a aggregation to the daily level with all of them and also calculate the daily max, min.
Here is the code to generate such df:
x=seq(
from=as.POSIXct("2012-1-1 0:00", tz="UTC"),
to=as.POSIXct("2012-1-3 23:00", tz="UTC"),
by="hour")
plant1_wd_hrly=data.frame("time"=x,"temp"=seq(20,length.out=length(x)))
plant1_wd_hrly$time=as.POSIXct(substr(plant1_wd_hrly$time,1,10))
plant2_wd_hrly=data.frame("time"=x,"temp"=seq(25,length.out=length(x)))
plant2_wd_hrly$time=as.POSIXct(substr(plant1_wd_hrly$time,1,10))
plant1_wd_hrly$temp[2:3]=NA
plant2_wd_hrly$temp[5:6]=NA
If it is only one df I usually do the aggregation using dplyr package:
plant1_hrly=plant1_wd_hrly %>% group_by(time) %>% summarise(
temp_avg = mean(temp,na.rm=TRUE),
temp_max = max(temp,na.rm=TRUE),
temp_min = min(temp,na.rm=TRUE))
But with multiple df, what is a more efficient way to do this?
First thing I'm thinking is to do a for loop, could I load a dymanic generated variable name from R, so I could loop through the different df since they all have very similar names? If I want to assign a value to a dynamic generated variable name I could use assign, but how to load one?
Thank you.
Make a vector of df names like that, for instance:
df_names <- grep("plant", ls(), value = T)
If no other variable names contain "plant". Otherwise you need to play with regex. Or pick them by hand.
Then just loop over the names using get() and assign() in the body.
You give the first one the name as a string, and it get the value from the variable. The second takes a name and a value and assign the value to the name.
for(df_n in df_names){
temp_data = get(df_n) %>% group_by(time) %>% summarise(
temp_avg = mean(temp,na.rm=TRUE),
temp_max = max(temp,na.rm=TRUE),
temp_min = min(temp,na.rm=TRUE))
assign(paste0(df_n, "_agr"), temp_data)
}