Populate rows based on the date sequence in R - r

I am having a data frame with a specific date range for each row.
stuID stdID roleStart roleEnd
1 1 7 2010-11-18 2020-06-14
2 2 2 2012-08-13 2014-04-01
3 2 4 2014-04-01 2015-10-01
4 2 3 2015-10-01 2018-10-01
5 2 6 2018-10-01 2020-06-14
6 3 4 2014-03-03 2015-10-01
I need to generate the rows based on the weeks of the date. To be precise, I need to populate the rows based on week between two dates in the given data frame.
I tried to achieve this using the following piece of code
extendedData <- reshape2::melt(setNames(lapply(1:nrow(df), function(x) seq.Date(df[x, "roleStart"],
df[x, "roleEnd"], by = "1 week")),df$stuID))
But when I execute this, I am getting the error message
Error in seq.int(0, to0 - from, by) : wrong sign in 'by' argument
This is the structure of the dataframe
'data.frame': 350 obs. of 4 variables:
$ stuID : int 1 2 2 2 2 3 3 3 4 4 ...
$ stdID : int 7 2 4 3 6 4 3 6 1 2 ...
$ roleStart: Date, format: "2010-11-18" "2012-08-13" "2014-04-01" "2015-10-01" ...
$ roleEnd : Date, format: "2020-06-14" "2014-04-01" "2015-10-01" "2018-10-01" ...
Can anyone say what's wrong with the code?
Thanks in advance!!

Here's a way to do this using tidyverse functions :
library(dplyr)
df %>%
mutate(date = purrr::map2(roleStart, roleEnd, seq, by = 'week')) %>%
tidyr::unnest(date)
As far as your code is concerned it works fine till this step i.e generating weekly dates
lapply(1:nrow(df), function(x)
seq.Date(df[x, "roleStart"], df[x, "roleEnd"], by = "1 week"))
I am not sure what you are trying to do with setNames and melt functions there.

Related

read dates from multiple excel files with different formats in R

I am trying to read dates from different excel files and each of them have the dates stored in different formats (character or date). This is making the date column on each file being read as character "28/02/2020" or as the numeric conversion excel does to the dates "452344" (number of days since 1900)
files1 = list.files(pattern="*.xlsx")
df = lapply(files1, read_excel,col_types = "text")
df = do.call(rbind, df)
¿How can I make R to read the character type "28/02/2020" and not the "452344" numeric type?
For multiple date format in one column I suggest using lubridate::parse_date_time() (or any other date converter that converts ambiguous format to NA instead of printing an error message)
I assume your df should look something like this:
# A tibble: 6 x 2
id date
<chr> <chr>
1 1 43889
2 2 43889
3 3 43889
4 1 28/02/2020
5 2 28/02/2020
6 3 28/02/2020
Then you should use this code:
library(lubridate)
df <- as.data.frame(df)
df$date2 <- parse_date_time(x = df$date, orders = "d m y") #converts rows like "28/02/2020" to date
df[is.na(df$date2),"date2"] <- as.Date(as.numeric(df[is.na(df$date2),"date"]), origin = "1899-12-30") #converts rows like "43889"
R output:
id date date2
1 1 43889 2020-02-28
2 2 43889 2020-02-28
3 3 43889 2020-02-28
4 1 28/02/2020 2020-02-28
5 2 28/02/2020 2020-02-28
6 3 28/02/2020 2020-02-28
str(df)
'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "1" ...
$ date : chr "43889" "43889" "43889" "28/02/2020" ...
$ date2: POSIXct, format: "2020-02-28" "2020-02-28" "2020-02-28" "2020-02-28" ...
I know it is not the nicest solution but it should work for you as well

Error in prepData function in package moveHMM contiguous data

I am trying to use the prepData function in the R package moveHMM. I am getting "Error in prepData(x, coordNames = c("lon", "lat")) : Each animal's obervations must be contiguous."
x is a data.frame with column names "ID", "long", "lat". ID column is the name of each animal as a character, and lon/lat are numeric. There are no NA values, no missing rows.
I do not know what this error means nor can I fix it. Help please.
x <- data.frame(dat$ID, dat$lon, dat$lat)
hmmgps <- prepData(x, coordNames=c("lon", "lat"))
The function prepData assumes that the rows for each track (or each animal) are grouped together in the data frame. The error message indicates that it is not the case, and that at least one track is split. For example, the following (artificial) data set would cause this error:
> data
ID lon lat
1 1 54.08658 12.190313
2 1 54.20608 12.101203
3 1 54.18977 12.270896
4 2 55.79217 9.943341
5 2 55.88145 9.986028
6 2 55.91742 9.887342
7 1 54.25305 12.374541
8 1 54.28061 12.190078
This is because the track with ID "1" is split into two parts, separated by the track with ID "2".
The tracks need to be contiguous, i.e. all observations with ID "1" should come first, followed by all observations with ID "2". One possible solution would be to order the data by ID and by date.
Consider the same data set, with a "date" column:
> data
ID lon lat date
1 1 54.08658 12.190313 2019-09-06 14:20:00
2 1 54.20608 12.101203 2019-09-06 15:20:00
3 1 54.18977 12.270896 2019-09-06 16:20:00
4 2 55.79217 9.943341 2019-09-04 07:55:00
5 2 55.88145 9.986028 2019-09-04 08:55:00
6 2 55.91742 9.887342 2019-09-04 09:55:00
7 1 54.25305 12.374541 2019-09-06 17:20:00
8 1 54.28061 12.190078 2019-09-06 18:20:00
Following the answer to that question, you can define the ordered data set with:
> data_ordered <- data[with(data, order(ID, date)),]
> data_ordered
ID lon lat date
1 1 54.08658 12.190313 2019-09-06 14:20:00
2 1 54.20608 12.101203 2019-09-06 15:20:00
3 1 54.18977 12.270896 2019-09-06 16:20:00
7 1 54.25305 12.374541 2019-09-06 17:20:00
8 1 54.28061 12.190078 2019-09-06 18:20:00
4 2 55.79217 9.943341 2019-09-04 07:55:00
5 2 55.88145 9.986028 2019-09-04 08:55:00
6 2 55.91742 9.887342 2019-09-04 09:55:00
Then, the ordered data (excluding the date column) can be passed to prepData:
> hmmgps <- prepData(data_ordered[,1:3], coordNames = c("lon", "lat"))
> hmmgps
ID step angle x y
1 1 16.32042 NA 54.08658 12.190313
2 1 18.85560 2.3133191 54.20608 12.101203
3 1 13.37296 -0.6347523 54.18977 12.270896
4 1 20.62507 -2.4551318 54.25305 12.374541
5 1 NA NA 54.28061 12.190078
6 2 10.86906 NA 55.79217 9.943341
7 2 11.60618 -1.6734604 55.88145 9.986028
8 2 NA NA 55.91742 9.887342
I hope that this helps.

R - Find a value based on a criteria

I have a dataframe DF in which I have numerous of columns, one is with Dates and an other is the Hour.
My point is that I need to find the PRICE (dame datafra 36 hours before. All my days don't have 24 hours so I can't just shift my data set.
My idea was to look for the day before in my dataset & 12 hours before.
This is what I wrote but this is not working:
for (i in 38:nrow(DF)){
RefDay=as.Date(DF$Date[i])
HourRef=DF$Hour[i]
DF$P24[i]=DF[which(DF$Date == (RefDay-1))& which(DF$Hour == (HourRef-36)),"PRICE"]
}
Here is my DF:
'data.frame': 20895 obs. of 45 variables:
$ Hour : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Date : POSIXct, format: "2016-07-01" "2016-07-01" "2016-07-01" "2016-07-01" ...
$ PRICE : num 29.4 24.7 23.4 21.9 20.2 ...
Here is a sample of my data:
DF.Hour DF.Date DF.PRICE
1 0 2016-07-01 29.36
2 1 2016-07-01 24.69
3 2 2016-07-01 23.42
4 3 2016-07-01 21.91
5 4 2016-07-01 20.19
6 5 2016-07-01 22.44
Try to fill the data.frame with full days. You can do it with complete in tidyr. It will fill the not existing values with NA.
If you have any NAs in your full data.frame you can go for the 36th element before with for example lag(price, 36).
DF <- complete(DF, Hour, Date) %>% arrange(Date)
DF$Price[is.na(DF$Price)] <- lag(Price, 36)

Aggregating hourly data into daily aggregates with missing value in R

[enter image description here][1][enter image description here][2]I have a data frame "RH", with hourly data and I want to convert it to daily maximum and minimum data. This code was very useful [question]:Aggregating hourly data into daily aggregates
RH$Date <- strptime(RH$Date,format="%y/%m/%d)
RH$day <- trunc(RH$Date,"day")
require(plyr)
x <- ddply(RH,.(Date),
summarize,
aveRH=mean(RH),
maxRH=max(RH),
minRH=min(RH)
)
But my first 5 years data are 3 hours data not hourly. so no results for those years. Any suggestion? Thank you in advance.
'data.frame': 201600 obs. of 3 variables:
$ Date: chr "1985/01/01" "1985/01/01" "1985/01/01" "1985/01/01" ...
$ Hour: int 1 2 3 4 5 6 7 8 9 10 ...
$ RH : int NA NA 93 NA NA NA NA NA 79 NA ...
The link you provided is an old one. The code is still perfectly good and would work, but here's a more modern version using dplyr and lubridate
df <- read.table(text='date_time value
"01/01/2000 01:00" 30
"01/01/2000 02:00" 31
"01/01/2000 03:00" 33
"12/31/2000 23:00" 25',header=TRUE,stringsAsFactors=FALSE)
library(dplyr);library(lubridate)
df %>%
mutate(date_time=as.POSIXct(date_time,format="%m/%d/%Y %H:%M")) %>%
group_by(date(date_time)) %>%
summarise(mean=mean(value,na.rm=TRUE),max=max(value,na.rm=TRUE),
min=min(value,na.rm=TRUE))
`date(date_time)` mean max min
<date> <dbl> <dbl> <dbl>
1 2000-01-01 31.33333 33 30
2 2000-12-31 25.00000 25 25
EDIT
Since there's already a date column, this should work:
RH %>%
group_by(Date) %>%
summarise(mean=mean(RH,na.rm=TRUE),max=max(RH,na.rm=TRUE),
min=min(RH,na.rm=TRUE))

Creating columns based on factor levels in base R reshape()

I am writing code to generate reports regarding study subjects and the timing of their follow-up visits. I have data that looks like this:
subj_id timepoint date
100 3 month 2013-01-01
101 3 month 2013-01-12
102 3 month 2013-02-01
... ... ...
I would like to turn this into a "wide" data frame that I can then merge into another data frame that I have which shows when the subject should have been seen. Using reshape, I can do this, but I run into the following issue: if I reshape the data frame, I only get as many columns as it finds actual variables in the timepoint variable, even if there are possible values that have not yet been encountered in the database.
So, in my example, the variable timepoint is a factor with four levels: 3 month, 6 month, 9 month, and 12 month. However, at this point in the study, we haven't had anyone get past the 3 month visit, so the data is just lines 100, 101, and 102 above.
Using the following commands, you can get what I'm seeing (obviously this isn't how my data is created):
test_df <- data.frame(subj_id=c(100,101,102),
timepoint=c("3 month","3 month","3 month"),
date=c(as.Date("2013-01-01"),
as.Date("2013-01-12"),
as.Date("2013-02-01")))
test_df$timepoint <- factor(x=test_df$timepoint,
levels=c("3 month","6 month",
"9 month","12 month"),
labels=c("3 month","6 month",
"9 month","12 month"),
ordered=TRUE)
print(test_df)
> subj_id timepoint date
> 1 100 3 month 2013-01-01
> 2 101 3 month 2013-01-12
> 3 102 3 month 2013-02-01
levels(test_df$timepoint)
> [1] "3 month" "6 month" "9 month" "12 month"
reshape(data=test_df,v.names="date",
timevar="timepoint",idvar="subj_id",direction="wide")
> subj_id date.3 month
> 1 100 2013-01-01
> 2 101 2013-01-12
> 3 102 2013-02-01
What I would like to get would be something like this:
> subj_id date.3 month date.6 month date.9 month date.12 month
> 1 100 2013-01-01 NA NA NA
> 2 101 2013-01-12 NA NA NA
> 3 102 2013-02-01 NA NA NA
Is there a way to do this in base reshape? My current thought is to put four "fake" records in before I run reshape so that it will see four levels and create the data frame accordingly, but that seems kludgy at best. Is there a better way?
Here's a way to programmatically extend the dataframe to add columns for the unpopulated levels.:
> new_df <- reshape(data=test_df,
+ timevar="timepoint",idvar="subj_id",direction="wide" )
> new_df
subj_id date.3 month
1 100 2013-01-01
2 101 2013-01-12
3 102 2013-02-01
> new_df[ , setdiff(levels(test_df$timepoint) ,
factor(test_df$timepoint)) ] <- NA
>
> new_df
subj_id date.3 month 6 month 9 month 12 month
1 100 2013-01-01 NA NA NA
2 101 2013-01-12 NA NA NA
3 102 2013-02-01 NA NA NA
Note: Those column names will always need to be quoted because they have spaces. I never allow column names to stay that way.

Resources