I wrote a function that creates column based on a datetime column using parameters starting and ending dates, but I can't get it to work.
df is a data frame object.
create_gv <- function(df, s_ymd, e_ymd, char) {
df<-get(df)
for (i in (1:nrow(df))) {
ymd <- format(df[i,1],"%y%m%d")
if ((strptime(ymd,format = "%y%m%d") >= strptime(s_ymd,format = "%y%m%d") & strptime(ymd,format = "%y%m%d") <= strptime(e_ymd,format = "%y%m%d")) == TRUE) {
df$group_var[i]<-char
}
}
}
create_gv("example","171224","171224","D")
I get
> example
start_time group_var
1 2017-12-24 10:42:39 NA
2 2017-12-24 10:44:31 NA
3 2018-01-14 12:05:53 NA
4 2018-01-14 12:22:12 NA
Reproducible data frame named example here:
example <- structure(list(start_time = structure(c(1514112159, 1514112271, 1515931553, 1515932532), class = c("POSIXct", "POSIXt"), tzone = ""), group_var = c(NA, NA, NA, NA)), .Names = c("start_time", "group_var"), row.names = c(NA, -4L), class = "data.frame")
Desired output:
start_time group_var
1 2017-12-24 10:42:39 D
2 2017-12-24 10:44:31 D
3 2018-01-14 12:05:53 NA
4 2018-01-14 12:22:12 NA
From your description, my understanding is that you want to check if the date in a row is between the start and end date (which are scalars), and update the value of group_var accordingly.
The lubridate package provides a set of tools which allow to easily work with dates. In order to compare dates you don't need to format them. format only helps with the viewing of these dates. I have used the dplyr package which allows you to easily perform data transformations.
To solve the problem, we use the dplyr::mutate function which transforms a column by row, as a function of other columns. In this case, the date column in our dataset (start_time) is to compared with scalar start and end times in order to codify the group_var variable.
library(lubridate)
library(magrittr)
char <- "D"
# Randomly setting the start and end times for the purpose of the example. Any value can be passed to this.
s_ymd <- df$start_time[1] - 5000
e_ymd <- df$start_time[2] + 5000
df %>% dplyr::mutate(group_var = ifelse(start_time > s_ymd & start_time <
e_ymd,
char, NA)) -> df
df
To use a function directly, write:
create_gv <- function(start_time, s_ymd, e_ymd, char){
g_var <- ifelse(start_time > s_ymd & start_time < e_ymd,
char, NA)
return(g_var)
}
df %>% dplyr::mutate(group_var = create_gv(start_time, !!s_ymd, !!e_ymd,
!!char))
Here since s_ymd, e_ymd and char are scalars (i.e. not columns in the data frame), we need to unquote them. Note that the mutate function works on vectorized functions as desired.
Related
I´m trying to rename two types of variables in R using tidyverse/dplyr. The first type "var_a_year", I want to rename it as "sample_year". The second type of variable "var_b_7", I want to rename it as "index_year".
The second variable, "var_b" starts on the number 7 for the first year "2004". And increases by 2 for each year. So for year 2005, the second type variable is called "var_b_9" as shown.
I would like to use a loop so I can make this faster instead of writting a line for each year.
Many thanks in advance!
df <- df %>%
rename(
sample_2004 = var_a_2004, index_2004 = var_b_7,
sample_2005 = var_a_2005, index_2005 = var_b_9,
sample_2006 = var_a_2006, index_2006 = var_b_11,
sample_2007 = var_a_2007, index_2007 = var_b_13,
...
sample_2020 = var_a_2020, index_2020 = var_b_39)
There's no need to use a loop. rename_with will do the trick:
df <- tibble(var_a_2004=NA, var_b_7=NA, var_a_2005=NA, var_b_8=NA)
renameA <- function(x) {
return(paste0("sample_", stringr::str_sub(x, -4)))
}
df %>% rename_with(renameA, starts_with("var_a"))
Gives
# A tibble: 1 x 4
sample_2004 var_b_7 sample_2005 var_b_8
<lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA
I'll leave you to work out how to code the corresponding function for your var_b_XXXX columns.
In addition to the answer of Limey:
#sample data
df <- structure(list(var_a_2004 = NA, var_b_7 = NA, var_a_2005 = NA,
var_b_9 = NA), row.names = c(NA, -1L), class = "data.frame")
#load data.table package
library(data.table)
#set df to data.table
dt <- as.data.table(df)
#convert var_a in columnnames to sample_
colnames(dt) <- gsub("var_a_", "sample_", colnames(dt))
#use a loop to replace var_b to index_
for(i in 2004:2005){
year <- i
nr <- 2* i -4001
setnames(dt, old = paste0("var_b_", nr), new = paste0("index_", year))
}
This function now works for the years 2004:2005 to match the sample data. You can change it to 2004:2020 for your dataset.
I have 24 files (1 for each hour of the day, HR_NBR = Hour Number) and I've to pad the dates in each of the files.
AS-IS data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
03/07/2016 1 10
TO-BE data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
02/07/2016 NA NA
03/07/2016 1 10
I can use the pad function for each file, like this:
chil_bev1_1 = pad (chil_bev1_1, interval= "day") # Hour1
chil_bev1_2 = pad (chil_bev1_2, interval= "day") # Hour2
and so on.
And it works. But I want to use a loop or LAPPLY.
I tried several variations of these 2 codes, but none of them worked:
df1 = data.frame (chil_bev1_1)
df2 = data.frame (chil_bev1_2)
dflist = c("df1","df2")
CODE1:
x = function(df) {df %>% pad}
allpad = lapply(dflist,x)
CODE2:
x = function(df) {pad (df)}
allpad = lapply(dflist,x)
The error is
"x must be a data frame".
I'm new to R. Any help would be greatly appreciated.
Thank you.
I managed to figure it out. Here's the answer:
hour_list = list(chil_bev1_1, chil_bev1_2)
chil_bev1n = lapply (hour_list, function (x) {x %>% complete(CLNDR_DT = seq.Date(min(CLNDR_DT), max(CLNDR_DT), by="day"), fill = list(QTY=0))})
Notes:
The fill = list() function replaces the NAs with 0s.
CLNDR_DT is the name of the column that contains dates.
I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.
Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29
Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.
When aggregating an R dataframe, the dates are converted in integer :
For instance, if I want to take the maximum dates for every Id in the following dataframe :
> df1 <- data.frame(id = rep(c(1, 2), 2), b = as.Date(paste("01/01/", 2000:2003, sep=''), format = "%d/%m/%Y"))
> df1
id b
1 1 2000-01-01
2 2 2001-01-01
3 1 2002-01-01
4 2 2003-01-01
> aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
id b
1 1 11688
2 2 12053
Why does R behave this way ? (and what's the best way to keep a date class column in the returned dataframe?)
Thanks for your help,
That works for me R version 3, perhaps there were some changes in updates, so I recommend you to update R :)
As for this version of R, have you tried as.Date() function after aggregating?
In your example, should be like:
dtf2<-aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
dtf2$b<-as.Date(dtf$b)
You can also add 'origin' option to as.Date, like
as.Date(dtf$b, origin='1970-01-01')
UPD: When R looks at dates as integers, its origin is January 1, 1970.
Hope that will help.
I'm trying to melt a data frame with chron class
library(chron)
x = data.frame(Index = as.chron(c(15657.00,15657.17)), Var1 = c(1,2), Var2 = c(9,8))
x
Index Var1 Var2
1 (11/13/12 00:00:00) 1 9
2 (11/13/12 04:04:48) 2 8
y = melt(x,id.vars="Index")
Error in data.frame(ids, variable, value, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 2, 4
I can trick with as.numeric() as follows:
x$Index= as.numeric(x$Index)
y = melt(x,id.vars="Index")
y$Index = as.chron(y$Index)
y
Index variable value
1 (11/13/12 00:00:00) Var1 1
2 (11/13/12 04:04:48) Var1 2
3 (11/13/12 00:00:00) Var2 9
4 (11/13/12 04:04:48) Var2 8
But can it be simpler ? (I want to keep the chron class)
(1) I assume you issued this command before running the code shown:
library(reshape2)
In that case you could use the reshape package instead. It doesn't result in this problem:
library(reshape)
Other solutions are to
(2) use R's reshape function:
reshape(direction = "long", data = x, varying = list(2:3), v.names = "Var")
(3) or convert the chron column to numeric, use melt from the reshape2 package and then convert back:
library(reshape2)
xt <- transform(x, Index = as.numeric(Index))
transform(melt(xt, id = 1), Index = chron(Index))
ADDED additional solutions.
I'm not sure but I think this might be an "oversight" in chron (or possibly data.frame, but that seems unlikely).
The issue occurs when constructing the data frame in melt.data.frame in reshape2, which typically uses recycling, but that portion of data.frame:
for (j in seq_along(xi)) {
xi1 <- xi[[j]]
if (is.vector(xi1) || is.factor(xi1))
xi[[j]] <- rep(xi1, length.out = nr)
else if (is.character(xi1) && class(xi1) == "AsIs")
xi[[j]] <- structure(rep(xi1, length.out = nr), class = class(xi1))
else if (inherits(xi1, "Date") || inherits(xi1, "POSIXct"))
xi[[j]] <- rep(xi1, length.out = nr)
else {
fixed <- FALSE
break
}
seems to go wrong, as the chron variable doesn't inherit either Date or POSIXct. This removes the error but alters the date times:
x = data.frame(Index = as.chron(c(15657.00,15657.17)), Var1 = c(1,2), Var2 = c(9,8))
class(x$Index) <- c(class(x$Index),'POSIXct')
y = melt(x,id.vars="Index")
Like I said, this sorta smells like a bug somewhere. My money would be on the need for chron to add POSIXct to the class vector, but I could be wrong. The obvious alternative would be to use POSIXct date times instead.