I have a data frame in R that contains one column named H:
H Index
11
11
11
11
12
12
12
13
13
14
14
15
15
15
16
17
18
19
20
20
20
21
22
23
00
00
00
01
01
02
03
04
04
04
04
05
06
07
07
07
08
09
09
09
10
11
12
How can I create a new column filled with 1 for H ranged from 10 to 18 (e.q., 10, 11, 12, 13, 14, 15, 16, 17 and 18) and filled with 0 for H from 19 to 09 (e.q., 19, 20, 21, 22, 23, 00, 01, 02, 03, 04, 05, 06, 07, 08 and 09)?
Thanks a lot.
We could also do
df$Index <- +(df$H<19 & df$H>9)
Or with ifelse
df$Index <- ifelse(df$H < 19 & df$H >9, 1, 0)
If the 'H' column is character, we convert it to numeric
df$H <- as.numeric(df$H)
Or if it is factor
df$H <- as.numeric(as.character(df$H))
and then perform the operations mentioned above
df$Index <- +(df$H < 19 & df$H >9)
This is easy as you need the value based on a range. If df is the dataframe,
df$H<19 & df$H>9
will give you a vector of True/False testing if the value is in the range from 10 to 18 or not. Using the as.integer function, you can cast this to 1s and 0s.1
df$Index <- as.integer(df$H<19 & df$H>9)
If the column is a character vector, we can first cast to a numeric value before doing the test
df$Index <- as.integer( as.integer(df$H)<19 & as.integer(df$H)>9)
If the value is not an integer, we can use as.numeric instead to do the inner casts.
1 This works because according to help(logical), True is coerced to 1 and False is coerced to 0 when called in a numerical context, and as.integer will follow those coercion rules. We could have manually done this coercion as well with the ifelse function as ifelse(df$H<19&df$H>9,1,0) which examines each element in this logical vector and uses a 1 if it is true or a 0 if it is false.
Related
How do I convert this data set into a time series format in R? Lets call the data set Bob. This is what it looks like
1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24
Are you looking for something like this....?
> dat <- read.table(text = "1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24
", header=FALSE) # your data
> ts(dat$V2, start=c(2013, 1), frequency = 12) # time series object
Jan Feb Mar Apr May Jun
2013 25 865 26 33 74 24
Assuming that your starting point is the data frame DF defined reproducibly in the Note at the end this converts it to a zoo series z as well as a ts series tt.
library(zoo)
z <- read.zoo(DF, FUN = as.yearmon, format = "%m/%Y")
tt <- as.ts(z)
z
## Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 Jun 2013
## 25 865 26 33 74 24
tt
## Jan Feb Mar Apr May Jun
## 2013 25 865 26 33 74 24
Note
Lines <- "1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24"
DF <- read.table(text = Lines)
I have the data.frame with the last 12 months values for 3 observations. There is a Date variable corresponging to the month.m0 (the most recent), and then the values goes backward in time substracting one month each time:
date <- c("2017-01-01", "2016-12-01", "2016-10-01")
month.m0 <- c(1, 2, 3)
month.m1 <- c(4, 5, 6)
month.m2 <- c(7, 8, 9)
month.m3 <- c(10, 11, 12)
month.m4 <- c(13, 14, 15)
month.m5 <- c(16, 17, 18)
month.m6 <- c(19, 20, 21)
month.m7 <- c(22, 23, 24)
month.m8 <- c(25, 26, 27)
month.m9 <- c(28, 29, 30)
month.m10 <- c(31, 32, 33)
month.m11 <- c(34, 35, 36)
df <- data.frame(date, month.m0, month.m1, month.m2, month.m3, month.m4, month.m5, month.m6, month.m7, month.m8, month.m9, month.m10, month.m11)
The input will be:
date month.m0 month.m1 month.m2 month.m3 month.m4 month.m5 month.m6 month.m7 month.m8 month.m9 month.m10 month.m11
1 2017-01-01 1 4 7 10 13 16 19 22 25 28 31 34
2 2016-12-01 2 5 8 11 14 17 20 23 26 29 32 35
3 2016-10-01 3 6 9 12 15 18 21 24 27 30 33 36
The problem here is that I don't know the real month of each observation, because the numeration is ordinal and depends on the date variable.
The initial value (month.m0) correspond for the first row to the month january, becasue the date is january (it doesnt matter the day or the year). For the second row, the date is indicating that the month.m0 corresponds to december, and the third corresponds to october. Then, month.m1 is the ((month(Date) - months(1)) value, the month.m2 corresponds to (month(Date) - months(2)) and so on, going back in time from the initial value
EDITED OUTPUT:
I was trying to assign each value to the real month, so the output would be:
date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 2017-01-01 1 34 31 28 25 22 19 16 13 10 7 4
2 2016-12-01 35 32 29 26 23 20 17 14 11 8 5 2
3 2016-10-01 30 27 24 21 18 15 12 9 6 3 36 33
It's easy to assign the first month for each observation, but then it complicates when going backwards in time.
Assuming that df is the dataframe you provided...
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
gather(month_num,value,-date) %>% # reshape datset
mutate(month_num = as.numeric(gsub("month.m","",month_num)), # keep only the number (as your step)
date = ymd(date), # transform date to date object
month_actual = month(date), # keep the number of the actual month (baseline)
month_now = month_actual + month_num, # create the current month (baseline + step)
month_now_upd = ifelse(month_now > 12, month_now-12, month_now), # update month number (for numbers > 12)
month_now_upd_name = month(month_now_upd, label=T)) %>% # get name of the month
select(date, month_now_upd_name, value) %>% # keep useful columns
spread(month_now_upd_name, value) %>% # reshape again
arrange(desc(date)) # start from recent month
# date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 2017-01-01 1 4 7 10 13 16 19 22 25 28 31 34
# 2 2016-12-01 5 8 11 14 17 20 23 26 29 32 35 2
# 3 2016-10-01 12 15 18 21 24 27 30 33 36 3 6 9
Note that I created various (helpful) variables that you won't need in the end, but they will help you understand the process when you run the chained commands step by step.
You can make the above code shorter by combining some commands within mutate if you want.
Your explanation is not very clear to me, so my output is not exactly yours. But this is how I would do it:
library(dplyr)
library(tidyr)
df %>%
# First create a new variable containing the month as a numeric between 1-12
mutate(month = strftime(date, "%m")) %>%
# Make data tidy so basically there is new column col containing
# month.1, month.2, month.3, ... and a column val containg
# the values
gather(col, val, -date, -month) %>%
# remove "month.m" so the col column has numeric values
mutate_at("col", str_replace, pattern = "month.m", replacement = "") %>%
mutate_at(c("month", "col"), as.numeric) %>%
# Compute the difference between the month column and the col column
mutate(col = abs((col - month + 1) %% 12)) %>%
# Sort the dataframe according to the new col column
arrange(month, col) %>%
# Add month.m to the col column so we redefine the names of the columns
mutate(col = paste0("month.m", col), month = NULL) %>%
# Untidy the data frame
spread(col, val)
diff(seq(as.Date("2016-12-21"), as.Date("2017-04-05"), by="month"))
Time differences in days
[1] 31 31 28
The above code generates no of days in the month Dec, Jan and Feb.
However, my requirement is as follows
#Results that I need
#monthly days from date 2016-12-21 to 2017-04-05
11, 31, 28, 31, 5
#i.e 11 days of Dec, 31 of Jan, 28 of Feb, 31 of Mar and 5 days of Apr.
I even tried days_in_month from lubridate but not able to achieve the result
library(lubridate)
days_in_month(c(as.Date("2016-12-21"), as.Date("2017-04-05")))
Dec Apr
31 30
Try this:
x = rle(format(seq(as.Date("2016-12-21"), as.Date("2017-04-05"), by=1), '%b'))
> setNames(x$lengths, x$values)
# Dec Jan Feb Mar Apr
# 11 31 28 31 5
Although we have seen a clever replacement of table by rle and a pure table solution, I want to add two approaches using grouping. All approaches have in common that they create a sequence of days between the two given dates and aggregate by month but in different ways.
aggregate()
This one uses base R:
# create sequence of days
days <- seq(as.Date("2016-12-21"), as.Date("2017-04-05"), by = 1)
# aggregate by month
aggregate(days, list(month = format(days, "%b")), length)
# month x
#1 Apr 5
#2 Dez 11
#3 Feb 28
#4 Jan 31
#5 Mrz 31
Unfortunately, the months are ordered alphabetically as it happened with the simple table() approach. In these situations, I do prefer the ISO8601 way of unambiguously naming the months:
aggregate(days, list(month = format(days, "%Y-%m")), length)
# month x
#1 2016-12 11
#2 2017-01 31
#3 2017-02 28
#4 2017-03 31
#5 2017-04 5
data.table
Now that I've got used to the data.table syntax, this is my preferred approach:
library(data.table)
data.table(days)[, .N, .(month = format(days, "%b"))]
# month N
#1: Dez 11
#2: Jan 31
#3: Feb 28
#4: Mrz 31
#5: Apr 5
The order of months is kept as they have appeared in the input vector.
I'm working with SparkR on Time Series and I have a question.
After some operation I got something like this, where DayHour represent the Day and the Hour of the ID's Value.
DayHour ID Value
01 00 4704 10
01 01 4705 11
.
.
.
04 23 4705 12
The problem is that I have some gap like 01 01, 01 02 missing
DayHour ID Value
01 00 4704 13
01 03 4704 12
I have to fill the gap in the whole dataset with :
DayHour ID Value
01 00 4704 13
01 01 4704 0
01 02 4704 0
01 03 4704 12
Foreach ID I have to fill the gap with the DayHour missing, ID and Value = 0
Solution both in R SparkR would be usefull.
I represented your data in data frame df_r
>df_r <- data.frame(DayHour=c("01 00","01 01","01 02","01 03","01 06","01 07"),
ID = c(4704,4705,4705,4706,4706,4706),Value=c(10,11,12,13,14,15))
> df_r
DayHour ID Value
1 01 00 4704 10
2 01 01 4705 11
3 01 02 4705 12
4 01 03 4706 13
5 01 06 4706 14
6 01 07 4706 15
where the missing hours are 01 04 and 01 05
#Removing white spaces
>df_r$DayHour <- sub(" ", "", df_r$DayHour)
# create dummy all the 'dayhour' in sequence
x=c(00:23)
y=01:04
all_day_hour <- data.frame(Hour = rep(x,4), Day = rep(y,each=24))
all_day_hour$Hour <- sprintf("%02d", all_day_hour$Hour)
all_day_hour$Day <- sprintf("%02d", all_day_hour$Day)
all_day_hour_1 <- transform(all_day_hour,DayHour=paste0(Day,Hour))
all_day_hour_1 <- all_day_hour_1[c(3)]
# using for loop to filter out by each id
>library(dplyr)
>library(forecast)
>df.new <- data.frame()
>factors=unique(df_r$ID)
>for(i in 1:length(factors))
{
df_r1 <- filter(df_r, ID == factors[i])
#Merge
df_data1<- merge(df_r1, all_day_hour_1, by="DayHour", all=TRUE)
df_data1$Value[which(is.na(df_data1$Value))] <- 0
df.new <- rbind(df.new, df_data1)
}
I want to check for seasonality in a time series by the day of the month.
The problem is that the months are not of equal length (or frequency) - there are months with 31, 28 & 30 days.
When declaring the ts object I can only specify a fixed frequency so it wont be correct.
> x <- data.frame(d = as.Date("2013-01-01") + 1:365 , v = runif(365))
> tapply(as.numeric(format(x$d,"%d")) , format(x$d,"%m") , max)
01 02 03 04 05 06 07 08 09 10 11 12
31 28 31 30 31 30 31 31 30 31 30 31
How can I create a time series object in r that i can later decompose and check for seasonality ?
Is it possible to create a pivot table and convert it into a ts ?