I have this data frame:
Source: local data frame [446,604 x 2]
date pressure
1 2014_01_01_0:01 991
2 2014_01_01_0:02 991
3 2014_01_01_0:03 991
4 2014_01_01_0:04 991
5 2014_01_01_0:05 991
6 2014_01_01_0:06 991
7 2014_01_01_0:07 991
8 2014_01_01_0:08 991
9 2014_01_01_0:09 991
10 2014_01_01_0:10 991
.. ... ...
I want to separate the date column using separate() from tidyr
library(tidyr)
separate(df, date, into = c("year", "month", "day", "time"), sep="_")
But it does not work. I managed to do it using substr() and mutate():
library(dplyr)
df %>%
mutate(
year = substr(date, 1, 4),
month = substr(date, 6, 7),
day = substr(date, 9, 10),
time = substr(date, 12, 15))
Update:
It does not work because I have malformed rows. I was able to diagnose using my initial substr() method and I found out that I had weird entries in the dataframe:
df %>%
select(date) %>%
mutate(
year = substr(date, 1, 4),
month = substr(date, 6, 7),
day = substr(date, 9, 10),
time = substr(date, 12, 15)) %>%
group_by(year) %>%
summarise(n=n())
And this is what I get:
Source: local data frame [33 x 2]
year n
1 2014 446293
2 4164 9
3 4165 10
4 4166 10
5 4167 10
6 4168 10
7 4169 10
8 4170 10
9 4171 10
10 4172 10
11 4173 10
12 4174 10
13 4175 10
14 4176 10
15 4177 10
16 4178 10
17 4179 10
18 4180 10
19 4181 10
20 4182 10
21 4183 10
22 4184 10
23 4185 10
24 4186 10
25 4187 10
26 4188 10
27 4189 10
28 4190 10
29 4191 10
30 4192 10
31 4193 11
32 4194 10
33 4195 1
Would there be a more efficient way to diagnose the structure of the elements of a column and find the malformed lines before doing separate() ?
The steps would be:
Try to separate() first (no extra)
Notice there are malformed rows (errors in console)
Use separate() with extra = "drop"
Use group_by() and summarise() to explore the data and determine which rows to filter out
Related
looking to aggregate data (mean) in half-year periods by group.
Here is a snapshot of the data:
Date Score Group Score2
01/01/2015 15 A 11
02/01/2015 34 A 33
03/01/2015 16 A 1
04/01/2015 29 A 36
05/01/2015 4 A 28
06/01/2015 10 B 33
07/01/2015 21 B 19
08/01/2015 6 B 47
09/01/2015 40 B 15
10/01/2015 34 B 13
11/01/2015 16 B 7
12/01/2015 8 B 4
I have dfd$mon<-as.yearmon(dfd$Date) then
r<-as.data.frame(dfd %>%
mutate(month = format(Date, "%m"), year = format(Date, "%Y")) %>%
group_by(Group,mon) %>%
summarise(total = mean(Score), total1 = mean(Score2)))
for monthly aggregation, but how would you do this for every 6 months, grouped by Group?
I sense I am overcomplicating a simple issue here!
add another mutate after the current one:
mutate(yearhalf = as.integer(6/7)+1) %>%
output is 1 for the first 6 months and 2 for the months 7 to 12. Then you of course have to adapt the following functions for the new name, but that should do the trick.
I have the data.frame with the last 12 months values for 3 observations. There is a Date variable corresponging to the month.m0 (the most recent), and then the values goes backward in time substracting one month each time:
date <- c("2017-01-01", "2016-12-01", "2016-10-01")
month.m0 <- c(1, 2, 3)
month.m1 <- c(4, 5, 6)
month.m2 <- c(7, 8, 9)
month.m3 <- c(10, 11, 12)
month.m4 <- c(13, 14, 15)
month.m5 <- c(16, 17, 18)
month.m6 <- c(19, 20, 21)
month.m7 <- c(22, 23, 24)
month.m8 <- c(25, 26, 27)
month.m9 <- c(28, 29, 30)
month.m10 <- c(31, 32, 33)
month.m11 <- c(34, 35, 36)
df <- data.frame(date, month.m0, month.m1, month.m2, month.m3, month.m4, month.m5, month.m6, month.m7, month.m8, month.m9, month.m10, month.m11)
The input will be:
date month.m0 month.m1 month.m2 month.m3 month.m4 month.m5 month.m6 month.m7 month.m8 month.m9 month.m10 month.m11
1 2017-01-01 1 4 7 10 13 16 19 22 25 28 31 34
2 2016-12-01 2 5 8 11 14 17 20 23 26 29 32 35
3 2016-10-01 3 6 9 12 15 18 21 24 27 30 33 36
The problem here is that I don't know the real month of each observation, because the numeration is ordinal and depends on the date variable.
The initial value (month.m0) correspond for the first row to the month january, becasue the date is january (it doesnt matter the day or the year). For the second row, the date is indicating that the month.m0 corresponds to december, and the third corresponds to october. Then, month.m1 is the ((month(Date) - months(1)) value, the month.m2 corresponds to (month(Date) - months(2)) and so on, going back in time from the initial value
EDITED OUTPUT:
I was trying to assign each value to the real month, so the output would be:
date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 2017-01-01 1 34 31 28 25 22 19 16 13 10 7 4
2 2016-12-01 35 32 29 26 23 20 17 14 11 8 5 2
3 2016-10-01 30 27 24 21 18 15 12 9 6 3 36 33
It's easy to assign the first month for each observation, but then it complicates when going backwards in time.
Assuming that df is the dataframe you provided...
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
gather(month_num,value,-date) %>% # reshape datset
mutate(month_num = as.numeric(gsub("month.m","",month_num)), # keep only the number (as your step)
date = ymd(date), # transform date to date object
month_actual = month(date), # keep the number of the actual month (baseline)
month_now = month_actual + month_num, # create the current month (baseline + step)
month_now_upd = ifelse(month_now > 12, month_now-12, month_now), # update month number (for numbers > 12)
month_now_upd_name = month(month_now_upd, label=T)) %>% # get name of the month
select(date, month_now_upd_name, value) %>% # keep useful columns
spread(month_now_upd_name, value) %>% # reshape again
arrange(desc(date)) # start from recent month
# date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 2017-01-01 1 4 7 10 13 16 19 22 25 28 31 34
# 2 2016-12-01 5 8 11 14 17 20 23 26 29 32 35 2
# 3 2016-10-01 12 15 18 21 24 27 30 33 36 3 6 9
Note that I created various (helpful) variables that you won't need in the end, but they will help you understand the process when you run the chained commands step by step.
You can make the above code shorter by combining some commands within mutate if you want.
Your explanation is not very clear to me, so my output is not exactly yours. But this is how I would do it:
library(dplyr)
library(tidyr)
df %>%
# First create a new variable containing the month as a numeric between 1-12
mutate(month = strftime(date, "%m")) %>%
# Make data tidy so basically there is new column col containing
# month.1, month.2, month.3, ... and a column val containg
# the values
gather(col, val, -date, -month) %>%
# remove "month.m" so the col column has numeric values
mutate_at("col", str_replace, pattern = "month.m", replacement = "") %>%
mutate_at(c("month", "col"), as.numeric) %>%
# Compute the difference between the month column and the col column
mutate(col = abs((col - month + 1) %% 12)) %>%
# Sort the dataframe according to the new col column
arrange(month, col) %>%
# Add month.m to the col column so we redefine the names of the columns
mutate(col = paste0("month.m", col), month = NULL) %>%
# Untidy the data frame
spread(col, val)
this is my first post so I do apologize if I am not specific enough.
I have a sequence of months and a data frame with approximately 100 rows, each with a unique identifier. Each identifier is associated with a start up date. I am trying to calculate the number of months since start up for each of these unique identifiers at each month in the sequence. I have tried unsuccessfully to write a for loop to accomplish this.
Example Below:
# Build Example Data Frame #
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
# Create Sequence of Months, Format to match Data Frame, Reverse for the For Loop #
base.date <- as.Date(c("2015-11-1"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m")
Months.2 <- rev(Months.1)
# Create For Loop #
require(zoo)
for(i in seq_along(Months.2))
{
for(j in 1:length(summary(as.factor(df_example$ID), maxsum = 100000)))
{
Active.Months <- 12 * as.numeric((as.yearmon(Months.2 - i) - as.yearmon(df_example$StartUp)))
}
}
The idea behind the for loop was that for every record in the Months.2 sequence, there would be a calculation of the number of months to that record (month date) from the Start Up month for each of the unique identifiers. However, this has been kicking back the error:
Error in Months.2 - i : non-numeric argument to binary operator
I am not sure what the solution is, or if I am using the for loop properly for this.
Thanks in advance for any help with solving this problem!
Edit: This is what I am hoping my expected outcome would be (this is just a sample as there are more months in the sequence):
ID Start Up Month 2015-11 2015-12 2015-12 2016-02 2016-03
1 A 2013-10 25 26 27 28 29
2 B 2013-10 25 26 27 28 29
3 C 2014-04 19 20 21 22 23
4 D 2015-06 5 6 7 8 9
5 E 2014-01 22 23 24 25 26
One way to do it is to first use as.yearmon from zoo package to convert the dates. Then simply we iterate over months and subtract from the ones in the df_example,
library(zoo)
df_example$StartUp <- as.Date(as.yearmon(df_example$StartUp))
Months.2 <- as.Date(as.yearmon(Months.2))
df <- as.data.frame(sapply(Months.2, function(i)
round(abs(difftime(df_example$StartUp, i, units = 'days')/30))))
names(df) <- Months.2
cbind(df_example, df)
# ID StartUp 2016-07 2016-06 2016-05 2016-04 2016-03 2016-02 2016-01 2015-12 2015-11
#1 A 2013-10 33 32 31 30 29 28 27 26 25
#2 B 2013-10 33 32 31 30 29 28 27 26 25
#3 C 2014-04 27 26 25 24 23 22 21 20 19
#4 D 2015-06 13 12 11 10 9 8 7 6 5
#5 E 2014-01 30 29 28 27 26 25 24 23 22
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
y_example <- paste(y_example,"-01",sep = "")
# past on the "-01" because I want the later function to work.
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
base.date <- as.Date(c("2015-11-01"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m-%d")
Months.2 <- rev(Months.1)
monnb <- function(d) { lt <- as.POSIXlt(as.Date(d, origin="1900-01-01")); lt$year*12 + lt$mon }
mondf <- function(d1, d2) {monnb(d2) - monnb(d1)}
NumofMonths <- abs(mondf(df_example[,2],Sys.Date()))
n = max(NumofMonths)
# sequence along the number of months and get the month count.
monthcount <- (t(sapply(NumofMonths, function(x) pmax(seq((x-n+1),x, +1), 0) )))
monthcount <- data.frame(monthcount[,-(1:24)])
names(monthcount) <- Months.1
finalDataFrame <- cbind.data.frame(df_example,monthcount)
Here is your final data frame which is the desired output you indicated:
ID StartUp 2015-11-01 2015-12-01 2016-01-01 2016-02-01 2016-03-01 2016-04-01 2016-05-01 2016-06-01 2016-07-01
1 A 2013-10-01 25 26 27 28 29 30 31 32 33
2 B 2013-10-01 25 26 27 28 29 30 31 32 33
3 C 2014-04-01 19 20 21 22 23 24 25 26 27
4 D 2015-06-01 5 6 7 8 9 10 11 12 13
5 E 2014-01-01 22 23 24 25 26 27 28 29 30
The overall idea is that we calculate the number of months and use the sequence function to create a counter of the number of months until we get the current month.
I have data table here:
row V1 velocity
1 2009-04-06 95.9230769230769
2 2009-04-11 95.0985074626866
3 2009-04-17 95.8064935064935
4 2009-04-22 94.6357142857143
5 2009-04-27 95.3626865671642
6 2009-05-03 95.9101265822785
7 2009-05-08 95.826582278481
8 2009-05-14 94.5126582278481
9 2009-05-20 95.8371428571429
10 2009-05-25 94.6981481481481
11 2009-05-30 96.397619047619
12 2009-06-05 94.8132530120482
13 2009-06-10 96.4558139534884
14 2009-06-16 94.9627906976744
15 2009-06-21 95.2666666666667
16 2009-06-26 95.2919540229885
17 2009-07-01 95.4333333333333
18 2009-07-07 95.3375
19 2009-07-12 95.0534246575343
20 2009-07-18 96.0277777777778
21 2009-07-24 95.6885057471264
22 2009-07-29 93.9375
23 2009-08-03 95.2776315789474
24 2009-08-08 94.9089285714286
25 2009-08-13 96.8906976744186
26 2009-08-19 95.4487804878049
27 2009-08-24 97.2444444444444
28 2009-08-30 95.1174418604651
I want to write a r code to find a mean value of velocity by month. (There are May, June, July, and August.
What could I do?
Or jusr:
tapply(df$velocity, months(as.Date(df$V1)), mean)
April August Juli Juni Mai
95.36530 95.81465 95.24634 95.35810 95.53038
Here's how I would do it
Use lubridate to create a month variable to group by in dplyr and then get means.
library(lubridate)
library(dplyr)
df %>% group_by(month = month(df$V1)) %>% summarize(mean = mean(velocity))
month mean
1 4 95.36530
2 5 95.53038
3 6 95.35810
4 7 95.24634
5 8 95.81465
If you add label=T you get this:
df %>% group_by(month = month(df$V1,label=T)) %>% summarize(mean = mean(velocity))
month mean
1 Apr 95.36530
2 May 95.53038
3 Jun 95.35810
4 Jul 95.24634
5 Aug 95.81465
My data looks like this:
date rmean
1/2/2004 6
1/5/2004 30
1/6/2004 27
1/7/2004 20
1/8/2004 10
1/9/2004 22
1/12/2004 21
1/13/2004 18
1/14/2004 19
1/15/2004 7
1/16/2004 9
1/19/2004 11
1/20/2004 18
1/21/2004 26
1/26/2004 8
1/27/2004 16
1/28/2004 19
1/29/2004 4
1/30/2004 1
2/3/2004 11
2/4/2004 9
2/5/2004 26
2/6/2004 16
2/9/2004 25
2/10/2004 2
2/11/2004 6
2/12/2004 2
2/13/2004 25
2/16/2004 17
2/17/2004 21
2/18/2004 26
2/19/2004 6
2/20/2004 14
2/23/2004 4
2/24/2004 7
2/25/2004 19
2/26/2004 10
2/27/2004 23
I want to find the rmean of (20 days + 15th of each month).
Note: if there isn't a value for rmean of that date in my data (some days are skipped), i want it to find the rmean of closest day of the
something like this but ( 20 + 15th of each month) instead of 15 :
dt <- Dataframe[, list(day15=abs(mday(date)-15) == min(abs(mday(date)-15)),
date, rmean), by=list(year(date), month(date))]
dt[day15==TRUE]
Finale = dt[day15==TRUE , .SD[1,] ,by=list(month, year)]
The expected output for my example above:
date rmean
2/4/2004 9
Here's one way to do it with base R.
First, some dummy data:
d <- data.frame(date=as.Date('1/1/2004', '%d/%m/%Y') + sort(sample(364, 200)),
x=runif(200))
head(d)
# date x
# 1 2004-01-02 0.29818227
# 2 2004-01-03 0.12543617
# 3 2004-01-04 0.78145310
# 4 2004-01-05 0.30456904
# 5 2004-01-06 0.45228066
# 6 2004-01-07 0.07511554
Calculate arrival dates within the date range of the data:
arrival <-
seq(as.Date(sprintf('15/%s', format(min(d$date), '%m/%Y')), '%d/%m/%Y'),
as.Date(sprintf('15/%s', format(max(d$date), '%m/%Y')), '%d/%m/%Y'),
by='month') + 20
arrival
# [1] "2004-02-04" "2004-03-06" "2004-04-04" "2004-05-05" "2004-06-04" "2004-07-05"
# [7] "2004-08-04" "2004-09-04" "2004-10-05" "2004-11-04" "2004-12-05" "2005-01-04"
Find the closest date to each of the arrival dates (taking that with max x value if there are two closest dates), and return a data.frame with the "arrival" dates, the closest dates to each of these arrival dates, and the corresponding values of x.
cbind(arrival, do.call(rbind, lapply(arrival, function(x) {
closest <- which(abs(d$date - x) == min(abs(d$date - x)))
d[closest[which.max(d$x[closest])], ]
})))
# arrival date x
# 25 2004-02-04 2004-02-03 0.78836413
# 45 2004-03-06 2004-03-06 0.61214949
# 63 2004-04-04 2004-04-04 0.49171847
# 79 2004-05-05 2004-05-05 0.02989788
# 93 2004-06-04 2004-06-04 0.25923715
# 109 2004-07-05 2004-07-05 0.90330331
# 120 2004-08-04 2004-08-04 0.48133237
# 139 2004-09-04 2004-09-03 0.12280267
# 151 2004-10-05 2004-10-03 0.46888891
# 169 2004-11-04 2004-11-04 0.40397949
# 186 2004-12-05 2004-12-04 0.18685615
# 200 2005-01-04 2004-12-30 0.97462347