I have a datatable with three date columns x, y and z and I am trying to create a new column (new_col) that is the middle date of the three dates in each row once ranked from earliest to latest, i.e., I want the date between the min and max date – please see table below:
x
y
z
new_col
1st Jan 2005
4th May 1998
2nd Mar 2009
1st Jan 2005
9th May 2010
14th Feb 2003
9th Jan 2008
9th Jan 2008
7th Sept 2002
8th Dec 2010
23rd May 2012
8th Dec 2010
So, for rows 1, 2, and 3 I would like the dates from column x, z, and y, respectively. How can I go about this in R? I have used pmin and pmax but I can't isolate the date in the middle
Thanks in advance!
The approach below
coerces the character date strings to numeric type Date as there is no arithmetic with character dates,
finds the position of the "middle" date in each row
and returns the corresponding character string
which eventually becomes new_col.
This can be implemented using apply() on each row using an appropriate function:
df$new_col <- apply(df, 1L, function(x) x[order(lubridate::dmy(x))][2L])
df
x y z new_col
1 1st Jan 2005 4th May 1998 2nd Mar 2009 1st Jan 2005
2 9th May 2010 14th Feb 2003 9th Jan 2008 9th Jan 2008
3 7th Sept 2002 8th Dec 2010 23rd May 2012 8th Dec 2010
Note
This returns the expected result. new_col is a character date string.
However, if the OP intends to continue working with type Date, e.g. doing more arithmetic, I recommend to follow Ben's example and to coerce the whole data.frame to type Date and to stick to it.
First make sure all your dates are "Date" type, you can use dmy from lubridate for this (assumes your data frame is called df):
library(lubridate)
df[] <- lapply(df, dmy)
Next, sort each row in chronological order, and take the middle column (column 2) to be the new_col:
df$new_col <- as.Date(t(apply(df, 1, sort))[,2])
Finally, if you want the result to be displayed in same text format (e.g., "1st Jan 2005" instead of "2005-01-01") then you can use a custom function based on this answer:
library(dplyr)
date_to_text <- function(dates){
dayy <- day(dates)
suff <- case_when(dayy %in% c(11,12,13) ~ "th",
dayy %% 10 == 1 ~ 'st',
dayy %% 10 == 2 ~ 'nd',
dayy %% 10 == 3 ~'rd',
TRUE ~ "th")
paste0(dayy, suff, " ", format(dates, "%b %Y"))
}
df[] <- lapply(df, date_to_text)
Output
x y z new_col
1 1st Jan 2005 4th May 1998 2nd Mar 2009 1st Jan 2005
2 9th May 2010 14th Feb 2003 9th Jan 2008 9th Jan 2008
3 7th Sep 2002 8th Dec 2010 23rd May 2012 8th Dec 2010
Data
df <- structure(list(x = c("1st Jan 2005", "9th May 2010", "7th Sept 2002"
), y = c("4th May 1998", "14th Feb 2003", "8th Dec 2010"), z = c("2nd Mar 2009",
"9th Jan 2008", "23rd May 2012")), class = "data.frame", row.names = c(NA,
-3L))
Related
I have a data frame ordered by month and year. I want to select only the integer number of years i.e. if the data start in July 2002 and ends in September 2010 then select only data from July 2002 to June 2010.
And if the data starts in September 1992 and ends in March 2000 then select only data from September 1992 to August 1999. Regardless of the missing months in between.
The data can be uploaded from the following link:
enter link description here
The code
mydata <- read.csv("E:/mydata.csv", stringsAsFactors=TRUE)
this is manually selection
selected.data <- mydata[1:73,] # July 2002 to June 2010
how to achieve that by coding.
Here is a base solution, that reproduce your manual subsetting:
mydata <- read.csv("D:/mydata.csv", stringsAsFactors=F)
lookup <-
c(
January = 1,
February = 2,
March = 4,
April = 4,
May = 5,
June = 6,
July = 7,
August = 8,
September = 9,
October = 10,
November = 11,
December = 12
)
mydata$Month <- unlist(lapply(mydata$Month, function(x) lookup[match(x, names(lookup))]))
first.month <- mydata$Month[1]
last.year <- max(mydata$Year)
mydata[1:which(mydata$Month==(first.month -1)&mydata$Year==last.year),]
Basically, I convert the Month name in number and find the month preceding the first month that appears in the dataframe, for the last year of the dataframe.
Here's a base R one-liner :
result <- mydata[seq_len(with(mydata, which(Month == month.name[match(Month[1],
month.name) - 1] & Year == max(Year)))), ]
head(result)
# Month Year var
#1 July 2002 -91.22997
#2 October 2002 -91.19007
#3 December 2002 -91.05395
#4 February 2003 -91.16958
#5 March 2003 -91.17881
#6 April 2003 -91.15110
tail(result)
# Month Year var
#68 December 2009 -90.92610
#69 January 2010 -91.07379
#70 February 2010 -91.12460
#71 March 2010 -91.10288
#72 April 2010 -91.06040
#73 June 2010 -90.94212
Dears
I am trying to breakdown quarterly data into monthly data using R. I am not concerned about dates as I can generate a vector of months corresponding to values without problems. The problem is with the vector of values and imputing the missing data. See the example below:
Quarter Value
2010-Q1 10
2010-Q2 15
2010-Q3 18
2010-Q4 12
The new data set should look like the following
Month Value
2010-3 10
2010-4 11.67
2010-5 13.34
2010-6 15
2010-7 16
2010-8 17
2010-9 18
2010-10 16
2010-11 14
2010-12 12
Now, the months within each quarter are filled using the following formula
The first month of the quarter[i] = The previous quarter value [i-1] + ((The difference between the quarter [i] and [i-1])/3)
The second month of the quarter[i] = The previous quarter value [i-1] + 2*((The difference between the quarter [i] and [i-1])/3)
For example:
2020-Q1 = 10
2020-Q2 = 15
Difference/3 = 5/3
2020-April = 10 + diff
2020-May = 10 + 2*diff
2020-June = 15 (end of the quarter stays the same) or can be calculated as 10 + 3*diff
I am wondering how to generate a new variable that can break down the values as mentioned above.
Thanks
1) Convert the input to a zoo series z with yearqtr index (which directly represents year and quarter without month or day) and then pad out with NAs and apply na.approx to fill them in linearly giving Value. Assuming that the series is regularly spaced we can just convert the first index value to yearmon (which directly represents a year and month without day) using a frequency of 12 months per year. Finally, either leave it as Value or else use the last line to convert it back to data frame DF2. Another possibility would be to use as.ts(Value) to convert it to a ts series.
Note that yearmon class displays as shown below but represents year and month internally as year plus a fraction equal to 0 for Jan, 1/12 for Feb, ..., 11/12 for Dec so as.integer(time(Value)) will give the year and cycle(time(Value)) will give the month number (Jan = 1, ..., Dec = 12).
library(zoo)
z <- read.zoo(DF, FUN = function(x) as.yearqtr(x, "%Y-Q%q"))
Value <- zooreg(na.approx(c(t(cbind(z, NA, NA)))),
start = as.yearmon(start(z)), freq = 12)
DF2 <- fortify.zoo(Value) # optional
giving:
> DF2
Index Value
1 Jan 2010 10.00000
2 Feb 2010 11.66667
3 Mar 2010 13.33333
4 Apr 2010 15.00000
5 May 2010 16.00000
6 Jun 2010 17.00000
7 Jul 2010 18.00000
8 Aug 2010 16.00000
9 Sep 2010 14.00000
10 Oct 2010 12.00000
Graphically it looks like this:
plot(Value, type = "o")
(continued after plot)
2) A second method starting with z from (1) is to first create the output yearmon time sequence tt, convert the time index of z to yearmon giving z.ym and then merge them generating NA's and finally apply na.approx to fill them in.
tt <- seq(as.yearmon(start(z)), as.yearmon(end(z)), 1/12)
z.ym <- aggregate(z, as.yearmon, c)
Value <- na.approx(merge(z.ym, zoo(, tt)))
Note
The input in reproducible form:
Lines <- "Quarter Value
2010-Q1 10
2010-Q2 15
2010-Q3 18
2010-Q4 12"
DF <- read.table(text = Lines, header = TRUE)
I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925
If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]
The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20
I have a data frame in the following format that represent a large data set that I have
F.names<-c('M','M','M','A','A')
L.names<-c('Ab','Ab','Ab','Ac','Ac')
year<-c('August 2015','September 2014','September 2016', 'August 2014','September 2013')
grade<-c(NA,'9th Grade','11th Grade',NA,'11th grade')
df.have<-data.frame(F.names,L.names,year,grade)
F.names L.names year grade
1 M Ab August 2015 <NA>
2 M Ab September 2014 9th Grade
3 M Ab September 2016 11th Grade
4 A Ac August 2014 <NA>
5 A Ac September 2013 11th grade
The year column is in factor format in the original data set and there are several missing values for grade.Basically I want to fill in the missing grade values based on year column so that it looks like the following.
F.names L.names year grade
1 M Ab August 2015 10th Grade
2 M Ab September 2014 9th Grade
3 M Ab September 2016 11th Grade
4 A Ac August 2014 12th Grade
5 A Ac September 2013 11th grade
I was thinking that my first step would be to covert the year column which is in factor format to a date format. and then arrange the columns in order and use something like fill from tidyrto fill the missing columns. How should I go about doing this, or is there a better way to approach this?
F.names<-c('M','M','M','A','A')
L.names<-c('Ab','Ab','Ab','Ac','Ac')
year<-c('August 2015','September 2014','September 2016', 'August 2014','September 2013')
grade<-c(NA,'9th Grade','11th Grade',NA,'11th grade')
df.have<-data.frame(F.names,L.names,year,grade)
library(tidyverse)
df.have %>%
separate(year, c("m","y"), convert = T, remove = F) %>%
separate(grade, c("num","type"), sep="th", convert = T) %>%
arrange(F.names, y) %>%
group_by(F.names) %>%
mutate(num = ifelse(is.na(num), lag(num) + 1, num),
type = "grade") %>%
ungroup() %>%
unite(grade, num, type, sep="th ") %>%
select(-m, -y)
# F.names L.names year grade
# 1 A Ac September 2013 11th grade
# 2 A Ac August 2014 12th grade
# 3 M Ab September 2014 9th grade
# 4 M Ab August 2015 10th grade
# 5 M Ab September 2016 11th grade
This solution assumes that you won't have 2 or more consecutive NAs for a given F.names value.
I have not found a solution for this, and I think it should be very simple but now I can't think right.
I have two data frames, monthly traffic volume averages, and yearly traffic volume averages. I need to divide yearly averages by monthly averages.
ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
So the first 12 rows of the TPDM table should divide the first row of the TPDA table and create a new data frame which should contain monthly factors.
Something like:
ano mes dias FA_Au
2012 Ene 31 4271.096/4288.323
2012 Feb 29 4271.096/3268.862
(Don't need to show the computation, just the result)
I am sure that selecting the data by year would do that but haven't found the right way to do it.
Merge by year and find columns to divide by position
As already mentioned by zx8754 this can be done by merging on year and dividing the corresponding columns in base R:
merged <- merge(TPDM, TPDA, by.x = "ano", by.y = "anosDB")
FA <- cbind(merged[, 1:3], merged[, 10:15]/merged[, 4:9])
# rename columns
names(FA) <- sub("TPDA_", "FA_", names(FA))
FA
ano mes dias FA_Au FA_Bu FA_CU FA_CAI FA_CAII FA_TOTAL
1 2012 Ene 31 0.9959828 0.9951086 1.0779532 1.1044977 1.1757530 1.705872
2 2012 Feb 29 1.3066003 1.0412831 1.0360781 0.9862042 1.1050245 1.663675
3 2012 Mar 31 1.1644517 0.9884285 0.8829349 0.9583809 1.0566337 1.546941
4 2012 Abr 30 0.9148231 0.9540314 0.9073122 1.0383376 1.0328838 1.440892
5 2012 May 31 1.3353096 1.0309085 0.9963600 0.9507334 0.8084003 1.576802
6 2012 Jun 30 1.1834349 1.0191696 0.9103332 0.9642064 0.9149720 1.534471
Caveat:
This approach works as long as the positions, i.e., column numbers, of the corresponding columns are known. With the given datasets, the columns are ordered in the same way. Therefore, only an offset has to be considered to match corresponding columns.
Merge by year and find columns to divide by name
If, for some reason, the positions are not known in advance we can find corresponding columns by matching the column names.
For this, both datasets are reshaped from wide to long format. In long format, the column names (now called variable) are treated as data. Now, we can join monthly and annual values on year and column name, divide annual values by the corresponding monthly values, and reshape back to wide format, finally:
library(data.table)
# reshape and prepare monthly data
longM <- melt(setDT(TPDM), id.vars = 1:3)
longM[, variable := stringr::str_replace(variable, "_TPDM", "")]
longM[, mes := forcats::fct_inorder(mes)]
# reshape and prepare annual data
longA <- melt(setDT(TPDA), id.vars = 1)
longA[, variable := stringr::str_replace(variable, "TPDA_", "")]
setnames(longA, "anosDB", "ano")
# join
long_FA <- longA[longM, on = .(ano, variable),
.(ano, mes, dias, variable, FA = value/i.value)]
# reshape back to wide format
dcast(long_FA, ano + mes +dias ~ paste0("FA_", variable), value.var = "FA")
ano mes dias FA_Au FA_Bu FA_CAI FA_CAII FA_CU FA_TOTAL
1: 2012 Ene 31 0.9959828 0.9951086 1.1044977 1.1757530 1.0779532 1.705872
2: 2012 Feb 29 1.3066003 1.0412831 0.9862042 1.1050245 1.0360781 1.663675
3: 2012 Mar 31 1.1644517 0.9884285 0.9583809 1.0566337 0.8829349 1.546941
4: 2012 Abr 30 0.9148231 0.9540314 1.0383376 1.0328838 0.9073122 1.440892
5: 2012 May 31 1.3353096 1.0309085 0.9507334 0.8084003 0.9963600 1.576802
6: 2012 Jun 30 1.1834349 1.0191696 0.9642064 0.9149720 0.9103332 1.534471
Data
TPDM <- read.table(text = "
i ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
", header = TRUE)[, -1L]
TPDA <- read.table(text = "
i anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
", header = TRUE)[, -1L]