I have a data frame ordered by month and year. I want to select only the integer number of years i.e. if the data start in July 2002 and ends in September 2010 then select only data from July 2002 to June 2010.
And if the data starts in September 1992 and ends in March 2000 then select only data from September 1992 to August 1999. Regardless of the missing months in between.
The data can be uploaded from the following link:
enter link description here
The code
mydata <- read.csv("E:/mydata.csv", stringsAsFactors=TRUE)
this is manually selection
selected.data <- mydata[1:73,] # July 2002 to June 2010
how to achieve that by coding.
Here is a base solution, that reproduce your manual subsetting:
mydata <- read.csv("D:/mydata.csv", stringsAsFactors=F)
lookup <-
c(
January = 1,
February = 2,
March = 4,
April = 4,
May = 5,
June = 6,
July = 7,
August = 8,
September = 9,
October = 10,
November = 11,
December = 12
)
mydata$Month <- unlist(lapply(mydata$Month, function(x) lookup[match(x, names(lookup))]))
first.month <- mydata$Month[1]
last.year <- max(mydata$Year)
mydata[1:which(mydata$Month==(first.month -1)&mydata$Year==last.year),]
Basically, I convert the Month name in number and find the month preceding the first month that appears in the dataframe, for the last year of the dataframe.
Here's a base R one-liner :
result <- mydata[seq_len(with(mydata, which(Month == month.name[match(Month[1],
month.name) - 1] & Year == max(Year)))), ]
head(result)
# Month Year var
#1 July 2002 -91.22997
#2 October 2002 -91.19007
#3 December 2002 -91.05395
#4 February 2003 -91.16958
#5 March 2003 -91.17881
#6 April 2003 -91.15110
tail(result)
# Month Year var
#68 December 2009 -90.92610
#69 January 2010 -91.07379
#70 February 2010 -91.12460
#71 March 2010 -91.10288
#72 April 2010 -91.06040
#73 June 2010 -90.94212
Related
I have a datatable with three date columns x, y and z and I am trying to create a new column (new_col) that is the middle date of the three dates in each row once ranked from earliest to latest, i.e., I want the date between the min and max date – please see table below:
x
y
z
new_col
1st Jan 2005
4th May 1998
2nd Mar 2009
1st Jan 2005
9th May 2010
14th Feb 2003
9th Jan 2008
9th Jan 2008
7th Sept 2002
8th Dec 2010
23rd May 2012
8th Dec 2010
So, for rows 1, 2, and 3 I would like the dates from column x, z, and y, respectively. How can I go about this in R? I have used pmin and pmax but I can't isolate the date in the middle
Thanks in advance!
The approach below
coerces the character date strings to numeric type Date as there is no arithmetic with character dates,
finds the position of the "middle" date in each row
and returns the corresponding character string
which eventually becomes new_col.
This can be implemented using apply() on each row using an appropriate function:
df$new_col <- apply(df, 1L, function(x) x[order(lubridate::dmy(x))][2L])
df
x y z new_col
1 1st Jan 2005 4th May 1998 2nd Mar 2009 1st Jan 2005
2 9th May 2010 14th Feb 2003 9th Jan 2008 9th Jan 2008
3 7th Sept 2002 8th Dec 2010 23rd May 2012 8th Dec 2010
Note
This returns the expected result. new_col is a character date string.
However, if the OP intends to continue working with type Date, e.g. doing more arithmetic, I recommend to follow Ben's example and to coerce the whole data.frame to type Date and to stick to it.
First make sure all your dates are "Date" type, you can use dmy from lubridate for this (assumes your data frame is called df):
library(lubridate)
df[] <- lapply(df, dmy)
Next, sort each row in chronological order, and take the middle column (column 2) to be the new_col:
df$new_col <- as.Date(t(apply(df, 1, sort))[,2])
Finally, if you want the result to be displayed in same text format (e.g., "1st Jan 2005" instead of "2005-01-01") then you can use a custom function based on this answer:
library(dplyr)
date_to_text <- function(dates){
dayy <- day(dates)
suff <- case_when(dayy %in% c(11,12,13) ~ "th",
dayy %% 10 == 1 ~ 'st',
dayy %% 10 == 2 ~ 'nd',
dayy %% 10 == 3 ~'rd',
TRUE ~ "th")
paste0(dayy, suff, " ", format(dates, "%b %Y"))
}
df[] <- lapply(df, date_to_text)
Output
x y z new_col
1 1st Jan 2005 4th May 1998 2nd Mar 2009 1st Jan 2005
2 9th May 2010 14th Feb 2003 9th Jan 2008 9th Jan 2008
3 7th Sep 2002 8th Dec 2010 23rd May 2012 8th Dec 2010
Data
df <- structure(list(x = c("1st Jan 2005", "9th May 2010", "7th Sept 2002"
), y = c("4th May 1998", "14th Feb 2003", "8th Dec 2010"), z = c("2nd Mar 2009",
"9th Jan 2008", "23rd May 2012")), class = "data.frame", row.names = c(NA,
-3L))
I am getting data from BLS website using the package blsAPI.
The code is:
library(blsAPI)
employ <- blsAPI(payload= "CES0500000001")
emp <- fromJSON(employ)
The data set emp is a list... this is where I am stumped. I've been trying all types of variations to convert emp to data.frame from list with no success.
Just set the argument return_data_frame = TRUE of blsAPI function. data.frame will be returned instead of list (default behaviour).
library(rjson)
library(blsAPI)
response <- blsAPI("CES0500000001", return_data_frame = TRUE)
head(response)
Output:
year period periodName value seriesID
1 2018 M08 August 126939 CES0500000001
2 2018 M07 July 126735 CES0500000001
3 2018 M06 June 126582 CES0500000001
4 2018 M05 May 126390 CES0500000001
5 2018 M04 April 126130 CES0500000001
6 2018 M03 March 125956 CES0500000001
I looking at foreign powers intervening into civil wars using R studio. My first dataset unit of analysis is conflict year while the second one is conflict month. I would need to have both of them in conflict years so I can merge them.
Is there any command that allows you to do the opposite of expanding rows?
It's hard to give you specifics without a sample of your data so we know what the structure is. I'm assuming your month-level dataset stores the month as a character string that includes a year. You should be able to extract the year with separate from the tidyr package:
library(tidyverse)
month <- c("June 2015", "July 2015", "September 2016", "August 2016", "March 2014")
conflict <- c("A", "B", "C", "D", "E")
my.data <- data.frame(month, conflict)
my.data
month conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E
my.data <- my.data %>%
separate(month, c("month", "year"), sep = " ")
> my.data
month year conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E
I am trying to extract the unemployment rate data from this site. In the form, there is a select tag with some options. I can extract the table from default year 2007 to 2017. But I am having a hard time to set a value for from_year and to_year. Here is the code I have so far:
session = html_session("https://data.bls.gov/timeseries/LNS14000000")
form = read_html("https://data.bls.gov/timeseries/LNS14000000") %>% html_node("table form") %>% html_form()
set_values(form, from_year = 2000, to_year = as.numeric(format(Sys.Date(), "%Y"))) # nothing happened if I set the value for years
submit_form(session, form)
It doesn't work as expected.
Thanks so much #Andrew!
I can use the api to extract the data.
library(rjson)
library(blsAPI)
uer1 <- list(
'seriesid'=c('LNS14000000'),
'startyear'=2000,
'endyear'=2009)
response <- blsAPI(uer1, 2, TRUE)
The response looks like:
year period periodName value seriesID
1 2009 M12 December 9.9 LNS14000000
2 2009 M11 November 9.9 LNS14000000
3 2009 M10 October 10.0 LNS14000000
4 2009 M09 September 9.8 LNS14000000
5 2009 M08 August 9.6 LNS14000000
6 2009 M07 July 9.5 LNS14000000
...
Note that there are some query limits in the api.
api limits
I have collected some time series data from the web and the timestamp that I got looks like below.
24 Jun
21 Mar
20 Jan
10 Dec
20 Jun
20 Jan
10 Dec
...
The interesting part is that the year is missing in the data, however, all the records are ordered, and you can infer the year from the record and fill in the missing data. So the data after imputing should be like this:
24 Jun 2014
21 Mar 2014
20 Jan 2014
10 Dec 2013
20 Jun 2013
20 Jan 2013
10 Dec 2012
...
Before lifting my sleeves and start writing a for loop with nested logic.. is there a easy way that might work out of box in R to impute the missing year.
Thanks a lot for any suggestion!
Here's one idea
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("2012", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 2014 - cumsum(diff(c(julian[1], julian))>0)
## Check that it worked
df
# day month year
# 1 24 Jun 2014
# 2 21 Mar 2014
# 3 20 Jan 2014
# 4 10 Dec 2013
# 5 20 Jun 2013
# 6 20 Jan 2013
# 7 10 Dec 2012
The OP has requested to complete the years in descending order starting in 2014.
Here is an alternative approach which works without date conversion and fake dates. Furthermore, this approach can be modified to work with fiscal years which start on a different month than January.
# create sample dataset
df <- data.frame(
day = c(24L, 21L, 20L, 10L, 20L, 20L, 21L, 10L, 30L, 10L, 10L, 7L),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Jan", "Dec", "Jan",
"Jan", "Jan", "Jun"))
df$year <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) > 0))
df
day month year
1 24 Jun 2014
2 21 Mar 2014
3 20 Jan 2014
4 10 Dec 2013
5 20 Jun 2013
6 20 Jan 2013
7 21 Jan 2012
8 10 Dec 2011
9 30 Jan 2011
10 10 Jan 2011
11 10 Jan 2011
12 7 Jun 2010
Completion of fiscal years
Let's assume the business has decided to start its fiscal year on February 1. Thus, January lies in a different fiscal year than February or March of the same calendar year.
To handle fiscal years, we only need to shuffle the factor levels accordingly:
df$fy <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb[c(2:12, 1)])) + df$day) > 0))
df
day month year fy
1 24 Jun 2014 2014
2 21 Mar 2014 2014
3 20 Jan 2014 2013
4 10 Dec 2013 2013
5 20 Jun 2013 2013
6 20 Jan 2013 2012
7 21 Jan 2012 2011
8 10 Dec 2011 2011
9 30 Jan 2011 2010
10 10 Jan 2011 2010
11 10 Jan 2011 2010
12 7 Jun 2010 2010