Converting numbers to dates in R - r

I have a large dataset that I'm importing from a txt file that has multiple date variables that are being formatted as number values 20190101, is there a way to assign a date format as part of import? There is no header in the file and I'm assigning names and lengths sample code below.
df <- read_fwf("file name",
fwf_cols(id = 8,
update_date = 8,
name = 35),
skip = 0)
Or is there a way to convert multiple values in one statement vs one at a time?
df$update_date <- as.Date(as.character(df$update_date), "%Y%m%d")

Here is a way to convert multiple values in one statement into Dates
(assuming yyyy mm dd). Here we target all columns that end with "date" in their name.
library(dplyr)
df <- data.frame(update_date = c(20190101, 20190102, 20190103),
end_date = c(20200101, 20200102, 20200103))
df %>% mutate_at(vars(ends_with("date")), ~as.Date(as.character(.x),format="%Y%m%d"))
You might similarly use
mutate_at(vars(starts_with("date"))
or
mutate_at(vars(c(update_date, end_date)

Related

Can't figure out how to change "X5.13.1996" to date class?

I have dates listed as "X5.13.1996", representing May 13th, 1996. The class for the date column is currently a character.
When using mdy from lubridate, it keeps populating NA. Is there a code I can use to get rid of the "X" to successfully use the code? Is there anything else I can do?
You can use substring(date_variable, 2) to drop the first character from the string.
substring("X5.13.1996", 2)
[1] "5.13.1996"
To convert a variable (i.e., column) in your data frame:
library(dplyr)
library(lubridate)
dates <- data.frame(
dt = c("X5.13.1996", "X11.15.2021")
)
dates %>%
mutate(converted = mdy(substring(dt, 2)))
or, without dplyr:
dates$converted <- mdy(substring(dates$dt, 2))
Output:
dt converted
1 X5.13.1996 1996-05-13
2 X11.15.2021 2021-11-15

Changing Dates in R from webscraper but not able to convert

I am trying to complete a problem that pulls from two data sets that need to be combined into one data set. To get to this point, I need to rbind both data sets by the year-month information. Unfortunately, the first data set needs to be tallied by year-month info, and I can't seem to figure out how to change the date so I can have month-year info rather than month-day-year info.
This is data on avalanches and I need to write code totally the number of avalanches each moth for the Snow Season, defined as Dec-Mar. How do I do that?
I keep trying to convert the format of the date to month-year but after I change it with
as.Date(avalancheslc$Date, format="%y-%m")
all the values for Date turn to NA's....help!
# write the webscraper
library(XML)
library(RCurl)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep=" ")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T)
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
avalancheslc$monthyear<-format(as.Date(avalancheslc$Date),"%Y-%m")
# How can I tally the number of avalanches?
The final output of my dataset should be something like:
date avalanches
2000-1 18
2000-2 4
2000-3 10
2000-12 12
2001-1 52
This should work (I tried it on only 1 page, not all 203). Note the use of the option stringsAsFactors = F in the readHTMLTable function, and the need to add names because 1 column does not automatically get one.
library(XML)
library(RCurl)
library(dplyr)
avalanche <- data.frame()
avalanche.url <- "https://utahavalanchecenter.org/observations?page="
all.pages <- 0:202
for(page in all.pages){
this.url <- paste(avalanche.url, page, sep=" ")
this.webpage <- htmlParse(getURL(this.url))
thispage.avalanche <- readHTMLTable(this.webpage, which = 1, header = T,
stringsAsFactors = F)
names(thispage.avalanche) <- c('Date','Region','Location','Observer')
avalanche <- rbind(avalanche,thispage.avalanche)
}
avalancheslc <- subset(avalanche, Region == "Salt Lake")
str(avalancheslc)
avalancheslc <- mutate(avalancheslc, Date = as.Date(Date, format = "%m/%d/%Y"),
monthyear = paste(year(Date), month(Date), sep = "-"))

I want to run code on data frame up to a certain date (column 2)

I am trying to run code on a data frame up to a certain date. I have individual game statistics, the second column is Date in order. I thought this is how to do this however I get an error:
Error in `[.data.frame`(dfmess, dfmess$Date <= Standingdate) :
undefined columns selected
Here is my code:
read.csv("http://www.football-data.co.uk/mmz4281/1516/E0.csv")
dfmess <- read.csv("http://www.football-data.co.uk/mmz4281/1516/E0.csv", stringsAsFactors = FALSE)
Standingdate <- as.Date("09/14/15", format = "%m/%d/%y")
dfmess[dfmess$Date <= Standingdate] -> dfmess
You probably want to convert dfmess$Date to as.Date first prior to comparing. In addition, per #Roland's comment, you require an additional comma ,:
dfmess <- read.csv("http://www.football-data.co.uk/mmz4281/1516/E0.csv", stringsAsFactors = FALSE)
dfmess$Date <- as.Date(dfmess$Date, "%m/%d/%y")
Standingdate <- as.Date("09/14/15", format = "%m/%d/%y")
dfmess[dfmess$Date <= Standingdate, ]

Why does readr store date objects as integer values?

When reading in csv files using the readr package date objects are stored as integer values. When I say stored as integer I don't mean the class of the date column, I mean the underlying date value R stores. This prevents the ability to use the dplyr join functions if one data frame's dates are stored as numeric values and the other's are integer. I've included a reproducible example below. Is there anything I can do to prevent this behavior?
library(readr)
df1 <- data.frame(Date = as.Date(c("2012-11-02", "2012-11-04", "2012-11-07", "2012-11-09", "2012-11-11")), Text = c("Why", "Does", "This", "Happen", "?"), stringsAsFactors = F)
class(df1$Date)
# [1] "Date"
dput(df1$Date[1])
# structure(15646, class = "Date")
# Write to dummy csv
write.csv(df1, file = "dummy_csv.csv", row.names = F)
# Read back in data using both read.csv and read_csv
df2 <- read.csv("dummy_csv.csv", as.is = T, colClasses = c("Date", "character"))
df3 <- read_csv("dummy_csv.csv")
# Examine structure of date values
class(df2$Date)
# [1] "Date"
class(df3$Date)
# [1] "Date"
dput(df2$Date[1])
# structure(15646, class = "Date")
dput(df3$Date[1])
# structure(15646L, class = "Date")
# Try to join using dplyr joins
both <- full_join(df2, df3, by = c("Date"))
Error: cannot join on columns 'Date' x 'Date': Cant join on 'Date' x 'Date' because of incompatible types (Date / Date)
# Base merge works
both2 <- merge(df2, df3, by = "Date")
# converting a POSIXlt object to Date is also stored as numeric
temp_date <- as.Date(as.POSIXct("11OCT2012:19:00:00", format = "%d%b%Y:%H:%M:%S"))
dput(temp_date)
# structure(15624, class = "Date")
Judging by this issue on the dplyr repo it seems like Hadley thinks this is a feature but any time your date values are stored differently you can't merge on them, and I haven't figured out a way to convert the integer date object to a numeric one. Is there anyway to stop the readr package from doing this or any way to convert a Date object stored as an integer to a numeric value?
According to the big man himself This is a bug with dplyr not readr. He says the storing of numeric vs integer values when reading in files is ok but dplyr should be able to handle the difference like merge does.

Subsetting data table in R by date

I need to subset my data by a date range, below is the code.
I read in two .csv (data2010, data2), I changed the date format to exclude the timestamp, rename the headers so they are the same for both files, then merge(data2011).
The files seem to actually merge but when I subset by the date range, no observations are created.
However, the date is grouped like 01/01/10 01/01/11 01/02/10 01/02/11 =
so same month/same day/different year pairing.
data2010 <- read.csv(file="2010final.csv")
data2 <- read.csv(file="2011final.csv")
#change format of timestamp to date with mm/dd/yyyy for 2011
data2$newdate <-strptime(as.character(data2$Date), "%m/%d/%y")
data2$Date <- format(data2$newdate, "%m/%d/%y")
data2$newdate <- NULL
#rename and format 2010
names(data2010) <- c("Region", "District", "Age", "Gender", "Marital Status", "Date", "Reason")
data2010$newdate <-strptime(as.character(data2010$Date), "%m/%d/%y %H")
data2010$Date <- format(data2010$newdate, "%m/%d/%y")
data2010$newdate <- NULL
#merge
data2011 <- rbind(data2010, data2)
summary(data2011)
str(data2011)
#I see from the above commands that the files have merged
jan6Before <- subset(data2011, Date >= "12/22/10" & Date <= "01/06/11")
summary(jan6Before)
str(jan6Before)
#But this does not produce any observations
I suspect it's because your Date variable is a character, not date, being compared to another character constant i.e. "12/22/10".
I suggest you have a look at the package lubridate. You can then easily convert character (in this case month-date-year) to compare, e.g. mdy(Date) >= mdy("12/22/10") .
Merge on your variable newDate, and use that for subsetting also.

Resources