Handling dates and other questions in a small project - r

I am trying working on a small project for my class. I got a data set from a website: http://www.cfr.org/interactives/GH_Vaccine_Map/#map
I successfully imported data by using code below
disease<- read.csv('Added Source information for vaccine map-3.csv')
My questions are:
How can I see some specific values? For example, I wanna list "Measles" in "Category" Column. I just want to see values column by column :)
How can I handle these dates? When I check its mode, it seems as numeric. How can I convert them to dates? Also, as you can see in picture, some dates are interval like "3/2010-9/2010", but some of them single date like "5/2014". What should I do with these dates to make a good visualization or something like that?
I could not add pic of table, since I just joined to website and not have enough reputation to share an image yet.

Maybe this can serve as a start:
loc <- "https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AjivqRkNvfzudElKUDJKdlkya29wS2VUTlVFZlBoVVE&single=true&gid=0&output=csv"
dat <- read.csv(
loc,
na.string=c("Sya", "unknown", "NA"),
colClasses=c(
rep("character", 3),
rep("numeric", 2),
"character",
rep("numeric", 2),
rep("character", 3)
)
)
str(dat)
Here the colClasses argument explicitly defines the datatypes for each column. When the data does not fit, it raises an error. The na.string option specifies the entries that represent "not available" items that get a NA value in the data.frame. There may be some problems with the character encoding that I did not look into here.
To see only the "Measles" entries, you could use:
View(subset(dat, Category=="Measles"))
For working with the Date column, a first idea is to split start and optional end month into two columns:
start_date <- function(d) strsplit(d, "-")[[1]][[1]]
start_date <- Vectorize(start_date)
end_date <- function(d) {
spl <- strsplit(d, "-")[[1]]
spl[[length(spl)]]
}
end_date <- Vectorize(end_date)
dat <- transform(dat,
StartDate=start_date(Date),
EndDate=end_date(Date),
stringsAsFactors=FALSE
)
Now you can filter for entries that started in a given month, e.g.
str(subset(dat, StartDate=="12/2013"))

Related

Convert character dates in r (weird format)

I have columns that are named "X1.1.21", "X12.31.20" etc.
I can get rid of all the "X"s by using the substring function:
names(df) <- substring(names(df), 2, 8)
I've been trying many different methods to change "1.1.21" into a date format in R, but I'm having no luck so far. How can I go about this?
R doesn't like column names that start with numbers (hence you get X in front of them). However, you can still force R to allow column names that start with number by using check.names = FALSE while reading the data.
If you want to include date format as column names, you can use :
df <- data.frame(X1.1.21 = rnorm(5), X12.31.20 = rnorm(5))
names(df) <- as.Date(names(df), 'X%m.%d.%y')
names(df)
#[1] "2021-01-01" "2020-12-31"
However, note that they look like dates but are still of type 'character'
class(names(df))
#[1] "character"
So if you are going to use the column names for some date calculation you need to change it to date type first.
as.Date(names(df))

R summaries when dates in main df fall within ranges from small df

Similar to do.call/lapply approach here, and data.table approach here, but both have the setup of:
MainDF with data and startdate/enddate ranges
SubDF with a vector of single dates
Where the users are looking for summaries of all the MainDF ranges that overlap each SubDF date. I have
MainDF with data and a vector of single dates
SubDF with startdate/enddate ranges
And am looking to append summaries, to SubDF, for multiple rows of MainDF data which fall within each SubDF range. Example:
library(lubridate)
MainDF <- data.frame(Dates = seq.Date(from = as.Date("2020-02-12"),
by = "days",
length.out = 10),
DataA = 1:10)
SubDF <- data.frame(DateFrom = as.Date(c("2020-02-13", "2020-02-16", "2020-02-19")),
DateTo = as.Date(c("2020-02-14", "2020-02-17", "2020-02-21")))
SubDF$interval <- interval(SubDF$DateFrom, SubDF$DateTo)
Trying the data.table approach from the second link I figure it should be something like:
MainDF[SubDF, on = .(Dates >= DateFrom, Dates <= DateTo), allow = TRUE][
, .(SummaryStat = max(DataA)), by = .(Dates)]
But it errors with unused arguments for on. On my actual data I got a result by using (the equivalent of) max(MainDF$DataA), but it was 3 repeats of the second value (In my actual data the final row won't run as it doesn't have a value for DateTo). I suspect using MainDF$ means I've subverting the grouping.
I suspect I'm close but I'm really struggling to get my head around the data.table mindset for complex use cases. The summary stats I'm looking to do are (for example data):
Mean & Max of DataA
length(which(DataA > 3))
difftime(last(Dates), first(Dates), units = "mins")
Dates[which.max(DataA)]
I added the interval line above as data.table's %between% help suggests one might be able to use a Dates %between% interval format but it doesn't mention intervals/difftimes specifically in the text nor examples and my attempts are already failing elsewhere so I'm loathe to concentrate on improving my running while I can't walk!
I've focused on the data.table approach since it's used for a similar problem, but I've been wondering whether dplyr's group_by/group_by_if could be used instead? group_by_if's .predicate seems to be constrained to tests on the columns (e.g. are they factors) rather than relating to data in the columns' rows, but I could be wrong.
Thanks in advance for any help!

Custom function to find date column in df and standardize name to "date" in R

A lot of my work involves unioning new datasets to old, but often the standardized "date" name I have in the master dataset won't match up to the date name in the new raw data (which may be "Date", "Day", "Time.Period", etc...). To make life easier, I'd like to create a custom function that will:
Detect the date columns in the new and old datasets
Standardize the column name to "date" (oftentimes the raw new data will come in with the date column named "Date" or "Day" or "Time Period", etc..)
Here are a couple datasets to play with:
Dates_A <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-12-31"), by = "day")
Dates_B <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-12-31"), by = "day")
Numbers <- rnorm(365)
df_a <- data.frame(Dates_A, Numbers)
df_b <- data.frame(Dates_B, Numbers)
My first inclination is to try a for-loop that searches for the class of the columns by index and automatically renames any with Class = Date to "date", but ideally I'd also like the function to solve for the examples below, where the class of the date column may be character or factor.
Dates_C <- as.character(Dates_B)
df_c <- data.frame(Dates_C, Numbers)
df_d <- data.frame(Dates_C, Numbers, stringsAsFactors = FALSE)
If you have any ideas or can point me in the right direction, I'd really appreciate it!
Based on the description, we could check whether a particular column is Date class, get a logical index and assign the name of that column to 'date'
is.date <- function(x) inherits(x, 'Date')
names(df_a)[sapply(df_a, is.date)] <- 'date'
Assuming that there is only a single 'date' column in the dataset. If there are multiple 'date' columns, inorder to avoid duplicate column names, use make.unique
names(df_a) <- make.unique(names(df_a))
akrun's solution works for columns of class Date but not for columns of classes factor or character like you ask at the end of the question, so maybe the following can be of use to you.
library(lubridate)
checkDates <- function(x) {
op <- options(warn = -1) # needed to keep stderr clean
on.exit(options(op)) # reset to original value
!all(is.na(ymd(x)))
}
names(df_c)[sapply(df_c, checkDates)] <- 'date'
names(df_d)[sapply(df_d, checkDates)] <- 'date'
Note that maybe you can get some inspiration on both solutions and combine them into one function. If inherits returns TRUE all done else try ymd.

How to subtract datetimes and store them in a separate column?

i am working with csv file and i have a column with name "statistics_lastLocatedTime" as shown in
csv file image
i would like to subtract second row of "statistics_lastLocatedTime" from first row; third row from second row and so on till the last row and then store all these differences in a separate column and then combine this column to the other related columns as shown in the code given below:
##select related features
data <- read.csv("D:/smart tech/store/2016-10-11.csv")
(columns <- data[with(data, macAddress == "7c:11:be:ce:df:1d" ),
c(2,10,11,38,39,48,50) ])
write.csv(columns, file = "updated.csv", row.names = FALSE)
## take time difference
date_data <- read.csv("D:/R/data/updated.csv")
(dates <- date_data[1:40, c(2)])
NROW(dates)
for (i in 1:NROW(dates)) {
j <- i+1
r1 <- strptime(paste(dates[i]),"%Y-%m-%d %H:%M:%S")
r2 <- strptime(paste(dates[j]),"%Y-%m-%d %H:%M:%S")
diff <- as.numeric(difftime(r1,r2))
print (diff)
}
## combine time difference with other related columns
combine <- cbind(columns, diff)
combine
now the problem is that i am able to get the difference of rows but not able to store these values as a column and then combine that column with other related columns. please help me. thanks in advance.
This is a four-liner:
Define a custom class 'myDate', and a converter function for your custom datetime, as per Specify custom Date format for colClasses argument in read.table/read.csv
Read in the datetimes as actual datetimes; no need to repeatedly convert later.
Simply use the vectorized diff operator on your date column (it sees their type, and automatically dispatches a diff function for POSIXct Dates). No need for for-loops:
.
setClass('myDate') # this is not strictly necessary
setAs('character','myDate', function(from) {
as.POSIXct(from, format='%d-%m-%y %H:%S', tz='UTC') # or whatever timezone
})
data <- read.csv("D:/smart tech/store/2016-10-11.csv",
colClasses=c('character','myDate','myDate','numeric','numeric','integer','factor'))
# ...
data$date_diff <- c(NA, diff(data$statistics_lastLocatedTime))
Note that diff() produces a result of length one shorter than vector that we diff'ed. Hence we have to pad it (e.g. with a leading NA, or whatever you want).
Consider directly assigning the diff variable using vapply. Also, there is no need for the separate date_data df as all operations can be run on the columns df. Notice too the change in time format to align to the format currently in dataframe:
columns$diff <- vapply(seq(nrow(columns)), function(i){
r1 <- strptime(paste(columns$statistics_lastLocatedTime[i]),"%d-%m-%y %H:%M")
r2 <- strptime(paste(columns$statistics_lastLocatedTime[i+1]),"%d-%m-%y %H:%M")
diff <- difftime(r1, r2)
}, numeric(1))

How to add NA's to the data not available for some dates?

I have a data for short term electricity load forecasting. I have to clean the data, adding NA's in the data for dates( and blocks) with no data.
For example: 1st case: with some dates missing:
data<-data.frame(date=c("2014-01-01","2014-01-02","2014-01-04"),value=c(1,2,3))
Notice that 2014-01-03 is missing. So I want to add a row with this date and NA's corresponding to the columns for this date.
The required output data is:
out_data<-data.frame(date=c("2014-01-01","2014-01-02","2014-01-03","2014-01-04"),value=c(1,2,NA,3))
2nd case: with some blocks missing from the date:
1,2,3,4,5,7,9,10
Notice that 6,8 blocks are missing. So I want to add a row for these blocks (6,8) and NA's corresponding to the columns for these blocks.
First problem is how to figure out missing dates, blocks. Once figured how to add NA's as described above. I am trying to accomplish this using loops, but if someone has a better approach or know some efficient package please help.
Edit- The software I am using is R
Thanks
It is hard to know without a reproducible example, but I gave it a go:
Case1
let's create some dummy data:
days <- c(1,2,4:6,9)
yourDates <- as.Date(paste(2014, 1, days, sep = "-"))
set.seed(111)
data <- data.frame(date= yourDates, col1 = rnorm(6), col2 = sample(letters, 6))
specify the last desired date:
enddate <- max(data$date)
create a new dataframe with NAs for missing dates:
df <- merge(data, data.frame(date = seq(min(yourDates),
as.Date(enddate), 1)), all.y = T)
# if you want to replace the dates where you have no records with NA:
df$date[!df$date %in% yourDates] <- NA
Case2
create full data based on your dates and blocks:
data2 <- expand.grid(block = 1:10, date = yourDates)
data with gaps (block 6, 8, is missing from the first date, and some other is also missing):
data2.gaps <- data2[-c(6,8, 15, 29),]
# and put NAs where block is missing:
data2$block <- data2.gaps$block[match(interaction(data2$block, data2$date),
interaction(data2.gaps$block, data2.gaps$date))]

Resources