I have columns that are named "X1.1.21", "X12.31.20" etc.
I can get rid of all the "X"s by using the substring function:
names(df) <- substring(names(df), 2, 8)
I've been trying many different methods to change "1.1.21" into a date format in R, but I'm having no luck so far. How can I go about this?
R doesn't like column names that start with numbers (hence you get X in front of them). However, you can still force R to allow column names that start with number by using check.names = FALSE while reading the data.
If you want to include date format as column names, you can use :
df <- data.frame(X1.1.21 = rnorm(5), X12.31.20 = rnorm(5))
names(df) <- as.Date(names(df), 'X%m.%d.%y')
names(df)
#[1] "2021-01-01" "2020-12-31"
However, note that they look like dates but are still of type 'character'
class(names(df))
#[1] "character"
So if you are going to use the column names for some date calculation you need to change it to date type first.
as.Date(names(df))
Related
I have a dataset that all of it’s date variables are messed up. All of the columns are characters. They look like this:
name <- c(“Ana”, “Maria”, “Rachel”, “Julia”)
date_of_birth <- c(“9/8/1997”, “22/3/1966”, “24/10/1969”, “25/6/2019”)
data <- as.data.frame(cbind(name, date_of_bieth))
I need to turn those dates into dd/mm/yyyy format. They are already in this order, but I need to add zero when dd or mm has only one digit.
For example, “9/8/1997” should be “09/08/1997”.
We can try this
> format(as.Date(date_of_birth, format = "%d/%m/%Y"), "%d/%m/%Y")
[1] "09/08/1997" "22/03/1966" "24/10/1969" "25/06/2019"
I'm trying to make new date field based on two other columns. If 'R' is present in the Indicator column, I want the date to be the ReportDate. If 'R' is not present, I want the date to be IncidentDate. A working example:
IncidentDate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
ReportDate <- as.Date(c('2010-11-1','2008-5-25','2007-5-14'))
Indicator <- c('','R','')
incident_data <- data.frame(IncidentDate, ReportDate, Indicator)
typeof(IncidentDate) #double
incident_data$calculatedDate <- ifelse(incident_data$ReportDate=='R',as.Date(incident_data$ReportDate), as.Date(incident_data$IncidentDate))
This gives me an error:
Error in charToDate(x) :
character string is not in a standard unambiguous format
I've also tried:
incident_data$calculatedDate <- ifelse(incident_data$ReportDate=='R',as.Date(as.character(incident_data$ReportDate)), as.Date(as.character(incident_data$IncidentDate)))
Which gives me the same error. Why might this be happening?
In base R, it may be better to use assignment on a logical vector instead of ifelse for Date class as ifelse can coerce and remove the Date attribute.
i1 <- incident_data$Indicator=='R'
incident_data$calculatedDate <- incident_data$IncidentDate
incident_data$calculatedDate[i1] <- incident_data$ReportDate
The logical should be based on the Indicator column. However, ifelse coerces the Date to its integer storage mode. So, it may be better to use if_else or case_when. With if_else, case_when, there is a type check associated with the the true, false cases.
library(dplyr)
if_else(incident_data$Indicator=='R',as.Date(incident_data$ReportDate),
as.Date(incident_data$IncidentDate))
#[1] "2010-11-01" "2008-05-25" "2007-03-14"
A lot of my work involves unioning new datasets to old, but often the standardized "date" name I have in the master dataset won't match up to the date name in the new raw data (which may be "Date", "Day", "Time.Period", etc...). To make life easier, I'd like to create a custom function that will:
Detect the date columns in the new and old datasets
Standardize the column name to "date" (oftentimes the raw new data will come in with the date column named "Date" or "Day" or "Time Period", etc..)
Here are a couple datasets to play with:
Dates_A <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-12-31"), by = "day")
Dates_B <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-12-31"), by = "day")
Numbers <- rnorm(365)
df_a <- data.frame(Dates_A, Numbers)
df_b <- data.frame(Dates_B, Numbers)
My first inclination is to try a for-loop that searches for the class of the columns by index and automatically renames any with Class = Date to "date", but ideally I'd also like the function to solve for the examples below, where the class of the date column may be character or factor.
Dates_C <- as.character(Dates_B)
df_c <- data.frame(Dates_C, Numbers)
df_d <- data.frame(Dates_C, Numbers, stringsAsFactors = FALSE)
If you have any ideas or can point me in the right direction, I'd really appreciate it!
Based on the description, we could check whether a particular column is Date class, get a logical index and assign the name of that column to 'date'
is.date <- function(x) inherits(x, 'Date')
names(df_a)[sapply(df_a, is.date)] <- 'date'
Assuming that there is only a single 'date' column in the dataset. If there are multiple 'date' columns, inorder to avoid duplicate column names, use make.unique
names(df_a) <- make.unique(names(df_a))
akrun's solution works for columns of class Date but not for columns of classes factor or character like you ask at the end of the question, so maybe the following can be of use to you.
library(lubridate)
checkDates <- function(x) {
op <- options(warn = -1) # needed to keep stderr clean
on.exit(options(op)) # reset to original value
!all(is.na(ymd(x)))
}
names(df_c)[sapply(df_c, checkDates)] <- 'date'
names(df_d)[sapply(df_d, checkDates)] <- 'date'
Note that maybe you can get some inspiration on both solutions and combine them into one function. If inherits returns TRUE all done else try ymd.
i am working with csv file and i have a column with name "statistics_lastLocatedTime" as shown in
csv file image
i would like to subtract second row of "statistics_lastLocatedTime" from first row; third row from second row and so on till the last row and then store all these differences in a separate column and then combine this column to the other related columns as shown in the code given below:
##select related features
data <- read.csv("D:/smart tech/store/2016-10-11.csv")
(columns <- data[with(data, macAddress == "7c:11:be:ce:df:1d" ),
c(2,10,11,38,39,48,50) ])
write.csv(columns, file = "updated.csv", row.names = FALSE)
## take time difference
date_data <- read.csv("D:/R/data/updated.csv")
(dates <- date_data[1:40, c(2)])
NROW(dates)
for (i in 1:NROW(dates)) {
j <- i+1
r1 <- strptime(paste(dates[i]),"%Y-%m-%d %H:%M:%S")
r2 <- strptime(paste(dates[j]),"%Y-%m-%d %H:%M:%S")
diff <- as.numeric(difftime(r1,r2))
print (diff)
}
## combine time difference with other related columns
combine <- cbind(columns, diff)
combine
now the problem is that i am able to get the difference of rows but not able to store these values as a column and then combine that column with other related columns. please help me. thanks in advance.
This is a four-liner:
Define a custom class 'myDate', and a converter function for your custom datetime, as per Specify custom Date format for colClasses argument in read.table/read.csv
Read in the datetimes as actual datetimes; no need to repeatedly convert later.
Simply use the vectorized diff operator on your date column (it sees their type, and automatically dispatches a diff function for POSIXct Dates). No need for for-loops:
.
setClass('myDate') # this is not strictly necessary
setAs('character','myDate', function(from) {
as.POSIXct(from, format='%d-%m-%y %H:%S', tz='UTC') # or whatever timezone
})
data <- read.csv("D:/smart tech/store/2016-10-11.csv",
colClasses=c('character','myDate','myDate','numeric','numeric','integer','factor'))
# ...
data$date_diff <- c(NA, diff(data$statistics_lastLocatedTime))
Note that diff() produces a result of length one shorter than vector that we diff'ed. Hence we have to pad it (e.g. with a leading NA, or whatever you want).
Consider directly assigning the diff variable using vapply. Also, there is no need for the separate date_data df as all operations can be run on the columns df. Notice too the change in time format to align to the format currently in dataframe:
columns$diff <- vapply(seq(nrow(columns)), function(i){
r1 <- strptime(paste(columns$statistics_lastLocatedTime[i]),"%d-%m-%y %H:%M")
r2 <- strptime(paste(columns$statistics_lastLocatedTime[i+1]),"%d-%m-%y %H:%M")
diff <- difftime(r1, r2)
}, numeric(1))
I am trying working on a small project for my class. I got a data set from a website: http://www.cfr.org/interactives/GH_Vaccine_Map/#map
I successfully imported data by using code below
disease<- read.csv('Added Source information for vaccine map-3.csv')
My questions are:
How can I see some specific values? For example, I wanna list "Measles" in "Category" Column. I just want to see values column by column :)
How can I handle these dates? When I check its mode, it seems as numeric. How can I convert them to dates? Also, as you can see in picture, some dates are interval like "3/2010-9/2010", but some of them single date like "5/2014". What should I do with these dates to make a good visualization or something like that?
I could not add pic of table, since I just joined to website and not have enough reputation to share an image yet.
Maybe this can serve as a start:
loc <- "https://docs.google.com/spreadsheet/pub?hl=en_US&hl=en_US&key=0AjivqRkNvfzudElKUDJKdlkya29wS2VUTlVFZlBoVVE&single=true&gid=0&output=csv"
dat <- read.csv(
loc,
na.string=c("Sya", "unknown", "NA"),
colClasses=c(
rep("character", 3),
rep("numeric", 2),
"character",
rep("numeric", 2),
rep("character", 3)
)
)
str(dat)
Here the colClasses argument explicitly defines the datatypes for each column. When the data does not fit, it raises an error. The na.string option specifies the entries that represent "not available" items that get a NA value in the data.frame. There may be some problems with the character encoding that I did not look into here.
To see only the "Measles" entries, you could use:
View(subset(dat, Category=="Measles"))
For working with the Date column, a first idea is to split start and optional end month into two columns:
start_date <- function(d) strsplit(d, "-")[[1]][[1]]
start_date <- Vectorize(start_date)
end_date <- function(d) {
spl <- strsplit(d, "-")[[1]]
spl[[length(spl)]]
}
end_date <- Vectorize(end_date)
dat <- transform(dat,
StartDate=start_date(Date),
EndDate=end_date(Date),
stringsAsFactors=FALSE
)
Now you can filter for entries that started in a given month, e.g.
str(subset(dat, StartDate=="12/2013"))