I have multiple datasets of physical variables, and I want to do some work on it with R. However, I would like to use a list. Here is my code for 1 of my dataframe :
# Table definition
df.jannuary <- read.table("C:\\...file1.csv", sep=";")
# Subset of the table containing only variables of interest
df.jannuary_sub <- subset(df.jannuary, select=c(2:8, 11:12))
# Column names
colnames(df.jannuary_sub)<-c("year","day","hour","minute","temp_air","temp_eau","humidity_rel","wind_intensity","wind_direction")
# Aggregation of the 4 Year-Day-Hour-Minute columns into a single column and conversion into a POSIXct objet through the temporary column "timestamp"
df.jannuary_sub$timestamp <- as.POSIXct(paste(df.jannuary_sub$year, df.jannuary_sub$day, df.jannuary_sub$hour, df.jannuary_sub$minute), format="%Y %j %H %M", tz="GMT")
# Getting the date with a new format from julian day to normal day into a column called "date"
df.jannuary_sub$date <- format(df.jannuary_sub$timestamp,"%d/%m/%Y %H:%M",tz = "GMT")
# Suppression of the 4 Year-Day-Hour-Minute initial columns and of the temporary column "timestamp", and placement of the date column as column 1
df.jannuary_sub <- subset(df.jannuary_sub, select=c(11, 5:9))
This code works. The thing is I got all the months of the year, for several years.
So I started to use a list, here is the example for the year 2011 :
df.jannuary <- read.table("C:\\...\file1.dat", sep=",")
#...
df.december <- read.table("C:\\...\file12.dat", sep=",")
# Creation of a list containing the month datasets, with a subset of the tables containing only variables of interest
list.dataset_2011<-list(
df.jannuary_sub <- subset(df.jannuary, select=c(2:8, 11:12)),
#...
df.december_sub <- subset(df.december, select=c(2:8, 11:12))
)
# Column names for all variables of the list for (j in 1:12)
{
colnames(list.dataset_2011[[j]])<-c("year","day","hour","minute","temp_air","temp_eau","humidity_rel","wind_intensity","wind_direction")
}
# Conversion of the list into a data.frame called "list.dataset_2011" for (i in 1:9)
{
list.dataset_2011[[i]]<-as.data.frame(list.dataset_2011[[i]])
}
# Aggregation of the 4 Year-Day-Hour-Minute columns into a single column and conversion into a POSIXct objet through the temporary column "timestamp"
list.dataset_2011$timestamp <- as.POSIXct(paste(list.dataset_2011$year, list.dataset_2011$day, list.dataset_2011$hour, list.dataset_2011$minute), format="%Y %j %H %M", tz="GMT")
# Getting the date with a new format from julian day to normal day into a column called "date"
list.dataset_2011$date <- format(list.dataset_2011$timestamp,"%d/%m/%Y %H:%M",tz = "GMT")
# Suppression of the 4 Year-Day-Hour-Minute initial columns and of the temporary column "timestamp", and placement of the date column as column 1
list.dataset_2011 <- subset(list.dataset_2011, select=c(11, 5:9))
I encounter a problem at the end of my code (hoping the rest is working !) with the subset command, which doesn't appear to work for the attribute "list".
Related
In my data, Time and date is stored in a column 'cord' (class=factor). I want to separate the date and the time into two separate columns.
The data looks like this:
1 2019-05-26T13:50:56.335288Z
2 2019-05-26T17:55:45.348073Z
3 2019-05-26T18:12:00.882572Z
4 2019-05-26T18:26:49.577310Z
I have successfully extracted the date using:cord$Date <- as.POSIXct(cord$Time)
I have however not been able to find a way to extract the time in format "H:M:S".
The output of dput(head(cord$Time)) returns a long list of timestamps: "2020-04-02T13:34:07.746777Z", "2020-04-02T13:41:11.095014Z",
"2020-04-02T14:08:05.508818Z", "2020-04-02T14:17:10.337101Z", and so on...
Extract H:M:S
library(lubridate)
format(as_datetime(cord$Time), "%H:%M:%S")
#> [1] "13:50:56" "17:55:45" "18:12:00" "18:26:49"
If you need milliseconds too:
format(as_datetime(cord$Time), "%H:%M:%OS6")
#> [1] "13:50:56.335288" "17:55:45.348073" "18:12:00.882572" "18:26:49.577310"
where cord is:
cord <- read.table(text = " Time
1 2019-05-26T13:50:56.335288Z
2 2019-05-26T17:55:45.348073Z
3 2019-05-26T18:12:00.882572Z
4 2019-05-26T18:26:49.577310Z ", header = TRUE)
I typically use lubridate and data.table to do my date and manipulation work. This works for me copying in some of your raw dates as strings
library(lubridate)
library(data.table)
x <- c("2019-05-26T13:50:56.335288Z", "2019-05-26T17:55:45.348073Z")
# lubridate to parse to date time
y <- parse_date_time(x, "ymd HMS")
# data.table to split in to dates and time
split_y <- tstrsplit(y, " ")
dt <- as.data.table(split_y)
setnames(dt, "Date", "Time")
dt[]
# if you use data.frames instead
df <- as.data.frame(dt)
df
I want to calculate the number of months between two dates but before that I have a problem when loading the data in r. In csv sheet the format is mm/dd/yyyy but in R the variable is classified as character.
I tried
data$issue_d <- format(as.Date(data$issue_d), "%m/%d/%Y")
and to convert as date first but it gives the following error
character string is not in a standard unambiguous format
Any suggestion for this?
Example input:
issue_d <- c("Dec,2011","Nov,2014","Apr,2015")
Try below:
# example data
df1 <- data.frame(
issue_d1 = c("Dec,2011","Nov,2014","Apr,2015"),
issue_d2 = c("Nov,2015","Sep,2017","Apr,2018"))
library(zoo)
df1$Months <-
(as.yearmon(df1$issue_d2, "%b,%Y") -
as.yearmon(df1$issue_d1, "%b,%Y")) * 12
df1
# issue_d1 issue_d2 Months
# 1 Dec,2011 Nov,2015 47
# 2 Nov,2014 Sep,2017 34
# 3 Apr,2015 Apr,2018 36
I am working with Geolife Trajectories 1.3 dataset (https://www.microsoft.com/en-us/download/confirmation.aspx?id=52367).
It contains bunch of folders, where each folder is separate user.
Each user have few separate .plt files with GPS coord and DATE-TIME info.
Some users have file with labels - time intervals, and transportation type
taken by user (airplane, car, etc)
I created two datasets, first contain all users ID's, DATE-TIMES's and other
info, irrelevant for now:
first dataset with users ID's and DATE's:
ID DATE
20 2007-04-29 08:34:32
... ...
100 2007-04-29 12:35:04
second contains all user ID's, StartTIME's, EndTime's and Transportation type:
ID Start.Time End.Time Transportation
1: 21 2007/04/29 12:34:24 2007/04/29 12:53:45 taxi
2: 21 2007/04/29 22:27:11 2007/04/30 04:28:00 car
...
From 'StartTIME, EndTime' columns of second dataset I created dataset with lubridate intervals:
2007-04-29 12:34:24 UTC--2007-04-29 12:53:45 UTC
...
2007-04-29 22:27:11 UTC--2007-04-30 04:28:00 UTC
Than I wrote 2 functions:
# function for single row label processing
# will search row's DATE in a subset of intervals for current ID
# if TRUE - will search for a label in a subset of labels for current ID
get_label <- function(id, date, labels_subset, interval_subset) {
# convert date to POSIX time
single_time <- as.POSIXct(date)
# search for current time in intervals subset and get label
result <- labels_subset[single_time %within% interval_subset]$Transportation
# check for result, if there is none -> return NA
if (identical(as.vector(result), character(0))) {
# "is type 'character' but expecting type 'logical'. Column types must be
# consistent for each group." will raise if `return(NA)` without `as.char`
return(as.character(NA))
} else {
return(as.character(result))
}
}
and
# function for ID subset label processing
# will create a subset of intervals for current ID
# will create a subset of labels for current ID
get_group <- function(tab) {
# grep ID
id <- tab$ID[1]
# create interval subset for ID
interval_subset <- intervals[labels_d$ID == id]
# create label subset for ID
labels_subset <- labels_d[labels_d$ID == id]
# pass all data for get_label function -- process `tab` by row
tab[, get_label(as.integer(ID), as.character(DATE), labels_subset, interval_subset), 1:nrow(tab)]
}
I want to get a vector with lables if DATE are in some lubridate interval and
NA if it is not in any lubridate interval for current ID.
And tmp <- get_group(dt[ID %in% c(21, 110)]) works:
> unique(tmp$V1)
[1] NA "car" "walk"
But tmp <- dt[, get_group(.SD), by = ID] does not work properly, it outputs only NA's (and dt have only two ID's -- 21 and 110):
> unique(tmp$V1)
[1] NA
Even if I create DT with only one ID, function(DT) works and DT[,function(.SD), by = ID] does not:
tmp<- DT[ID==21]
unique(tmp[, get_group(.SD), by = ID]$V1)
>[1] NA
unique(get_group(tmp)$V1)
>[1] NA "car" "walk"
Why, what I am doing wrong?
UPD:
I should have printed .SD earlier.
By default, R does not pass by= argument into .SD, so my function could not achieve an ID. Sadly, there is no standard warning about that.
.SDcols did the trick:
tmp[, get_group(.SD), by = ID, .SDcols=c('ID', 'DATE')]
You can do a data.table non-equi join as follows:
ds2[ds1, on=.(ID, Start.Time <= DATE, End.Time >= DATE)]
Have a factor column as April-2017 February-2017 etc. Want to convert it to Month and Year to order the column as per the month and year so that it start from January. Tried following:
Combi$Month <- as.yearmon(levels(Combined$Month))[Combined$Month] -> Yields 'NA'
Combined$Month <- as.Date(Combined$Month,'%B-%Y') -> Yields 'NA'
The "yearmon" class can represent year-month and sorts as expected:
library(zoo)
x <- factor(c('April-2017', 'February-2017')) # test data
ym <- as.yearmon(x, "%B-%Y")
sort(ym)
## [1] "Feb 2017" "Apr 2017"
Because of this you don't really need to convert it to "Date" class nor do you need the year and month separately but if for some reason not stated in the question you still do need separate values then as.integer(ym) and cycle(ym) give the years as 4 digit numbers and the months as numbers between 1 and 12. Also as.Date(ym) gives "Date" class values.
An R base way:
# Some sample data
df <- data.frame(period=sample(c("April-2017","February-2017"),10, replace = TRUE))
nicep <- function(x) {
months <- c('January','February','March','April','May','June','July','August','September','October','November','December')
l <- strsplit(x, '-')
return(sprintf("%s-%02d",l[[1]][2], which(months == l[[1]][1])))
}
# change levels for a nice name
levels(df$period) <- unlist(lapply(as.character(levels(df$period)), FUN=nicep))
I have the following data:
Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,124),
Date=c("12/27/2016 15:16","12/27/2016 15:20","12/27/2016 15:24","12/27/2016 15:28","12/27/2016 15:28","12/27/2016 15:42","12/28/2016 7:22","12/28/2016 7:26","12/28/2016 7:35","12/28/2016 11:02","12/28/2016 11:02","12/28/2016 11:28"),
OldValue=c("","Open","In Progress","Open","System Declined","In Progress","System Declined","Open","In Progress","Open","Complete","In Progress"),
NewValue=c("Open","In Progress","System Declined","In Progress","Open","System Declined","Open","In Progress","Complete","In Progress","Open","Complete"))
The data is already ordered by Project, then Date.
However, if there are two rows with the same Date (such as rows 4,5 and 10,11) I want to designate the order based on OldValue. So I'd like row 5 ahead of row 4, and row 11 ahead of row 10.
How can I go about doing this?
#Assign Desired order to the OldValue, CHANGE "y" IF NECESSARY
OldValue_order = data.frame(OldValue = c("","Open","In Progress","System Declined","Complete"), y = c(0,4,2,1,3))
# We'll need lookup command to copy desired order to the "Data"
library(qdapTools)
Data$OV_order = lookup(Data$OldValue, OldValue_order) # Adds new column to "Data"
# Arrange the data.frame in desired order
Data = Data[with(Data, order(Project, as.POSIXct(Date, format = "%m/%d/%Y %H:%M"), OV_order)),]
#Remove the added column
Data = Data[1:4]