I have a large matrix full of times in character format like this
a <- as.matrix(c("18:12:30", "6:15:30", "12:31:40"))
b <- as.matrix(c("1:50:30", "9:50:32", "5:30:43"))
c <- as.matrix(c("7:54:23", "22:45:34", "12:54:23"))
mat <- cbind(a,b,c)
I would like to convert each of the values to a time format. I know I could do it row by row using
a <- strptime(a, "%H:%M:%OS")
b <- strptime(b, "%H:%M:%OS")
c <- strptime(b, "%H:%M:%OS")
But I have a large matrix, so I'm looking for a function that could do this even if I have many more columns and rows.
Beware of how your time and date data is stored. strptime converts to POSIXlt, which always includes a date, so strptime inserts today's date if you don't specify one. That can create huge reproducibility problems.
Instead, you need to use a package to get a suitable time data structure. chron has a nice simple one. To recreate a data.frame of times (matrices can only store numbers):
library(chron)
# lapply chron over the columns of your data; collect in data.frame
time_mat <- do.call(data.frame, lapply(list(a, b, c), function(x){chron(times. = x)}))
# make the names prettier
names(time_mat) <- c('a', 'b', 'c')
which gives you
> time_mat
a b c
1 18:12:30 01:50:30 07:54:23
2 06:15:30 09:50:32 22:45:34
3 12:31:40 05:30:43 12:54:23
with a class of times, which will be consistent in any usage.
Related
I want to use sapply (or something similar) to convert certain columns to POSIXct in an R data.frame but maintain the datetime format of the columns. When I do it currently, it converts the format to numeric. How can I do this? An example is below.
#sample dataframe
df <- data.frame(
var1=c(5, 2),
char1=c('he', 'she'),
timestamp1=c('2019-01-01 20:30:08', '2019-01-02 08:27:34'),
timestamp2=c('2019-01-01 12:24:54', '2019-01-02 10:57:47'),
stringsAsFactors = F
)
#Convert only columns with 'timestamp' in name to POSIXct
df[grep('timestamp', names(df))] <- sapply(df[grep('timestamp', names(df))], function(x) as.POSIXct(x, format='%Y-%m-%d %H:%M:%S'))
df
var1 char1 timestamp1 timestamp2
1 5 he 1546392608 1546363494
2 2 she 1546435654 1546444667
Note: I can use as.posixlt instead of as.posixctand it works, but I want the data in POSIXct format. I also tried converting to POSIXlt first and then to POSIXct, but that also ended up converting the columns to numeric.
Use lapply rather than sapply. The "s" in sapply is for simplify and it turns the result into a matrix but sapply can't create a matrix of POSIXct values so it gets cast to a simple numeric matrix. But if you keep it a list, you don't lose the class.
df[grep('timestamp', names(df))] <- lapply(df[grep('timestamp', names(df))], function(x) as.POSIXct(x, format='%Y-%m-%d %H:%M:%S'))
You could also do this fairly easily with dplyr
library(dplyr)
df %>% mutate_at(vars(contains("timestamp")), as.POSIXct)
i am working with csv file and i have a column with name "statistics_lastLocatedTime" as shown in
csv file image
i would like to subtract second row of "statistics_lastLocatedTime" from first row; third row from second row and so on till the last row and then store all these differences in a separate column and then combine this column to the other related columns as shown in the code given below:
##select related features
data <- read.csv("D:/smart tech/store/2016-10-11.csv")
(columns <- data[with(data, macAddress == "7c:11:be:ce:df:1d" ),
c(2,10,11,38,39,48,50) ])
write.csv(columns, file = "updated.csv", row.names = FALSE)
## take time difference
date_data <- read.csv("D:/R/data/updated.csv")
(dates <- date_data[1:40, c(2)])
NROW(dates)
for (i in 1:NROW(dates)) {
j <- i+1
r1 <- strptime(paste(dates[i]),"%Y-%m-%d %H:%M:%S")
r2 <- strptime(paste(dates[j]),"%Y-%m-%d %H:%M:%S")
diff <- as.numeric(difftime(r1,r2))
print (diff)
}
## combine time difference with other related columns
combine <- cbind(columns, diff)
combine
now the problem is that i am able to get the difference of rows but not able to store these values as a column and then combine that column with other related columns. please help me. thanks in advance.
This is a four-liner:
Define a custom class 'myDate', and a converter function for your custom datetime, as per Specify custom Date format for colClasses argument in read.table/read.csv
Read in the datetimes as actual datetimes; no need to repeatedly convert later.
Simply use the vectorized diff operator on your date column (it sees their type, and automatically dispatches a diff function for POSIXct Dates). No need for for-loops:
.
setClass('myDate') # this is not strictly necessary
setAs('character','myDate', function(from) {
as.POSIXct(from, format='%d-%m-%y %H:%S', tz='UTC') # or whatever timezone
})
data <- read.csv("D:/smart tech/store/2016-10-11.csv",
colClasses=c('character','myDate','myDate','numeric','numeric','integer','factor'))
# ...
data$date_diff <- c(NA, diff(data$statistics_lastLocatedTime))
Note that diff() produces a result of length one shorter than vector that we diff'ed. Hence we have to pad it (e.g. with a leading NA, or whatever you want).
Consider directly assigning the diff variable using vapply. Also, there is no need for the separate date_data df as all operations can be run on the columns df. Notice too the change in time format to align to the format currently in dataframe:
columns$diff <- vapply(seq(nrow(columns)), function(i){
r1 <- strptime(paste(columns$statistics_lastLocatedTime[i]),"%d-%m-%y %H:%M")
r2 <- strptime(paste(columns$statistics_lastLocatedTime[i+1]),"%d-%m-%y %H:%M")
diff <- difftime(r1, r2)
}, numeric(1))
I have a data.table where I have few customers,some day value and pay_day value .
pay_day is a vector of length 5 for each customer and it consists of day values
I want to check each day value with the pay_day vector whether the day is part of the pay_day
Here is a dummy data for this (pardon for the messy way to create the data ) could not think of a better way atm
customers <- c("179288" ,"146506" ,"202287","16207","152979","14421","41395","199103","183467","151902")
mdays <- 1:31
set.seed(1)
data <- sort(rep(customers,100))
days <- sample(mdays,1000,replace=T)
xyz <- cbind(data,days)
x <- vector(length=1000L)
j <- 1
for( i in 1:10){
set.seed(i) ## I wanted diff dates to be picked
m <- sample(mdays,5)
while(j <=100*i){
x[j] <- paste(m,collapse = ",")
j <- j+1
}
}
xyz <- cbind(xyz,x)
require(data.table)
my_data <- setDT(as.data.frame(xyz))
setnames(my_data, c("cust","days","pay_days"))
my_data[,pay:=runif(1000,min = 0,max=10000)]
Now I want for each cust the vector of pays which happens in pay_days.
i have tried various ways but cant seem to figure it out , my initial thought is to create a flag based if days is a subset of pay_days and then take the pays according to the flag
my_data[,ifelse(grepl(days,pay_days),1,0),cust]
this does not work as I expect it to . I dont want to use a native loop as the
actual data is huge .
Using tidyr to split the pay_days column into and then checking if days is in pay_days:
library(tidyr)
library(dplyr)
# creating long-form data
tidier <- my_data %>%
mutate(pay_days = strsplit(as.character(pay_days), ",")) %>%
unnest(pay_days)
# casting as numeric to make factor & character columns comparable
tidier[, days := as.numeric(days)]
tidier[, pay_days := as.numeric(pay_days)]
tidier[days == pay_days, pay, by=cust]
Not sure how this performs for large data, as you multiply your table length by the number of days in pay_days...
Side note: I can't comment yet, but to replicate your data one needs to add library(data.table) and initialize x x<-vector() which is otherwise not found, as Dee also points out.
Another one-liner approach using the data table:
my_data[,result:=sum(unlist(lapply(strsplit(as.character(pay_days),","),match,days)),na.rm=T)>0,by=1:nrow(my_data)]
I have 2 relatively large data frames in R. I'm attempting to merge / find all combos, as efficiently as possible. The resulting df turns out to be huge (the length is dim(myDF1)[1]*dim(myDF2)[1]), so I'm attempting to implement a solution using ff. I'm also open to using other solutions, such as the bigmemory package to work-around these memory issues. I'm have virtually no experience with either of these packages.
Working example - assume I'm working with some data frame that looks similar to USArrests:
library('ff')
library('ffbase')
myNames <- USArrests
myNames$States <- rownames(myNames)
rownames(myNames) <- NULL
Now, I will fabricate 2 data frames, which represent some particular sets of observations from myNames. I'm going to try to reference them by their rownames later.
myDF1 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(3*1e5, 1, 50))], ncol = 3)))
myDF2 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(2*1e5, 1, 50))], ncol = 2)))
# unique combos:
myDF1 <- unique(myDF1)
myDF2 <- unique(myDF2)
For example, my first set of states in myDF1 are myNames[unlist(myDF1[1, ]), ]. Then I will find all combos of myDF1 and myDF2 using ikey :
# create keys:
myDF1$key <- ikey(myDF1)
myDF2$key <- ikey(myDF2)
startTime <- Sys.time()
# Create some huge vectors:
myVector1 <- ffrep.int(myDF1$key, dim(myDF2)[1])
myVector2 <- ffrep.int(myDF2$key, dim(myDF1)[1])
# This takes about 25 seconds on my machine:
print(Sys.time() - startTime)
# Sort one DF (to later combine with the other):
myVector2 <- ffsorted(myVector2)
# Sorting takes an additional 2.5 minutes:
print(Sys.time() - startTime)
1) Is there a faster way to sort this?
# finally, find all combinations:
myDF <- as.ffdf(myVector1, myVector2)
# Very fast:
print(Sys.time() - startTime)
2) Is there an alternative to this type of combination (without using RAM)?
Finally, I'd like to be able to reference any of the original data by row / column. Specifically, I'd like to get different types of rowSums. For example:
# Here are the row numbers (from myNames) for the top 6 sets of States:
this <- cbind(myDF1[myDF[1:6,1], -4], myDF2[myDF[1:6,2], -3])
this
# Then, the original data for the first set of States is:
myNames[unlist(this[1,]),]
# Suppose I want to get the sum of the Urban Population for every row, such as the first:
sum(myNames[unlist(this[1,]),]$UrbanPop)
3) Ultimately, I'd like a vector with the above rowSum, so I can perform some type of subset on myDF. Any advice on how to most efficiently accomplish this?
Thanks!
It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.
require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)
I'm trying to sum a set of POSIXct objects by a factor variable, but am getting an error that sum is not defined for POSIXt objects. However, it works fine if I just calculate the mean. But how can I get the summed times by group using tapply?
Example:
data <- data.frame(time = c("2:50:04", "1:24:10", "3:10:43", "1:44:26", "2:10:19", "3:01:04"),
group = c("A","A","A","B","B","B"))
data$group <- as.factor(data$group)
data$time <- as.POSIXct(paste("1970-01-01", data$time), format="%Y-%m-%d %H:%M:%S", tz="GMT")
# works
tapply(data$time, data$group, mean)
# doesn't work
tapply(data$time, data$group, sum)
Date objects cannot be summed, this does semantically not make sense, the + operator is also not defined for POSIXct objects.
Probably you want to model time differences and sum them up?
Try:
times <- as.difftime(c("2:50:04", "1:24:10", "3:10:43",
"1:44:26", "2:10:19", "3:01:04"), "%H:%M:%S")
sum(times)
A difftime object is also that what you get when you subtract two date objects (which is semantically reasonable).
EDIT:
A entire solution for the OPs problem in a semantically neater way (tapply seams to destroy the structure of the difftime class - use group_by from the dplyr package instead)
library(dplyr)
times <- as.difftime(c("2:50:04", "1:24:10", "3:10:43",
"1:44:26", "2:10:19", "3:01:04"), format="%H:%M:%S")
data <- data.frame(time = times, group = c("A","A","A","B","B","B"))
summarise(group_by(data, group), sum(time))
This gives the following output:
Source: local data frame [2 x 2]
group sum(time)
1 A 7.415833 hours
2 B 6.930278 hours