subtract pairs of rows for long dataset - r

I would like to find a way to subtract in a long dataset, both time and meters values every two rows (two measures per day) and therefore create a new table that store those values.
(03:21-09:37 and 3.2-0.9, so on...). Is there a function that can do it automatically. how is possible to set it? I am completly new using R and I need to figure out those things only with R
time <- c("03:21","09:37","15:41","21:46","03:54","10:12")
day <- c(1,1,1,1,2,2)
meters <- c(3.2,0.9,3.2,0.9,3.2,0.9)
df <- data.frame(day,time,meters)
day time meters
1 1 03:21 3.2
2 1 09:37 0.9
3 1 15:41 3.2
4 1 21:46 0.9
5 2 03:54 3.2
6 2 10:12 0.9

Here are a couple of options that quickly come to mind to consider:
Option 1: Subset with TRUE and FALSE to calculate the difference:
Time <- strptime(df$time, format="%H:%M")
TimeD <- Time[c(TRUE, FALSE)] - Time[c(FALSE, TRUE)]
MetersD <- df$meters[c(TRUE, FALSE)] - df$meters[c(FALSE, TRUE)]
cbind(meters = MetersD, time = TimeD)
# meters time
# [1,] 2.3 -6.266667
# [2,] 2.3 -6.083333
# [3,] 2.3 -6.300000
Option 2: Use %/% to create a grouping variable and use aggregate
df$pairs <- c(0, 1:(nrow(df)-1) %/% 2)
df$time2 <- strptime(df$time, format="%H:%M")
aggregate(list(meters = df$meters, time = df$time2),
by = list(pairs = df$pairs), FUN=function(y) diff(rev(y)))
# pairs meters time
# 1 0 2.3 -6.266667
# 2 1 2.3 -6.083333
# 3 2 2.3 -6.300000
Update
It's not too difficult to extend the idea to get your "day" column back tooL
with(df, {
time <- strptime(time, format="%H:%M")
time <- time[c(TRUE, FALSE)] - time[c(FALSE, TRUE)]
meters <- meters[c(TRUE, FALSE)] - meters[c(FALSE, TRUE)]
day <- day[c(TRUE, FALSE)]
data.frame(day, time, meters)
})
# day time meters
# 1 1 -6.266667 hours 2.3
# 2 1 -6.083333 hours 2.3
# 3 2 -6.300000 hours 2.3

Using diff
# Create a proper date
df$date <- strptime(paste(df$day,df$time),format="%d %H:%M")
new_df <- data.frame(
diff_meters = abs(diff(df$meters)),
diff_time = diff(df$date))
new_df
diff_meters diff_time
1 2.3 6.266667 hours
2 2.3 6.066667 hours
3 2.3 6.083333 hours
4 2.3 6.133333 hours
5 2.3 6.300000 hours
It's pretty easy to get every other row, if that's what you're actually looking for (not really clear from the question nor your comment:
new_df[seq(1,nrow(new_df),2),]
diff_meters diff_time
1 2.3 6.266667 hours
3 2.3 6.083333 hours
5 2.3 6.300000 hours

Related

data.table applying function by row in R

I have a function that takes arguments and I want to apply it over each row in a data.table.
My data looks like follows:
Row Temp Humidity Elevation
1 10 0.5 1000
2 25 1.5 2000
3 28 2.0 1500
and I have a function
myfunc <- function(x, n_features=3){
# Here x represents each row of a data table.
# Feature names are important for me as my actual function is operating on feature names
return(x[,Temp]+x[,Humidity]+(x[,Elevation]*(n_features)))
}
What I want my output to look like is
Row Temp Humidity Elevation myfuncout
1 10 0.5 1000 3010.5
2 25 1.5 2000 6026.5
3 28 2.0 1500 4530
I have tried df[, myfuncout := myfunc(x, n_features=3), by=.I] but this didnt work.
Also, not sure if I have to use .SD here to make this work...
Any inputs here on how I can achieve this?
Thanks!
If it is a data.table, we can use
myfunc <- function(dt, n_features = 3) {
dt[, out := (Temp + Humidity) + (Elevation * n_features)]
dt
}
myfunc(df, 3)
-output
df
# Row Temp Humidity Elevation out
#1: 1 10 0.5 1000 3010.5
#2: 2 25 1.5 2000 6026.5
#3: 3 28 2.0 1500 4530.0

Assigning elements of one vector to elements of another with R

I would like to assign elements of one vector to elements of another for every single user.
For example:
Within a data frame with the variables "user", "activities" and "minutes" (see below), I would like to assign, for example, the duration (4 minutes) of the first activity (4 minutes to activity "READ") of user 1 to new variable READ_duration. Then duration (5 minutes) of second activity ("EDIT") to the new variable EDIT_duration. And the duration (2 minutes) of third activity (again "READ") to the new variable READ_duration.
user <- 1,2,3
activities <- c("READ","EDIT","READ"), c("READ","EDIT", "WRITE"), c("WRITE","EDIT")
minutes <- c(4,5,2), c(3.5, 1, 2), c(4.5,3)
Output should be like: in a data frame with the assigned minutes to the activities:
user READ_duration EDIT_duration WRITE_duration
1 6 5 0
2 3.5 1 2
3 0 3 4.5
The tricky thing here is the algorithm needs to consider that the activities are not in the same order for every user. For example, user 3 starts with writing and therefore the duration 4.5 needs to be assigned to column 4 WRITE_duration.
Also, a loop-function would be needed due to a massive amount of users.
Thank you so much for your help!!
This needs a simple reshape to wide format with sum as an aggregation function.
Prepare a long-format data.frame:
user <- c(1,2,3)
activities <- list(c("READ","EDIT","READ"), c("READ","EDIT", "WRITE"), c("WRITE","EDIT"))
minutes <- list(c(4,5,2), c(3.5, 1, 2), c(4.5,3))
DF <- Map(data.frame, user = user, activities = activities, minutes = minutes)
DF <- do.call(rbind, DF)
# user activities minutes
#1 1 READ 4.0
#2 1 EDIT 5.0
#3 1 READ 2.0
#4 2 READ 3.5
#5 2 EDIT 1.0
#6 2 WRITE 2.0
#7 3 WRITE 4.5
#8 3 EDIT 3.0
Reshape:
library(reshape2)
dcast(DF, user ~ activities, value.var = "minutes", fun.aggregate = sum)
# user EDIT READ WRITE
#1 1 5 6.0 0.0
#2 2 1 3.5 2.0
#3 3 3 0.0 4.5
in base R you could do:
xtabs(min~ind+values, cbind(stack(setNames(activities, user)), min = unlist(minutes)))
values
ind EDIT READ WRITE
1 5.0 6.0 0.0
2 1.0 3.5 2.0
3 3.0 0.0 4.5

time step from 0.1 second to half hour (30 minutes) with UTC decimal hours

I have data at an interval of 0.1 second or 10 lines for one second
So 864000 lines for one day based on 24*60*60*10.
I want to find mean of columns (wind speed and other variables not shown here) in my data by aggregating it from 0.1 second time step to half hour. So data will be aggregated from 864000 lines to 48 lines (for one day)
Input:
tms Hr Min Sec Wind speed
7/13/2014 0:00 0 0 0 3.45
7/13/2014 0:00 0 0 0.1 52.34
7/13/2014 0:00 0 0 0.2 1.23
7/13/2014 0:00 0 0 0.3 4.3
7/13/2014 0:00 0 0 0.4 1.34
7/13/2014 0:00 0 0 0.5 3.6
Output I want to see:
Year Month Day Hr Wind speed
7/13/2014 7 13 0 21.92
7/13/2014 7 13 0.5 29.38
7/13/2014 7 13 1 24.18
7/13/2014 7 13 1.5 1.70
7/13/2014 7 13 2 1.80
My code for hourly mean and I want to change in to aggregate data by half hour (not one hour). Where dat is the data without tms column: so I added a date column.
library(data.table)
library(xts)
dat <- data.table(dat)
tms <- as.POSIXct(seq(0,24*(60*60*10)-1,by=1),origin="2014-07-13",tz="UTC")
xts.ts <- data.frame(xts(dat,tms))
Now I added tms column to my data
Aut <- data.frame(tms,xts.ts, check.names=FALSE, row.names=NULL)
mean2 <- aggregate(Aut,
list(hour=cut(as.POSIXct(Aut$tms), "hour")),
mean)
But this not correct even for hourly. I want mean of my data by half hour. Any suggestions?
As I mentioned in my comment, you can do this easily with xts::period.apply:
library(xts)
options(digits.secs = 1) # display fractional seconds
# create 1 day of timestamps that are 0.1 seconds apart
tms <- as.POSIXct(seq(0, 86400-1, by=0.1), origin="2014-07-13", tz="UTC")
# create an xts object with some random data and the times created above
set.seed(21)
xts.ts <- xts(runif(length(tms), 0, 50), tms)
# use period.apply and endpoints to calculate the 30-minute means
mean30min <- period.apply(xts.ts, endpoints(xts.ts, "mins", 30), mean)
# round up to next 30-minute period
mean30min <- align.time(mean30min, 60*30)
If you want the result to be a data.table or data.frame with the additional columns added, you can do that easily after aggregating.
library(data.table)
dt.mean30 <- as.data.table(mean30min)
dt.mean30[, Month := .indexmon(mean30min) + 1]
dt.mean30[, Day := .indexmday(mean30min)]
dt.mean30[, Hr := .indexhour(mean30min) + .indexmin(mean30min)/60]

R: Improvement of loop to create distance matrix from data frame

I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0

R: running corr across two matrices by column

I'm tracking how much my cats are pooping, and trying to figure out if that's correlated with how much they're eating.
So if I have the following data:
food <- cbind(fluffy=c(0.9,1.1,1.3,0.7),misterCuddles=c(0.5,1.2,1.4,0.5))
poop <- cbind(fluffy=c(0.9,1.1,1.3,0.7),misterCuddles=c(-0.5,-1.2,-1.4,-0.5))
dates <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
rownames(food) <- dates
rownames(poop) <- dates
cube <- abind(food, poop, along=3)
Notes for the curious:
amounts are in deci-pennies: 1.1 means the poop weighs about as much as 11 pennies
negative poop amounts demonstrate that mister cuddles is part unicorn
This gives me the following:
> cube
, , food
fluffy misterCuddles
2013-01-01 0.9 0.5
2013-01-02 1.1 1.2
2013-01-03 1.3 1.4
2013-01-04 0.7 0.5
, , poop
fluffy misterCuddles
2013-01-01 0.9 -0.5
2013-01-02 1.1 -1.2
2013-01-03 1.3 -1.4
2013-01-04 0.7 -0.5
Now if I want to find the correlation for mister cuddles to demonstrate his magic:
> corr(cube[,"misterCuddles",])
[1] -1
What I'd like is a named vector with the correlation number for each cat:
> c(fluffy=1.0,misterCuddles=-1.0)
fluffy misterCuddles
1 -1
Is there a way I can do this in one shot, ideally in parallel? In reality, I have buttloads of cats.
Thanks!
EDIT
Can it be as simple as...
> result <- simplify2array(mclapply(colnames(food), function(x) corr(cube[,x,])))
> names(result) <- colnames(food)
> result
fluffy misterCuddles
1 -1
library(boot) # for corr
sapply(dimnames(cube)[[2]], function(x) corr(cube[ , x, ]))
# fluffy misterCuddles
# 1 -1

Resources