I'm working on a database of sensors
This is a subset of my data
I would like to calculate the duration of each 'ON' event for all my sensors.
Knowing that the duration of an 'ON' is equal to the difference between the first ON and the first OFF
For example in the table for sensor 'capteur1', I have to have 41 seconds, 30 seconds, 25 seconds, etc.
Thank,
short code - give it a try:
# important - time must be a time-element (strptime or use format for example)
# subset only ON and OFF
ON <- df$time[df$capteur1 %in% "ON"]
OFF <- df$time[df$capteur1 %in% "OFF"]
# Create tempoary element to append to in for-loop
temp <- NULL # temp stands for temporary
name_temp <- NULL
# Loop over ON-Elements
for (i in ON) {
a <- difftime(OFF, i, units = "sec") # difftime of i-element of ON vs all OFF
a <- a[!a < 0] # drop negative seconds as Off was before on
# append positive Seconds - but only the first element - as this was the next OFF
temp <- c(temp, a[1])
name_temp <- c(name_temp, as.character(i))
}
# Give names to elements
names(temp) <- name_temp
# show temp
temp
Hope this helps
Sebastian
Related
I have a table of ranges (start, stop), which looks something like this:
ID
start
stop
x1
351525
352525
x2
136790
136990
x3
74539
74739
x4
478181
478381
...
...
...
I also have a vector of positions.
The data can be simulated with:
s=round(runif(50,0,500000),0)
# ranges:
# (+200 is random, the difference my be more or less than that, but stop is always higher than start)
ranges=cbind(ID=paste0("x",1:50), start=s, stop=s+200)
# positions
pos=round(runif(5000,0,500000),0)
I want to select all IDs which have at least one position within their range.
I could loop through ranges and pos:
library(dplyr)
selected.IDs <- c()
for(r in 1:nrow(ranges)){
for(p in 1:length(pos)){
if(between(pos[p],left = ranges[r,2], right = ranges[r,3])){
selected.IDs <- append(selected.IDs, ranges[r,1])
break
} else{next}
}
}
That works fine (I think).
However, the 'ranges' object has 83,000 rows and there are 180,000 position. It takes a long time to loop through all of them.
Does anyone has an idea how to do that without a loop?
Thanks
I usually do this using overlap joins with data.table::foverlaps.
s <- round(runif(50,0,500000),0)
# ranges:
# (+200 is random, the difference my be more or less than that, but stop is always higher than start)
ranges <- data.table(ID=paste0("x",1:50), start=s, stop=s+200)
# positions
pos <- round(runif(5000,0,500000),0)
pos <- data.table(start = pos, stop = pos + 1)
setkey(pos, start, stop)
setkey(ranges, start, stop)
res <- foverlaps(ranges, pos, nomatch = 0)
selected.IDs <- res$ID
My dataset is called data and I have a column called time that contains time in mm:ss format. I also wrote a function functime(var1,var2).
I would like ultimately to use apply or vapply and have var2 set to a constant (lets say var2 = 6) and var1 to be each value of the column data$time.
Something like:
If
data$time <- c("10:10","11:00", "09:30"), when I do vapply(), I would like to get a
data$output <-c(functime(data$time[1],6),functime(data$time[2],6),functime(data$time[3],6))
which in this example is the same as
data$output <- c(functime("10:10",6),functime("11:00",6),functime("09:30",6))
My lame attempt to that is something like:
vapply(data$time,functime,var2 = 6,FUN.VALUE = 1)
The documentation for vapply says it should be :vapply(x,fun,fun.value)
I am confused on how to "say to vapply" that I want to take as its first argument all rows of the data$time column, have a fixed second argument that I will define it as 6.
Ultimately I would like to add my data$output in the original dataset using a mutate.
Edit: (Include lines of data and function)
data$id <- c(9,6,5763,4)
data$time <- c("5:06","5:06","5:11","5:08")
data$city <-c("Kyle","Oklahoma","Monterey","Austin")
The function is:
calctime <- function(racePace, raceDistance){
# racePace is the per unit pace in mm:ss - character
# raceDistance is the total race distance - numeric
# Pace and race distance must use same units (km or mi or whatever)
# Seconds to character time function
CharMinSec <- function(sec){
outMin <- floor(sec/60)
outSec <- ((sec/60)-outMin)*60
if(outSec==0 | round(outSec)<10){
outChar <- paste0(outMin,":0",round(outSec))
} else {
outChar <- paste(outMin,round(outSec),sep=":")
}
outChar
}
paceMinSec <- as.numeric(strsplit(racePace,':')[[1]])
paceSec <- paceMinSec[1]*60+ paceMinSec[2]
raceMin <- floor(paceSec*raceDistance/60)
raceSec <- ((paceSec*raceDistance/60)-raceMin)*60
raceTime <- CharMinSec(raceMin*60+raceSec)
list(Seconds=raceSec)
}
# Example of 4:15/km for a half-marathon
calctime("4:15",21.097494)
calcTime <- function(pace,distance){
return (lubridate::period_to_seconds(lubridate::ms(pace)) * distance)
}
pace <- c("10:10","11:00", "09:30")
vapply(pace,calcTime,6,FUN.VALUE = 1)
## 10:10 11:00 09:30
## 36960 39960 34560
d <- tibble::as_tibble(list(pace = pace))
dplyr::mutate(d, raceSeconds = calcTime(pace,6))
## A tibble: 3 x 2
## pace raceSeconds
## <chr> <dbl>
## 1 10:10 36960
## 2 11:00 39960
## 3 09:30 34560
I had to change 2 things, but your vapply call was right.
In the function, I changed the last line so it returns a value instead of a list with one value
calctime <- function(racePace, raceDistance){
# racePace is the per unit pace in mm:ss - character
# raceDistance is the total race distance - numeric
# Pace and race distance must use same units (km or mi or whatever)
# Seconds to character time function
CharMinSec <- function(sec){
outMin <- floor(sec/60)
outSec <- ((sec/60)-outMin)*60
if(outSec==0 | round(outSec)<10){
outChar <- paste0(outMin,":0",round(outSec))
} else {
outChar <- paste(outMin,round(outSec),sep=":")
}
outChar
}
paceMinSec <- as.numeric(strsplit(racePace,':')[[1]])
paceSec <- paceMinSec[1]*60+ paceMinSec[2]
raceMin <- floor(paceSec*raceDistance/60)
raceSec <- ((paceSec*raceDistance/60)-raceMin)*60
raceTime <- CharMinSec(raceMin*60+raceSec)
raceSec
}
Now that the list returns a value, the vapply() works, but in my case I had to force the time column to be a character
data = data.frame(
id = c(9,6,5763,4),
time = c("5:06","5:06","5:11","5:08"),
city = c("Kyle","Oklahoma","Monterey","Austin")
)
data$time = as.character(data$time)
data$output = vapply(data$time,calctime,raceDistance = 6,FUN.VALUE=1) #works fine
I have made an algorithm in R to combine multiple sensor readings together under one timestamp.
Most sensor readings are taken every 500ms but some sensors only report changes. Therefor I had to make an algorithm that takes the last known value of a sensor at a given time.
Now the algorithm works, however it is so slow that when i would start using it for the actual 20+ sensors it would take ages to complete. My hypothesis is that it is slow because of my use of dataframes or the way I access and move my data.
I have tried making it faster by only walking trough every dataframe once and not iterating over them for every timestamp. I have also preallocated all space needed for the data.
Any suggestions would be very welcome. I am very new to the R language so I don't really know what datatypes are slow and which are fast.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
if (typeof(dataframes) == "list"){
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data vectors for the dataframe
result.timestamps <- numeric(result.size)
result <- list(result.timestamps)
for (i in 2:(length(dataframes)+1)) result[[i]] <- numeric(result.size) # add an empty vector for every datapoint
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[[1]][i] <- current_timestamp
for (i2 in 1:length(dataframes)){
while (dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] < current_timestamp && dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] != max(dataframes[[i2]]$TIMESTAMP)){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[[i2+1]][i] <- dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
return(result.final)
} else {
return(NA)
}
}
I fixed it today by changing every dataframe to a matrix. The code ran in 9.5 seconds instead of 70 minutes.
Conclusion: dataframes are VERY bad for performance.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
library(compiler)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
time.start <- Sys.time()
if (typeof(dataframes) == "list"){
# store the sizes of all the dataframes
resources.largest_size <- 0
resources.sizes <- numeric(length(dataframes))
for (i in 1:length(dataframes)){
resources.sizes[i] <- length(dataframes[[i]]$VALUE)
if (resources.sizes[i] > resources.largest_size) resources.largest_size <- resources.sizes[i]
}
# generate a matrix that can hold all needed dataframe values
resources <- matrix(nrow = resources.largest_size, ncol = length(dataframes)*2)
for (i in 1:length(dataframes)){
j <- i*2
resources[1:resources.sizes[i],j-1] <- dataframes[[i]]$TIMESTAMP
resources[1:resources.sizes[i],j] <- dataframes[[i]]$VALUE
}
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data matrix for the result
result <- matrix(data = 0, nrow = result.size, ncol = length(dataframes)+1)
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[i,1] <- current_timestamp
for (i2 in 1:length(dataframes)){
j <- i2*2
while (resources[dataframe_indexes[i2],j-1] < current_timestamp && resources[dataframe_indexes[i2],j-1] != resources.sizes[i2]){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
# at the moment the last value of the array is never selected, needs to be fixed
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[i,i2+1] <- resources[dataframe_indexes[i2], j] #dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
time.end <- Sys.time()
print(time.end-time.start)
return(result.final)
} else {
return(NA)
}
}
I am new to R but trying desperately to learn the ropes. In fact I feel a little stupid asking this question as I have gone through a number of similar problems but have not been able to get the desired results. My code is as shown below :
## Initializing Parameters
fstart <- 960 ## Start frequency in MHz
fstop <- 1240 ## Stop Frequency In MHz
bw <- 5.44 ## IF Bandwidth in MHz
offset <- 100 ## Max. Variation in TOD in milliseconds
f_dwell <- 1 ## Time spent on each search frequency in millisecond
iterations <- 100 ## No. of iterations to run
## No. of possible frequencies
f <- seq((fstart + bw/2), (fstop - bw/2), by=bw)
## Initializing the frequency table
freq_table <- matrix (NA, nrow=(2*offset +1), ncol=offset)
## Fill frequency table row wise with random values of possible frequencies
for (i in 1:(2*offset + 1)){
row_value <- c(sample(f), sample(f, offset-length(f)))
freq_table[i, ] <- row_value
}
## Assign a row from freq_table to unknown node
unknown_node <- freq_table[sample(1:(2*offset + 1), 1), ]
t = numeric(iterations)
## Calculate number of repetitions of frequencies
for(k in 1:iterations){
for(j in 1:offset){
y <- (sort(table(freq_table[, j]), decreasing=TRUE))
x <- as.vector(y) ## Number of repetitions of each frequency
y <- names(y)
## Search Frequencies
sf1 <- as.numeric(y[1])
sf2 <- as.numeric(y[2])
if (unknown_node[j] == sf1){
t[k] <- ((j-1)*f_dwell)*2 + f_dwell
break
}
else {
if (unknown_node[j] == sf2){
t[k] <- ((j-1)*f_dwell)*2 + 2*f_dwell
break
}
}
## Delete rows from freq_table that have sf1 & sf2
freq_table <- subset(freq_table, freq_table[, 1]!=sf1 & freq_table[, 1]!=sf2 )
}
}
print(t)
If I run this without the k for loop, I get different values of variable t every time. However, I wanted to run the inner for loop iteratively and get a vector of t values each time the inner for loop runs. I do get the length of t as 100, but the values are repeating. The first few values (2 0r 3 or sometimes 4) are different, but the rest keep repeating. I can't figure out why.
I have a problem with calculating the doubling time for cancer growth in R. The data contains multiple scans of the same patient taken over 5 years. There seem to be, however, cases where the patient has been scanned multiple times in a year. I want to calculate the doubling time of the mass of nodes for all patients for 1 scan and the last scan.
I have calculated the doubling time of a node of the last patient, but I need to get the doubling time for all patients.
The code i have used:
Nod <- read.table("NoData270513.txt" , header = T)
Nod$CoNo <- 10*Nod$StNo + Nod$LeNo
length(Nod$CoNo); length(unique (Nod$CoNo))
Nod$CoNo <- factor(Nod$CoNo)
Nod$CTDato <- as.Date(Nod$CTDato)
NodTyp1 <- rep(NA, length(unique(Nod$CoNo)))
i <- 0; i1 <- 0; i2 <- 0
for (j in unique(Nod$CoNo)) { temp <- Nod[Nod$CoNo==j, ]
i <- i + 1; i1 <- i2 + 1; i2 <- i2 + length(temp$CoNo)
NodTyp1[1:20]
vdt <- rep(NA, 1216)
if (length(temp$Age) > 1 )
{
vdt[j] <- (as.numeric(temp$CTDato[length(temp$Age)]) - as.numeric(temp$CTDato[1])) * log(2)/log((temp$SDia[length(temp$Age)]/temp$SDia[1]))
}
If I got it right, the only thing you need is to create a function that takes data filename and returns what you need. Then just iterate through all data files.
It seems that will be the patern:
# declare function for one patient
calculate.doub.time <- function(filename){
Nod <- read.table(filename , header = T)
# ...
# ...
# return what you want
}
# calculate all data files
all.data <- list.files() # assuming your working directory contains all data
result <- sapply(all.data, calculate.doub.time)
Sorry in advance if I misunderstood what you want to achieve.