I have made an algorithm in R to combine multiple sensor readings together under one timestamp.
Most sensor readings are taken every 500ms but some sensors only report changes. Therefor I had to make an algorithm that takes the last known value of a sensor at a given time.
Now the algorithm works, however it is so slow that when i would start using it for the actual 20+ sensors it would take ages to complete. My hypothesis is that it is slow because of my use of dataframes or the way I access and move my data.
I have tried making it faster by only walking trough every dataframe once and not iterating over them for every timestamp. I have also preallocated all space needed for the data.
Any suggestions would be very welcome. I am very new to the R language so I don't really know what datatypes are slow and which are fast.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
if (typeof(dataframes) == "list"){
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data vectors for the dataframe
result.timestamps <- numeric(result.size)
result <- list(result.timestamps)
for (i in 2:(length(dataframes)+1)) result[[i]] <- numeric(result.size) # add an empty vector for every datapoint
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[[1]][i] <- current_timestamp
for (i2 in 1:length(dataframes)){
while (dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] < current_timestamp && dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] != max(dataframes[[i2]]$TIMESTAMP)){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[[i2+1]][i] <- dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
return(result.final)
} else {
return(NA)
}
}
I fixed it today by changing every dataframe to a matrix. The code ran in 9.5 seconds instead of 70 minutes.
Conclusion: dataframes are VERY bad for performance.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
library(compiler)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
time.start <- Sys.time()
if (typeof(dataframes) == "list"){
# store the sizes of all the dataframes
resources.largest_size <- 0
resources.sizes <- numeric(length(dataframes))
for (i in 1:length(dataframes)){
resources.sizes[i] <- length(dataframes[[i]]$VALUE)
if (resources.sizes[i] > resources.largest_size) resources.largest_size <- resources.sizes[i]
}
# generate a matrix that can hold all needed dataframe values
resources <- matrix(nrow = resources.largest_size, ncol = length(dataframes)*2)
for (i in 1:length(dataframes)){
j <- i*2
resources[1:resources.sizes[i],j-1] <- dataframes[[i]]$TIMESTAMP
resources[1:resources.sizes[i],j] <- dataframes[[i]]$VALUE
}
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data matrix for the result
result <- matrix(data = 0, nrow = result.size, ncol = length(dataframes)+1)
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[i,1] <- current_timestamp
for (i2 in 1:length(dataframes)){
j <- i2*2
while (resources[dataframe_indexes[i2],j-1] < current_timestamp && resources[dataframe_indexes[i2],j-1] != resources.sizes[i2]){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
# at the moment the last value of the array is never selected, needs to be fixed
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[i,i2+1] <- resources[dataframe_indexes[i2], j] #dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
time.end <- Sys.time()
print(time.end-time.start)
return(result.final)
} else {
return(NA)
}
}
Related
I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.
I'm working on a database of sensors
This is a subset of my data
I would like to calculate the duration of each 'ON' event for all my sensors.
Knowing that the duration of an 'ON' is equal to the difference between the first ON and the first OFF
For example in the table for sensor 'capteur1', I have to have 41 seconds, 30 seconds, 25 seconds, etc.
Thank,
short code - give it a try:
# important - time must be a time-element (strptime or use format for example)
# subset only ON and OFF
ON <- df$time[df$capteur1 %in% "ON"]
OFF <- df$time[df$capteur1 %in% "OFF"]
# Create tempoary element to append to in for-loop
temp <- NULL # temp stands for temporary
name_temp <- NULL
# Loop over ON-Elements
for (i in ON) {
a <- difftime(OFF, i, units = "sec") # difftime of i-element of ON vs all OFF
a <- a[!a < 0] # drop negative seconds as Off was before on
# append positive Seconds - but only the first element - as this was the next OFF
temp <- c(temp, a[1])
name_temp <- c(name_temp, as.character(i))
}
# Give names to elements
names(temp) <- name_temp
# show temp
temp
Hope this helps
Sebastian
I want to write the results of a for I have designed to a matrix or a data frame at each step (and later turn that matrix or data frame into a CSV file).
My code is as follows
I want the matrix/dataframe to have 3 columns
for (i in 1:100) {
for (j in 2:100)
{
if (i<j) {
temp <- pairwiseAlignment(protein_dat[i,], protein_dat[j,],substitutionMatrix=BLOSUM62,type="local")
###save i value into column 1 [or Seq1 in the data frame I made below]
###save j value into column 2 [or Seq2]
###save temp#score into column 3 [or into Score]
###go to next row of the dataframe to save the next result
}
How can I do this?
EDIT #1
I have made a data frame as follows:
df <- data.frame(Seq1=as.numeric(),
Seq2=as.numeric(),
Score=as.numeric(),
stringsAsFactors=FALSE)
EDIT #2
Is this the right way to do it?
for (i in 1:100) {
for (j in 2:100)
{
if (i<j) {
t <- pairwiseAlignment(protein_dat[i,], protein_dat[j,],substitutionMatrix=BLOSUM62,type="local")
df <- rbind(df, c(i,j,t#score))
}
}
}
Assuming temp stores a "scalar" (vector of length one) numeric value, try this:
# preallocate matrix
dataMat <- matrix(0, (100*100/2), 3)
dataRow <- 0
for (i in 1:100) {
for (j in 2:100) {
if (i < j) {
# increment data row
dataRow <- dataRow + 1
temp <- pairwiseAlignment(protein_dat[i,], protein_dat[j,],
substitutionMatrix=BLOSUM62,type="local")
dataMat[dataRow, ] <- c( i, j, temp)
}
If the values are accessed differently, you could make these adjustments fairly easily. After you are done, you can convert dataMat into a data.frame:
myDataFrame <- data.frame(dataMat)
# give it some names
names(myDataFrame) <- c("iVal", "jVal", "tempVal")
EDIT: Thanks to #42- for a nice code replacement.
I have a problem with calculating the doubling time for cancer growth in R. The data contains multiple scans of the same patient taken over 5 years. There seem to be, however, cases where the patient has been scanned multiple times in a year. I want to calculate the doubling time of the mass of nodes for all patients for 1 scan and the last scan.
I have calculated the doubling time of a node of the last patient, but I need to get the doubling time for all patients.
The code i have used:
Nod <- read.table("NoData270513.txt" , header = T)
Nod$CoNo <- 10*Nod$StNo + Nod$LeNo
length(Nod$CoNo); length(unique (Nod$CoNo))
Nod$CoNo <- factor(Nod$CoNo)
Nod$CTDato <- as.Date(Nod$CTDato)
NodTyp1 <- rep(NA, length(unique(Nod$CoNo)))
i <- 0; i1 <- 0; i2 <- 0
for (j in unique(Nod$CoNo)) { temp <- Nod[Nod$CoNo==j, ]
i <- i + 1; i1 <- i2 + 1; i2 <- i2 + length(temp$CoNo)
NodTyp1[1:20]
vdt <- rep(NA, 1216)
if (length(temp$Age) > 1 )
{
vdt[j] <- (as.numeric(temp$CTDato[length(temp$Age)]) - as.numeric(temp$CTDato[1])) * log(2)/log((temp$SDia[length(temp$Age)]/temp$SDia[1]))
}
If I got it right, the only thing you need is to create a function that takes data filename and returns what you need. Then just iterate through all data files.
It seems that will be the patern:
# declare function for one patient
calculate.doub.time <- function(filename){
Nod <- read.table(filename , header = T)
# ...
# ...
# return what you want
}
# calculate all data files
all.data <- list.files() # assuming your working directory contains all data
result <- sapply(all.data, calculate.doub.time)
Sorry in advance if I misunderstood what you want to achieve.
I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code