I have a problem with calculating the doubling time for cancer growth in R. The data contains multiple scans of the same patient taken over 5 years. There seem to be, however, cases where the patient has been scanned multiple times in a year. I want to calculate the doubling time of the mass of nodes for all patients for 1 scan and the last scan.
I have calculated the doubling time of a node of the last patient, but I need to get the doubling time for all patients.
The code i have used:
Nod <- read.table("NoData270513.txt" , header = T)
Nod$CoNo <- 10*Nod$StNo + Nod$LeNo
length(Nod$CoNo); length(unique (Nod$CoNo))
Nod$CoNo <- factor(Nod$CoNo)
Nod$CTDato <- as.Date(Nod$CTDato)
NodTyp1 <- rep(NA, length(unique(Nod$CoNo)))
i <- 0; i1 <- 0; i2 <- 0
for (j in unique(Nod$CoNo)) { temp <- Nod[Nod$CoNo==j, ]
i <- i + 1; i1 <- i2 + 1; i2 <- i2 + length(temp$CoNo)
NodTyp1[1:20]
vdt <- rep(NA, 1216)
if (length(temp$Age) > 1 )
{
vdt[j] <- (as.numeric(temp$CTDato[length(temp$Age)]) - as.numeric(temp$CTDato[1])) * log(2)/log((temp$SDia[length(temp$Age)]/temp$SDia[1]))
}
If I got it right, the only thing you need is to create a function that takes data filename and returns what you need. Then just iterate through all data files.
It seems that will be the patern:
# declare function for one patient
calculate.doub.time <- function(filename){
Nod <- read.table(filename , header = T)
# ...
# ...
# return what you want
}
# calculate all data files
all.data <- list.files() # assuming your working directory contains all data
result <- sapply(all.data, calculate.doub.time)
Sorry in advance if I misunderstood what you want to achieve.
Related
I am new to R so I apologize if this is a basic question. I have a df with 12k days of river height data. I've made a filter for when the river is > 28ft. But now i would like to break this filtered df into a list for when the river is >28 ft for x consecutive days.. So for instance if the river is >28ft for 20 days i would like that to be grouped together.
library(dplyr)
RawData <- read.csv("c:/Users/Anthony/Desktop/R/CSVRiverData.csv")
RiverData <- cbind(data.frame(as.Date(RawData$Row.Labels, format = "%m/%d/%Y")),
RawData$Average.of.height)
colnames(RiverData) <- c("Date","RiverHeight")
Filt_River_Data <- filter(RiverData,RiverData$RiverHeight >28)
Date_Diff <- data.frame(Filt_River_Data$Date - lag(Filt_River_Data$Date, 1L))
Here is my brute force approach to get the number of consecutive days and then group them. There may be a more elegant solution out there somewhere but here is a simple way to do it with loops.
# get some reasonable pseudo data
library(forecast)
rd <- rnorm(1010,mean=28,sd=10)
rd <- forecast::ma(rd,order=10)
rd <- rd[!(is.na(rd))]
temp.dd <- seq.Date(as.Date("2000-01-01"),by=1,length.out=length(rd))
RiverData <- data.frame(temp.dd,rd)
colnames(RiverData) <- c('Date','RiverHeight')
# add vector to df and calculate number of consecutive days
## assume that you want to calcualte # consecutive days based on raw data, not pre-filtered
RiverData$numConDays <- rep(NA,nrow(RiverData))
count = 0
h_thresh <- 28
for (i in 1:nrow(RiverData)) {
if (RiverData$RiverHeight[i] >= h_thresh) {
count = count + 1
} else {
count = 0
}
RiverData$numConDays[i] <- count
}
# now you can filter as you wish
## the first condition of > h_thresh shouldn't be necessary
RiverDataSubset <- RiverData[(RiverData$RiverHeight > h_thresh & RiverData$numConDays > 20),]
head(RiverDataSubset)
# get the grouping for each set
## assumes daily data
RiverDataSubset$group <- rep(NA,nrow(RiverDataSubset))
gg <- 1
RiverDataSubset$group[1] <- gg
for (i in 2:nrow(RiverDataSubset)) {
if (as.numeric(difftime(RiverDataSubset$Date[i],RiverDataSubset$Date[i-1])) != 1) {
gg <- gg +1
}
RiverDataSubset$group[i] <- gg
}
I have made an algorithm in R to combine multiple sensor readings together under one timestamp.
Most sensor readings are taken every 500ms but some sensors only report changes. Therefor I had to make an algorithm that takes the last known value of a sensor at a given time.
Now the algorithm works, however it is so slow that when i would start using it for the actual 20+ sensors it would take ages to complete. My hypothesis is that it is slow because of my use of dataframes or the way I access and move my data.
I have tried making it faster by only walking trough every dataframe once and not iterating over them for every timestamp. I have also preallocated all space needed for the data.
Any suggestions would be very welcome. I am very new to the R language so I don't really know what datatypes are slow and which are fast.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
if (typeof(dataframes) == "list"){
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data vectors for the dataframe
result.timestamps <- numeric(result.size)
result <- list(result.timestamps)
for (i in 2:(length(dataframes)+1)) result[[i]] <- numeric(result.size) # add an empty vector for every datapoint
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[[1]][i] <- current_timestamp
for (i2 in 1:length(dataframes)){
while (dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] < current_timestamp && dataframes[[i2]]$TIMESTAMP[dataframe_indexes[i2]] != max(dataframes[[i2]]$TIMESTAMP)){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[[i2+1]][i] <- dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
return(result.final)
} else {
return(NA)
}
}
I fixed it today by changing every dataframe to a matrix. The code ran in 9.5 seconds instead of 70 minutes.
Conclusion: dataframes are VERY bad for performance.
library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(dplyr)
library(pracma)
library(compiler)
# take a list of dataframes as a parameter
generalise_data <- function(dataframes, timeinterval){
time.start <- Sys.time()
if (typeof(dataframes) == "list"){
# store the sizes of all the dataframes
resources.largest_size <- 0
resources.sizes <- numeric(length(dataframes))
for (i in 1:length(dataframes)){
resources.sizes[i] <- length(dataframes[[i]]$VALUE)
if (resources.sizes[i] > resources.largest_size) resources.largest_size <- resources.sizes[i]
}
# generate a matrix that can hold all needed dataframe values
resources <- matrix(nrow = resources.largest_size, ncol = length(dataframes)*2)
for (i in 1:length(dataframes)){
j <- i*2
resources[1:resources.sizes[i],j-1] <- dataframes[[i]]$TIMESTAMP
resources[1:resources.sizes[i],j] <- dataframes[[i]]$VALUE
}
# get the biggest and smallest datetime stamp from every dataframe
# this will be used to calculate the size of the resulting frame ((largest time - smallest time)/1000 = dataframe rows)
# this means one value every second
largest_time <- 0
smallest_time <- as.numeric(Sys.time())*1000 # everything will be smaller than the current time
for (i in 1:length(dataframes)){
dataframe_max <- max(dataframes[[i]]$TIMESTAMP)
dataframe_min <- min(dataframes[[i]]$TIMESTAMP)
if (dataframe_max > largest_time) largest_time <- dataframe_max
if (dataframe_min < smallest_time) smallest_time <- dataframe_min
}
# result dataframe wil have ... rows
result.size <- floor((largest_time - smallest_time)/timeinterval)
sprintf("Result size: %i", result.size)
# create a numeric array that contains the indexes of every dataframe, all set to 1
dataframe_indexes <- numeric(length(dataframes))
dataframe_indexes[dataframe_indexes == 0] <- 1
# data matrix for the result
result <- matrix(data = 0, nrow = result.size, ncol = length(dataframes)+1)
# use progressbar
pb <- txtProgressBar(1, result.size, style = 3)
# make a for loop to run through every data row of the resulting data frame (creating a row every run through)
# every run through increase the index of dataframes until the resulting row exceeds the result rows timestamp, than go one index back
#for (i in 1:200){
for (i in 1:result.size){
current_timestamp <- smallest_time + timeinterval*(i-1)
result[i,1] <- current_timestamp
for (i2 in 1:length(dataframes)){
j <- i2*2
while (resources[dataframe_indexes[i2],j-1] < current_timestamp && resources[dataframe_indexes[i2],j-1] != resources.sizes[i2]){
dataframe_indexes[i2] <- dataframe_indexes[i2]+1
}
# at the moment the last value of the array is never selected, needs to be fixed
if (dataframe_indexes[i2] > 1){
dataframe_indexes[i2] <- dataframe_indexes[i2]-1 # take the one that's smaller
}
result[i,i2+1] <- resources[dataframe_indexes[i2], j] #dataframes[[i2]]$VALUE[dataframe_indexes[i2]]
}
setTxtProgressBar(pb, i)
}
close(pb)
result.final <- data.frame(result)
time.end <- Sys.time()
print(time.end-time.start)
return(result.final)
} else {
return(NA)
}
}
I recently asked a question about improving performance in my code (Faster method than "while" loop to find chain of infection in R).
Background:
I'm analyzing large tables (300 000 - 500 000 rows) that store data output by a disease simulation model. In the model, animals on a landscape infect other animals. For example, in the example pictured below, animal a1 infects every animal on the landscape, and the infection moves from animal to animal, branching off into "chains" of infection.
In my original question, I asked how I could output a data.frame corresponding to animal "d2"s "chain of infection (see below, outlined in green, for illustration of one "chain"). The suggested solution worked well for one animal.
In reality, I will need to calculate chains for about 400 animals, corresponding to a subset of all animals (allanimals table).
I've included a link to an example dataset that is large enough to play with.
Here is the code for one chain, starting with animal 5497370, and note that I've slightly changed column names from my previous question, and updated the code!
The code:
allanimals <- read.csv("https://www.dropbox.com/s/0o6w29lz8yzryau/allanimals.csv?raw=1",
stringsAsFactors = FALSE)
# Here's an example animal
ExampleAnimal <- 5497370
ptm <- proc.time()
allanimals_ID <- setdiff(unique(c(allanimals$ID, allanimals$InfectingAnimal_ID)), -1)
infected <- rep(NA_integer_, length(allanimals_ID))
infected[match(allanimals$ID, allanimals_ID)] <-
match(allanimals$InfectingAnimal_ID, allanimals_ID)
path <- rep(NA_integer_, length(allanimals_ID))
curOne <- match(ExampleAnimal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
chain <- allanimals[path[seq_len(i - 1)], ]
chain
proc.time() - ptm
# check it out
chain
I'd like to output chains for each animal in "sel.set":
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
If possible, I'd like to store each "chain" data.frame as list with length = number of chains.
So I'll return the indices to access the data frame rather than all data frame subsets. You'll just need to use lapply(test, function(path) allanimals[path, ]) or with a more complicated function inside the lapply if you want to do other things on the data frame subsets.
One could think of just lapply on the solution for one animal:
get_path <- function(animal) {
curOne <- match(animal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
path[seq_len(i - 1)]
}
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
system.time(
test <- lapply(sel.set, get_path)
) # 0.66 seconds
We could rewrite this function as a recursive function (this will introduce my third and last solution).
system.time(
sel.set.match <- match(sel.set, allanimals_ID)
) # 0
get_path_rec <- function(animal.match) {
`if`(is.na(nextOne <- infected[animal.match]),
NULL,
c(animal.match, get_path_rec(nextOne)))
}
system.time(
test2 <- lapply(sel.set.match, get_path_rec)
) # 0.06
all.equal(test2, test) # TRUE
This solution is 10 times as fast. I don't understand why though.
Why I wanted to write a recursive function? I thought you might have a lot of cases where you want for example to get the path of animalX and animalY where animalY infected animalX. So when computing the path of animalX, you would recompute all path of animalY.
So I wanted to use memoization to store already computed results and memoization works well with recursive functions. So my last solution:
get_path_rec_memo <- memoise::memoize(get_path_rec)
memoise::forget(get_path_rec_memo)
system.time(
test3 <- lapply(sel.set.match, get_path_rec_memo)
) # 0.12
all.equal(test3, test) # TRUE
Unfortunately, this is slower than the second solution. Hope it will be useful for the whole dataset.
I have a gridded climate dataset, such as:
# generate time vector
time1 <- seq(14847.5,14974.5, by = 1)
time2 <- seq(14947.5,14974.5, by = 1)
time <- c(time1,time2)
time <- as.POSIXct(time*86400,origin='1970-01-01 00:00')
# generate lat and lon coordinates
lat <- seq(80,90, by = 1)
lon <- seq(20,30, by = 1)
# generate 3dimensional array
dat <- array(runif(length(lat)*length(lon)*length(time)),
dim = c(length(lon),length(lat),length(time)))
such that
> dim(dat)
[1] 11 11 156
the dimensions of the data are describing the variable at different longitude (dim = 1), latitude (dim = 2), and time (dim = 3).
The issue I have at the moment is that some of the times are repeated, something to do with overlapping sensors measuring the data. Therefore, I was wondering if it was possible to only keep the unique times for dat, but average the data within the grid for the duplicated times i.e. if there are two repeated days we take the average value in each latitude and longitude grid for that time.
I can find the unique times as:
# only select unique times
new_time <- unique(time)
unique_time <- unique(time)
The following code then aims to loop through each grid (lat/lon) and average all of the duplicated days.
# loop through lat/lon coordinates to generate new data
new_dat <- array(dim = c(length(lon),length(lat),length(new_time)))
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
dat2 <- dat[i,ii,]
dat2b <- NA
for(k in 1:length(unique_time)){
idx <- time == unique_time[k]
dat2b[k] <- mean(dat2[idx], na.rm = TRUE)
}
new_dat[i,ii,] <- dat2b
}
}
I'm convinced that this provides the correct answer, but I'm certain there is a much cleaner method do achieve this.
I should also note that my data is quite large (i.e. k = 7000), so this last loop is not very efficient, to say the least.
My original answer:
This is a bit more concise and efficient by use of aggregate:
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
new_dat[i,ii,] <- as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)
}
}
It still has 2 out of the 3 of the loops, but it manages to bypass creating dat2, dat2b, and unique_time.
My improved answer:
f <- function(i, ii){as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)}
for(i in 1:nrow(expand.grid(1:length(lon),1:length(lat)))){
new_dat[expand.grid(1:length(lon),1:length(lat))[i,1],
expand.grid(1:length(lon),1:length(lat))[i,2],] <-
f(expand.grid(1:length(lon),1:length(lat))[i,1],expand.grid(1:length(lon),
1:length(lat))[i,2])
}
Got it down to just 1 loop. We could probably bypass that loop too with an apply.
I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code