Build dataframe with function dependant on its own output - r

I've tried on the web but I haven't found a suitable solution. Any help would be much appreciated!
Example:
# I want this OUTPUT df
Previous <- c(2, 4, 8, 16)
Today <- c(4, 8, 16, 32)
df <- data.frame(Previous, Today)
INPUT conditions :
Today is a function of Previous:
Today_function <- Previous_start * 2
For Previous I only have the starting value:
Previous_start <- 2
The output of Today_function feeds the next row of Previous and is the new input of Today_function.

You can abuse Reduce to iteratively apply a function to its own output:
n <- 4 # Desired number of rows in the data.frame
val0 <- 2 # Starting value
f <- function( valPrev, dummy ) { valPrev * 2 } # Make function "binary"
v <- Reduce( f, rep(NA,n), val0, accumulate=TRUE )
X <- data.frame( Previous=head(v,-1), Today=tail(v,-1) )
Side note: I know it's common, but I recommend against using df as a variable name, because it's a built-in name for the density function of the F distribution.

Related

Rfast segmentation fault on independence test

I am having troubles using the G2-test function of the Fast function in R since it outputs a segmentation fault even though it seems to me that the input parameters are correct.
More specifically, I am able to run the example code in the manual page
nvalues <- 3
nvars <- 10
nsamples <- 5000
data <- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars )
dc <- rep(nvalues, nvars)
res<-g2Test( data, 1, 2, 3, c(3, 3, 3) )
But I'm not able to make it run on my data. The function g2Test takes as input a matrix of numbers, three integer that stands for the column on which to condition (in the example we are studying the dependence of the first on the second conditioned on the third) and a vector with the number of unique values per column.
My code follows the same principles reading data from the ALARM csv file
library(readr)
library(Rfast)
# open the file
path <- "datasets/alarm.csv"
dataset <- read.csv(path)
# search for the indexes of the column I'm interested in and the amount of unique values per column
c1 <- "PVS"
c2 <- "ACO2"
s <- c("VALV", "VLNG", "VTUB", "VMCH")
n <- colnames(dataset)
col_c1 <- match(c1, n)
col_c2 <- match(c2, n)
cols_c3 <- c()
uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])
if (!s[1]=="()"){
for(v in s){
idx <- match(v, n)
cols_c3 <- append(cols_c3,idx)
uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
}
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}
ds <- as.matrix(dataset)
colnames(ds) <- NULL
# running the G2 test
res <- g2Test(ds, col_c1, col_c2, cols_c3, uni)
But it results into a segmentation fault
*** caught segfault ***
address 0x1f103f96a, cause 'memory not mapped'
Traceback:
1: g2Test(ds, col_c1, col_c2, cols_c3, uni)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
The same happens if I condition on just one variable and not on multiple ones.
I really don't understand why this happens since it seems to me that my case is the same as the example on the reference, just with different data. I would really appreciate any help for debugging this issue, please tell me if I need to specify further infos.
First, I'm sorry that I missed that you had originally included your data!
Alright, I wish I would have realized this sooner (as you will, as well...). The columns have to be consecutive and the values must start at zero. So what does that mean? You have to rearrange the columns so that col_c1 is the first column, col_c2 is the second column, and so on. You have to subtract all values by one (since the lowest value is 1).
This is what I did (and how I checked it):
# there was no PVS, I assume this was PVSAT
c1 <- "PVSAT"
# c1 <- "PVS"
# there was no ACO2, I assume this was ARTCO2
c2 <- "ARTCO2"
# c2 <- "ACO2"
# there are no columns with these names...
# for VALV - VENTALV; for VLNG - VENTLUNG; for VTUB - VENTTUBE; for VMCH - VENTMACH
s <- c("VENTALV", "VENTLUNG", "VENTTUBE", "VENTMACH")
# s <- c("VALV", "VLNG", "VTUB", "VMCH")
This next chunk is exactly as you wrote it:
n <- colnames(dataset)
col_c1 <- match(c1, n)
col_c2 <- match(c2, n)
cols_c3 <- c()
uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])
if (!s[1]=="()"){
for(v in s){
idx <- match(v, n)
cols_c3 <- append(cols_c3,idx)
uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
}
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}
ds <- as.matrix(dataset)
This is where I made the minimum zero:
# look at the number of unique values before changing, as a means of validation
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))
# look at the minimum, as a means of validation
sapply(1:ncol(ds), function(x) min(ds[,x]))
# the minimum value must be zero
ds <- ds - 1
# check
sapply(1:ncol(ds), function(x) min(ds[,x]))
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))
# looked as expected
Next, I rearranged the columns. I did this before removing the names so I could use the names to ensure the order was correct.
# the data must be consecutive numbers
# catch names before and after
n2 <- dimnames(ds)
# some of the results from this:
# [[2]]
# [1] "HISTORY" "CVP" "PCWP" "HYPOVOLEMIA"
# create the list of column indicies other than those getting called in g2Test
tellMe <- c(1:ncol(ds))
tellMe <- tellMe[-c(col_c1, col_c2, sort(cols_c3))]
# rearrange using the indices
ds <- ds[, c(col_c1, col_c2, sort(cols_c3), tellMe)]
# check it
(n3 <- dimnames(ds))
# some of the results from this
# [[2]]
# [1] "PVSAT" "ARTCO2" "VENTMACH" "VENTTUBE"
All that's left is removing the names (just as you did) and then calling the function. Since the indices changed, your objects won't work here, though.
colnames(ds) <- NULL
# running the G2 test
# res <- g2Test(ds, col_c1, col_c2, sort(cols_c3), uni)
res2 <- g2Test(ds, 1, 2, c(3,4,5,6), c(3, 3, 4, 4, 4, 4))
# $statistic
# [1] 19.78506
#
# $df
# [1] 1024
#

Variance of a List of Arrays

Suppose x is a list of arrays.
a <- array(1,c(2,3,4));
b <- array(2,c(2,3,4));
c <- array(3,c(2,3,4));
x <- list(a=a,b=b,c=c)
My desired output is the following array:
array(var(c(1,2,3)),c(2,3,4))
This is quietly same as the following page, but change the sum to the variance.
Ref: Sum a list of arrays
Edit for comments of #akrun
Thank you #akrun.
His answer is the following;
apply(
array(unlist(x),append( dim(x[[1]]), length(x) ) ),
1:length(dim(x[[1]])),
var
)
where x is the following list of arrays:
a <- array(1,c(2,3,4));
b <- array(2,c(2,3,4));
c <- array(3,c(2,3,4));
d <- array(4,c(2,3,4));
x <- list(a=a,b=b,c=c,d=d)
One option would be to unlist the list, create an array and use apply to get the variance
apply(array(unlist(x), c(2, 3, 4, 3)), 1:3, var)

r: function with lag working across rows, not columns

I'm writing a function that will take the most recent observation and add it to the previous days values times a designated share of the previous observations. The below is a version that just uses one transformation and works:
df1<- data.frame(var1=rnorm(10,3,2), var2= rnorm(10, 4, 3))
df1$carryover<- lag(df1$var1, 1, default = 0)*(.5) + df1$var1
>df1
var1 var2 carryover
1 3.2894474 2.0839128 3.2894474
2 3.6059389 7.8880658 5.2506625
3 -1.4274057 6.2763882 0.3755637
4 3.8531253 3.2653448 3.1394225
My function attempts to do the same but across multiple different shares, see below:
carryover<- function(x){
result_df<- data.frame(x)
xnames<- names(x)
for (i in 1:7){
result_column<- lag(x, 1, default = 0)*(i/10) + x
result_column_name<- paste(xnames, i, sep= "_")
result_df[result_column_name] <- result_column
}
return(result_df)
}
When I run carryover(df1), df$var1 remains the same across all iterations while df1$var2 takes lag values across rows, when I'm aiming for columns. What is structurally wrong about my function that is causing it to not return lag the column values?
Worked on this a bit using feedback from Stackoverflow and came-up with the below solve, defining the the carryover function within a larger function, then using apply with MARGIN=2 to calculate by column:
adStock<- function(x){
# create datafame to store results in
result_df<- data.frame(x)
# assign names to be applied as a column
xnames<- names(x)
# create list of carryovers
carryovers<- seq(.1, .7, .1)
# create carryover function
carryover<- function(x){
x + dplyr::lag(x, 1, default = 0)*(i)
}
# run for loop across all carryover values
for (i in carryovers){
result_column<- apply(x, 2, carryover)
result_column_name<- paste(xnames, i, sep= "_")
result_df[result_column_name] <- result_column
}
return(data.frame(result_df))
}

Using coefficient of variation in aggregate

I have a data frame with 50000 rows and 200 columns. There are duplicate rows in the data and I want to aggregate the data by choosing the row with maximum coefficient of variation among the duplicates using aggregate function in R. With aggregate I can use "mean", "sum" by default but not coefficient variation.
For example
aggregate(data, as.columnname, FUN=mean)
Works fine.
I have a custom function for calculating coefficient of variation but not sure how to use it with aggregate.
co.var <- function(x)
(
100*sd(x)/mean(x)
)
I have tried
aggregate(data, as.columnname, function (x) max (co.var (x, data[index (x),])
but it is giving an error as object x is not found.
Assuming that I understand your problem, I would suggest using tapply() instead of aggregate() (see ?tapply for more info). However, a minimal working example would be very helpful.
co.var <- function(x) ( 100*sd(x)/mean(x) )
## Data with multiple repeated measurements.
## There are three things (ID 1, 2, 3) that
## are measured two times, twice each (val1 and val2)
myDF<-data.frame(ID=c(1,2,3,1,2,3),val1=c(20,10,5,25,7,2),
val2=c(19,9,4,24,4,1))
## Calculate coefficient of variation for each measurement set
myDF$coVar<-apply(myDF[,c("val1","val2")],1,co.var)
## Use tapply() instead of aggregate
mySel<-tapply(seq_len(nrow(myDF)),myDF$ID,function(x){
curSub<-myDF[x,]
return(x[which(curSub$coVar==max(curSub$coVar))])
})
## The mySel vector is then the vector of rows that correspond to the
## maximum coefficient of variation for each ID
myDF[mySel,]
EDIT:
There are faster ways, one of which is below. However, with a 40000 by 100 dataset, the above code only took between 16 and 20 seconds on my machine.
# Create a big dataset
myDF <- data.frame(val1 = c(20, 10, 5, 25, 7, 2),
val2 = c(19, 9, 4, 24, 4, 1))
myDF <- myDF[sample(seq_len(nrow(myDF)), 40000, replace = TRUE), ]
myDF <- cbind(myDF, rep(myDF, 49))
myDF$ID <- sample.int(nrow(myDF)/5, nrow(myDF), replace = TRUE)
# Define a new function to work (slightly) better with large datasets
co.var.df <- function(x) ( 100*apply(x,1,sd)/rowMeans(x) )
# Create two datasets to benchmark the two methods
# (A second method proved slower than the third, hence the naming)
myDF.firstMethod <- myDF
myDF.thirdMethod <- myDF
Time the original method
startTime <- Sys.time()
myDF.firstMethod$coVar <- apply(myDF.firstMethod[,
grep("val", names(myDF.firstMethod))], 1, co.var)
mySel <- tapply(seq_len(nrow(myDF.firstMethod)),
myDF.firstMethod$ID, function(x) {
curSub <- myDF.firstMethod[x, ]
return(x[which(curSub$coVar == max(curSub$coVar))])
}, simplify = FALSE)
endTime <- Sys.time()
R> endTime-startTime
Time difference of 17.87806 secs
Time second method
startTime3 <- Sys.time()
coVar3<-co.var.df(myDF.thirdMethod[,
grep("val",names(myDF.thirdMethod))])
mySel3 <- tapply(seq_along(coVar3),
myDF[, "ID"], function(x) {
return(x[which(coVar3[x] == max(coVar3[x]))])
}, simplify = FALSE)
endTime3 <- Sys.time()
R> endTime3-startTime3
Time difference of 2.024207 secs
And check to see that we get the same results:
R> all.equal(mySel,mySel3)
[1] TRUE
There is an additional change from the original post, in that the edited code considers that there may be more than one row with the highest CV for a given ID. Therefore, to get the results from the edited code, you must unlist the mySel or mySel3 objects:
myDF.firstMethod[unlist(mySel),]
myDF.thirdMethod[unlist(mySel3),]

Avoiding Loop with R using Apply (?)

I'm trying to run apply a function to each row of a dataset. The function looks up matching rows in a second dataset and computes a similarity score for the product details passed to it.
The function works if I just call it with test numbers but I can't figure out how to run it on all rows of my dataset. I've tried using apply but can't get it working.
I'm going to be iterating different parameter settings to find those that best fit historical data so speed is important... meaning that a loop is out. Any help you can provide would be hugely appreciated.
Thanks! Alan
GetDistanceTest <- function(SnapshotDate, Cand_Type, Cand_Height, Cand_Age) {
HeightParam <- 1/5000
AgeParam <- 1
Stock_SameType <- HistoricalStock[!is.na(HistoricalStock$date) & !is.na(HistoricalStock$Type) & as.character(HistoricalStock$date)==as.character(SnapshotDate) & HistoricalStock$Type==Cand_Type,]
Stock_SameType$ED <- (HeightParam*(Stock_SameType$Height - Cand_Height))^2 + (AgeParam*(Stock_SameType$Age - Cand_Age))^2
return(sqrt(sum(Stock_SameType$ED)))
}
HistoricalStock <- HistoricalAQStock[,c(1, 3, 4, 5)]
colnames(HistoricalStock) <- c("date", "Age", "Height", "Type")
Sales <- AllSales[,c(2,10,11,25)]
colnames(Sales) <- c("date", "Age", "Height", "Type")
GetDistanceTest("2010-04-01", 5261, 12, 7523) #works and returns a single number
res1 <- transform(Sales, ClusterScore=GetDistanceTest(date, Type, Height, Age))
# returns Error in `$<-.data.frame`(`*tmp*`, "ED", value = c(419776714.528591, 22321257.0276852, : replacement has 4060 rows, data has 54
# also 4 warnings, one for each variable. e.g. 1: In as.character(HistoricalStock$date) == as.character(SnapshotDate) : longer object length is not a multiple of shorter object length
res2 <- apply(Sales, 1, GetDistanceTest, Sales$Type, Sales$Height, Sales$Age)
# `$<-.data.frame`(`*tmp*`, "ED", value = c(419648071.041523, 22325941.2704261, : replacement has 4060 rows, data has 13
# also same 4 warnings as res1
I took some liberties with your code b/c I try to vectorize vice use loops whenever I can... With the merge function, you merge the two data frames, and operate on the "columns", which allows you to use the vectorization built into R. I think this will do what you want (in the second line I'm just making sure that A and B don't have the same values for height and age so that your distance isn't always zero):
A <- B <- data.frame(date=Sys.Date()-9:0, stock=letters[1:10], type=1:10, height=1:10, age=1:10)
B$height <- B$age <- 10:1
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1/5000
age.param <- 1
temp <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Use mapply, the multivariate form of apply:
res1 <- mapply(GetDistanceTest, Sales$date, Sales$Type, Sales$Height, Sales$Age)
Code as per above comment:
A <- data.frame(date=rep(Sys.Date()-9:0,100), id=letters[1:10], type=floor(runif(1000, 1, 10)), height=runif(1000, 1, 100), age=runif(1000, 1, 100))
B <- data.frame(date=rep(Sys.Date()-9:0,1000), type=floor(runif(10000, 1, 10)), height=runif(10000, 1, 10), age=runif(10000, 1, 10))
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1
age.param <- 1
AB$ClusterScore <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Scores <- ddply(AB, c("id"), function(df)sum(df$ClusterScore))

Resources