efficiently replacing data frame with cumulative frequency - r

I'm trying to write a program that takes a large data frame and replaces each column of values by the cumulative frequency of those values (sorted ascending). For instance, if the column of values are: 5, 8, 3, 5, 4, 3, 8, 5, 5, 1. Then the relative and cumulative frequencies are:
1: rel_freq=0.1, cum_freq = 0.1
3: rel_freq=0.2, cum_freq = 0.3
4: rel_freq=0.1, cum_freq = 0.4
5: rel_freq=0.4, cum_freq = 0.8
8: rel_freq=0.2, cum_freq = 1.0
Then the original column becomes: 0.8, 1.0, 0.3, 0.8, 0.4, 0.3, 1.0, 0.8, 0.8, 0.1
The following code performs this operation correctly, but it scales poorly probably due to the nested loop. Any idea how to perform this task more efficiently?
mydata = read.table(.....)
totalcols = ncol(mydata)
totalrows = nrow(mydata)
for (i in 1:totalcols) {
freqtable = data.frame(table(mydata[,i])/totalrows) # create freq table
freqtable$CumSum = cumsum(freqtable$Freq) # calc cumulative freq
hashtable = new.env(hash=TRUE)
nrows = nrow(freqtable)
# store cum freq in hash
for (x in 1:nrows) {
dummy = toString(freqtable$Var1[x])
hashtable[[dummy]] = freqtable$CumSum[x]
}
# replace original data with cum freq
for (j in 1:totalrows) {
dummy = toString(mydata[j,i])
mydata[j,i] = hashtable[[dummy]]
}
}

This handles a single column without the for-loop:
R> x <- c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1)
R> y <- cumsum(table(x)/length(x))
R> y[as.character(x)]
5 8 3 5 4 3 8 5 5 1
0.8 1.0 0.3 0.8 0.4 0.3 1.0 0.8 0.8 0.1

Here is one way. Using a data frame with two variables each containing your example data
d <- data.frame(var1 = c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1),
var2 = c(5, 8, 3, 5, 4, 3, 8, 5, 5, 1))
use a simple function to
generate the cumsum() of the relative proportions given by table(x) / length(x), then
match() the observations in a variable with the names of the table of cumulative sums, then
use the id matches to select from the table of cumulative sums (and un-name it)
Such a functions is:
f <- function(x) {
tab <- cumsum(table(x) / length(x))
ind <- match(x, as.numeric(names(tab)))
unname(tab[ind])
}
In practice we use lapply() and coerce to a data frame:
out <- data.frame(lapply(d, f))
out
which gives:
R> out
var1 var2
1 0.8 0.8
2 1.0 1.0
3 0.3 0.3
4 0.8 0.8
5 0.4 0.4
6 0.3 0.3
7 1.0 1.0
8 0.8 0.8
9 0.8 0.8
10 0.1 0.1

Related

Pooled Mean Matching MICE

For MICE imputations I need to constrict the predictions so that the predicted values will have the same mean (which is a measured value). The situation is we are dealing with mean blood serum samples (individual blood samples are pooled together) where we have measured values, which are representative of the mean of those individuals. I am trying to predict what the concentration of x was in those individuals based on the measured mean and covariate data. You'll notice in my dummy dataset that there are 3 individuals (Individual_id) for each pool (Pool_id). So when imputing these values to the individuals we need the average of those 3 individuals to equal the Pool_mean.
How can we constrict the Mice algorithm to still predict based on covariate data, but have the means match exactly (can be any method chose, "cart", in this circumstance)? Could this conceptually be done through a MICE squeeze constraint with inputs from the mean?
The code is below:
library(mice)
library(dplyr)
#create demo data table as an example
Pool_id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
Pool_mean <- c(15, 15, 15, 35, 35, 35, 42, 42, 42)
Individual_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
concentration <- c(10, 20, NA, 30, NA,NA, NA, NA, 70)
co_variate <- c(0.1, 0.2, 0.1, 0.2, 0.3, 0.1, 0.1, 0.2, 0.3)
df <- data.frame(Pool_id, Pool_mean, Individual_id, concentration, co_variate)
#run mice to impute missing data
initial_imputed <- mice(df, m = 5, maxit = 10, meth = "cart", seed = 3985))
completed <- complete(intial_imputed)
I know that we can constraint mice using a post process and maybe a custom function like the vec_squeeze below. However, I need to constraint the values based on a mean. How could I update this function to create this?
vec_squeeze <- function(x, bounds) {
stopifnot(length(x) == nrow(bounds))
pmin(pmax(x, bounds[,1]), bounds[,2])
}
Here's an example of how to use passive imputation on the 3rd variable to force the mean of the imputations to be equal to pool_mean from the data. First generate some data in 'wide' format.
set.seed(123)
# Using larger example data to avoid issues with imputation models
n <- 20
pool_id <- rep(1:n, each = 3)
ind_id <- rep(1:3, times = n)
cov_1 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_2 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_3 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
conc_1 <- round(rnorm(n*3, mean = 20 + 5*cov_1, sd = 5))
conc_2 <- round(rnorm(n*3, mean = 20 + 5*cov_2, sd = 5))
conc_3 <- round(rnorm(n*3, mean = 20 + 5*cov_3, sd = 5))
pool_mean <- apply(cbind(conc_1, conc_2, conc_3), FUN = mean, MARGIN = 1)
df <- data.frame(pool_id, ind_id, pool_mean, conc_1, conc_2,
conc_3, cov_1, cov_2, cov_3)
df[which(rbinom(n*3, 1, prob = 0.5) == 1), "conc_3"] <- NA
df[which(rbinom(n*3, 1, prob = 0.2) == 1), "conc_2"] <- NA
df[which(is.na(df$conc_2)),"conc_3"] <- NA
head(df)
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 NA NA 0.3 0.1 0.2
#> 4 2 1 25.00000 25 NA NA 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 NA 0.2 0.3 0.3
I forced missing values into the third position to avoid re-arranging. I also have ind_id repeated within each pool_id instead of unique, but that's not important for what follows.
The key part of the passive imputation is meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)". If (A+B+C)/3 = D, then 3D - A - B = C.
library(mice)
ini <- mice(df, maxit = 0, printFlag = FALSE)
# Limit the variables used in prediction, to avoid co-linearity
pred <- ini$predictorMatrix
pred[,] <- 0
pred["conc_1", c("pool_mean","cov_1")] <- 1
pred["conc_2", c("pool_mean","conc_1","cov_2")] <- 1
# Set the imputation methods. Use passive imputation for conc_3
meth <- ini$method
meth["conc_2"] <- "pmm"
meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)"
# Control the visit sequence to ensure that conc_3 is updated
# after conc_2. Add other missing variables if needed.
visit_seq <- c("conc_2", "conc_3")
imps <- mice(df, method = meth,
predictorMatrix = pred,
visitSequence = visit_seq,
printFlag = FALSE)
head(complete(imps, action = 1))
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 18 5 0.3 0.1 0.2
#> 4 2 1 25.00000 25 23 27 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 24 0.2 0.3 0.3
Created on 2022-11-20 with reprex v2.0.2
The imputation procedure has correctly replaced row 6's conc_3 value with 24. The other rows have received a random imputation for conc_2 and then had conc_3 passively imputed. In general conc_3 has no other restrictions on it. In row 3 conc_3 = 5, which may be questionably low. It could even become negative in some situations. Better modelling of conc_2 would help.

how to connect directional sub-graphs in R igraph

I have a directional weighted graph that is made of two or more disconnected sub-graphs (with some attributes, in addition to weight):
require(igraph)
df<-data.frame(from=c(1,2,4,5),to=c(2,3,5,6),weight=c(1,1,1,1),attr=c(0.1,0.1,0.1,0.1))
g<-graph_from_data_frame(df,directed=T)
My ultimate goal is to find shortest path, but this can be done only for connected graphs.
As a result, I need to connect these two sub-graphs with an edge between 3 and 4 with the large weight (perhaps vcount(g)) so at the end I have just one graph. In general, vertex names are dates that define direction (from smaller to larger). More than one gap can be present.
You can try the code below if you have more than one gap (i.e., more than two clusters)
e <- c(sapply(decompose(g),function(x) names(V(x))[degree(x)==1]))
G <- g %>%
add.edges(e[2:(length(e)-1)],weight = vcount(g))
such that
> get.data.frame(G)
from to weight attr
1 1 2 1 0.1
2 2 3 1 0.1
3 4 5 1 0.1
4 5 6 1 0.1
5 7 8 1 0.1
6 8 9 1 0.1
7 3 4 9 NA
8 6 7 9 NA
Data
df <-
data.frame(
from = c(1, 2, 4, 5, 7, 8),
to = c(2, 3, 5, 6, 8, 9),
weight = c(1, 1, 1, 1, 1, 1),
attr = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
)
In your case, you could do:
(Thanks for the comment by #ThomasIsCoding
h <- add.edges(g, c("3","4"), weight = vcount(g))

Extract dataframes from a large list based on matching values given by another dataframe

I am new to R and despite having researched the site I can't seem to solve this:
I have a very large list of dataframes (of historical climate data from different sites) and I need to make a new list that would only contain those dataframes with specific values (longitudes and latitudes) given by one independent dataframe.
So I have:
df1 <- data.frame("x" = c(1, 1, 1), "y" = c(-2, -2, -2), "a" = c(3, 6, 9), "b" = c(4, 5, 3))
df2 <- data.frame("x" = c(1.2, 1.2, 1.2), "y" = c(-2, -2, -2), "a" = c(3, 4, 78), "b" = c(12, 5, 8))
df3 <- data.frame("x" = c(1.3, 1.3, 1.3), "y" = c(-2.1, -2.1, -2.1), "a" = c(19, 5, 5), "b" = c(7, 7, 20))
my_list <- list(df1, df2, df3)
Each dataframe corresponds to one site, with a specific value of lon and lat given in two columns (constant value for all row length, since it's the same site)
> df1
x y a b
1 1 -2 3 4
2 1 -2 6 5
3 1 -2 9 3
> df2
x y a b
1 1.2 -2 3 12
2 1.2 -2 4 5
3 1.2 -2 78 8
> df3
x y a b
1 1.3 -2.1 19 7
2 1.3 -2.1 5 7
3 1.3 -2.1 5 20
Then, I have an independent dataframe with longitudes and latitudes that I will like to obtain the climate data for.
df_xy <- data.frame("x"= c(1, 1.3), "y" = c(-2, -2.1))
> df_xy
x y
1 1.0 -2.0
2 1.3 -2.1
Ideally I would obtain a new list with the dataframes that match the x, y specified.
output_list <- list(df1, df3)
As much as I have tried with lapply and Map(merge, my_list, df_xy) I can't seem to make it work out. Thank you very much for any help!
You could do:
output_list <- lapply(my_list, function(x) merge(x, df_xy))
In output_list, the second list will be empty.
Optionally, based on How do I remove empty data frames from a list?, you could then disregard empty dataframes from output_list using Filter(function(x) dim(x)[1], output_list)

Divide by sum by factor and carry back

I have a data.table that looks like:
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7)
M <- data.table(A,B,C,D)
My question is similar to R help: divide values by sum produced through factor with a few extra considerations. A specifies a date (I'm simply using integers here). B are individuals. C is a classification in the individual belongs to. D is a value variable.
For each classification c of C, for each day a of A, divide the value D by the sum of the values for all individuals in c, carrying backward when needed such that 0<x-a<=N where x is the date of another individual (meaning that we pick the smallest x-a and use that as an approximation for the value of the other individual in group c on day a).
Let's say N=5. Here's my expected output.
A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7/(0.7+0.3), 0.3/(0.3), 0.5/(0.5), 0.9/(0.9), 4/(4+7), 7/(4+7))
M <- data.table(A,B,C,D)
Note that the values for group B are not carried backward for individual 3, as the length is greater than 5 (20-5). Is there a nice way of doing this in data.table?
For each value in D, I wish to divide by the sum of all the values of the same group (either I, II,II) on that day. However, you'll notice for some groups, observations do not exist on that day. I'll try and walk through the logic on a few observations.
Edit: Let me try and walk through a few cases.
For individual 1 (column B) on day 1 (column A), the individual is of group I (column C). Other individuals of group I are: 2. For each of those others, we see that for individual 2, their nearest observation is on day 3 and 3-1<=5, so we'll use 0.3 in the denominator.
For individual 3 (column B) on day 5 (column A), the individual is of group II (column C). Other individuals of group II are: 3. For each of those others, we see that for individual 3, their nearest observation is on day 20 and 20-5>5, so we cannot use their observation in the denominator.
This, I think, will give you your answer:
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
V <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
N=5
#Put data into a frame
test = data.frame(A,B,C,V)
#order the data
test = test[order(as.numeric(test$C), test$A),]
#Get the 'rollback' possibilities for each value
Roll = sapply(test$A, FUN = function(x){paste(which(test$A < (x+N) & test$A >= x), collapse=",")})
#Get the groupings
Group = sapply(test$C, FUN = function(x){paste(which(test$C == x), collapse=",")})
#Intersect the values
ToGet = apply(cbind(Roll, Group), MARGIN=1, FUN=function(x){intersect(unlist(strsplit(x[1],",")), unlist(strsplit(x[2],",")))})
#Calculate the denominators
test$D = sapply(ToGet, FUN=function(x){sum(test$V[as.numeric(x)])})
test$Calc = test$V/test$D
Output:
> test
A B C V D Calc
1 1 1 I 0.7 1.0 0.7000000
2 3 2 I 0.3 0.4 0.7500000
7 7 7 I 0.1 0.1 1.0000000
3 5 3 II 0.5 0.5 1.0000000
4 20 4 II 0.9 0.9 1.0000000
5 21 5 III 4.0 11.0 0.3636364
6 21 6 III 7.0 11.0 0.6363636
The questions is tagged with data.table, so here is a data.table solution which uses non-equi joins to identify individuals within each group to treat them as cohort if the observations fall within a date window of 5 days.
library(data.table) # CRAN version 1.10.4 used
# set length of date window in days
N <- 5L
# give columns more semantic names according to OP's description
setnames(M, c("day", "id", "grp", "val"))
# prepare data for non-equi join: allowable date range
ranged <- M[, .(start = day, end = day + N, co.id = id, grp)]
# non-equi join to determine cohort
joined <- M[ranged, on = c("grp", "day>=start", "day<=end")]
# compute denominator for each cohort
grouped <- joined[, .(den = sum(val)), by = co.id]
# final update on join and order
result <- M[grouped, on = c("id==co.id"), calc := val / den][order(grp, id)]
result
# day id grp val calc
#1: 1 1 I 0.7 0.7000000
#2: 3 2 I 0.3 0.7500000
#3: 7 7 I 0.1 1.0000000
#4: 5 3 II 0.5 1.0000000
#5: 20 4 II 0.9 1.0000000
#6: 21 5 III 4.0 0.3636364
#7: 21 6 III 7.0 0.6363636
Data
A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
M <- data.table(A,B,C,D)
Compact versions
For those who prefer compact code, here is a more convoluted version:
joined <- M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")]
M[joined[, .(den = sum(val)), by = co.id], on = c("id==co.id"),
calc := val / den][order(grp, id)]
Or, as a "one-liner":
M[M[M[, .(start = day, end = day + N, co.id = id, grp)],
on = c("grp", "day>=start", "day<=end")
][, .(den = sum(val)), co.id],
on = c("id==co.id"), calc := val / den][order(grp, id)]

Cumulative Calculations (e.g. cumulative correlation) with data.table in R

In R, I have a data.table with two measurements red and green and would like to calculate their cumulative correlation.
library(data.table)
DT <- data.table(red = c(1, 2, 3, 4, 5, 6.5, 7.6, 8.7),
green = c(2, 4, 6, 8, 10, 12, 14, 16),
id = 1:8)
How can I get the following output within one data.table command?
...
> DT[1:5, cor(red, green)]
[1] 1 # should go into row 5
> DT[1:6, cor(red, green)]
[1] 0.9970501 # should go into row 6, and so on ...
> DT[1:7, cor(red, green)]
[1] 0.9976889
Edit:
I am aware that it can be solved by looping, but my data.table has about 1 million rows grouped into smaller chunks, so looping is rather slow and I thought there might be some other possibility.
Building on my answer to the similar question here for cumulative variances, you can find cumulative covariances as
library(dplyr) # for cummean
cum_cov <- function(x, y){
n <- 1:length(x)
res <- cumsum(x*y) - cummean(x)*cumsum(y) - cummean(y)*cumsum(x) + n*cummean(x)*cummean(y)
res / (n-1)
}
cum_var <- function(x){# copy-pasted from previous answer
n <- 1:length(x)
(cumsum(x^2) - n*cummean(x)^2) / (n-1)
}
Cumulated correlation is then
cum_cor <- function(x, y) cum_cov(x, y)/sqrt(cum_var(x)*cum_var(y))
DT[, cumcor:=cum_cor(red, green),]
red green id cumcor
1: 1.0 2 1 NaN
2: 2.0 4 2 1.0000000
3: 3.0 6 3 1.0000000
4: 4.0 8 4 1.0000000
5: 5.0 10 5 1.0000000
6: 6.5 12 6 0.9970501
7: 7.6 14 7 0.9976889
8: 8.7 16 8 0.9983762
I hope it is fast enough
x <- rnorm(1e6)
y <- rnorm(1e6)+x
system.time(cum_cor(x, y))
# user system elapsed
# 0.319 0.020 0.339
How about creating a cumcor function?
library(data.table)
DT <- data.table(red = c(1, 2, 3, 4, 5, 6.5, 7.6, 8.7),
green = c(2, 4, 6, 8, 10, 12, 14, 16),
id = 1:8)
cumcor <- function(x, y, start = 5, ...) {
c(rep(NA, start - 1), sapply(start:length(x), function(k) cor(x[1:k], y[1:k]), ...))
}
DT[, list(red, green, cumcor = cumcor(red, green))]
red green cumcor
1: 1.0 2 NA
2: 2.0 4 NA
3: 3.0 6 NA
4: 4.0 8 NA
5: 5.0 10 1.0000000
6: 6.5 12 0.9970501
7: 7.6 14 0.9976889
8: 8.7 16 0.9983762
Please mind the above cumcor function would require more QC at the beginning (x and y having same length, start being greater than 0, etc.)

Resources