Replicate rows by different N - r

I’ve the following data
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
1) I need to repeat number of rows for each id by n. For example, n=2.63 for id=1, then I need to replicated id=1 row three times. If n=0.5, then I need to replicate it only one time... so n needs to be round up.
2) Create a new variable called t, where the sum of t for each id must equal to n.
3) Create another new variable called accumulated.t
Here how the output looks like:
id n t accumulated.t
1 2.63 1 1
1 2.63 1 2
1 2.63 0.63 2.63
2 1.5 1 1
2 1.5 0.5 1.5
3 0.5 0.5 0.5
4 3.5 1 1
4 3.5 1 2
4 3.5 1 3
4 3.5 0.5 3.5
5 4 1 1
5 4 1 2
5 4 1 3
5 4 1 4

Get the ceiling of 'n' column and use that to expand the rows of 'mydata' (rep(1:nrow(mydata), ceiling(mydata$n)))
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(mydata1)), grouped by 'id' column, we replicate (rep) 1 with times specified as the trunc of the first value of 'n' (rep(1, trunc(n[1]))). Take the difference between the unique value of 'n' per group and the sum of 'tmp' (n[1]-sum(tmp)). If the difference is greater than 0, we concatenate 'tmp' and 'tmp2' (c(tmp, tmp2)) or if it is '0', we take only 'tmp'. This can be placed in a list to create the two columns 't' and the cumulative sum of 'tmp3 (cumsum(tmp3)).
library(data.table)
mydata1 <- mydata[rep(1:nrow(mydata),ceiling(mydata$n)),]
setDT(mydata1)[, c('t', 'taccum') := {
tmp <- rep(1, trunc(n[1]))
tmp2 <- n[1]-sum(tmp)
tmp3= if(tmp2==0) tmp else c(tmp, tmp2)
list(tmp3, cumsum(tmp3)) },
by = id]
mydata1
# id n t taccum
# 1: 1 2.63 1.00 1.00
# 2: 1 2.63 1.00 2.00
# 3: 1 2.63 0.63 2.63
# 4: 2 1.50 1.00 1.00
# 5: 2 1.50 0.50 1.50
# 6: 3 0.50 0.50 0.50
# 7: 4 3.50 1.00 1.00
# 8: 4 3.50 1.00 2.00
# 9: 4 3.50 1.00 3.00
#10: 4 3.50 0.50 3.50
#11: 5 4.00 1.00 1.00
#12: 5 4.00 1.00 2.00
#13: 5 4.00 1.00 3.00
#14: 5 4.00 1.00 4.00

An alternative that utilizes base R.
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
mynewdata <- data.frame(id = rep(x = mydata$id,times = ceiling(x = mydata$n)),
n = mydata$n[match(x = rep(x = mydata$id,ceiling(mydata$n)),table = mydata$id)],
t = rep(x = mydata$n / ceiling(mydata$n),times = ceiling(mydata$n)))
mynewdata$t.accum <- unlist(x = by(data = mynewdata$t,INDICES = mynewdata$id,FUN = cumsum))
We start by creating a data.frame with three columns, id, n, and t. id is calculated using rep and ceiling to repeat the ID variable the number of appropriate times. n is obtained by using match to look up the right value in mydata$n. t is obtained by obtaining the ratio of n and ceiling of n, and then repeating it the appropriate amount of times (in this case, ceiling of n again.
Then, we use cumsum to get the cumulative sum, called using by to allow by-group processing for each group of IDs. You could probably use tapply() here as well.

Related

block diagonal covariance matrix by group of variable

If this is my dataset, 3 subjects, measured at two time points t1,t2 , at each time point each subject is measured twice
ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2
What is the block diagonal covariance matrix for this dataset ? I am assuming the matrix would have three blocks, each block representing 2x2 variance-covariance matrix of subject i with t1 and t2. Is this correct ?
The code below computes the covariance matrices for each ID, then creates a block matrix out of them.
make_block_matrix <- function(x, y, fill = NA) {
u <- matrix(fill, nrow = dim(x)[1], ncol = dim(y)[2])
v <- matrix(fill, nrow = dim(y)[1], ncol = dim(x)[2])
u <- cbind(x, u)
v <- cbind(v, y)
rbind(u, v)
}
# compute covariance matrices for each ID
sp <- split(df1, df1$ID)
cov_list <- lapply(sp, \(x) {
y <- reshape(x[-1], direction = "wide", idvar = "Time", timevar = "Observation")
cov(y[-1])
})
# no longer needed, tidy up
rm(sp)
# fill with NA's, simple call
# res <- Reduce(make_block_matrix, cov_list)
# fill with zeros
res <- Reduce(\(x, y) make_block_matrix(x, y, fill = 0), cov_list)
# finally, the results' row and column names
nms <- paste(sapply(cov_list, rownames), names(cov_list), sep = ".")
dimnames(res) <- list(nms, nms)
res
#> Y.1.1 Y.2.2 Y.1.3 Y.2.1 Y.1.2 Y.2.3
#> Y.1.1 0.605 -1.54 0.00 0.000 0.00 0.000
#> Y.2.2 -1.540 3.92 0.00 0.000 0.00 0.000
#> Y.1.3 0.000 0.00 0.32 0.360 0.00 0.000
#> Y.2.1 0.000 0.00 0.36 0.405 0.00 0.000
#> Y.1.2 0.000 0.00 0.00 0.000 0.18 0.510
#> Y.2.3 0.000 0.00 0.00 0.000 0.51 1.445
Created on 2023-02-20 with reprex v2.0.2
Data
df1 <- "ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2"
df1 <- read.table(text = df1, header = TRUE)
Created on 2023-02-20 with reprex v2.0.2

How to calculate values in an R dataframe when columns are dependent on each other

First time poster - so apologies if this question is basic/poorly explained. Grateful to anyone can help or point me in the right direction!
I would like to do the following within an R dataframe if possible:
Existing Data
Column A is a vector of values, say 10 to 20.
New Data/columns
Column B will be be Column A multiplied by Column C
Column C will be Column C minus Column B from the previous row of data, i.e data$C[-1] - data$B[-1], apart the first row of Column C of course, which I will give a fixed value.
I have tried these as separate steps, but I keep overwriting columns B and C, and have a feeling I have been going about this the wrong way! I could share my code, but I think this would confuse matters!
Thanks in advance!
EDIT TO ADD CODE:
A <- c(0.1,0.2,0.3,0.4,0.5)
df1 <- data.frame(A)
df1$B <- 0
df1$C <- 0
df1$C[1] <- 100
df2 <- df1 %>%
mutate(B = C * A,
C = lag(C-B))
RESULT FROM THE ABOVE
A
B
C
1
0.1
10
NA
2
0.2
0
90
3
0.3
0
0
4
0.4
0
0
5
0.5
0
0
EXPECTED OUTPUT
A
B
C
1
0.1
10
100
2
0.2
18
90
3
0.3
21.6
72
4
0.4
20.16
50.4
5
0.5
15.12
30.24
C2 = C1 - B1
B2 = C2 * A2
We can use accumulate from purrr to do recursive update
library(dplyr)
library(purrr)
tmp <- with(df1, accumulate(A, ~ .x - (.x * .y), .init = first(C)))
df2 <- df1 %>%
mutate(C = head(tmp, -1), B = -diff(tmp))
df2
# A B C
#1 0.1 10.00 100.00
#2 0.2 18.00 90.00
#3 0.3 21.60 72.00
#4 0.4 20.16 50.40
#5 0.5 15.12 30.24
Or use base R
tmp <- with(df1, Reduce(function(x, y) x - (x * y), A,
accumulate = TRUE, init = C[1]))
df2 <- transform(df1, C = head(tmp, -1), B = -diff(tmp))
If you don't mind using a mathematical approach, you can first derive the general expression for the recursion and then have the R code afterwards.
Below is one implementation with base R
transform(
transform(
df1,
C = C[1] * c(1, cumprod(1 - A)[-nrow(df1)])
),
B = A * C
)
which gives
A B C
1 0.1 10.00 100.00
2 0.2 18.00 90.00
3 0.3 21.60 72.00
4 0.4 20.16 50.40
5 0.5 15.12 30.24
A data.table option in a similar manner is
> setDT(df1)[, C := first(C) * c(1, cumprod(1 - A)[-.N])][, B := A * C][]
A B C
1: 0.1 10.00 100.00
2: 0.2 18.00 90.00
3: 0.3 21.60 72.00
4: 0.4 20.16 50.40
5: 0.5 15.12 30.24

Assign bin value according to vector of thresholds

I have a vector of thresholds that I want to use for creating bins of a column on a data.table
thrshlds <- seq(from = 0, to = 1, by = 0.05)
test <- data.table(
A = rnorm(1000, 0.7, 1),
B = rbinom(1000, 3, 0.6)
)
The logic that I'm looking to implement is:
If the value of column A is equal or less than the value of each threshold, then assign it its respective threshold value. Similar to a SQL case when, but without manually assigning each threshold value.
Something like:
test[, new_category := fcase(A <= thrshlds[1], thrshlds[1],
A <= thrshlds[2], thrshlds[2],
.....)]
But I don't know how to do this kind of iteration inside a data.table query.
Thanks!
You can use cut :
library(data.table)
test[, new_category := cut(A, c(-Inf, thrshlds), thrshlds)]
test
# A B new_category
# 1: 0.220744413 3 0.25
# 2: -0.814886795 3 0
# 3: 1.134536656 2 <NA>
# 4: 0.180463333 1 0.2
# 5: -0.134559033 1 0
# ---
# 996: -0.332559649 1 0
# 997: 0.585641110 0 0.6
# 998: 0.765738832 2 0.8
# 999: 2.167632026 2 <NA>
#1000: 0.008935421 2 0.05
Not sure if this is an appropriate method or not, but here's a rolling join option that seems to work:
test[, new_category := data.table(thrshlds)[test, on="thrshlds==A", x.thrshlds, roll=-Inf] ]
#test[sample(1000, 12)]
# A B new_category
# 1: -1.1317742 3 0.00
# 2: 0.2926608 2 0.30
# 3: 1.5441214 2 NA
# 4: 0.9249706 1 0.95
# 5: 1.2663975 2 NA
# 6: 0.6472989 0 0.65
# 7: -0.5606153 2 0.00
# 8: 0.4439064 2 0.45
# 9: 0.8182938 1 0.85
#10: 0.8461909 2 0.85
#11: 1.0237554 1 NA
#12: 0.7752323 1 0.80

dataframe column wise subtraction and division.

need help in N number or column wise subtraction and division, Below are the columns in a input dataframe.
input dataframe:
> df
A B C D
1 1 3 6 2
2 3 3 3 4
3 1 2 2 2
4 4 4 4 4
5 5 2 3 2
formula - a, (b - a) / (1-a)
MY CODE
ABC <- cbind.data.frame(DF[1], (DF[-1] - DF[-ncol(DF)])/(1 - DF[-ncol(DF)]))
Expected out:
A B C D
1 Inf -1.5 0.8
3 0.00 0.0 -0.5
1 Inf 0.0 0.0
4 0.00 0.0 0.0
5 0.75 -1.0 0.5
But i dont want to use ncol here, cause there is a last column after column D in the actual dataframe.
So want to apply this formula only till first 4 column, IF i use ncol, it will traverse till last column in the dataframe.
Please help thanks.
What about trying:
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
df_2 <- matrix((df[,2]-df[,1])/(1-df[,1]),5,1)
df_3 <- matrix((df[,3]-df[,2])/(1-df[,2]),5,1)
df_4 <- matrix((df[,4]-df[,3])/(1-df[,3]),5,1)
cbind(df[,1],df_2,df_3,df_4)
edit: a loop version
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
test_bind <- c()
test_bind <- cbind(test_bind, df[,1])
for (i in 1:3){df_1 <- matrix((df[,i+1]-df[,i])/(1-df[,i]),5,1)
test_bind <- cbind(test_bind,df_1)}
test_bind
here is one option with tidyverse
library(dplyr)
library(purrr)
map2_df(DF[2:4], DF[1:3], ~ (.x - .y)/(1- .y)) %>%
bind_cols(DF[1], .)
# A B C D
#1 1 Inf -1.5 0.8
#2 3 0.00 0.0 -0.5
#3 1 Inf 0.0 0.0
#4 4 0.00 0.0 0.0
#5 5 0.75 -1.0 0.5

Summing values in columns based on other values in R

I'm relatively new to R and am having trouble creating a vector that sums certain values based on other values. I'm not quite sure what the problem is. I don't receive an error, but the output is not what I was looking for. Here is a reproducible example:
fakeprice <- c(1, 2, 2, 1, NA, 5, 4, 4, 3, 3, NA)
fakeconversion <-c(.2, .15, .07, .25, NA, .4, .36, NA, .67, .42, .01)
fakedata <- data.frame(fakeprice, fakeconversion)
fake.list <- sort(unique(fakedata$fakeprice))
fake.sum <- vector(,5)
So, fakedata looks like:
fakeprice fakeconversion
1 1 0.20
2 2 0.15
3 2 0.07
4 1 0.25
5 NA NA
6 5 0.40
7 4 0.36
8 4 NA
9 3 0.67
10 3 0.42
11 NA 0.01
I think the problem lies in the NAs, but I'm not quite sure (there are quite a few in the original data set). Here are the for loops with nested if statements. I kept getting an error when the price was 'NA' and so I added the is.na():
for(i in fake.list){
sum=0
for(j in fakedata$fakeprice){
if(is.na(fakedata$fakeprice[j])==TRUE){
NULL
} else {
if(fakedata$fakeprice[j]==fake.list[i]){
sum <- sum+fakedata$fakeconversion[j]
}}
}
fake.sum[i]=sum
}
sumdata <- data.frame(fake.list, fake.sum)
I'm looking for an output that adds up fakeconversion for each unique price. So, for fakeprice=1, fake.sum=0.45. The resulting data I am looking for would look like:
fake.list fake.sum
1 1 0.45
2 2 0.22
3 3 1.09
4 4 0.36
5 5 0.40
What I get, however, is:
sumdata
fake.list fake.sum
1 1 0.90
2 2 0.44
3 3 0.00
4 4 0.00
5 5 0.00
Any help is very much appreciated!
aggregate(fakedata$fakeconversion, list(price = fakedata$fakeprice), sum, na.rm = TRUE)
The above will deal with the NA in fakeprice 4.
The aggregate function works by subsetting your data by something and then running a function, FUN.
So:
aggregate(x, by, FUN, ...,)
x is what you wish to run the FUN on. By can be given a list if you wish to split the data by multiple columns.

Resources