Replicate rows by different N

Replicate rows by different N - r

I’ve the following data
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
1) I need to repeat number of rows for each id by n. For example, n=2.63 for id=1, then I need to replicated id=1 row three times. If n=0.5, then I need to replicate it only one time... so n needs to be round up.
2) Create a new variable called t, where the sum of t for each id must equal to n.
3) Create another new variable called accumulated.t
Here how the output looks like:
id n t accumulated.t
1 2.63 1 1
1 2.63 1 2
1 2.63 0.63 2.63
2 1.5 1 1
2 1.5 0.5 1.5
3 0.5 0.5 0.5
4 3.5 1 1
4 3.5 1 2
4 3.5 1 3
4 3.5 0.5 3.5
5 4 1 1
5 4 1 2
5 4 1 3
5 4 1 4

Get the ceiling of 'n' column and use that to expand the rows of 'mydata' (rep(1:nrow(mydata), ceiling(mydata$n)))
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(mydata1)), grouped by 'id' column, we replicate (rep) 1 with times specified as the trunc of the first value of 'n' (rep(1, trunc(n[1]))). Take the difference between the unique value of 'n' per group and the sum of 'tmp' (n[1]-sum(tmp)). If the difference is greater than 0, we concatenate 'tmp' and 'tmp2' (c(tmp, tmp2)) or if it is '0', we take only 'tmp'. This can be placed in a list to create the two columns 't' and the cumulative sum of 'tmp3 (cumsum(tmp3)).
library(data.table)
mydata1 <- mydata[rep(1:nrow(mydata),ceiling(mydata$n)),]
setDT(mydata1)[, c('t', 'taccum') := {
tmp <- rep(1, trunc(n[1]))
tmp2 <- n[1]-sum(tmp)
tmp3= if(tmp2==0) tmp else c(tmp, tmp2)
list(tmp3, cumsum(tmp3)) },
by = id]
mydata1
# id n t taccum
# 1: 1 2.63 1.00 1.00
# 2: 1 2.63 1.00 2.00
# 3: 1 2.63 0.63 2.63
# 4: 2 1.50 1.00 1.00
# 5: 2 1.50 0.50 1.50
# 6: 3 0.50 0.50 0.50
# 7: 4 3.50 1.00 1.00
# 8: 4 3.50 1.00 2.00
# 9: 4 3.50 1.00 3.00
#10: 4 3.50 0.50 3.50
#11: 5 4.00 1.00 1.00
#12: 5 4.00 1.00 2.00
#13: 5 4.00 1.00 3.00
#14: 5 4.00 1.00 4.00

An alternative that utilizes base R.
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
mynewdata <- data.frame(id = rep(x = mydata$id,times = ceiling(x = mydata$n)),
n = mydata$n[match(x = rep(x = mydata$id,ceiling(mydata$n)),table = mydata$id)],
t = rep(x = mydata$n / ceiling(mydata$n),times = ceiling(mydata$n)))
mynewdata$t.accum <- unlist(x = by(data = mynewdata$t,INDICES = mynewdata$id,FUN = cumsum))
We start by creating a data.frame with three columns, id, n, and t. id is calculated using rep and ceiling to repeat the ID variable the number of appropriate times. n is obtained by using match to look up the right value in mydata$n. t is obtained by obtaining the ratio of n and ceiling of n, and then repeating it the appropriate amount of times (in this case, ceiling of n again.
Then, we use cumsum to get the cumulative sum, called using by to allow by-group processing for each group of IDs. You could probably use tapply() here as well.

Related

block diagonal covariance matrix by group of variable

If this is my dataset, 3 subjects, measured at two time points t1,t2 , at each time point each subject is measured twice
ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2
What is the block diagonal covariance matrix for this dataset ? I am assuming the matrix would have three blocks, each block representing 2x2 variance-covariance matrix of subject i with t1 and t2. Is this correct ?

The code below computes the covariance matrices for each ID, then creates a block matrix out of them.
make_block_matrix <- function(x, y, fill = NA) {
u <- matrix(fill, nrow = dim(x)[1], ncol = dim(y)[2])
v <- matrix(fill, nrow = dim(y)[1], ncol = dim(x)[2])
u <- cbind(x, u)
v <- cbind(v, y)
rbind(u, v)
}
# compute covariance matrices for each ID
sp <- split(df1, df1$ID)
cov_list <- lapply(sp, \(x) {
y <- reshape(x[-1], direction = "wide", idvar = "Time", timevar = "Observation")
cov(y[-1])
})
# no longer needed, tidy up
rm(sp)
# fill with NA's, simple call
# res <- Reduce(make_block_matrix, cov_list)
# fill with zeros
res <- Reduce(\(x, y) make_block_matrix(x, y, fill = 0), cov_list)
# finally, the results' row and column names
nms <- paste(sapply(cov_list, rownames), names(cov_list), sep = ".")
dimnames(res) <- list(nms, nms)
res
#> Y.1.1 Y.2.2 Y.1.3 Y.2.1 Y.1.2 Y.2.3
#> Y.1.1 0.605 -1.54 0.00 0.000 0.00 0.000
#> Y.2.2 -1.540 3.92 0.00 0.000 0.00 0.000
#> Y.1.3 0.000 0.00 0.32 0.360 0.00 0.000
#> Y.2.1 0.000 0.00 0.36 0.405 0.00 0.000
#> Y.1.2 0.000 0.00 0.00 0.000 0.18 0.510
#> Y.2.3 0.000 0.00 0.00 0.000 0.51 1.445
Created on 2023-02-20 with reprex v2.0.2
Data
df1 <- "ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2"
df1 <- read.table(text = df1, header = TRUE)
Created on 2023-02-20 with reprex v2.0.2

How to calculate values in an R dataframe when columns are dependent on each other

First time poster - so apologies if this question is basic/poorly explained. Grateful to anyone can help or point me in the right direction!
I would like to do the following within an R dataframe if possible:
Existing Data
Column A is a vector of values, say 10 to 20.
New Data/columns
Column B will be be Column A multiplied by Column C
Column C will be Column C minus Column B from the previous row of data, i.e data$C[-1] - data$B[-1], apart the first row of Column C of course, which I will give a fixed value.
I have tried these as separate steps, but I keep overwriting columns B and C, and have a feeling I have been going about this the wrong way! I could share my code, but I think this would confuse matters!
Thanks in advance!
EDIT TO ADD CODE:
A <- c(0.1,0.2,0.3,0.4,0.5)
df1 <- data.frame(A)
df1$B <- 0
df1$C <- 0
df1$C[1] <- 100
df2 <- df1 %>%
mutate(B = C * A,
C = lag(C-B))
RESULT FROM THE ABOVE
A
B
C
1
0.1
10
NA
2
0.2
0
90
3
0.3
0
0
4
0.4
0
0
5
0.5
0
0
EXPECTED OUTPUT
A
B
C
1
0.1
10
100
2
0.2
18
90
3
0.3
21.6
72
4
0.4
20.16
50.4
5
0.5
15.12
30.24
C2 = C1 - B1
B2 = C2 * A2

We can use accumulate from purrr to do recursive update
library(dplyr)
library(purrr)
tmp <- with(df1, accumulate(A, ~ .x - (.x * .y), .init = first(C)))
df2 <- df1 %>%
mutate(C = head(tmp, -1), B = -diff(tmp))
df2
# A B C
#1 0.1 10.00 100.00
#2 0.2 18.00 90.00
#3 0.3 21.60 72.00
#4 0.4 20.16 50.40
#5 0.5 15.12 30.24
Or use base R
tmp <- with(df1, Reduce(function(x, y) x - (x * y), A,
accumulate = TRUE, init = C[1]))
df2 <- transform(df1, C = head(tmp, -1), B = -diff(tmp))

If you don't mind using a mathematical approach, you can first derive the general expression for the recursion and then have the R code afterwards.
Below is one implementation with base R
transform(
transform(
df1,
C = C[1] * c(1, cumprod(1 - A)[-nrow(df1)])
),
B = A * C
)
which gives
A B C
1 0.1 10.00 100.00
2 0.2 18.00 90.00
3 0.3 21.60 72.00
4 0.4 20.16 50.40
5 0.5 15.12 30.24
A data.table option in a similar manner is
> setDT(df1)[, C := first(C) * c(1, cumprod(1 - A)[-.N])][, B := A * C][]
A B C
1: 0.1 10.00 100.00
2: 0.2 18.00 90.00
3: 0.3 21.60 72.00
4: 0.4 20.16 50.40
5: 0.5 15.12 30.24

Assign bin value according to vector of thresholds

I have a vector of thresholds that I want to use for creating bins of a column on a data.table
thrshlds <- seq(from = 0, to = 1, by = 0.05)
test <- data.table(
A = rnorm(1000, 0.7, 1),
B = rbinom(1000, 3, 0.6)
)
The logic that I'm looking to implement is:
If the value of column A is equal or less than the value of each threshold, then assign it its respective threshold value. Similar to a SQL case when, but without manually assigning each threshold value.
Something like:
test[, new_category := fcase(A <= thrshlds[1], thrshlds[1],
A <= thrshlds[2], thrshlds[2],
.....)]
But I don't know how to do this kind of iteration inside a data.table query.
Thanks!

You can use cut :
library(data.table)
test[, new_category := cut(A, c(-Inf, thrshlds), thrshlds)]
test
# A B new_category
# 1: 0.220744413 3 0.25
# 2: -0.814886795 3 0
# 3: 1.134536656 2 <NA>
# 4: 0.180463333 1 0.2
# 5: -0.134559033 1 0
# ---
# 996: -0.332559649 1 0
# 997: 0.585641110 0 0.6
# 998: 0.765738832 2 0.8
# 999: 2.167632026 2 <NA>
#1000: 0.008935421 2 0.05

Not sure if this is an appropriate method or not, but here's a rolling join option that seems to work:
test[, new_category := data.table(thrshlds)[test, on="thrshlds==A", x.thrshlds, roll=-Inf] ]
#test[sample(1000, 12)]
# A B new_category
# 1: -1.1317742 3 0.00
# 2: 0.2926608 2 0.30
# 3: 1.5441214 2 NA
# 4: 0.9249706 1 0.95
# 5: 1.2663975 2 NA
# 6: 0.6472989 0 0.65
# 7: -0.5606153 2 0.00
# 8: 0.4439064 2 0.45
# 9: 0.8182938 1 0.85
#10: 0.8461909 2 0.85
#11: 1.0237554 1 NA
#12: 0.7752323 1 0.80

dataframe column wise subtraction and division.

need help in N number or column wise subtraction and division, Below are the columns in a input dataframe.
input dataframe:
> df
A B C D
1 1 3 6 2
2 3 3 3 4
3 1 2 2 2
4 4 4 4 4
5 5 2 3 2
formula - a, (b - a) / (1-a)
MY CODE
ABC <- cbind.data.frame(DF[1], (DF[-1] - DF[-ncol(DF)])/(1 - DF[-ncol(DF)]))
Expected out:
A B C D
1 Inf -1.5 0.8
3 0.00 0.0 -0.5
1 Inf 0.0 0.0
4 0.00 0.0 0.0
5 0.75 -1.0 0.5
But i dont want to use ncol here, cause there is a last column after column D in the actual dataframe.
So want to apply this formula only till first 4 column, IF i use ncol, it will traverse till last column in the dataframe.
Please help thanks.

What about trying:
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
df_2 <- matrix((df[,2]-df[,1])/(1-df[,1]),5,1)
df_3 <- matrix((df[,3]-df[,2])/(1-df[,2]),5,1)
df_4 <- matrix((df[,4]-df[,3])/(1-df[,3]),5,1)
cbind(df[,1],df_2,df_3,df_4)
edit: a loop version
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
test_bind <- c()
test_bind <- cbind(test_bind, df[,1])
for (i in 1:3){df_1 <- matrix((df[,i+1]-df[,i])/(1-df[,i]),5,1)
test_bind <- cbind(test_bind,df_1)}
test_bind

here is one option with tidyverse
library(dplyr)
library(purrr)
map2_df(DF[2:4], DF[1:3], ~ (.x - .y)/(1- .y)) %>%
bind_cols(DF[1], .)
# A B C D
#1 1 Inf -1.5 0.8
#2 3 0.00 0.0 -0.5
#3 1 Inf 0.0 0.0
#4 4 0.00 0.0 0.0
#5 5 0.75 -1.0 0.5

Summing values in columns based on other values in R

I'm relatively new to R and am having trouble creating a vector that sums certain values based on other values. I'm not quite sure what the problem is. I don't receive an error, but the output is not what I was looking for. Here is a reproducible example:
fakeprice <- c(1, 2, 2, 1, NA, 5, 4, 4, 3, 3, NA)
fakeconversion <-c(.2, .15, .07, .25, NA, .4, .36, NA, .67, .42, .01)
fakedata <- data.frame(fakeprice, fakeconversion)
fake.list <- sort(unique(fakedata$fakeprice))
fake.sum <- vector(,5)
So, fakedata looks like:
fakeprice fakeconversion
1 1 0.20
2 2 0.15
3 2 0.07
4 1 0.25
5 NA NA
6 5 0.40
7 4 0.36
8 4 NA
9 3 0.67
10 3 0.42
11 NA 0.01
I think the problem lies in the NAs, but I'm not quite sure (there are quite a few in the original data set). Here are the for loops with nested if statements. I kept getting an error when the price was 'NA' and so I added the is.na():
for(i in fake.list){
sum=0
for(j in fakedata$fakeprice){
if(is.na(fakedata$fakeprice[j])==TRUE){
NULL
} else {
if(fakedata$fakeprice[j]==fake.list[i]){
sum <- sum+fakedata$fakeconversion[j]
}}
}
fake.sum[i]=sum
}
sumdata <- data.frame(fake.list, fake.sum)
I'm looking for an output that adds up fakeconversion for each unique price. So, for fakeprice=1, fake.sum=0.45. The resulting data I am looking for would look like:
fake.list fake.sum
1 1 0.45
2 2 0.22
3 3 1.09
4 4 0.36
5 5 0.40
What I get, however, is:
sumdata
fake.list fake.sum
1 1 0.90
2 2 0.44
3 3 0.00
4 4 0.00
5 5 0.00
Any help is very much appreciated!

aggregate(fakedata$fakeconversion, list(price = fakedata$fakeprice), sum, na.rm = TRUE)
The above will deal with the NA in fakeprice 4.
The aggregate function works by subsetting your data by something and then running a function, FUN.
So:
aggregate(x, by, FUN, ...,)
x is what you wish to run the FUN on. By can be given a list if you wish to split the data by multiple columns.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replicate rows by different N - r

Related

block diagonal covariance matrix by group of variable

How to calculate values in an R dataframe when columns are dependent on each other

Assign bin value according to vector of thresholds

dataframe column wise subtraction and division.

Summing values in columns based on other values in R

Categories

Resources