This is related to R: use the newly generated data in the previous row
I realized the actual problem I was faced with is a bit more complicated than the example I gave in the thread above - it seems I have to pass 3 arguments to the recursive calculation to achieve what I want. Thus, accumulate2 or reduce may not work. So I open a new question here to avoid possible confusion.
I have the following dataset grouped by ID:
ID <- c(1, 2, 2, 3, 3, 3)
pw <- c(1:6)
add <- c(1, 2, 3, 5, 7, 8)
x <- c(1, 2, NA, 4, NA, NA)
df <- data.frame(ID, pw, add, x)
df
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 NA
4 3 4 5 4
5 3 5 7 NA
6 3 6 8 NA
Within each group for column x, I want to keep the value of the first row as it is, while fill in the remaining rows with lagged values raised to the power stored in pw, and add to the exponent the value in add. I want to update the lagged values as I proceed. So I would like to have:
ID pw add x
1 1 1 1 1
2 2 2 2 2
3 2 3 3 2^3 + 3
4 3 4 5 4
5 3 5 7 4^5 + 7
6 3 6 8 (4^5 + 7)^6 + 8
I have to apply this calculation to a large dataset, so it would be perfect if there is a fast way to do this!
If we want to use accumulate2, then specify the arguments correctly i.e. it takes two input arguments as 'pw' and 'add' and an initialization argument which would be the first value of 'x'. As it is a grouped by 'ID', do the grouping before we do the accumulate2, extract the lambda default arguments ..1, ..2 and ..3 respectively in that order and create the recursive function based on this
library(dplyr)
library(purrr)
out <- df %>%
group_by(ID) %>%
mutate(x1 = accumulate2(pw[-1], add[-1], ~ ..1^..2 + ..3,
.init = first(x)) %>%
flatten_dbl ) %>%
ungroup
out$x1
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
With more than 3 arguments, a for loop would be better
# // initialize an empty vector
out <- c()
# // loop over the `unique` ID
for(id in unique(df$ID)) {
# // create a temporary subset of data based on that id
tmp_df <- subset(df, ID == id)
# // initialize a temporary storage output
tmp_out <- numeric(nrow(tmp_df))
# // initialize first value with the first element of x
tmp_out[1] <- tmp_df$x[1]
# // if the number of rows is greater than 1
if(nrow(tmp_df) > 1) {
// loop over the rows
for(i in 2:nrow(tmp_df)) {
#// do the recursive calculation and update
tmp_out[i] <- tmp_out[i - 1]^ tmp_df$pw[i] + tmp_df$add[i]
}
}
out <- c(out, tmp_out)
}
out
#[1] 1 2 11
#[4] 4 1031 1201024845477409792
In base R we could use the following solution for more than two arguments.
In this solution I first subset the original data set on ID values
Then I chose row id values through seq_len(nrow(tmp))[-1] omitting the first row id since it was provided by init
In anonymous function I used in Reduce, b argument represents accumulated/ previous value starting from init and c represents new/current values of our vector which is row numbers
So in every iteration our previous value (starting from init) will be raised to the power of new value from pw and will be summed by new value from add
cbind(df[-length(df)], unlist(lapply(unique(df$ID), function(a) {
tmp <- subset(df, df$ID == a)
Reduce(function(b, c) {
b ^ tmp$pw[c] + tmp$add[c]
}, init = tmp$x[1],
seq_len(nrow(tmp))[-1], accumulate = TRUE)
}))) |> setNames(c(names(df)))
ID pw add x
1 1 1 1 1.000000e+00
2 2 2 2 2.000000e+00
3 2 3 3 1.100000e+01
4 3 4 5 4.000000e+00
5 3 5 7 1.031000e+03
6 3 6 8 1.201025e+18
Data
structure(list(ID = c(1, 2, 2, 3, 3, 3), pw = 1:6, add = c(1,
2, 3, 5, 7, 8), x = c(1, 2, NA, 4, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
Base R, not using Reduce() but rather a while() Loop:
# Split-apply-combine while loop: res => data.frame
res <- do.call(rbind, lapply(with(df, split(df, ID)), function(y){
# While there are any NAs in x:
while(any(is.na(y$x))){
# Store the index of the first NA value: idx => integer scalar
idx <- with(y, head(which(is.na(x)), 1))
# Calculate x at that index using the business rule provided:
# x => numeric vector
y$x[idx] <- with(y, x[(idx-1)] ** pw[idx] + add[idx])
}
# Explicitly define the return object: y => GlobalEnv
y
}
)
)
OR recursive function:
# Recursive function: estimation_func => function()
estimation_func <- function(value_vec, exponent_vec, add_vec){
# Specify the termination condition; when all elements
# of value_vec are no longer NA:
if(all(!(is.na(value_vec)))){
# Return value_vec: numeric vector => GlobalEnv
return(value_vec)
# Otherwise recursively apply the below:
}else{
# Store the index of the first na value: idx => integer vector
idx <- Position(is.na, value_vec)
# Calculate the value of the value_vec at that index;
# using the provided business logic: value_vec => numeric vector
value_vec[idx] <- (value_vec[(idx-1)] ** exponent_vec[idx]) + add_vec[idx]
# Recursively apply function: function => Local Env
return(estimation_func(value_vec, exponent_vec, add_vec))
}
}
# Split data.frame into a list on ID;
# Overwrite x values, applying recursive function;
# Combine list into a data.frame
# res => data.frame
res <- data.frame(
do.call(
rbind,
Map(function(y){y$x <- estimation_func(y$x, y$pw, y$add); y}, split(df, df$ID))
), row.names = NULL
)
Related
My dataframe is structured like the following:
ID A_L A_R B_L B_R
1 7 5 6 3
2 3 2 3 1
3 6 3 4 5
The goal is to create a new column for each existing column (besides the first column ID) dividing the value of the existing column through its L/R counterpart. So A_L_ratio = A_L/A_R and A_R_ratio = A_R/A_L etc.
I've tried to create a for-loop, using if/elseto differentiate between odd and even indices.
for (col in 2:length(df)) {
if( (col%%2) == 0){
a <- df[,col] / df[,col+1]}
else{
a <- df[,col] / df[,col-1]}
df[colnames(df[col])"_ratio"] <- a
}
But I seem to fail at R's syntax when it comes to naming the columns. Name should be the name of the column that is called in each loop df[,col] + the string _ratio. At the end I want to append that columne to df. Could someone tell me the right syntax to do this? Thanks a lot!
You need to paste the colnames to the string "_ratio". Something like this, maybe:
# Create the data.frame
df <- data.frame(
ID = 1:3,
A_L = c(7, 3, 6),
A_R = c(5, 2, 3),
B_L = c(6, 3, 4),
B_R = c(3, 1, 5)
)
# create the cols with "_ratio" character appended
for (col in 2:length(df)) {
if( (col%%2) == 0){
a <- df[,col] / df[,col+1]
} else {(a <- df[,col] / df[,col-1])}
df[paste(colnames(df[col]), "_ratio", sep = "")] <- a
}
There are easier and more efficient ways to do this using the dplyr package, though.
Don't konw how important this is, but I got the imrpession you have many more columns than what is shown? If so better take it nice and slow so you don't get errors.
If safety checks are not needed, then disregard this.
df <- read.table( text="
ID A_L A_R B_L B_R
1 7 5 6 3
2 3 2 3 1
3 6 3 4 5
", header=TRUE )
var.names.L <- grep( "_L$", colnames(df) , value=TRUE )
var.names.R <- sub( "_L", "_R", var.names.L )
i.L.name.ok <- var.names.R %in% colnames(df)
ok.L.names <- var.names.L[i.has.R.name]
ok.R.names <- var.names.R[i.has.R.name]
new.columns.1 <- df[, ok.L.names ] / df[, ok.R.names ]
colnames(new.columns.1) <- paste0( colnames(new.columns.1), "_ratio" )
new.columns.2 <- df[, ok.R.names ] / df[, ok.L.names ]
colnames(new.columns.2) <- paste0( colnames(new.columns.2), "_ratio" )
cbind.data.frame(
df,
new.columns.1,
new.columns.2
)
The above code nice and neatly checks that for every _L column there is a coresponding _R column, and then it performs the divition with only those columns.
Output:
ID A_L A_R B_L B_R A_L_ratio B_L_ratio A_R_ratio B_R_ratio
1 1 7 5 6 3 1.4 2.0 0.7142857 0.5000000
2 2 3 2 3 1 1.5 3.0 0.6666667 0.3333333
3 3 6 3 4 5 2.0 0.8 0.5000000 1.2500000
For example, I have a vector below:
a = c(1,1,1,1,1,2,2,2,3,3,3,3)
Now I want to randomly pick 4 of the elements from all the elements then change them into different value, for instance,
if the elements I pick is 1, 1, 2, 3, then I need to change them randomly , like, 2, 3, 1, 2
The resulting vector is the following
a' = c(1,2,3,1,1,2,1,2,3,3,3,2)
No idea how to make this.
May be this function helps
# #param vec - input vector
# #param n - number of values to replace
# #param n1 - number of unique value threshold
# #return replaced sampled vector
sample_fn <- function(vec, n, n1) {
flag <- TRUE
while(flag) {
# // sample on the positions
pos <- sample(seq_along(vec), n, replace = FALSE)
print(pos)
# // extract the values based on the position index
as <- vec[pos]
# // get the unique values
un1 <- unique(as)
print(un1)
if(length(un1) > n1)
flag <- FALSE
}
# // sample the unique and set it as names of unique
# // use named vector to match and replace
# // assign the output back to the same positions in the vector
vec[pos] <- setNames(sample(un1), un1)[as.character(as)]
vec
}
sample_fn(a, 4, 2)
#[1] 10 1 12 2
#[1] 3 1
#[1] 1 8 4 3
#[1] 1 2
#[1] 7 11 4 12
#[1] 2 3 1
# [1] 1 1 1 2 1 2 3 2 3 3 1 1
I am not sure if the values for random replacement are also from a. If so, the code below might be an option
replace(a,sample(seq_along(a),4),sample(unique(a),4,replace = TRUE))
this is my first time asking a question and hopefully I can get your help!
I need to remove rows that have values for only one or two genes using R
basically I need to get rid of 50S, ABCC8, and ACAT1 because these have a n<3.
My desired output is
thank you very much!
If this is in a data.frame, you can use dplyr package to do some manipulation. We can group the data by the Genes and count how many instances are there. Then we simply set the filter criteria to remove the records.
require(dplyr)
df <- data.frame(
Genes=c('50S' ,'abcb1' ,'abcb1' ,'abcb1' ,'ABCC8' ,'ABL' ,'ABL' ,'ABL' ,'ABL' ,'ACAT1' ,'ACAT1' ),
Values=c(-0.627323448, -0.226358414, 0.347305901 ,0.371632631 ,0.099485307 ,0.078512979 ,-0.426643782, -1.060270668, -2.059157991, 0.608899174 ,-0.048795611)
)
#group, filter and join back to get subset the data
df %>% group_by(Genes)
%>% summarize(count=n())
%>% filter(count>=3)
%>% inner_join(df)
%>% select(Genes,Values)
As per #Lamia's comments, it is possible to simplify it to just:
df %>% group_by(Genes) %>% filter(n()>=3)
# generating data
x <- c(NA, NA, NA, NA, 2, 3) # has n < 3!
y <- c(1, 2, 3, 4, 5, 6)
z <- c(1 ,2, 3, NA, 5, 6)
df <- data.frame(x,y,z)
colsToKeep <- c() # making empty vector I will fill with column numbers
for (i in 1:ncol(df)) { # for every column
if (sum(!is.na(df[,i]))>=3) { # if that column has greater than 3 valid values (i.e., ones that are not na...
colsToKeep <- c(colsToKeep, i) # then save that column number into this vector
}
}
df[,colsToKeep] # then use that vector to call the columns you want
Note that R treats FALSE as 0 and TRUE as 1, so that is how the sum() function works here.
Another possible solution by using table:
gene <- c("A","A","A","B","B","C","C","C","C","D")
value <- c(seq(1,10,1))
df<-data.frame(gene,value)
df
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9
su<-data.frame(table(df$gene))
df_keep <-df[which(df$gene %in% su[which(su$Freq>2),1]),]
df_keep
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9
I have a matrix of integers
m <- rbind(c(1,2),
c(3,6),
c(5,1),
c(2,1),
c(6,3))
and I am looking for a function that takes this matrix as input and outputs a vector flag with length(flag) == ncol(m) that assigns the rows that contain the same set of integers the same unique (let's say integer) value.
For the above example, the desired output would be:
flag <- c(1, 2, 3, 1, 2)
So rows 1 and 4 inm get the same flag 1, because they both contain the same set of integers, in this case {1, 2}. Similarly, rows 2 and 5 get the same flag.
The solution should work for any number of columns.
The only thing I could come up with is the following approach ...
FlagSymmetric <- function(x) {
vec_sim <- rep(NA, nrow(x)) # object containing flags
ind_ord <- ncol(x)
counter <- 1
for(i in 1:nrow(x)) {
if(is.na(vec_sim[i])) { # if that row is not flagged yet, proceed ...
vec_sim[i] <- counter # ... and give the next free flag
for(j in (i+1):nrow(x)) {
if( (i+1) > nrow(x) ) next # in case of tiny matrices
ind <- x[j, ] %in% x[i, ]
if(sum(ind)==ind_ord) vec_sim[j] <- counter # if the same, assign flag
}
counter <- counter + 1
}
}
return(vec_sim)
}
... which does what I want:
> FlagSymmetric(m)
[1] 1 2 3 1 2
If n = nrow(m) this needs 1/2 n^2 operations. Of course, I could make it much quicker by writing this in C++, but this only alleviates my problem to some extent, because I am working with matrices with a potentially huge number of rows.
I guess there must be a smarter way of doing this.
EDIT:
Additional, more general example (sorting row and pasting to character string not possible):
m2 <- rbind(c(1,112),
c(11,12),
c(12,11),
c(112,1),
c(6,3))
flag2 <- c(1, 2, 2, 1, 3) # desired output
FlagSymmetric(m2) # works
[1] 1 2 2 1 3
Assuming you only have numeric data in your matrix.
First converting the matrix to dataframe,
m <- data.frame(m)
We can sort every row and paste them together. Convert them to factor and then to numeric to get unique numbers for every combination
m$flag <- as.numeric(factor(apply(m, 1, function(x) paste0(sort(x), collapse = ""))))
m
# X1 X2 flag
#1 1 2 1
#2 3 6 3
#3 5 1 2
#4 2 1 1
#5 6 3 3
EDIT
The above solution does not work for every combination as explained in the new example. To differentiate between each number, as #d.b commented we can use any non-empty collapse argument. For updated example,
as.numeric(factor(apply(m2, 1, function(x) paste0(sort(x), collapse = "-"))))
#[1] 1 2 2 1 3
I have a dataframe and I would like to use a custom function to add multiple new columns to that dataframe. These new columns will be some function of an existing column, but they require the use of a custom function.
I am currently trying to have my custom function return the results in a list, which I then parse into separate columns. This sometimes works by returning a vector of lists, but sometimes this returns a matrix, in which case I get an error like
Error in $<-.data.frame(*tmp*, "z", value = list(1, 2, 2, 3, 3, 4)) : replacement has 2 rows, data has 3
Below is a sample of what I am trying to do.
sample_func <- function(number)
{
list(w = number + 1, u = number + 2)
}
data = data.frame(x = c(1,2,3), y= c(5,6,7))
data$z = sapply(c(1,2,3),sample_func)
data$w = sapply(data$z,"[[","w")
data$u = sapply(data$z,"[[","u")
The function sapply automatically simplifies the result. In this case, you obtain a matrix. You can avoid this behaviour with the argument simplify = FALSE. But it's easier to use lapply because this function doesn't try to simplify the result.
The command
tmp <- lapply(c(1,2,3), sample_func)
returns a list of lists:
[[1]]
[[1]]$w
[1] 2
[[1]]$u
[1] 3
[[2]]
[[2]]$w
[1] 3
[[2]]$u
[1] 4
[[3]]
[[3]]$w
[1] 4
[[3]]$u
[1] 5
You can use the following command to add the new columns to your data frame:
cbind(data, do.call(rbind, tmp))
# x y w u
# 1 1 5 2 3
# 2 2 6 3 4
# 3 3 7 4 5
Update to address comment:
If possible, you can modify the function to return a data frame.
sample_func <- function(number)
{
data.frame(w = number + 1, u = number + 2)
}
tmp <- lapply(c(1,2,3), sample_func)
cbind(data, do.call(rbind, tmp))
The result will be a data frame with numeric columns.