I have a question regarding a for-loop within R's dplyr. Imagine I have the following dataframe:
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df$count_x <- cumsum(df$check)
df$count_y <- NA
df$count_y[1] <- ifelse(df$check[1] == 0, 0, 1)
co <- df$count_y[1]
I want to fill the variable count_y based on an adjusted cumulative function below:
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
The output of this for-loop is correct. However, in my current data set, I have many IDs, and using a for loop to iterate over the IDs will take too much time. I'm trying to use the functionality of dplyr to speed up the process.
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df <- df %>% group_by(id) %>% mutate(count_x = cumsum(check),
count_y = NA) %>% ungroup()
df <- df %>% group_by(id) %>% mutate(count_y = replace(count_y, 1, ifelse(check[1] == 0, 0 , 1)))
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
}
I want to use mutate to call the function count_n to fill count_y as described above. I'm aware that I'm passing just one variable, where I have to pass a data frame as the function relies on the column 'check' (col number 2) and 'count_y' (col number 4). I have tried multiple options (mutate_at, all, etc) but I couldn't make it to work. What can I do differently?
df <- df %>% group_by(id) %>% mutate_at(vars(count_y), ~count_n(.))
I think this is the perfect case to use purrr::accumulate2().
purrr::accumulate() is often used to calculate conditional cumulative sums. It takes a function as the second argument. This function should have 2 arguments: the cumulative output co, and the currently evaluated value x.
purrr::accumulate2() allows us to use a second variable to iterate on, and here we use lag(check) as lx. The tricky part is that this second variable should be one item shorter, as it does not matter for the initial value.
Here is the code, matching your expected output.
library(tidyverse)
df = structure(list(id = c(8, 8, 8, 8, 8, 8, 8, 8, 8),
check = c(0, 1, 1, 0, 0, 1, 0, 0, 0),
count_x = c(0, 1, 2, 2, 2, 3, 3, 3, 3)),
row.names = c(NA, -9L), class = "data.frame")
df %>%
mutate(
count_y = accumulate2(check, lag(check)[-1], function(co, x, lx){
case_when(
x==0 ~ co,
x==1 & lx==0 ~ 1,
x==1 & lx==1 ~ co+1,
TRUE ~ 999 #error value in case of unexpected input
)
})
)
#> id check count_x count_y
#> 1 8 0 0 0
#> 2 8 1 1 1
#> 3 8 1 2 2
#> 4 8 0 2 2
#> 5 8 0 2 2
#> 6 8 1 3 1
#> 7 8 0 3 1
#> 8 8 0 3 1
#> 9 8 0 3 1
Created on 2021-05-05 by the reprex package (v2.0.0)
The first issue is that you weren't returning anything in your function. The second issue is that you don't need to use a mutate_at (or even a mutate as would be more appropriate for a single variable) when you're writing the function that modifies the entire tibble. The simplest way to get it working is adding a return statement and running it in line like so:
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
return(df)
}
df %>% group_by(id) %>% count_n(.)
However, I would use Dan's answer above because it's much cleaner and has the advantage of not running a for loop, which isn't very "R". :)
Related
I have data that looks like this :
is_severe encoding sn_id
6 1 1 chr1 17689
7 0 2 chr1 17689
8 1 1 chr1 17689
9 1 2 chr1 69511
10 1 2 chr1 69511
11 1 1 chr1 69511
12 0 1 chr1 69511
I performed a statistical test on every "group" of values based on the sn_id column.
this is the function for the statistical test:
catt <-
function(y, x, score = c(0, 1, 2)) {
miss <- unique(c(which(is.na(y)), which(is.na(x))))
n.miss <- length(miss)
if(n.miss > 0) {
y <- y[-miss]
x <- x[-miss]
}
if(!all((y == 0) | (y == 1)))
stop("y should be only 0 or 1.")
if(!all((x == 0) | (x == 1) |(x == 2)))
stop("x should be only 0, 1 or 2.")
ca <- x [y == 1]
co <- x [y == 0]
htca <- table(ca)
htco <- table(co)
A <- matrix(0, 2, 3)
colnames(A) <- c(0, 1, 2)
rownames(A) <- c(0, 1)
A[1, names(htca)] <- htca
A[2, names(htco)] <- htco
ptt <- prop.trend.test(A[1, ], colSums(A), score = score)
p.value = as.numeric(ptt$p.value)
res=p.value
return(res)}
and i performed it on the groups of snp_id using the by function:
send=by(merged_df_normal,merged_df_normal$snp_id, function (merged_df_normal) {catt(merged_df_normal$is_sever_int,merged_df_normal$encoding)})
and got these results for example :
merged_df_normal$snp_id: chr11441806
[1] 0.6274769
---------------------------------------------------------------------
merged_df_normal$snp_id: chr1144192891
[1] NA
i wanted to transform this into a data frame which will look like this:
snp_id pvalue
chr11441806 0.6274769
chr1144192891 NA
I tried this :
do.call(rbind,list(send)
and it returned a matrix
that looks like this:
chr11441806 chr1144192891
0.6274769 NA
I had to edit the function after accepting an answer :
catt_2 <-
function(y, x, score = c(0, 1, 2)) {
miss <- unique(c(which(is.na(y)), which(is.na(x))))
n.miss <- length(miss)
if(n.miss > 0) {
y <- y[-miss]
x <- x[-miss]
}
if(!all((y == 0) | (y == 1)))
stop("y should be only 0 or 1.")
if(!all((x == 0) | (x == 1) |(x == 2)))
stop("x should be only 0, 1 or 2.")
ca <- x [y == 1]
co <- x [y == 0]
htca <- table(ca)
htco <- table(co)
A <- matrix(0, 2, 3)
colnames(A) <- c(0, 1, 2)
rownames(A) <- c(0, 1)
A[1, names(htca)] <- htca
A[2, names(htco)] <- htco
ptt <- prop.trend.test(A[1, ], colSums(A), score = score)
res <- list(
chisq = as.numeric(ptt$statistic),
p.value = as.numeric(ptt$p.value)
)
return(res)
}
and now the results are :
send=by(merged_df_normal,merged_df_normal$snp_id, function (merged_df_normal) {catt_2(merged_df_normal$is_sever,merged_df_normal$encoding)})
merged_df_normal$snp_id: chr11007252
$chisq
[1] NA
$p.value
[1] NA
------------------------------------------------------------------------
merged_df_normal$snp_id: chr1100731820
$chisq
[1] 0.9111779
$p.value
[1] 0.3398021
and what I would like it to be is:
snp_id pvalue chisq
chr11441806 0.6274769 0.9111779
chr1144192891 NA NA
the answer:
library(data.table)
setDT(merged_df_normal)
merged_df_normal[,.(p.value=catt(is_sever,encoding)),snp_id]
worked really well for getting just the p.value but is there a way to edit the above answer and add a new column chisq? thank you for the help the previous answer
I believe you can just apply catt() to each group of sn_id. Let's say your original data is called df. Then, you can do the following:
library(data.table)
setDT(df)
df[,.(p.value=catt(is_severe,encoding)),sn_id]
You need to adjust your function so that it handles sn_id groups that don't have sufficient data; in your example data frame, catt() only runs without error on sn_id == chr1 69511..
In general, however, the output will look like this, with one row in the frame for each sn_id value
sn_id p.value
<char> <num>
1: chr1 69511 0.2482131
i have data frame that looks like this :
is severe encoding sn_id
1 1 1
0 2 1
1 2 2
0 1 2
1 1 2
im using on by function this function :
catt <-
function(y, x, score = c(0, 1, 2)) {
miss <- unique(c(which(is.na(y)), which(is.na(x))))
n.miss <- length(miss)
if(n.miss > 0) {
y <- y[-miss]
x <- x[-miss]
}
if(!all((y == 0) | (y == 1)))
stop("y should be only 0 or 1.")
if(!all((x == 0) | (x == 1) |(x == 2)))
stop("x should be only 0, 1 or 2.")
ca <- x [y == 1]
co <- x [y == 0]
htca <- table(ca)
htco <- table(co)
A <- matrix(0, 2, 3)
colnames(A) <- c(0, 1, 2)
rownames(A) <- c(0, 1)
A[1, names(htca)] <- htca
A[2, names(htco)] <- htco
ptt <- prop.trend.test(A[1, ], colSums(A), score = score)
#list(#"2x3-table" = A,
#chisq = as.numeric(ptt$statistic),
#df = as.numeric(ptt$parameter),
res= p.value = as.numeric(ptt$p.value)
#n.miss = n.miss)
return(res)
}
when i run it :
by(es_test,es_test$sn_id, function (es_test) {catt(es_test$ï..is_severe,es_test$encoding)})
i get these results:
es_test$sn_id: 1
[1] 0.1572992
------------------------------------------------------------------------
es_test$sn_id: 2
[1] 0.3864762
it is not a very comfortable format as i want to further work with it , is there any way to get these results as list :[0.157,0.386]?
i tried this :
result_pv=c(by(es_test,es_test$sn_id, function (es_test) {catt(es_test$ï..is_severe,es_test$encoding)}))
but it produced double and i want it as vector or list :
the double :
Browse[6]> result_pv
1 2
0.1572992 0.3864762
> typeof(result_pv)
[1] "double"
what i want to do with it later is to add this result_pv to data frame as column and when it is a double i cant do that
thank you
I really need some help to write a recursion in R.
The function that I want changes a certain observation according to a set of comparisons between different rows in a data frame, which I shall call g. One of these comparisons depends on the previous value of this same observation.
Suppose first that I want to update the value of column index, row i in my data df in the following way:
j <- 1:4
g <- (df$dom[i] > 0 &
abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
df$index[i] <- ifelse(any(g), which(g)[[1]], df$index[[i]])
The thing is, the object w is actually a list:
w = list(0, 1, 2, df$age[i])
So, as you can see, I want to create a function foo() that updates df$index iteratively. It changes it by looping through w and comparisons depend on updated values.
Here is some data:
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
I am not sure if a recursive function is actually needed or if something like reduce or map would do it.
Thank you!
The following function uses a double for loop to change the values of column index according to the condition defining g. It accepts a data.frame as input and returns the updated data.frame.
foo <- function(x){
change_index <- function(x, i, w){
j <- seq_len(nrow(x))
(x$dom[i] > 0 & abs(x$V2009[i] - x$V2009[j]) <= w) |
x$index[i] == x$index[j]
}
for(i in seq_len(nrow(x))){
W <- list(0, 1, 2, x$age[i])
for(w in W){
g <- change_index(x, i, w)
if(any(g)) x$index[i] <- which(g)[1]
}
}
x
}
foo(df)
# dom V2009 index age
#1 0 9 1 2
#2 0 11 2 2
#3 6 9 1 2
#4 6 11 1 2
One can define w inside a function and use lexical scoping (closure).
Using your instructions, the function index_value calculates for any given i the index value.
correct_index_col returns the corrected df.
df <- data.frame(dom = c(0, 0, 6, 6),
V2009 = c(9, 11, 9, 11),
index = c(1, 2, 1, 2),
age = c(2, 2, 2, 2))
index_value <- function(df, i) {
j <- nrow(df)
w <- c(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w) |
df$index[i] == df$index[j]
ifelse(any(g), which(g)[[1]], df$index[[i]])
}
correct_index_col <- function(df) {
indexes <- Vectorize(function(i) {
index_value(df, i)
})
df$index <- indexes(1:nrow(df))
df
}
# > correct_index_col(df)
# dom V2009 index age
# 1 0 9 1 2
# 2 0 11 1 2
# 3 6 9 3 2
# 4 6 11 1 2
#
If you want to really update (mutate) your df, then you have to do
df <- correct_index_col(df).
Here is an attempt of my own. I guess I figured out a way to use recursion over mutate:
test <- function(i, df, k){
j <- 1:nrow(df)
w <- list(0, 1, 2, df$age[i])
g <- (df$dom[i] > 0 & abs(df$V2009[i] - df$V2009[j]) <= w[k]) |
df$index[i] == df$index[j]
l <- ifelse(any(g), which(g)[1], df$index[i])
return(l)
}
loop <- function(data,
k = 1) {
data <- data %>%
mutate(index = map_dbl(seq(n()),
~ test(.x, df = cur_data(), k)))
if (k == 4) {
return(data)
} else {
return(loop(data, k + 1))
}
}
df %>% loop()
I welcome any comments in case this is inefficient considering large datasets
I have a data frame "var" and I need to get a vector output that satisfies the following conditions.
Basically, what I am trying to execute is the following: if the value of psqi_2_sleepstart1 is less than 15, comp21score needs to be assigned the value 0; between 16 and 30, comp21score needs to be assigned the value 1; between 31 and 60, comp21score needs to be assigned the value 2 and over 60 comp21score should take the value of 3. For example, if the data frame had values for psqi_2_sleepstart1 as 16, 40, 6 and 10; I want the output to be 1, 2, 0, 1. I was using the ifelse statement, but I got the error that argument "yes" is missing, with no default.
Here is my attempt:
for (i in 1: nrow(var)) {
ifelse (psqi_2_sleepstart1 <= 15)
comp21score [i] <- 0
ifelse (psqi_2_sleepstart1 > 15 & psqi_2_sleepstart1 <= 30)
comp21score [i] <- 1
ifelse (psqi_2_sleepstart1 > 30 & psqi_2_sleepstart1 <= 60)
comp21score [i] <- 2
ifelse (psqi_2_sleepstart1 > 60)
comp21score [i] <- 3
}
print (comp21score)
Does anyone have suggestions on what I might be able to use instead or how to avoid this error?
Thanks!
Just for kicks - here is a case_when dplyr example (as mentioned in the comments):
DF1 <- data.frame("score"= 0:20)
DF1 <- DF1 %>% mutate(value = case_when(
score < 5 ~ 1,
score >= 5| score < 10 ~ 2,
score >= 10 ~ 3
)
)
> DF1
score value
1 0 1
2 1 1
3 2 1
.....
Use this layout instead:
x <- 2
if (x==0) {
y <- log
} else if (x == 1) {
y <- identity
} else if (x == 2) {
y <- function(x) x^2
}
I have a vector like this:
x <- c(0, 0, 0, 0, 4, 5, 0, 0, 3, 2, 7, 0, 0, 0)
I want to keep only the elements from position 5 to 11. I want to delete the zeroes in the start and end. For this vector it is quite easy since it is small.
I have very large data and need something in general for all vectors.
Try this:
x[ min( which ( x != 0 )) : max( which( x != 0 )) ]
Find index for all values that are not zero, and take the first -min and last - max to subset x.
You can try something like:
x=c(0,0,0,0,4,5,0,0,3,2,7,0,0,0)
rl <- rle(x)
if(rl$values[1] == 0)
x <- tail(x, -rl$lengths[1])
if(tail(rl$values,1) == 0)
x <- head(x, -tail(rl$lengths,1))
x
## 4 5 0 0 3 2 7
Hope it helps,
alex
This would also work :
x[cumsum(x) & rev(cumsum(rev(x)))]
# [1] 4 5 0 0 3 2 7
I would probably define two functions, and compose them:
trim_leading <- function(x, value=0) {
w <- which.max(cummax(x != value))
x[seq.int(w, length(x))]
}
trim_trailing <- function(x, value=0) {
w <- which.max(cumsum(x != value))
x[seq.int(w)]
}
And then pipe your data through:
x %>% trim_leading %>% trim_trailing