i have data frame that looks like this :
is severe encoding sn_id
1 1 1
0 2 1
1 2 2
0 1 2
1 1 2
im using on by function this function :
catt <-
function(y, x, score = c(0, 1, 2)) {
miss <- unique(c(which(is.na(y)), which(is.na(x))))
n.miss <- length(miss)
if(n.miss > 0) {
y <- y[-miss]
x <- x[-miss]
}
if(!all((y == 0) | (y == 1)))
stop("y should be only 0 or 1.")
if(!all((x == 0) | (x == 1) |(x == 2)))
stop("x should be only 0, 1 or 2.")
ca <- x [y == 1]
co <- x [y == 0]
htca <- table(ca)
htco <- table(co)
A <- matrix(0, 2, 3)
colnames(A) <- c(0, 1, 2)
rownames(A) <- c(0, 1)
A[1, names(htca)] <- htca
A[2, names(htco)] <- htco
ptt <- prop.trend.test(A[1, ], colSums(A), score = score)
#list(#"2x3-table" = A,
#chisq = as.numeric(ptt$statistic),
#df = as.numeric(ptt$parameter),
res= p.value = as.numeric(ptt$p.value)
#n.miss = n.miss)
return(res)
}
when i run it :
by(es_test,es_test$sn_id, function (es_test) {catt(es_test$ï..is_severe,es_test$encoding)})
i get these results:
es_test$sn_id: 1
[1] 0.1572992
------------------------------------------------------------------------
es_test$sn_id: 2
[1] 0.3864762
it is not a very comfortable format as i want to further work with it , is there any way to get these results as list :[0.157,0.386]?
i tried this :
result_pv=c(by(es_test,es_test$sn_id, function (es_test) {catt(es_test$ï..is_severe,es_test$encoding)}))
but it produced double and i want it as vector or list :
the double :
Browse[6]> result_pv
1 2
0.1572992 0.3864762
> typeof(result_pv)
[1] "double"
what i want to do with it later is to add this result_pv to data frame as column and when it is a double i cant do that
thank you
Related
I have data that looks like this :
is_severe encoding sn_id
6 1 1 chr1 17689
7 0 2 chr1 17689
8 1 1 chr1 17689
9 1 2 chr1 69511
10 1 2 chr1 69511
11 1 1 chr1 69511
12 0 1 chr1 69511
I performed a statistical test on every "group" of values based on the sn_id column.
this is the function for the statistical test:
catt <-
function(y, x, score = c(0, 1, 2)) {
miss <- unique(c(which(is.na(y)), which(is.na(x))))
n.miss <- length(miss)
if(n.miss > 0) {
y <- y[-miss]
x <- x[-miss]
}
if(!all((y == 0) | (y == 1)))
stop("y should be only 0 or 1.")
if(!all((x == 0) | (x == 1) |(x == 2)))
stop("x should be only 0, 1 or 2.")
ca <- x [y == 1]
co <- x [y == 0]
htca <- table(ca)
htco <- table(co)
A <- matrix(0, 2, 3)
colnames(A) <- c(0, 1, 2)
rownames(A) <- c(0, 1)
A[1, names(htca)] <- htca
A[2, names(htco)] <- htco
ptt <- prop.trend.test(A[1, ], colSums(A), score = score)
p.value = as.numeric(ptt$p.value)
res=p.value
return(res)}
and i performed it on the groups of snp_id using the by function:
send=by(merged_df_normal,merged_df_normal$snp_id, function (merged_df_normal) {catt(merged_df_normal$is_sever_int,merged_df_normal$encoding)})
and got these results for example :
merged_df_normal$snp_id: chr11441806
[1] 0.6274769
---------------------------------------------------------------------
merged_df_normal$snp_id: chr1144192891
[1] NA
i wanted to transform this into a data frame which will look like this:
snp_id pvalue
chr11441806 0.6274769
chr1144192891 NA
I tried this :
do.call(rbind,list(send)
and it returned a matrix
that looks like this:
chr11441806 chr1144192891
0.6274769 NA
I had to edit the function after accepting an answer :
catt_2 <-
function(y, x, score = c(0, 1, 2)) {
miss <- unique(c(which(is.na(y)), which(is.na(x))))
n.miss <- length(miss)
if(n.miss > 0) {
y <- y[-miss]
x <- x[-miss]
}
if(!all((y == 0) | (y == 1)))
stop("y should be only 0 or 1.")
if(!all((x == 0) | (x == 1) |(x == 2)))
stop("x should be only 0, 1 or 2.")
ca <- x [y == 1]
co <- x [y == 0]
htca <- table(ca)
htco <- table(co)
A <- matrix(0, 2, 3)
colnames(A) <- c(0, 1, 2)
rownames(A) <- c(0, 1)
A[1, names(htca)] <- htca
A[2, names(htco)] <- htco
ptt <- prop.trend.test(A[1, ], colSums(A), score = score)
res <- list(
chisq = as.numeric(ptt$statistic),
p.value = as.numeric(ptt$p.value)
)
return(res)
}
and now the results are :
send=by(merged_df_normal,merged_df_normal$snp_id, function (merged_df_normal) {catt_2(merged_df_normal$is_sever,merged_df_normal$encoding)})
merged_df_normal$snp_id: chr11007252
$chisq
[1] NA
$p.value
[1] NA
------------------------------------------------------------------------
merged_df_normal$snp_id: chr1100731820
$chisq
[1] 0.9111779
$p.value
[1] 0.3398021
and what I would like it to be is:
snp_id pvalue chisq
chr11441806 0.6274769 0.9111779
chr1144192891 NA NA
the answer:
library(data.table)
setDT(merged_df_normal)
merged_df_normal[,.(p.value=catt(is_sever,encoding)),snp_id]
worked really well for getting just the p.value but is there a way to edit the above answer and add a new column chisq? thank you for the help the previous answer
I believe you can just apply catt() to each group of sn_id. Let's say your original data is called df. Then, you can do the following:
library(data.table)
setDT(df)
df[,.(p.value=catt(is_severe,encoding)),sn_id]
You need to adjust your function so that it handles sn_id groups that don't have sufficient data; in your example data frame, catt() only runs without error on sn_id == chr1 69511..
In general, however, the output will look like this, with one row in the frame for each sn_id value
sn_id p.value
<char> <num>
1: chr1 69511 0.2482131
I am using the lapp functin of {terra} in R and I want to update rast_a with values from rast_b or rast_c (and some other math) depending on the value in each cell of rast_a.
sample data
rast_a <- rast(ncol = 2, nrow = 2)
values(rast_a) <- 1:4
rast_b <- rast(ncol = 2, nrow = 2)
values(rast_b) <- c(2,2,2,2)
rast_c <- rast(ncol = 2, nrow = 2)
values(rast_c) <- c(3,3,3,3)
Problem
This is my (wrong) attempt.
my_update_formula <- function(a, b, c) {
a[a == 1] <- b[a == 1] + 10 + 20 - 30
a[a == 2] <- c[a == 2] + 10 + 50 - 50
return(a)
}
result <- lapp(c(rast_a, rast_b, rast_c),
fun = my_update_formula)
values(result)
lyr1
[1,] 3
[2,] 3
[3,] 3
[4,] 4
The actual result should be 2,3,3,4. But because of the operations inside the formula, the first value gets updated twice. First it is changed from 1 to 2 (correctly) but then it fulfills the condition of the second line of code also, and is changed again (I don't want that to happen).
How can I solve this please?
You can change your formula to
f1 <- function(a, b, c) {
d <- a
d[a == 1] <- b[a == 1]
d[a == 2] <- c[a == 2] + 10
d
}
#or
f2 <- function(a, b, c) {
i <- a == 1
j <- a == 2
a[i] <- b[i]
a[j] <- c[j] + 10
return(a)
}
lapp(c(rast_a, rast_b, rast_c), fun = f1) |> values()
# lyr1
#[1,] 2
#[2,] 13
#[3,] 3
#[4,] 4
lapp(c(rast_a, rast_b, rast_c), fun = f2) |> values()
# lyr1
#[1,] 2
#[2,] 13
#[3,] 3
#[4,] 4
You can get the same result with
x <- ifel(rast_a==1, rast_b,
ifel(rast_a == 2, rast_c + 10, rast_a))
I have a question regarding a for-loop within R's dplyr. Imagine I have the following dataframe:
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df$count_x <- cumsum(df$check)
df$count_y <- NA
df$count_y[1] <- ifelse(df$check[1] == 0, 0, 1)
co <- df$count_y[1]
I want to fill the variable count_y based on an adjusted cumulative function below:
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
The output of this for-loop is correct. However, in my current data set, I have many IDs, and using a for loop to iterate over the IDs will take too much time. I'm trying to use the functionality of dplyr to speed up the process.
id <- c(rep(8, 9))
check <- c(0,1,1,0,0,1,0,0,0)
df <- data.frame(id, check)
df <- df %>% group_by(id) %>% mutate(count_x = cumsum(check),
count_y = NA) %>% ungroup()
df <- df %>% group_by(id) %>% mutate(count_y = replace(count_y, 1, ifelse(check[1] == 0, 0 , 1)))
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
}
I want to use mutate to call the function count_n to fill count_y as described above. I'm aware that I'm passing just one variable, where I have to pass a data frame as the function relies on the column 'check' (col number 2) and 'count_y' (col number 4). I have tried multiple options (mutate_at, all, etc) but I couldn't make it to work. What can I do differently?
df <- df %>% group_by(id) %>% mutate_at(vars(count_y), ~count_n(.))
I think this is the perfect case to use purrr::accumulate2().
purrr::accumulate() is often used to calculate conditional cumulative sums. It takes a function as the second argument. This function should have 2 arguments: the cumulative output co, and the currently evaluated value x.
purrr::accumulate2() allows us to use a second variable to iterate on, and here we use lag(check) as lx. The tricky part is that this second variable should be one item shorter, as it does not matter for the initial value.
Here is the code, matching your expected output.
library(tidyverse)
df = structure(list(id = c(8, 8, 8, 8, 8, 8, 8, 8, 8),
check = c(0, 1, 1, 0, 0, 1, 0, 0, 0),
count_x = c(0, 1, 2, 2, 2, 3, 3, 3, 3)),
row.names = c(NA, -9L), class = "data.frame")
df %>%
mutate(
count_y = accumulate2(check, lag(check)[-1], function(co, x, lx){
case_when(
x==0 ~ co,
x==1 & lx==0 ~ 1,
x==1 & lx==1 ~ co+1,
TRUE ~ 999 #error value in case of unexpected input
)
})
)
#> id check count_x count_y
#> 1 8 0 0 0
#> 2 8 1 1 1
#> 3 8 1 2 2
#> 4 8 0 2 2
#> 5 8 0 2 2
#> 6 8 1 3 1
#> 7 8 0 3 1
#> 8 8 0 3 1
#> 9 8 0 3 1
Created on 2021-05-05 by the reprex package (v2.0.0)
The first issue is that you weren't returning anything in your function. The second issue is that you don't need to use a mutate_at (or even a mutate as would be more appropriate for a single variable) when you're writing the function that modifies the entire tibble. The simplest way to get it working is adding a return statement and running it in line like so:
count_n <- function(df){
co <- df$count_y[1]
for (idx in 2:nrow(df)){
if(df[idx, 2] == 1 & df[idx - 1, 2] == 0){
co <- 1
df[idx, 4] <- co
} else if (df[idx, 2] == 1 & df[idx - 1, 2] == 1){
co <- co + 1
df[idx, 4] <- co
} else if (df[idx, 2] == 0){
df[idx, 4] <- co
}
}
return(df)
}
df %>% group_by(id) %>% count_n(.)
However, I would use Dan's answer above because it's much cleaner and has the advantage of not running a for loop, which isn't very "R". :)
I generated a 600 length sample using:
x <- rnorm(600, mean = 30, sd = 10)
and then made another 600 length list using:
y = ((x-30)/10)
and my plan is to have if statements that test whether y is bigger than 1.96 and if -y is smaller than 1.96 and if this happens then increment a variable a by 1 and if this doesn't happen then increment a variable b by 1.
I have tried the following things:
a = 0
b = 0
ifelse(y > 1.96, inc(a) <- 1, inc(b) <- 1)
ifelse(-y < -1.96, inc(a) <- 1, inc(b) <- 1)
Error in inc(a) <- 1 : could not find function "inc<-"
Error in inc(b) <- 1 : could not find function "inc<-"
ifelse(y > 1.96, '+'(a) <- 1, '+'(b) <- 1)
ifelse(-y < -1.96, '+'(a) <- 1, '+'(b) <- 1)
Error in +a <- 1 : could not find function "+<-"
Error in +b <- 1 : could not find function "+<-"
if (y > 1.96) {
a = a + 1
}
if (-y < -1.96) {
a = a + 1
} else{
b = b + 1
}
Warning message:
In if (y > 1.96) { :
the condition has length > 1 and only the first element will be used
ifelse(y > 1.96, a <- a + 1, b <- b + 1)
ifelse(-y < -1.96, a <- a + 1, b <- b + 1)
This almost worked but it gave me an output of all 1's and:
a
# [1] 1
b
# [1] 1
So how else would I increment the variables or what am I doing wrong?
Avoid using an ifelse() for this kind of computation. In this case, it's better to use R's vectorization properties (e.g. there is no "scalar" in R only vectors) by obtaining a vector of TRUE/FALSE values and then sum over the TRUE values.
# Make the data
set.seed(1337)
x = rnorm(600, mean = 30, sd = 10)
y = ((x-30)/10)
# Get Indicators (T/F)
v1 = y > 1.96
v2 = y < -1.96
# Sum over w.r.t to true cases
a = sum(v1) + sum(v2)
# Handle the false cases without resumming.
b = 2*length(y) - a
Also, here we opt to use y < -1.96 to get an appropriate two-side count.
Consider two strings of the form below:
101001
010001
How I can do OR between these two and report number of ones?
My goal is to just report 4 for the two strings above.
Thanks very much for your help
There's probably a more elegant way, but how about this:
x = "101001"
y = "010001"
dat = c(strsplit(x, split=""), strsplit(y, split=""))
sum(dat[[1]] == 1 | dat[[2]] == 1)
or this:
sum(unlist(strsplit(x, split="")) == 1 | unlist(strsplit(y, split="")) == 1)
or, per #jbaums comment:
sum(as.numeric(strsplit(x, '')[[1]]) | as.numeric(strsplit(y, '')[[1]]))
If you're only dealing with binary, you can convert the strings to numerics, add them, and count the number of non-zeros. (Edited to incorporate Julius's recommendation)
x = "101001"
y = "010001"
xy <- as.numeric(x) + as.numeric(y)
length(gregexpr("(1|2)", xy)[[1]])
You can write this to run over a vector pretty easily too.
#* function to generate sample data
make_binary_string <- function(n = 10, len = 6)
{
vapply(1:n,
function(i, n, len) paste0(sample(0:1, 6, replace = TRUE), collapse = ""),
character(1),
n = n,
len = len)
}
set.seed(pi)
x <- make_binary_string(n = 10)
y <- make_binary_string(n = 10)
xy <- as.numeric(x) + as.numeric(y)
nchar(gsub("0", "", xy))
Here is what I tried.
df <- data.frame(strsplit(str1,split = ""), strsplit(str2,split = ""))
names(df) <- c('x1', 'x2')
This will convert strings into dataframe like this
x1 x2
1 1 0
2 0 1
3 1 0
4 0 0
5 0 0
6 1 1
And then count number of rows which have atleast one 1
nrow(df[df$x1 == 1 | df$x2 == 1,])
Or
sum(bitwOr(as.numeric(strsplit(str1,split = "")[[1]]) , as.numeric(strsplit(str2,split = "")[[1]])))
We can define a function to.bool() that converts a string to a sequence of boolean values:
to.bool <- function(boolstr) as.logical(as.integer(unlist(strsplit(boolstr,""))))
sum(to.bool("101001") | to.bool("010001"))
#[1] 4