Compute Multiple Means in Dataframe in R [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I have a dataframe with 2 columns. A quality score, and an outcomes. Outcomes are either 1 or 0. Quality Scores are different integers from 1 - 135. This is a snapshot of the dataframe:
For each quality score, I would like to compute the mean. I can do it for one Quality Score at a time as such:
test <- subset(deletion_qs, qs == 10)
sum(test$outcomes)/length(test$outcomes)
[1] 0.4
But this is too slow. I was wondering if there is a way using one of the apply functions?
Here is the data:
quality_score <- c(2, 1 ,18 , 1 , 2 , 1 , 1 , 1 , 2 , 1, 1 , 1 , 1 , 1 ,10 , 10 ,10, 10 , 10 , 10 , 10 , 10, 10 , 10 , 1 ,29 ,1 , 29 ,63 , 1 ,25 , 1 , 1 ,52 ,28 , 1 , 1 ,10 , 3, 28 , 1 , 20, 1, 10, 1 , 10 , 3 , 1 , 3 , 10 ,10 , 56 , 1, 1, 2 , 3 , 2 , 1 , 1, 44 , 1 , 1, 10 , 33 , 67 ,67, 19 , 8 , 39, 10 , 2 , 1 , 42 , 22, 7 , 93 , 1 , 12 , 10 ,135 , 1 , 31 , 6 , 16, 15 , 1 , 35 , 1, 10 , 10)
outcome <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1)

You can use dplyr group_by and summarise
Combine first to "tot.data". Then
library(dplyr)
group_by(tot.data, quality_score) %>% summarise(Mean1 = mean(outcome))

Related

Create a new variable based on other columns values

I have a paneldata dataframe structure, something like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
I want to generate a new dummy variable, that takes the value 1, if the rows contains 1 in any of the three columns or otherwise 0 if not. It should end up like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
"Final_status" = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0))
Can anyone help me achieve this?
We can use if_any on the columns that starts_with 'Status', to check for any 1 value in a row and it returns TRUE if there is one or else FALSE which is coerced to binary with as.integer/+
library(dplyr)
df %>%
mutate(Final_status = +(if_any(starts_with('Status'), ~ . ==1)))
-outptu
id Status_2014 Status_2015 Status_2016 Final_status
1 1 1 0 0 1
2 1 1 0 0 1
3 1 1 0 0 1
4 1 1 0 0 1
5 2 0 1 0 1
6 2 0 1 0 1
7 2 0 1 0 1
8 2 0 1 0 1
9 3 0 0 0 0
10 3 0 0 0 0
11 3 0 0 0 0
12 3 0 0 0 0
Or using rowSums from base R
df$Final_status <- +(rowSums(df[-1] > 0) > 0)
You write an if condition to define the variable as 1 or 0, and inside this condition the most straight forward ways would be a dplyr pipe.
I don't have the dplyr syntax in my head, to long not used, but dplyr is what you want.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
best greetings

R: expand sequence of binary values from a time column

I have a table of time and binary values,
> head(x,10)
Time binary
1 358.214 1
2 359.240 1
3 360.039 0
4 361.163 0
5 361.164 1
6 362.113 1
7 362.114 0
8 365.038 0
9 365.039 0
10 367.488 0
I want to check after a second wether the value in binary column is 1 or 0, and then create new column of the new values. The time here is not continues. For example, first value here is (358.214) and the binary value is 1, if I add a second it is going to be (359.214) and the value is still 1 based on the previous value because (359.214) is not in the dataset.
I want to add two new columns, one for the seconds increasing and one for the new binary values.
time2 new_binary
1 358.214 1
2 359.214 1
3 360.214 0
4 361.214 1
5 362.214 0
6 363.214 0
7 364.214 0
8 365.214 0
9 366.214 0
10 367.214 0
How can I do this in R?
The dataset,
Time <- c(358.214, 359.240, 360.039, 361.163, 361.164, 362.113, 362.114, 365.038, 365.039, 367.488, 367.489, 368.763, 368.764, 371.538, 371.539, 384.013, 384.014, 386.088, 386.089, 389.463, 389.464, 392.663, 392.664, 414.588, 414.589, 421.463, 421.464, 427.863, 427.864, 431.488, 431.489, 432.074, 432.075, 437.124, 437.125, 439.024, 439.025, 451.724, 451.725, 456.224, 456.225, 457.301, 457.302, 459.526, 459.527, 470.776, 470.777, 471.951, 471.952, 477.651, 477.652, 479.601, 479.602, 480.426, 480.427, 480.950, 480.951, 494.626, 494.627, 516.551, 516.552, 539.901, 539.902, 545.276, 545.277, 546.536, 546.537, 548.436, 548.437, 551.111, 551.112, 556.086, 556.087, 557.561, 557.562, 567.799, 567.800, 580.049, 580.050, 583.249, 583.250, 587.374, 587.375, 588.599, 588.600, 596.199, 596.200, 597.674, 597.675, 601.249, 601.250, 602.499, 602.500, 620.699, 620.700, 631.099, 631.100, 637.249, 637.250, 638.999, 639.000, 650.574, 650.575, 658.199, 658.200, 658.696, 658.697, 668.396, 668.397, 676.021, 676.022, 678.846, 678.847, 688.121, 688.122, 690.371, 690.372, 701.946, 701.947, 704.921, 704.922, 712.346, 712.347, 719.321, 719.322, 721.146, 721.147, 723.496, 723.497, 725.696, 725.697, 727.121, 727.122, 729.871, 729.872, 733.721, 733.722, 739.054, 758.078, 761.321, 761.322, 764.221, 764.222, 768.679, 768.680, 774.529, 774.530, 776.679, 776.680, 778.129, 778.130, 780.779, 780.780, 837.204, 837.205, 842.079, 842.080, 846.329, 846.330, 847.579)
binary <- c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0 ,0 ,1 ,1, 0, 0, 1, 1, 0, 0, 1, 1 ,0, 0 ,1 ,1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0 ,0 ,1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1 ,0 ,0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1)
Update my attempts:
First I got a sequence of the new seconds (which is longer than the original Time)
time2 <- seq(x$Time[1],x$Time[length])
Then I used ifelse to loop through the Time and compare it with time2, if the value in time2 not equal to the value in Time -> put the previous binary value of Time, else, get the binary value. So I want a function that continue comparing two different length columns.
What I did is this,
View(vec_new <-data.frame(time2))
vec_new <- vec_new %>%
mutate(new_Binary = ifelse((x$Time != vec_new$time2)&(vec_new$time2 %l% x$Time),lag(x$binary), x$binary))
However, I got this warning because of the different length columns.
"longer object length is not a multiple of shorter object length"
Also, the results are not quit what I expected. I don't know how this loop works through the values and how loops through all the values. I got a complete binaries till the end of the time2 though.
Any idea how to achieve this in R?
If you use mutate from the dplyr package the solution is relatively easy:
library(dplyr)
df <- data.frame(Time, binary) %>%
mutate(Time=Time-Time[1]) %>%
mutate(binary=as.logical(binary))
Output
head(df)
# Time binary
# 1 0.000 TRUE
# 2 1.026 TRUE
# 3 1.825 FALSE
# 4 2.949 FALSE
# 5 2.950 TRUE
# 6 3.899 TRUE
If you want to create new columns you simply have to give them a new name.
df <- data.frame(Time, binary) %>%
mutate(time2=Time-Time[1]) %>%
mutate(new_binary=as.logical(binary))
Output
head(df)
# Time binary time2 new_binary
# 1 358.214 1 0.000 TRUE
# 2 359.240 1 1.026 TRUE
# 3 360.039 0 1.825 FALSE
# 4 361.163 0 2.949 FALSE
# 5 361.164 1 2.950 TRUE
# 6 362.113 1 3.899 TRUE
And this solution gives you the time according to your desired output (I hope).
df <- data.frame(Time, binary) %>%
mutate(time2=as.numeric(rownames(df))+357.214) %>%
mutate(new_binary=as.logical(binary))
head(df)
Output
head(df)
# Time binary time2 new_binary
# 1 358.214 1 358.214 TRUE
# 2 359.240 1 359.214 TRUE
# 3 360.039 0 360.214 FALSE
# 4 361.163 0 361.214 FALSE
# 5 361.164 1 362.214 TRUE
# 6 362.113 1 363.214 TRUE

Is there an R function similar to foreach loops in Stata for creating new variables based on the name (or root) of existing variables?

I have a list of 60 variables (30 pairs, essentially), and I need to combine the information across all the pairs to create new variables based on the data stored in each pair.
To give some context, I am working on a systematic review of prediction model studies, and I extracted data on which variables were considered for inclusion in the prediction model of each study (the first 30 variables) and which variables were included in the model (the second 30 variables)
All variables are binary.
The first 30 variables are written in the form “p_[varname]”
The second 30 are written in the form “p_[varname]_inc”.
I want to create a new variable that is called [varname] and takes the values “Not considered”, “Considered”, and “Included”.
In Stata, I could easily do this like so:


foreach v of [varname1]-[varname30] {
gen `v' = "Not considered" if p_`v' == 0
replace `v' = "Considered" if p_`v' == 1 & p_`v'_inc == 0
replace `v' = "Included" if p_`v'_inc == 1 & p_`v'_inc == 1
}
In R, the only way I can figure out to do it is by copy and pasting the same ifelse statement for all variables, for example:
predictor_vars %>%
mutate(age = ifelse(p_age==1 & p_age_inc==1, "Included",
ifelse(p_age==1 & p_age_inc==0, "Considered", "Not considered")),
sex = ifelse(p_sex==1 & p_sex_inc==1, "Included",
ifelse(p_sex==1 & p_sex_inc==0, "Considered", "Not considered")),
....
[varname] = ifelse([varname]==1 & [varname]_inc==1, "Included",
ifelse([varname]==1 & [varname]==0, "Considered", "Not considered"))
)
Is there an easier way to do this in R / dplyr?
Edit: Sorry for not providing enough detail before (new here, but really appreciate the fast responses!). Here is a sample of the data
structure(list(p_age = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0), label = "Age", class = c("labelled",
"numeric")), p_age_inc = structure(c(1, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0
), label = "Age", class = c("labelled", "numeric")), p_sex = structure(c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0), label = "Sex", class = c("labelled", "numeric"
)), p_sex_inc = structure(c(1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), label = "Sex", class = c("labelled",
"numeric")), p_nation = structure(c(0, 0, 0, 0, 1, 1, 0, 1, 0,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), label = "Nationality / country", class = c("labelled",
"numeric")), p_nation_inc = structure(c(0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0), label = "Nationality / country", class = c("labelled", "numeric"
)), p_prevtb = structure(c(0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), label = "Treatment regimen / treatment status (retreatment)", class = c("labelled",
"numeric")), p_prevtb_inc = structure(c(0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), label = "Previous TB / retreated TB", class = c("labelled",
"numeric"))), row.names = c(NA, 50L), class = "data.frame")
The first 5 rows (with 4 sets of selected predictors) looks like this:
p_age p_age_inc p_sex p_sex_inc p_nation p_nation_inc p_prevtb
1 1 1 1 1 0 0 0
2 1 0 1 0 0 0 0
3 1 0 1 1 0 0 0
4 1 1 1 1 0 0 0
5 1 1 1 0 1 0 1
6 1 1 1 0 1 0 1
p_prevtb_inc
1 0
2 0
3 0
4 0
5 0
6 0
And I'd like to create the new variables like this:
p_age p_age_inc p_sex p_sex_inc p_nation p_nation_inc p_prevtb
1 1 1 1 1 0 0 0
2 1 0 1 0 0 0 0
3 1 0 1 1 0 0 0
4 1 1 1 1 0 0 0
5 1 1 1 0 1 0 1
6 1 1 1 0 1 0 1
p_prevtb_inc age sex nation prevtb
1 0 Included Included Not considered Not considered
2 0 Considered Considered Not considered Not considered
3 0 Considered Included Not considered Not considered
4 0 Included Included Not considered Not considered
5 0 Included Considered Considered Considered
6 0 Included Considered Considered Considered
This solution could be improved upon but it works. The function does what the question asks for creating the variables in a standard for loop over the p_* variables. And then returns the result.
Argument Bind can be used to return just the newly created variables by setting Bind = FALSE.
create_var <- function(X, Bind = TRUE){
xnames <- names(X)
p_only <- grep('p_([^_]+$)', xnames, value = TRUE)
res <- vector('list', length = length(p_only))
for(i in seq_along(p_only)){
x <- X[[ p_only[i] ]]
y <- X[[paste0(p_only[i], '_inc')]]
res[[i]] <- case_when(
as.logical(x) & as.logical(y) ~ "Included",
as.logical(x) & !as.logical(y) ~ "Considered",
!as.logical(x) ~ "Not considered",
TRUE ~ "Not considered"
)
}
names(res) <- sub('^p_', '', p_only)
res <- do.call(cbind.data.frame, res)
if(Bind) cbind(X, res) else res
}
create_var(df1)
df1 %>% create_var()
df1 %>% create_var(Bind = FALSE)

How to find bounding boxes of objects in raster?

I have a binary raster consisting of objects (1) and background (0). How can I find bounding boxes of objects? Each object should have its own bouding box.
Input:
library("raster")
mat = matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE
)
ras = raster(mat)
I expect this result:
result = raster(matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0,
0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0,
0, 1, 0, 0, 1, 0,
0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE
))
Here in an approach
Example data
library(raster)
mat = matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE )
ras <- raster(mat)
Solution
f <- function(r) {
x <- reclassify(ras, cbind(0,NA))
y <- rasterToPolygons(x, dissolve=TRUE)
z <- disaggregate(y)
e <- sapply(1:length(z), function(i) extent(z[i,]))
p <- spPolygons(e)
r <- rasterize(p, r)
d <- boundaries(r)
reclassify(d, cbind(NA, 0))
}
r <- f(res)
as.matrix(r)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 0 0 0 0 0
#[2,] 0 1 1 1 1 0
#[3,] 0 1 1 1 1 0
#[4,] 0 0 0 0 0 0
#[5,] 0 1 1 1 1 0
#[6,] 0 1 0 0 1 0
#[7,] 0 1 1 1 1 0
#[8,] 0 0 0 0 0 0
It is of course possible that bounding boxes of objects overlap, in which there is no solution, I suppose.

How to frequency of consecutive rows with the same number for several columns

I have a dataset as follows:
structure(list(chr = c(1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
1, 0, 0, 1, 1, 1, 1), leftPos = c(240000, 1080000, 1200000, 1320000,
1440000, 1800000, 2400000, 2520000, 3120000, 3360000, 3480000,
3600000, 3720000, 4200000, 4560000, 4920000, 5040000, 5160000,
5280000, 6e+06), chr.1 = c(1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1), leftPos.1 = c(240000, 1080000, 1200000,
1320000, 1440000, 1800000, 2400000, 2520000, 3120000, 3360000,
3480000, 3600000, 3720000, 4200000, 4560000, 4920000, 5040000,
5160000, 5280000, 6e+06), ASample = c(0,
0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0), Sample1 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1), Sample2 = c(0,
1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1), Sample3 = c(0,
1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1), Sample4 = c(0,
0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1), Sample5 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1), Sample6 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1), Sample7 = c(0,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1), Sample8 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1), Sample9 = c(0,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1), Sample10 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1), Sample11 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1), Sample12 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1), Sample13 = c(0,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0), Sample14 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1), Sample15 = c(0,
1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1)), .Names = c("chr",
"leftPos", "chr.1", "leftPos.1", "Sample1",
"Sample2",
"Sample3", "Sample4",
"Sample5", "Sample6",
"Sample7", "Sample8",
"Sample9", "Sample10",
"Sample11", "Sample12",
"Sample13", "Sample14",
"Sample15"), row.names = c(NA,
20L), class = "data.frame")
I need to count the number of rows per column where there is more than one of the same 1 or -1
I would like to be able to count the number of consecutive rows for each column, grouped by chr, that have three consecutive 1 or -1 within a chromosome (column called chr).
The ideal output would be something like (not taken from the dput data above)
chr numberOfConsecutive1s FreqSample1 FreqSample2 FreqSample3 etc
1 2 3 2 14
1 3 5 2 2
1 4 5 0 6
1 5 4 3 5
1 6 3 0 3
1 7 7 5 7
1 8 5 0 2
1 9 54 2 6
1 10 34 77 7
2 2 6 4 2
2 3 23 34 34
2 4 5 37 2
2 5 55 24 22
2 6 2 0 11
2 7 3 14 5
2 8 2 5 77
2 9 5 23 34
2 10 5 11 34
3 1 32 0 2
So far I have tried the following which simply converts non consecutive 1s into 0s so I am left with the consecutive 1s only. I don't know how to count them up as per the desired output.
dx<-DAT_list2res
f0 <- function( colNr, dx )
{
col <- dx[,colNr]
n1 <- which( col == 1 ) # The `1`-rows.
d0 <- which( diff(col) == 0 ) # Consecutive entries are equal.
dc0 <- which( diff(dx[,1]) == 0 ) # Same chromosome.
m <- intersect( n1-1, intersect( d0, dc0 ) )
return ( setdiff( 1:nrow(dx), union(m,m+1) ) )
}
g <- function( dx )
{
for ( i in 3:ncol(dx) ) { dx[f0(i,dx),i] <- 0 }
return ( dx )
}
dx<-g(dx)
EDIT
I also tried this as suggested by bramtayl:
result =
consecFreq %>%
select(-chr) %>%
gather(variable, chr, 5:190) %>%
group_by(variable) %>%
mutate(ID =
chr %>%
lag %>%
`!=`(chr) %>%
plyr::mapvalues(NA, FALSE) %>%
cumsum) %>%
count(variable, chr, ID) %>%
rename(numberOfConsecutive1s = n) %>%
count(variable, chr, numberOfConsecutive1s) %>%
spread(variable, n, fill = 0)
but it gives me an 'index out of bounds' error. If I ignore the spread line I get an odd output as well so I'm not sure this is the answer
REVISED
Based on clarifications, this approach uses the rle function for each chromosome to find runs of consecutive 1's or -1's and then table to count the number of runs for each value. This gives NA for samples which have no counts for a particular value so the last line of the code converts the NA's to 0's if this is helpful. Finally there seems to be a problem with your structure input in that Cytospongex10_SLX.9395.FastSeqK.fq.gz.res is missing from the .Names section of structure. This causes all the column names to be shifted and the last column name to be NA which can cause problems in the execution.
The code below assigns the correct names to the input data (in data.frame df) and then calculates the frequencies as described above.
colnames(data) <- c("chr",
"leftPos", "chr.1", "leftPos.1", "Cytospongex10_SLX.9395.FastSeqK.fq.gz.res", "Sample1",
"Sample2",
"Sample3", "Sample4",
"Sample5", "Sample6",
"Sample7", "Sample8",
"Sample9", "Sample10",
"Sample11", "Sample12",
"Sample13", "Sample14",
"Sample15")
chr_labels <- sort(unique(data$chr))
sampl_freqs <- data.frame(chr=1, numberOfConsecutive1s=1, count=0)
for( sampl in colnames(data)[-(1:5)]) {
freqs <- data.frame()
for( chr in chr_labels ) {
runs <- rle(data[data$chr == chr,sampl])
freqs_chr <- data.frame(chr=chr, table(runs$length[runs$values %in% c(-1,1)], dnn = "numberOfConsecutive1s") )
freqs <- rbind(freqs, freqs_chr)
}
sampl_freqs <- merge.data.frame(sampl_freqs, freqs, by = c("chr","numberOfConsecutive1s"), all=TRUE)
colnames(sampl_freqs) <- c(head(colnames(sampl_freqs),-1),paste("Freq",sampl,sep=""))
}
# clean up from sampl_freqs definition
sampl_freqs <- sampl_freqs[,-3]
# To convert NA's to 0
sampl_freqs <- data.frame(sampl_freqs[,1:2], sapply(sampl_freqs[,-(1:2)], function(x) ifelse(is.na(x), 0, x)))
Similar to above, but uses dplyr
library(reshape2)
library(dplyr)
df <- melt(data[,-(2:5)], id.vars="chr", variable.name="sample")
sampl_freqs <- df %>% group_by(sample, chr ) %>%
do(data.frame(unclass(rle(.$value))) %>%
filter(values %in% c(-1,1)) ) %>%
group_by(sample, chr, lengths) %>%
summarize(Freq = n() ) %>%
dcast( chr + lengths ~ sample, value.var = "Freq" )
sampl_freqs <- with(sampl_freqs,data.frame( chr, numberOfConsecutive1s = lengths ,
sapply(sampl_freqs[,-(1:2)], function(x) ifelse(is.na(x), 0, x))))
I think you want something like this:
library(dplyr)
library(tidyr)
min_chunk_length = 1
result =
data %>%
rename(chromosome = chr) %>%
select(chromosome, Sample1:Sample15) %>%
gather(sample, value, Sample1:Sample15) %>%
group_by(chromosome, sample) %>%
mutate(non_zero = value %in% c(1, -1),
chunk_ID =
non_zero %>%
lag %>%
`!=`(non_zero) %>%
plyr::mapvalues(NA, FALSE) %>%
cumsum) %>%
filter(non_zero = TRUE) %>%
group_by(chromosome, sample, chunk_ID) %>%
mutate(length_of_chunk = n()) %>%
filter(length_of_chunk > min_chunk_length) %>%
count(chromosome, sample) %>%
spread(sample, n, fill = 0)

Resources