Identifying pattern over time within observations (R) - r

I have data on transfer payments for thousands of people over several years with monthly entries whether an observation received a payment that month or not. I want to find out whether certain types of transfer receivers proposed by theory can be confirmed by the data. To do so, I plan to first do some descriptive statistics and later use the package TraMineR.
First, however, I want to simply figure out which observation fits which category. One such category, for example, are short time receivers of financial aid who only show up once. Thus, I need to identify all observations who received payments for only three month (or less). In addition, these periods of receiving aid cannot be interrupted, so if someone received aid for two month, the nothing for two, and then one month again, this would already be a different category. Here is a little example for only one year and for 30 observations:
dat <- data.frame(matrix(c(0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0),ncol=12))
In this example, my problem is row 13, otherwise I could simple use rowSums and then pick every row with a result equal or smaller than 3. Which procedure could I use to identify only those observations which received aid only in one connected period? And how would I identify observations such as 13?

You can use this function to identify the number of contiguous periods of payment and the number of months in each period:
aid <- lapply(apply(dat, 1, rle), function(x) unname(x$lengths[x$values==1]))
This will return a list, with one compoent per row of your data. For instance:
> aid[[1]]
integer(0)
> aid[[8]]
[1] 3
> aid[[13]]
[1] 1 1
indicating no period for row 1, one period of 3 months for row 8 and two periods of 1 month for row 13.
To find out how many contiguous periods each row has, you can use this:
cont <- sapply(aid, length)
Result:
> cont
[1] 0 1 1 0 0 0 1 1 0 0 1 1 2 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0
> cont[13]
[1] 2
Note that only row 13 has two separate periods.

You can use rle function to filter which rows values equal to 1 at different times.
idx <- apply(dat,1,function(x){
y <- rle(x)
length(y$lengths[y$values ==1])> 1
})
dat[idx,]
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
13 0 0 0 0 0 0 0 0 1 0 0 1
Then you can apply rowSums on the filtred data
rowSums(dat[!idx,]) <=3

Related

R adding 10 to a specific element of a list in a dataframe

I have this dataframe df:
df<-structure(list(tile_type_index = c(9, 15, 20, 5, 20), tile_type = c("Flowers",
"Leather", "Outpost", "Wood 2", "Outpost"), material_on_hex = list(
c(0, 0, 0, 0, 0, 0, 0, 0, 1000, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1000,
0, 0, 0, 0, 0), c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), c(0, 0, 0, 0, 1000, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))), row.names = c(NA,
5L), class = "data.frame")
tile_type_index tile_type
1 9 Flowers
2 15 Leather
3 20 Outpost
4 5 Wood 2
5 20 Outpost
material_on_hex
1 0, 0, 0, 0, 0, 0, 0, 0, 1000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1000, 0, 0, 0, 0, 0
3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
4 0, 0, 0, 0, 1000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
5 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
I want to manipulate it in the following way: if the tile_type is "Outpost" then the row of material_on_hex should remain the same but if tile_type!="Outpost" then I want to add 10 to material_on_hex[tile_type_index]. The result should be:
tile_type_index tile_type
1 9 Flowers
2 15 Leather
3 20 Outpost
4 5 Wood 2
5 20 Outpost
material_on_hex
1 0, 0, 0, 0, 0, 0, 0, 0, 1010, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1010, 0, 0, 0, 0, 0
3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
4 0, 0, 0, 0, 1010, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
5 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Trivial with a loop:
library(magrittr)
for (i in seq_len(nrow(df))) {
if (df$tile_type[i] == 'Outpost') next
tidx = df$tile_type_index[i]
df$material_on_hex[[i]][tidx] %<>% add(10)
}
# tile_type_index tile_type material_on_hex
# 1 9 Flowers 0, 0, 0, 0, 0, 0, 0, 0, 1010, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
# 2 15 Leather 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1010, 0, 0, 0, 0, 0
# 3 20 Outpost 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
# 4 5 Wood 2 0, 0, 0, 0, 1010, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
# 5 20 Outpost 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Could also use mapply():
df$material_on_hex = mapply(
\(x, y, z) {
if (x!='Outpost') z[y] %<>% add(10)
return(z)
},
x = df$tile_type, y = df$tile_type_index, z = df$material_on_hex,
SIMPLIFY = FALSE
)

How do I compute the average per row of multiple numeric columns

I have 8 age categories as 8 separate columns. Each column has a value between 1 and 3. I want to compute a new column that holds the average age per row.
This is my data:
structure(list(`2.5` = c(0, 0, 0, 1, 1, 2, 1, 2, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0,
0, 2, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 2, 0, 0, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 2, 2,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 2, 0, 2, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 3,
0, 0, 1), `9` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0,
1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 1,
0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 2, 0,
2, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1
), `15.5` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0),
`21.5` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `29.5` = c(0,
1, 2, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0), `42` = c(0, 0, 0,
2, 1, 2, 2, 2, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 2,
0, 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 0, 2, 0, 1, 1, 2, 0, 2,
1, 0, 0, 0, 0, 2, 1, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0,
2, 0, 1, 0, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0,
2, 1, 1, 0, 0, 2, 0, 0, 0, 2, 1, 1, 1, 1, 0, 1, 2, 2, 0,
0, 0, 0, 2, 0, 2, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 2, 0, 2, 1, 0, 1, 1, 2, 0, 0, 2, 1, 2, 2, 2, 0,
1, 0, 1, 0, 2, 2, 2, 1, 0, 0, 2, 0, 0, 0, 0, 2, 0, 2, 2,
2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2, 1, 0, 0, 0, 2, 2, 0, 2,
0, 0, 2, 2, 0, 0, 0, 0, 2, 1, 2, 0, 0, 1, 2, 0, 0, 0, 1,
1, 2, 2, 1, 0, 0, 0, 2, 1, 1, 2), `57` = c(0, 1, 0, 0, 0,
0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 2, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 2, 0, 0, 2,
0, 0, 2, 0, 0, 0, 0, 1, 0, 2, 0, 1, 0, 2, 2, 0, 0, 0, 0,
0, 0, 2, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0,
0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 0, 2, 0, 0, 0, 2, 0, 1,
2, 0, 2, 0, 1, 1, 0, 0, 0, 2, 0, 0, 1, 2, 2, 2, 0, 2, 0,
0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0,
0, 0, 2, 0, 1, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1, 1, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 2, 0, 1, 0, 1, 0, 0), `72` = c(2, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 2,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 2, 0, 0, 0, 2,
0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 2, 1, 0, 0,
0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 0,
0, 2, 1, 0, 0, 0, 0)), row.names = c(NA, -204L), class = c("data.table",
"data.frame"))
First, create a temporary data frame where you calculate the ages from column names. Then, with rowSums calculate average ages. (Supposed your data frame is called d.)
tmp <- do.call(cbind, lapply(seq(d), function(x) d[x] * as.numeric(colnames(d)[x])))
d$mu <- rowSums(tmp) / rowSums(d)
head(d)
# 2.5 9 15.5 21.5 29.5 42 57 72 mu
# 1 0 0 0 0 0 0 0 2 72.00000
# 2 0 0 0 0 1 0 1 0 43.25000
# 3 0 0 0 0 2 0 0 0 29.50000
# 4 1 0 0 0 0 2 0 0 28.83333
# 5 1 0 0 0 1 1 0 1 36.50000
# 6 2 0 0 0 0 2 0 0 22.25000
apply is a useful option, where 1 tells it to compute by row. It also seems to play nicely with data.tables:
df$means <- apply(df, 1, function(r) sum(r * as.double(names(df))) / sum(r))
#### OUTPUT ####
2.5 9 15.5 21.5 29.5 42 57 72 means
1: 0 0 0 0 0 0 0 2 72.00000
2: 0 0 0 0 1 0 1 0 43.25000
3: 0 0 0 0 2 0 0 0 29.50000
4: 1 0 0 0 0 2 0 0 28.83333
5: 1 0 0 0 1 1 0 1 36.50000
---
200: 0 0 0 0 0 0 1 1 64.50000
201: 3 0 0 0 0 2 0 0 18.30000
202: 0 0 1 0 0 1 1 0 38.16667
203: 0 0 0 0 1 1 0 0 35.75000
204: 1 1 0 0 0 2 0 0 23.87500
Here is a base R one-liner where we multiply values in the columns of dataframe by its names, calculate the sum of column values and divide by its rowSums.
df$result <- colSums(t(df) * as.numeric(names(df)))/rowSums(df)
head(df)
# 2.5 9 15.5 21.5 29.5 42 57 72 result
#1 0 0 0 0 0 0 0 2 72.00000
#2 0 0 0 0 1 0 1 0 43.25000
#3 0 0 0 0 2 0 0 0 29.50000
#4 1 0 0 0 0 2 0 0 28.83333
#5 1 0 0 0 1 1 0 1 36.50000
#6 2 0 0 0 0 2 0 0 22.25000

R - Use all non-zero fractions in a data frame as probabilities for replacing themselves with sample()

I have an R tibble called my_data which is composed of either (1) zeros, or (2) numbers between zero and one:
> my_data
# A tibble: 30 x 40
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 0 0.969 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0
# ... with 20 more rows, and 27 more variables: s14 <dbl>, s15 <dbl>, s16 <dbl>,
# s17 <dbl>, s18 <dbl>, s19 <dbl>, s20 <dbl>, s21 <dbl>, s22 <dbl>, s23 <dbl>,
# s24 <dbl>, s25 <dbl>, s26 <dbl>, s27 <dbl>, s28 <dbl>, s29 <dbl>, s30 <dbl>,
# s31 <dbl>, s32 <dbl>, s33 <dbl>, s34 <dbl>, s35 <dbl>, s36 <dbl>, s37 <dbl>,
# s38 <dbl>, s39 <dbl>, s40 <dbl>
I want to replace all the non-zero numbers in my_data (such as 0.969 in column s7) with 1s at a certain probability where the numbers themselves are the probabilities, otherwise they get replaced with 0s. For example, there is a 0.969 probability that the number 0.969 (in the column named s7) will be replaced by 1, and a 0.031 probability that it will be replaced by 0.
I tried this but it doesn't work:
# Doesn't work:
my_data %>%
mutate_all(function(x) {
case_when(x == 0 ~ 0,
x > 0 ~ sample(0:1, size = 1, prob = c(1 - x, x)))
})
How do I do this? Should I be using purrr::map() (how?) or something else? Thank you!
Here is the dput() of my_data:
structure(list(s1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), s2 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0.956159271283707, 0), s3 = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.982878394164842,
0, 0, 0, 0, 0.982878394164842), s4 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0.959674748019852, 0, 0, 0, 0, 0, 0, 0,
0, 0.959674748019852, 0, 0, 0, 0, 0), s5 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.876892497722416,
0, 0, 0, 0, 0, 0.876892497722416, 0, 0), s6 = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0.989641778880238, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0.989641778880238, 0, 0, 0, 0, 0), s7 = c(0.969355168732184,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.969355168732184,
0, 0, 0, 0, 0, 0, 0, 0), s8 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0.991517098892877, 0, 0, 0, 0, 0, 0, 0, 0, 0.991517098892877,
0, 0, 0, 0.991517098892877, 0.991517098892877), s9 = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.929770500656618, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.929770500656618, 0, 0), s10 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), s11 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.993637560789263,
0), s12 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0.949340969426271, 0, 0, 0.949340969426271,
0, 0), s13 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.919919910704918, 0), s14 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.937896138681889,
0, 0, 0.937896138681889, 0, 0, 0, 0, 0), s15 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.991517098892877, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.991517098892877), s16 = c(0.956159271283707,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.956159271283707, 0, 0,
0, 0, 0, 0, 0, 0.956159271283707, 0, 0, 0, 0, 0, 0, 0.956159271283707,
0.956159271283707), s17 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0.597187792371775, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0.597187792371775, 0), s18 = c(0.975209130375021, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.975209130375021, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.975209130375021), s19 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), s20 = c(0.937234650859115, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.937234650859115, 0, 0, 0,
0, 0, 0.937234650859115, 0, 0, 0, 0, 0, 0, 0, 0), s21 = c(0.929770500656618,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.929770500656618,
0, 0, 0, 0, 0, 0.929770500656618, 0, 0, 0, 0, 0, 0, 0.929770500656618,
0), s22 = c(0.929770500656618, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.929770500656618, 0, 0, 0, 0,
0, 0, 0, 0.929770500656618), s23 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0.921445826350068), s24 = c(0.919919910704918, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), s25 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.993637560789263, 0),
s26 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.942968974602997, 0.942968974602997
), s27 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.959674748019852,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), s28 = c(0.999498946154851,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0), s29 = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0.988925875658174, 0), s30 = c(0, 0.975209130375021,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0.975209130375021, 0), s31 = c(0.986350500013957,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.986350500013957, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.986350500013957
), s32 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.942968974602997,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.942968974602997),
s33 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.927760110879459,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), s34 = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.919919910704918,
0, 0, 0, 0, 0, 0, 0, 0, 0.919919910704918, 0, 0, 0.919919910704918,
0, 0), s35 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.956159271283707,
0, 0, 0, 0, 0, 0, 0, 0.956159271283707, 0, 0, 0, 0, 0, 0,
0.956159271283707, 0), s36 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0.991517098892877, 0, 0.991517098892877, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.991517098892877, 0.991517098892877,
0), s37 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0.919919910704918, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.919919910704918,
0), s38 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), s39 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0.999972102622884, 0, 0, 0, 0, 0), s40 = c(0.942968974602997,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0.942968974602997, 0, 0, 0, 0, 0, 0, 0, 0)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -30L))
You are trying to sample from a binomial distribution. Fortunately rbinom is vectorized regarding its prob parameter and you can avoid any R loops (for, apply, Vectorize, etc.).
m <- as.matrix(DF)
set.seed(42) #for reproducibility
m[m != 0] <- rbinom(sum(m != 0), 1, m[m != 0])
You could try :
library(tidyverse)
as_tibble(apply(df, c(1,2), function(x) sample(c(0,1),1,prob=c(1-x,x))))
It's usually discouraged to convert from matrix to data.frame but here it seems you really have a matrix formatted as a data.frame so I went for it.
To avoid the conversion you could do:
df %>% mutate_all(~ map_dbl(.,~sample(c(0,1),1,prob=c(1-.x,.x))))
The following will test the value before sampling, but I'm not sure if it'll be much faster or any faster:
df %>% mutate_all(~ map_if(.,~. != 0, ~sample(c(0,1),1,prob=c(1-.x,.x))) %>% unlist)
I would use runif:
df %>%
map_df(~ if_else(runif(length(.x)) < .x, 1, 0))
If you really want to use your custom function (with case_when) you can do
df %>%
rowwise() %>%
mutate_all(function(x) {
case_when(x == 0 ~ 0L,
x > 0 ~ sample(0:1, size = 1, prob = c(1 - x, x)))
})
Or
f = function(x) {
case_when(x == 0 ~ 0L,
x > 0 ~ sample(0:1, size = 1, prob = c(1 - x, x))) }
f = Vectorize(f)
df %>% mutate_all(f)
You had 2 issues with your approach.
1) Your function wasn't vectorized and was considering the whole columns of probabilities. The error was Error in mutate_impl(.data, dots) :
Evaluation error: incorrect number of probabilities. Using rowwise or vectorising your function will solve this.
2) case_when didn't return values of the same type. The error was Error in mutate_impl(.data, dots) :
Evaluation error: must be type double, not integer. Using 0L insted of 0 will solve this.

Aggregating data from a combination of selected rows and columns

First, I have checked previous answers without finding the solution to my specific problem.
I have data on scientific papers with year of publication in one column and several columns with yes/no (1/0 binary data), whether the papers assign to various themes (in the data each given a 2-letter abbreviation). Here is a selection of the data set:
structure(list(YR = c(2016, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2014, 2014, 2014, 2014, 2014, 2014, 2013, 2013, 2012, 2012,
2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2010, 2010, 2009,
2009, 2009, 2009, 2009, 2009, 2007, 2007, 2007, 2007, 1993, 1993,
1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1992,
1992, 1992, 1992, 1992, 1991, 1991, 1991, 1991, 1991, 1991, 1991,
1991, 1991, 1991, 1990, 1990, 1988, 1988, 1988, 1988, 1988, 1988,
1988, 1988, 1988, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987,
1987, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1980, 1980, 1980,
1980, 1980, 1979, 1979), ALL = c(0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
1, 1, 1, 0, 1), AB = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), BB = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0),
BS = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), CS = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), DS = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), EG = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0), FB = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0), GB = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), KB = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), KS = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), LS = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), LP = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), MC = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), NB = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), NW = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
SB = c(1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1), SH = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), VM = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), WH = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 0)), .Names = c("YR", "ALL", "AB", "BB", "BS", "CS",
"DS", "EG", "FB", "GB", "KB", "KS", "LS", "LP", "MC", "NB", "NW",
"SB", "SH", "VM", "WH"), row.names = c(NA, -99L), class = "data.frame")
The following script:
part <- test[,c(2:21)]
col1 <- names(test[,2:21])
col2 <- data.frame(apply(part,2,sum))
pops.n <- cbind(col1,col2)
colnames(pops.n) <- c('theme','count')
gives me the following output:
theme count
ALL ALL 51
AB AB 2
BB BB 6
BS BS 11
CS CS 11
DS DS 4
EG EG 8
FB FB 10
GB GB 2
KB KB 2
KS KS 4
LS LS 4
LP LP 2
MC MC 3
NB NB 8
NW NW 2
SB SB 19
SH SH 4
VM VM 4
WH WH 24
However, I want an output that counts each theme per year, in a "long version" data table:
YR | theme | count
--------------------------
1975 WH 15
1976 WH 9
1977 WH 22
...
1975 AB 3
1976 AB 9
...
1976 SB 7
1978 SB 19
1979 SB 7
...
Grateful for any help.
Here is a method that "reshapes"the data long and then uses aggregate to sum over year-theme pairs.
newDat <- setNames(aggregate(values ~ YR + ind,
data=cbind(dat[1], stack(dat[-1])), sum),
c("YR", "theme", "count"))
Here, stack produces a two column data.frame with the counts in one column and the themes in the other. The year variable is added with cbind. This data.frame is the fed to aggregate which calculates the group level sums. setNames renames the variables of the output data.frame.
The first 10 lines are
head(newDat, 10)
YR ind values
1 1979 AB 0
2 1980 AB 0
3 1981 AB 0
4 1987 AB 1
5 1988 AB 0
6 1990 AB 0
7 1991 AB 0
8 1992 AB 0
9 1993 AB 0
10 2007 AB 0
You can also use data.table. First melt the data to a long format, and then sum the value by year and variable name.
library(data.table)
melt(setDT(df), id.vars = "YR")[, sum(value), by = .(YR, variable)]
## YR variable V1
## 1: 2016 ALL 0
## 2: 2015 ALL 4
## 3: 2014 ALL 5
## 4: 2013 ALL 0
## 5: 2012 ALL 5
## ---
## 336: 1988 WH 6
## 337: 1987 WH 2
## 338: 1981 WH 2
## 339: 1980 WH 1
## 340: 1979 WH 0
You have to use melt from reshape2 package like this:
melted_test <- melt(test, id.vars = 'YR')
Then, if you want the non-zero rows only, you can get them by:
melted_non_zero <- melted_test[which(melted_test$value != 0), ]
However, for how is your dataset defined, I can guess you're going to obtain a value column full of ones. Not a big deal!
Here is one way with tidyverse
library(tidyverse)
gather(dat, theme, count, -YR) %>%
group_by(YR , theme) %>%
summarise(count = sum(count)) %>%
arrange(theme)
# A tibble: 340 x 3
# Groups: YR [17]
# YR theme count
# <dbl> <chr> <dbl>
# 1 1979 AB 0
# 2 1980 AB 0
# 3 1981 AB 0
# 4 1987 AB 1
# 5 1988 AB 0
# 6 1990 AB 0
# 7 1991 AB 0
# 8 1992 AB 0
# 9 1993 AB 0
#10 2007 AB 0
# ... with 330 more rows
<

How to compare consecutive rows in a matrix column and then change value accordingly

I have a matrix full of 1's and 0's. The columns represent samples and the rows represent chromosomes.
I would like to keep all rows that have consecutive 1's in them (ie at least two consecutive rows with a 1 in it). This has to be restricted per chromosome (so that consecutive 1's between two chromosomes is not counted).
I would like to do this for each column in the matrix.
My data is as follows:
chr leftPos OC_030_ST.res OC_031_WG.res
1 4324 0 1
1 23433 1 1
1 34436 1 0
1 64755 1 1
3 234 1 0
3 354 0 1
4 1666 0 1
4 4565 0 1
5 34777 1 1
7 2345 1 1
7 4567 1 1
and the output should be:
chr leftPos OC_030_ST.res OC_031_WG.res
1 4324 0 1
1 23433 1 1
1 34436 1 0
1 64755 1 0
3 234 0 0
3 354 0 0
4 1666 0 1
4 4565 0 1
5 34777 0 0
7 2345 1 1
7 4567 1 1
I don't know how to compare consecutive rows according to chromosome. I imagine I could group by dplyr and somehow compare rows but the comparison is a bit beyond me.
EDIT
Using dput actual data
structure(list(chr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), leftPos = c(240000,
1080000, 1200000, 1320000, 1440000, 1800000, 2400000, 2520000,
3120000, 3360000, 3480000, 3600000, 3720000, 4200000, 4560000,
4920000, 5040000, 5160000, 5280000, 6000000, 7080000, 7200000,
7320000, 7440000, 7560000, 7680000, 7800000, 8280000, 8400000,
8520000, 8760000, 9120000, 9720000, 9840000, 9960000, 10080000,
10200000, 10320000, 10440000, 10560000, 10800000, 11040000, 11160000,
11280000, 11400000, 11520000, 11760000, 11880000, 12000000, 12120000
), chr.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), leftPos.res = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), OC_AH_026C.res = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
), OC_AH_026C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_026C.2.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_AH_084C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_AH_086C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_086C.1.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_AH_086C.2.res = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), OC_AH_086C.3.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), OC_AH_088C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_AH_094C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_AH_094C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_094C.2.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_AH_094C.3.res = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), OC_AH_094C.4.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_094C.5.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_AH_094C.6.res = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), OC_AH_094C.7.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_096C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_AH_100C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_AH_100C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_AH_127C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_AH_133C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_ED_008C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ED_008C.1.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0), OC_ED_008C.2.res = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0), OC_ED_008C.3.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ED_016C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_ED_031C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_ED_036C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_GS_001C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_QE_062C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_RS_010C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_RS_027C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_RS_027C.1.res = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), OC_RS_027C.2.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_SH_051C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_ST_014C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_ST_014C.1.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ST_020C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_ST_024C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_ST_033C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ST_034C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_ST_034C.1.res = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), OC_ST_034C.2.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_ST_035C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_ST_036C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_ST_040C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_WG_002C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), OC_WG_005C.res = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), OC_WG_006C.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), OC_WG_019C.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), Type.res = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ZSSLX.10457.FastSeqA.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), ZSSLX.10457.FastSeqB.BloodDMets_13AF_AHMMH.s_1.r_1.fq.gz.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), ZSSLX.10457.FastSeqC.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0), ZSSLX.10457.FastSeqD.BloodDMets_27AF_AHMMH.s_1.r_1.fq.gz.res = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0), Means.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
sd.res = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), count = 1:50), .Names = c("chr",
"leftPos", "chr.res", "leftPos.res", "OC_AH_026C.res", "OC_AH_026C.1.res",
"OC_AH_026C.2.res", "OC_AH_084C.res", "OC_AH_086C.res", "OC_AH_086C.1.res",
"OC_AH_086C.2.res", "OC_AH_086C.3.res", "OC_AH_088C.res", "OC_AH_094C.res",
"OC_AH_094C.1.res", "OC_AH_094C.2.res", "OC_AH_094C.3.res", "OC_AH_094C.4.res",
"OC_AH_094C.5.res", "OC_AH_094C.6.res", "OC_AH_094C.7.res", "OC_AH_096C.res",
"OC_AH_100C.res", "OC_AH_100C.1.res", "OC_AH_127C.res", "OC_AH_133C.res",
"OC_ED_008C.res", "OC_ED_008C.1.res", "OC_ED_008C.2.res", "OC_ED_008C.3.res",
"OC_ED_016C.res", "OC_ED_031C.res", "OC_ED_036C.res", "OC_GS_001C.res",
"OC_QE_062C.res", "OC_RS_010C.res", "OC_RS_027C.res", "OC_RS_027C.1.res",
"OC_RS_027C.2.res", "OC_SH_051C.res", "OC_ST_014C.res", "OC_ST_014C.1.res",
"OC_ST_020C.res", "OC_ST_024C.res", "OC_ST_033C.res", "OC_ST_034C.res",
"OC_ST_034C.1.res", "OC_ST_034C.2.res", "OC_ST_035C.res", "OC_ST_036C.res",
"OC_ST_040C.res", "OC_WG_002C.res", "OC_WG_005C.res", "OC_WG_006C.res",
"OC_WG_019C.res", "Type.res", "ZSSLX.10457.FastSeqA.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res",
"ZSSLX.10457.FastSeqB.BloodDMets_13AF_AHMMH.s_1.r_1.fq.gz.res",
"ZSSLX.10457.FastSeqC.BloodDMets_16AF_AHMMH.s_1.r_1.fq.gz.res",
"ZSSLX.10457.FastSeqD.BloodDMets_27AF_AHMMH.s_1.r_1.fq.gz.res",
"Means.res", "sd.res", "count"), row.names = c(NA, 50L), class = "data.frame")
Here's a solution applying a function across chr values using the by = argument to data.table. Non-adjacent sequences are located using rle(). Should be fast too.
First, here is the data as I input it:
df <- read.table(textConnection(
"chr leftPos OC_030_ST.res OC_031_WG.res
1 4324 0 1
1 23433 1 1
1 34436 1 0
1 64755 1 1
3 234 1 0
3 354 0 1
4 1666 0 1
4 4565 0 1
5 34777 1 1
7 2345 1 1
7 4567 1 1"), header = TRUE)
Then the code to process the result:
# function to take an integer vector and make non-consecutive 1s into 0s
convertNonRuns <- function(booleanVec) {
rleVals <- rle(booleanVec)
makeZeroIndex1 <- which(rleVals$lengths == 1 & rleVals$values == 1)
makeZeroIndex2 <- sapply(makeZeroIndex1, function(x) cumsum(rleVals$lengths[1:x])[x])
if (length(makeZeroIndex2))
booleanVec[makeZeroIndex2] <- 0L
as.integer(booleanVec)
}
require(data.table)
dt <- data.table(df)
# use data.table's by command to convert runs within chr(omosome)
dt[, c("OC_030_ST.res", "OC_031_WG.res") :=
list(convertNonRuns(OC_030_ST.res), convertNonRuns(OC_031_WG.res)),
by = chr]
dt
## chr leftPos OC_030_ST.res OC_031_WG.res
## 1: 1 4324 0 1
## 2: 1 23433 1 1
## 3: 1 34436 1 0
## 4: 1 64755 1 0
## 5: 3 234 0 0
## 6: 3 354 0 0
## 7: 4 1666 0 1
## 8: 4 4565 0 1
## 9: 5 34777 0 0
## 10: 7 2345 1 1
## 11: 7 4567 1 1
Added
For the newly added dput data, this will work:
# select all variables OC*.res
varnamesToChange <- names(dt)[grep("^OC.*\\.res$", names(dt))]
dt[, varnamesToChange := lapply(varnamesToChange, function(x) dt[[x]]), by = chr]
I am using data.table version 1.9.6.
data.table solution, building on my initial ave solution, which is also below:
library(data.table)
setDT(dat)
for (nam in names(dat)[3:4]) {
dat[,
c(nam) := ((length((get(nam)==1)[get(nam)]) >= 2) & get(nam)==1)+0L,
by=list(chr, cumsum(get(nam)==0))
]
}
# chr leftPos OC_030_ST.res OC_031_WG.res
# 1: 1 4324 0 1
# 2: 1 23433 1 1
# 3: 1 34436 1 0
# 4: 1 64755 1 0
# 5: 3 234 0 0
# 6: 3 354 0 0
# 7: 4 1666 0 1
# 8: 4 4565 0 1
# 9: 5 34777 0 0
#10: 7 2345 1 1
#11: 7 4567 1 1
And my attempt using ave with a custom function:
fun <- function(x,grp,limit=2) {
runs <- ave(
x==1,
list(grp,cumsum(x==0)),
FUN=function(g) length(g[g]) >= limit
)
as.numeric(runs & x==1)
}
lapply(dat[3:4], fun, grp=dat$chr)
#$OC_030_ST.res
# [1] 0 1 1 1 0 0 0 0 0 1 1
#
#$OC_031_WG.res
# [1] 1 1 0 0 0 0 1 1 0 1 1
To overwrite your original data:
dat[3:4] <- lapply(dat[3:4], fun, grp=dat$chr)
f0(colNr,df) contains the row numbers in which the column df[,colNr] should change to 0. g(df) is the converted data frame.
f0 <- function( colNr, df )
{
col <- df[,colNr]
n1 <- which( col == 1 ) # The `1`-rows.
d0 <- which( diff(col) == 0 ) # Consecutive entries are equal.
dc0 <- which( diff(df[,1]) == 0 ) # Same chromosome.
m <- intersect( n1-1, intersect( d0, dc0 ) )
return ( setdiff( 1:nrow(df), union(m,m+1) ) )
}
g <- function( df )
{
for ( i in 3:ncol(df) ) { df[f0(i,df),i] <- 0 }
return ( df )
}
.
Example 1:
> df
chr leftPos OC_030_ST.res OC_031_WG.res
1 1 4324 0 1
2 1 23433 1 1
3 1 34436 1 0
4 1 64755 1 1
5 3 234 1 0
6 3 354 0 1
7 4 1666 0 1
8 4 4565 0 1
9 5 34777 0 1
10 7 2345 1 1
11 7 4567 1 1
> g(df)
chr leftPos OC_030_ST.res OC_031_WG.res
1 1 4324 0 1
2 1 23433 1 1
3 1 34436 1 0
4 1 64755 1 0
5 3 234 0 0
6 3 354 0 0
7 4 1666 0 1
8 4 4565 0 1
9 5 34777 0 0
10 7 2345 1 1
11 7 4567 1 1
>
Example 2:
> df
chr leftPos OC_030_ST.res OC_031_WG.res
1 1 4324 0 1
2 1 23433 1 1
3 1 34436 1 0
4 1 64755 1 1
5 3 234 1 0
6 3 354 1 1
7 4 1666 0 1
8 4 4565 1 1
9 5 34777 0 0
10 5 1234 1 0
11 7 2345 1 1
12 7 4567 1 1
> g(df)
chr leftPos OC_030_ST.res OC_031_WG.res
1 1 4324 0 1
2 1 23433 1 1
3 1 34436 1 0
4 1 64755 1 0
5 3 234 1 0
6 3 354 1 0
7 4 1666 0 1
8 4 4565 0 1
9 5 34777 0 0
10 5 1234 0 0
11 7 2345 1 1
12 7 4567 1 1
>
A simple trick can be to compare the original data set, say df, with its own copy df[-1,], which essentially takes the first row off.
Comparing (columnswise) df$OC_030_ST.res == df[-1,]$OC_030_ST.res (likewise for the others) gives back a true table where each element is being compared with its next one.
Perhaps you can make the next piece into a function and apply that per column per chromosome:
rand <- c(0,0,0,1,1,1,0,0,1,0,1,0,1,1,1,0,0,1,1,0)
first=T
keep <- vector(length=length(rand),'numeric')
for (i in 1:length(rand)){
if (first == T){first=F;if ((rand[i] == 1) & (rand[i+1] == 1)){keep[i] <- 1}} #check if first is 1 and had neigbour 1
else if (rand[i] == 0){keep[i] <- 0} # if 0 than keep = 0
else if (i == length(rand)){if (rand[i-1] == 1){keep[i] <- 1}} #if last than check if 1 and neighbour is 1 than keep = 1
else if ((rand[i-1]==1) | (rand[i+1]==1)){keep[i] <- 1} #if 1 and has neighbour 1 than keep =1
}
Output:
[1] 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0

Resources