How to find three consecutive rows with the same value - r

I have a dataframe as follows:
chr leftPos Sample1 X.DD 3_samples MyStuff
1 324 -1 1 1 1
1 4565 -1 0 0 0
1 6887 -1 1 0 0
1 12098 1 -1 1 1
2 12 -1 1 0 1
2 43 -1 1 1 1
5 1 -1 1 1 0
5 43 0 1 -1 0
5 6554 1 1 1 1
5 7654 -1 0 0 0
5 8765 1 1 1 0
5 9833 1 1 1 -1
6 12 1 1 0 0
6 43 0 0 0 0
6 56 1 0 0 0
6 79 1 0 -1 0
6 767 1 0 -1 0
6 3233 1 0 -1 0
I would like to convert it according to the following rules
For each chromosome:
a. If there are three or more 1's or -1's consecutively in a column then the value stays as it is.
b. If there are less than three 1's or -1s consecutively in a column then the value of the 1 or -1 changes to 0
The rows in a column have to have the same sign (+ or -ve) to be called consecutive.
The result of the dataframe above should be:
chr leftPos Sample1 X.DD 3_samples MyStuff
1 324 -1 0 0 0
1 4565 -1 0 0 0
1 6887 -1 0 0 0
1 12098 0 0 0 0
2 12 0 0 0 0
2 43 0 0 0 0
5 1 0 1 0 0
5 43 0 1 0 0
5 6554 0 1 0 0
5 7654 0 0 0 0
5 8765 0 0 0 0
5 9833 0 0 0 0
6 12 0 0 0 0
6 43 0 0 0 0
6 56 1 0 0 0
6 79 1 0 -1 0
6 767 1 0 -1 0
6 3233 1 0 -1 0
I have managed to do this for two consecutive rows but I'm not sure how to change this for three or more rows.
DAT_list2res <-cbind(DAT_list2[1:2],DAT_list2res)
colnames(DAT_list2res)[1:2]<-c("chr","leftPos")
DAT_list2res$chr<-as.numeric(gsub("chr","",DAT_list2res$chr))
DAT_list2res<-as.data.frame(DAT_list2res)
dx<-DAT_list2res
f0 <- function( colNr, dx)
{
col <- dx[,colNr]
n1 <- which(col == 1| col == -1) # The `1`-rows.
d0 <- which( diff(col) == 0) # Consecutive rows in a column are equal.
dc0 <- which( diff(dx[,1]) == 0) # Same chromosome.
m <- intersect( n1-1, intersect( d0, dc0 ) )
return ( setdiff( 1:nrow(dx), union(m,m+1) ) )
}
g <- function( dx )
{
for ( i in 3:ncol(dx) ) { dx[f0(i,dx),i] <- 0 }
return ( dx )
}
dx<-g(dx)

Here is one solution only using base R.
First define a function that will replace any repetitions which are less than 3 for zeros:
replace_f <- function(x){
subs <- rle(x)
subs$values[subs$lengths < 3] <- 0
inverse.rle(subs)
}
Then split your data.frame by chr and then apply the function to all columns that you want to change (in this case columns 3 to 6):
df[,3:6] <- do.call("rbind", lapply(split(df[,3:6], df$chr), function(x) apply(x, 2, replace_f)))
Notice that we combine the results together with rbind before replacing the original data. This will give you the desired result:
chr leftPos Sample1 X.DD X3_samples MyStuff
1 1 324 -1 0 0 0
2 1 4565 -1 0 0 0
3 1 6887 -1 0 0 0
4 1 12098 0 0 0 0
5 2 12 0 0 0 0
6 2 43 0 0 0 0
7 5 1 0 1 0 0
8 5 43 0 1 0 0
9 5 6554 0 1 0 0
10 5 7654 0 0 0 0
11 5 8765 0 0 0 0
12 5 9833 0 0 0 0
13 6 12 0 0 0 0
14 6 43 0 0 0 0
15 6 56 1 0 0 0
16 6 79 1 0 -1 0
17 6 767 1 0 -1 0
18 6 3233 1 0 -1 0

A data.table solution using rleid would be
require(data.table)
setDT(dat)
dat[,Sample1 := Sample1 * as.integer(.N>=3), by=.(chr, rleid(Sample1))]
This used the grouping by rleid(Sample1) and data.table's helpful .N-variable.
Doing it for all columns you could use the eval(parse(text=...)) syntax as follows:
for(i in names(dat)[3:6]){
by_string = paste0("list(chr, rleid(", i, "))")
def_string = paste0(i, "* as.integer(.N>=3)")
dat[,(i) := eval(parse(text=def_string)), by=eval(parse(text=by_string))]
}
So it results in:
> dat[]
chr leftPos Sample1 X.DD X3_samples MyStuff
1: 1 324 -1 0 0 0
2: 1 4565 -1 0 0 0
3: 1 6887 -1 0 0 0
4: 1 12098 0 0 0 0
5: 2 12 0 0 0 0
6: 2 43 0 0 0 0
7: 5 1 0 1 0 0
8: 5 43 0 1 0 0
9: 5 6554 0 1 0 0
10: 5 7654 0 0 0 0
11: 5 8765 0 0 0 0
12: 5 9833 0 0 0 0
13: 6 12 0 0 0 0
14: 6 43 0 0 0 0
15: 6 56 1 0 0 0
16: 6 79 1 0 -1 0
17: 6 767 1 0 -1 0
18: 6 3233 1 0 -1 0

Related

create a loop to get samples in grouped data which meet a condition

I have a dataframe where data are grouped by ID. I need to know how many cells are the 10% of each group in order to select this number in a sample, but this sample should select the cells which EP is 1.
I've tried to do a nested For loop: one For to know the quantity of cells which are the 10% for each group and the bigger one to sample this number meeting the condition EP==1
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
x
ID EP
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
9 1 0
10 1 1
11 2 0
12 2 1
13 2 0
14 2 1
15 2 0
16 2 1
17 2 0
18 2 1
19 2 0
20 2 1
for(j in 1:1000){
for (i in 1:nrow(x)){
d <- x[x$ID==i,]
npix <- 10*nrow(d)/100
}
r <- sample(d[d$EP==1,],npix)
print(r)
}
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
.
.
.
until 1000
I would want to get this dataframe, where each sample is in a new column in x, and the cell sampled has "1":
ID EP s1 s2....s1000
1 1 0 0 0 ....
2 1 1 0 1
3 1 0 0 0
4 1 1 0 0
5 1 0 0 0
6 1 1 0 0
7 1 0 0 0
8 1 1 0 0
9 1 0 0 0
10 1 1 1 0
11 2 0 0 0
12 2 1 0 0
13 2 0 0 0
14 2 1 0 1
15 2 0 0 0
16 2 1 0 0
17 2 0 0 0
18 2 1 1 0
19 2 0 0 0
20 2 1 0 0
see that each 1 in S1 and s2 are the sampled cells and correspond to 10% of cells in each group (1, 2) which meet the condition EP==1
you can try
set.seed(1231)
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
library(tidyverse)
x %>%
group_by(ID) %>%
mutate(index= ifelse(EP==1, 1:n(),0)) %>%
mutate(s1 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0)) %>%
mutate(s2 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0))
# A tibble: 20 x 5
# Groups: ID [2]
ID EP index s1 s2
<int> <int> <dbl> <dbl> <dbl>
1 1 0 0 0 0
2 1 1 2 0 0
3 1 0 0 0 0
4 1 1 4 0 0
5 1 0 0 0 0
6 1 1 6 1 1
7 1 0 0 0 0
8 1 1 8 0 0
9 1 0 0 0 0
10 1 1 10 0 0
11 2 0 0 0 0
12 2 1 2 0 0
13 2 0 0 0 0
14 2 1 4 0 1
15 2 0 0 0 0
16 2 1 6 0 0
17 2 0 0 0 0
18 2 1 8 0 0
19 2 0 0 0 0
20 2 1 10 1 0
We can write a function which gives us 1's which are 10% for each ID and place it where EP = 1.
library(dplyr)
rep_func <- function() {
x %>%
group_by(ID) %>%
mutate(s1 = 0,
s1 = replace(s1, sample(which(EP == 1), floor(0.1 * n())), 1)) %>%
pull(s1)
}
then use replicate to repeat it for n times
n <- 5
x[paste0("s", seq_len(n))] <- replicate(n, rep_func())
x
# ID EP s1 s2 s3 s4 s5
#1 1 0 0 0 0 0 0
#2 1 1 0 0 0 0 0
#3 1 0 0 0 0 0 0
#4 1 1 0 0 0 0 0
#5 1 0 0 0 0 0 0
#6 1 1 1 0 0 1 0
#7 1 0 0 0 0 0 0
#8 1 1 0 1 0 0 0
#9 1 0 0 0 0 0 0
#10 1 1 0 0 1 0 1
#11 2 0 0 0 0 0 0
#12 2 1 0 0 1 0 0
#13 2 0 0 0 0 0 0
#14 2 1 1 1 0 0 0
#15 2 0 0 0 0 0 0
#16 2 1 0 0 0 0 1
#17 2 0 0 0 0 0 0
#18 2 1 0 0 0 1 0
#19 2 0 0 0 0 0 0
#20 2 1 0 0 0 0 0

Merge/match two data frames

I would like to merge two data frames, y$genes and symbol_annotations, by
the row names of y and the second column, "hgnc_symbol", of symbol_annotations, and create a column labeled "Symbol", y$genes$Symbol, listing all of the matches. If there is no match between "hgnc_symbol" and the row name, I would like for 'NA' to populate instead of an empty cell. I keep getting an error because the two data frames aren't of the same dimensions and contain NAs, and I'm not sure how to correct it.
>read.counts <- read.table("gene_counts.txt", header=TRUE)
>row.names(read.counts) <- read.counts$Geneid
>treatment <- factor(treatment)
> head(treatment)
[1] T0 IL2 IL2.ZA IL2.OKT3 IL2.OKT3.ZA T0
Levels: T0 IL2 IL2.OKT3 IL2.OKT3.ZA IL2.ZA
>y <- DGEList(read.counts, group=treatment, genes=read.counts)
>head(y$genes)
SM01 SM02 SM03 SM04 SM05 SM06 SM07 SM08 SM09 SM10 SM11 SM12 SM13 SM14 SM15 SM16 SM17 SM18 SM19
ENSG00000223972 0 1 1 1 0 0 1 0 0 3 0 0 1 2 0 0 0 0 1
ENSG00000227232 33 31 13 15 20 43 36 32 43 43 61 42 92 73 80 64 33 25 28
ENSG00000278267 1 0 1 0 0 5 3 1 1 2 1 0 2 4 6 0 2 2 1
ENSG00000243485 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
ENSG00000237613 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSG00000268020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SM20 SM21 SM22 SM23 SM24 SM25 SM26 SM27 SM28 SM29 SM30
ENSG00000223972 0 0 0 0 1 0 0 0 0 0 0
ENSG00000227232 15 60 13 29 22 28 87 42 61 67 74
ENSG00000278267 2 3 5 1 3 4 4 3 2 4 3
ENSG00000243485 0 0 0 0 0 1 0 0 0 0 1
ENSG00000237613 0 0 0 0 0 0 0 0 0 0 0
ENSG00000268020 0 0 0 0 0 0 0 0 0 0 0
>head(symbol_annotations, n=10)
ensembl_gene_id hgnc_symbol
1 ENSG00000210049 MT-TF
2 ENSG00000211459 MT-RNR1
3 ENSG00000210077 MT-TV
4 ENSG00000210082 MT-RNR2
5 ENSG00000209082 MT-TL1
6 ENSG00000198888 MT-ND1
7 ENSG00000210100 MT-TI
8 ENSG00000223795 <NA>
9 ENSG00000210107 MT-TQ
10 ENSG00000210112 MT-TM
>dim(symbol_annotations)
[1] 58069 2
>dim(y$genes)
[1] 58051 30
>y$genes$Symbol <- merge((rownames(y)), symbol_annotations[,c(2)])
Error in if (n > 0) c(NA_integer_, -n) else integer() :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rep.fac * nx : NAs produced by integer overflow
2: In .set_row_names(as.integer(prod(d))) :
NAs introduced by coercion to integer range

adding data frame of counts to template data frame in R

I have data.frames of counts such as:
a <- data.frame(id=1:10,
"1"=c(rep(1,3),rep(0,7)),
"3"=c(rep(0,4),rep(1,6)))
names(a)[2:3] <- c("1","3")
a
> a
id 1 3
1 1 1 0
2 2 1 0
3 3 1 0
4 4 0 0
5 5 0 1
6 6 0 1
7 7 0 1
8 8 0 1
9 9 0 1
10 10 0 1
and a template data.frame such as
m <- data.frame(id=1:10,
"1"= rep(0,10),
"2"= rep(0,10),
"3"= rep(0,10),
"4"= rep(0,10))
names(m)[-1] <- 1:4
m
> m
id 1 2 3 4
1 1 0 0 0 0
2 2 0 0 0 0
3 3 0 0 0 0
4 4 0 0 0 0
5 5 0 0 0 0
6 6 0 0 0 0
7 7 0 0 0 0
8 8 0 0 0 0
9 9 0 0 0 0
10 10 0 0 0 0
and I want to add the values of a into the template m
in the appropraite columns, leaving the rest as 0.
This is working but I would like to know
if there is a more elegant way, perhaps using plyr or data.table:
provi <- rbind.fill(a,m)
provi[is.na(provi)] <- 0
mnew <- aggregate(provi[,-1],by=list(provi$id),FUN=sum)
names(mnew)[1] <- "id"
mnew <- mnew[c(1,order(names(mnew)[-1])+1)]
mnew
> mnew
id 1 2 3 4
1 1 1 0 0 0
2 2 1 0 0 0
3 3 1 0 0 0
4 4 0 0 0 0
5 5 0 0 1 0
6 6 0 0 1 0
7 7 0 0 1 0
8 8 0 0 1 0
9 9 0 0 1 0
10 10 0 0 1 0
I guess the concise option would be:
m[names(a)] <- a
Or we match the column names ('i1'), use that to create the column index with max.col, cbind with the row index ('i2'), and a similar step can be done to create 'i3'. We change the values in 'm' corresponding to 'i2' with the 'a' values based on 'i3'.
i1 <- match(names(a)[-1], names(m)[-1])
i2 <- cbind(m$id, i1[max.col(a[-1], 'first')]+1L)
i3 <- cbind(a$id, max.col(a[-1], 'first')+1L)
m[i2] <- a[i3]
m
# id 1 2 3 4
#1 1 1 0 0 0
#2 2 1 0 0 0
#3 3 1 0 0 0
#4 4 0 0 0 0
#5 5 0 0 1 0
#6 6 0 0 1 0
#7 7 0 0 1 0
#8 8 0 0 1 0
#9 9 0 0 1 0
#10 10 0 0 1 0
A data.table option would be melt/dcast
library(data.table)
dcast(melt(setDT(a), id.var='id')[,
variable:= factor(variable, levels=1:4)],
id~variable, value.var='value', drop=FALSE, fill=0)
# id 1 2 3 4
# 1: 1 1 0 0 0
# 2: 2 1 0 0 0
# 3: 3 1 0 0 0
# 4: 4 0 0 0 0
# 5: 5 0 0 1 0
# 6: 6 0 0 1 0
# 7: 7 0 0 1 0
# 8: 8 0 0 1 0
# 9: 9 0 0 1 0
#10: 10 0 0 1 0
A similar dplyr/tidyr option would be
library(dplyr)
library(tidyr)
gather(a, Var, Val, -id) %>%
mutate(Var=factor(Var, levels=1:4)) %>%
spread(Var, Val, drop=FALSE, fill=0)
You could use merge, too:
res <- suppressWarnings(merge(a, m, by="id", suffixes = c("", "")))
(res[, which(!duplicated(names(res)))][, names(m)])
# id 1 2 3 4
# 1 1 1 0 0 0
# 2 2 1 0 0 0
# 3 3 1 0 0 0
# 4 4 0 0 0 0
# 5 5 0 0 1 0
# 6 6 0 0 1 0
# 7 7 0 0 1 0
# 8 8 0 0 1 0
# 9 9 0 0 1 0
# 10 10 0 0 1 0

R: Algorithm for setting missing values faster

I have a problem with setting missing values in data frame. In the first 3 columns there are ID of product, ID of store, and number of week. There are also 28 columns from 4 to 31 corresponding to last 28 days of selling item (last 7 days are days in our week). I want to set the missing values by comparing two records with the same first and second column but different number of weeks.
corrections <- function(x,y){
#the functions changes vector y if the difference between weeks is not greeter than 3
if (x[1]==y[1] && x[2]==y[2] && -(x[3]-y[3])<=3){
t=y[3]-x[3]
t=as.integer(t)
a=x[(4+ (t*7) ):31]
b=y[4:(31- (t*7)) ]
c= a-b
for (i in 1:(28-(t*7))){
if (is.na(c[i]))
{
if (!(is.na(a[i]) && is.na(b[i])))
{
if (is.na(b[i]))
b[i]=a[i]
else
a[i]=b[i]
}
}
}
y[4:(31- t*7)]=b
}
return(y)
}
for (i in 2:(dim(salesTraining)[1]) {
salesTraining[i,]=corrections(salesTraining[i-1,], salesTraining[i,])
}
The loop takes 1 minute for every 1000 records so if my data have 212000 records it will take ~3,5 hours (if it's linear complexity). Is there any error or can I do it better - faster?
Example of data frame:
productID storeID weekInData dailySales1 dailySales2 dailySales3 dailySales4 dailySales5
1 1 1 37 0 0 0 0 0
2 1 1 38 0 0 0 0 0
3 1 1 39 0 0 0 0 0
4 1 1 40 0 NA 0 NA 2
5 1 1 41 NA 0 NA 0 0
6 1 1 42 0 0 0 NA 0
7 1 1 43 0 0 NA 0 NA
8 1 1 44 0 2 1 NA 0
9 1 1 45 NA 0 0 NA 0
10 1 1 46 NA 0 0 NA NA
dailySales6 dailySales7 dailySales8 dailySales9 dailySales10 dailySales11 dailySales12 dailySales13
1 NA NA 0 NA 0 0 0 0
2 0 NA NA 0 0 0 0 0
3 0 NA 0 0 0 NA 2 NA
4 0 NA 0 NA 0 NA 0 0
5 0 0 NA 0 0 0 0 0
6 NA 0 NA 0 0 0 0 0
7 0 0 0 2 NA 0 0 0
8 0 NA 0 NA 0 NA 0 1
9 1 0 0 0 0 0 1 0
10 0 0 0 NA 0 NA 0 0
dailySales14 dailySales15 dailySales16 dailySales17 dailySales18 dailySales19 dailySales20
1 0 0 0 0 0 0 0
2 0 0 0 0 5 2 NA
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 2 1 0 0 NA
7 0 0 0 0 0 0 1
8 0 0 0 0 0 1 0
9 0 0 -1 0 0 0 0
10 0 0 0 0 0 0 0
dailySales21 dailySales22 dailySales23 dailySales24 dailySales25 dailySales26 dailySales27
1 NA 0 0 0 5 2 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 NA 1 0 0 0
6 0 0 0 0 0 0 1
7 0 0 0 0 0 1 0
8 0 0 NA 0 0 0 0
9 NA 0 0 0 NA 0 0
10 0 1 0 0 0 0 0
dailySales28 daysStoreClosed_series daysStoreClosed_target dayOfMonth dayOfYear weekOfYear month
1 0 5 2 23 356 51 12
2 0 6 2 30 363 52 12
3 0 6 1 6 5 1 1
4 0 6 1 13 12 2 1
5 0 6 1 19 18 3 1
6 0 5 1 26 25 4 1
7 0 4 1 2 32 5 2
8 0 4 1 9 39 6 2
9 0 4 1 16 46 7 2
10 0 4 1 23 53 8 2
quarter
1 4
2 4
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1

How to sum leading diagonal of table in R

I have a table created using the table() command in R:
y
x 0 1 2 3 4 5 6 7 8 9
0 23 0 0 0 0 1 0 0 0 0
1 0 23 1 0 1 0 1 2 0 2
2 1 1 28 0 0 0 1 0 2 2
3 0 1 0 24 0 1 0 0 0 1
4 1 1 0 0 34 0 3 0 0 0
5 0 0 0 0 0 33 0 0 0 0
6 0 0 0 0 0 2 32 0 0 0
7 0 1 0 1 0 0 0 36 0 1
8 1 1 1 1 0 0 0 1 20 1
9 1 3 0 1 0 1 0 1 0 24
This table shows the results of a classification, and I want to sum the leading diagonal of it (the diagonal with the large numbers - like 23, 23, 28 etc). Is there a sensible/easy way to do this in R?
How about sum(diag(tbl)), where tbl is your table?

Resources