How to remove columns according to simultaneous conditions - r

I need to delete the columns (from second onwards) having values different than 0 only in the rows which in the first column have specific values (e.g., sp3 and sp5).
My dataset is large, but here it is a small sample of the data.
SP id2324 id8283 id3912 id3912 id1231...
sp.1 0 2 4 1 0
sp.2 12 10 2 3 15
sp.3 0 0 23 0 4
sp.4 2 2 11 19 0
sp.5 0 0 0 0 3
sp.6 3 1 7 3 0
sp.7 0 14 1 0 12
sp.8 1 0 2 6 6
In this small example I would expect the id3912 and id1231 variables to disappear.

We can first select the rows where SP is c("sp.3", "sp.5"), then select columns where there is at least one value different than 0.
cbind(df[1], df[-1][colSums(df[df$SP %in% c("sp.3", "sp.5"), -1] != 0) == 0])
# SP id2324 id8283 id3912.1
#1 sp.1 0 2 1
#2 sp.2 12 10 3
#3 sp.3 0 0 0
#4 sp.4 2 2 19
#5 sp.5 0 0 0
#6 sp.6 3 1 3
#7 sp.7 0 14 0
#8 sp.8 1 0 6
Breaking it down step-by-step
Select rows where SP is c("sp.3", "sp.5")
df[df$SP %in% c("sp.3", "sp.5"), -1]
# id2324 id8283 id3912 id3912.1 id1231
#3 0 0 23 0 4
#5 0 0 0 0 3
Find cells where value is not equal to 0
df[df$SP %in% c("sp.3", "sp.5"), -1] != 0
# id2324 id8283 id3912 id3912.1 id1231
#3 FALSE FALSE TRUE FALSE TRUE
#5 FALSE FALSE FALSE FALSE TRUE
Find columns where all values are 0
colSums(df[df$SP %in% c("sp.3", "sp.5"), -1] != 0) == 0
# id2324 id8283 id3912 id3912.1 id1231
# TRUE TRUE FALSE TRUE FALSE
We then select the columns which are TRUE and cbind them with 1st column.

Related

Finding variance of columns from 2 dataframes

I have 2 dataframes
DataFrame A and Dataframe B.
A <- data.frame(a=c(1,2,3,4,5),b=c(2,4,6,8,10),c=c(3,6,9,12,15),x=c(4,8,12,16,20),y=c(5,10,15,20,25))
B <- data.frame(a=c(1,2,3,4,5),b=c(2,4,6,8,10),c=c(3,6,9,12,15),x=c(4,8,12,16,20),y=c(5,10,15,20,25))
A
a b c x y
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
B
a b c x y
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
Expected Output:
C
a b c x y
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
Both have a key column which is alpha-numeric.
Both dataframes have 260 columns in all out of which 250 are float.
Is there an eaiser way to easily compute the variance of each of the 250 columns and store the variance in another dataframe?
I think you want difference brtween respective columns of two dataframes
temp = names(A)
data.frame(A["a"], do.call(cbind, lapply(temp[!temp %in% "a"], function(x) A[x] - B[x])))
# a b c x y
#1 1 0 0 0 0
#2 2 0 0 0 0
#3 3 0 0 0 0
#4 4 0 0 0 0
#5 5 0 0 0 0
We can use Map/mapply to find the difference between the corresponding columns of 'A' and 'B'
cbind(A[1], mapply(`-`, A[-1], B[names(A)[-1]]))
# a b c x y
#1 1 0 0 0 0
#2 2 0 0 0 0
#3 3 0 0 0 0
#4 4 0 0 0 0
#5 5 0 0 0 0
Or just
cbind(A[1], A[-1] - B[-1])

How to find Z score of each value in row of Table?

I have a table in R, how do I make a value in the row that is greater or equal to a certain number a 1 and the rest of the values a 0. For example, if my special number was 4, then every value that is 4 and above 4 in my table would be 1, and the rest would be zero. For example then this table:
a b c d e
Bill 1 2 3 4 5
Susan 4 1 5 4 2
Malcolm 4 5 6 2 1
Reese 0 0 2 3 8
Would Turn Into
a b c d e
Bill 0 0 0 1 1
Susan 1 0 1 1 0
Malcolm 1 1 1 0 0
Reese 0 0 0 0 1
We can create a logical matrix of TRUE/FALSE and convert to binary format by using +
+(df1>=4)
# a b c d e
#Bill 0 0 0 1 1
#Susan 1 0 1 1 0
#Malcolm 1 1 1 0 0
#Reese 0 0 0 0 1
Just to be clear, when we do the >=, it creates a logical matrix of TRUE/FALSE
df1 >=4
# a b c d e
#Bill FALSE FALSE FALSE TRUE TRUE
#Susan TRUE FALSE TRUE TRUE FALSE
#Malcolm TRUE TRUE TRUE FALSE FALSE
#Reese FALSE FALSE FALSE FALSE TRUE
But, the OP wanted this to be convert it to 1/0. There are many ways to do this by coercing TRUE/FALSE to binary form. One option is
(df1>=4) + 0L
Or
(df1>=4)*1L
Or simply putting a + will do the coercion
+(df1>=4)
According to ?TRUE
Logical vectors are coerced to integer vectors in contexts where a
numerical value is required, with ‘TRUE’ being mapped to ‘1L’,
‘FALSE’ to ‘0L’ and ‘NA’ to ‘NA_integer_’.
We could also wrap with as.integer, but the output will be a vector
as.integer(df1>=4)
#[1] 0 1 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 1
If we assign the output back to the original dataset, we can change that dataset and keep its structure
df1[] <- as.integer(df1>=4)
df1
# a b c d e
#Bill 0 0 0 1 1
#Susan 1 0 1 1 0
#Malcolm 1 1 1 0 0
#Reese 0 0 0 0 1

Create counter of consecutive runs of a certain value

I have data where consecutive runs of zero are separated by runs of non-zero values. I want to create a counter for the runs of zero in the column 'SOG'.
For the first sequence of 0 in SOG, set the counter in column Stops to 1. For the second run of zeros, set 'Stops' to 2, and so on.
SOG Stops
--- -----
4 0
4 0
0 1
0 1
0 1
3 0
4 0
5 0
0 2
0 2
1 0
2 0
0 3
0 3
0 3
SOG <- c(4,4,0,0,0,3,4,5,0,0,1,2,0,0,0)
#run length encoding:
tmp <- rle(SOG)
#turn values into logicals
tmp$values <- tmp$values == 0
#cumulative sum of TRUE values
tmp$values[tmp$values] <- cumsum(tmp$values[tmp$values])
#inverse the run length encoding
inverse.rle(tmp)
#[1] 0 0 1 1 1 0 0 0 2 2 0 0 3 3 3
Try
df$stops<- with(df, cumsum(c(0, diff(!SOG))>0)*!SOG)
df$stops
# [1] 0 0 1 1 1 0 0 0 2 2 0 0 3 3 3
Using dplyr:
library(dplyr)
df <- df %>% mutate(Stops = ifelse(SOG == 0, yes = cumsum(c(0, diff(!SOG) > 0)), no = 0))
df$Stops
#[1] 0 1 1 1 0 0 0 2 2 0 0 3 3 3
EDIT: As an aside to those of us who are still beginners, many of the answers to this question make use of logicals (i.e. TRUE, FALSE). ! before a numeric variable like SOG tests whether the value is 0 and assigns TRUE if it is, and FALSE otherwise.
SOG
#[1] 4 0 0 0 3 4 5 0 0 1 2 0 0 0
!SOG
#[1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#[12] TRUE TRUE TRUE
diff() takes the difference between the value and the one before it. Note that there is one less element in this list than in SOG since the first element doesn't have a lag with which to compute a difference. When it comes to logicals, diff(!SOG) produces 1 for TRUE - FALSE = 1, FALSE - TRUE = -1, and 0 otherwise.
diff(SOG)
#[1] -4 0 0 3 1 1 -5 0 1 1 -2 0 0
diff(!SOG)
#[1] 1 0 0 -1 0 0 1 0 -1 0 1 0 0
So cumsum(diff(!SOG) > 0) just focuses on the TRUE - FALSE changes
cumsum(diff(!SOG) > 0)
#[1] 1 1 1 1 1 1 2 2 2 2 3 3 3
But since the list of differences is one element shorter, we can append an element:
cumsum(c(0, diff(!SOG) > 0)) #Or cumsum( c(0, diff(!SOG)) > 0 )
#[1] 0 1 1 1 1 1 1 2 2 2 2 3 3 3
Then either "multiply" that list by !SOG as in #akrun's answer or use the ifelse() command. If a particular element of SOG == 0, we use the corresponding element from cumsum(c(0, diff(!SOG) > 0)); if it isn't 0, we assign 0.
A one-liner with rle would be -
df <- data.frame(SOG = c(4,4,0,0,0,3,4,5,0,0,1,2,0,0,0))
df <- transform(df, Stops = with(rle(SOG == 0), rep(cumsum(values) * values, lengths)))
df
# SOG Stops
#1 4 0
#2 4 0
#3 0 1
#4 0 1
#5 0 1
#6 3 0
#7 4 0
#8 5 0
#9 0 2
#10 0 2
#11 1 0
#12 2 0
#13 0 3
#14 0 3
#15 0 3

How fill a matrix with 1 and 0 when there is association

I´ve been trying to make a matrix from a data frame in R, without succes. I have the next data frame
Order Object idrA idoA
8001505892 CHR56029398AB 1 1
8001506013 CHR56029398AB 1 2
8001507782 CHR56029398AB 1 3
8001508088 CHR56029398AB 1 4
8001508788 CHR56029398AB 1 5
8001509281 CHR56029398AB 1 6
8001509322 CHR56029398AB 1 7
8001509373 CHR56029398AB 1 8
8001505342 MMRMD343563 2 9
8001506699 MMRMD343563 2 10
8001507102 MMRMD343563 2 11
8001507193 MMRMD343563 2 12
8001508554 MMRMD343563 2 13
8001508654 MMRMD343563 2 14
8001509151 MMRMD343563 2 15
8001509707 MMRMD343563 2 16
8001509712 MMRMD343563 2 17
8001509977 MMRMD343563 2 18
8001510279 MMRMD343563 2 19
8001505342 MMRMD343565 3 9
8001507112 MMRMD343565 3 20
8001507193 MMRMD343565 3 12
8001508554 MMRMD343565 3 13
8001508654 MMRMD343565 3 14
8001509151 MMRMD343565 3 15
8001509707 MMRMD343565 3 16
8001509712 MMRMD343565 3 17
8001509977 MMRMD343565 3 18
8001510279 MMRMD343565 3 19
8001505920 MMRMN146319 4 21
8001506733 MMRMN146319 4 22
8001506929 MMRMN146319 4 23
8001507112 MMRMN146319 4 20
8001507196 MMRMN146319 4 24
8001510302 MMRMN146319 4 25
8001517272 MMRMN146319 4 26
8001506186 MMRMN146320 5 27
8001506733 MMRMN146320 5 22
8001506929 MMRMN146320 5 23
8001507112 MMRMN146320 5 20
8001508638 MMRMN146320 5 28
8001509526 MMRMN146320 5 29
8001505452 SSR664050011 6 30
8001508551 SSR664050011 6 31
8001509229 SSR664050011 6 32
8001510174 SSR664050011 6 33
Where idr are the Id for each object and ido is the Id for each purchase order. So I want to make a matriz with the number of row = N° orders and N° columns= N°object, and fill it with a vector with 1s and 0s, with a 1 when in each order was purchased some of the bjects and 0 if it wasn´t.
Example: the order with ido=20 must have a vector like this (0,0,1,1,1,0).
I hope I could explain clearly, thanks!
You can use xtabs to create a cross table:
Recreate your data:
dat <- read.table(header=TRUE, text="
Order Object idrA idoA
8001505892 CHR56029398AB 1 1
....
8001506013 CHR56029398AB 1 2
8001507782 CHR56029398AB 1 3
8001509229 SSR664050011 6 32
8001510174 SSR664050011 6 33")
Create the cross table:
xtabs(Order ~ idoA + idrA, dat) != 0
idrA
idoA 1 2 3 4 5 6
1 TRUE FALSE FALSE FALSE FALSE FALSE
2 TRUE FALSE FALSE FALSE FALSE FALSE
....
20 FALSE FALSE TRUE TRUE TRUE FALSE
....
32 FALSE FALSE FALSE FALSE FALSE TRUE
33 FALSE FALSE FALSE FALSE FALSE TRUE
To coerce the logical values to numeric values, you can use apply() and as.numeric, but then you have some work left to replace the row names:
apply(xtabs(Order ~ idoA + idrA, dat) != 0, 2, as.numeric)
Or, you can use a little trick by adding 0 to the values. This coerces the logical values to numeric:
(xtabs(Order ~ idoA + idrA, dat) != 0) + 0
idrA
idoA 1 2 3 4 5 6
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 0 0 0 0 0
....
Another option is to use acast from reshape2
library(reshape2)
res1 <- (acast(dat, idoA~idrA, value.var='Order', fill=0)!=0)+0
head(res1)
# 1 2 3 4 5 6
#1 1 0 0 0 0 0
#2 1 0 0 0 0 0
#3 1 0 0 0 0 0
#4 1 0 0 0 0 0
#5 1 0 0 0 0 0
#6 1 0 0 0 0 0
Or using dplyr/tidyr
library(dplyr)
library(tidyr)
dat %>%
select(-Object) %>%
spread(idrA, Order, fill=0) %>%
mutate_each(funs((!!.)+0), select=-idoA) %>%
head()
#idoA 1 2 3 4 5 6
#1 1 1 0 0 0 0 0
#2 2 1 0 0 0 0 0
#3 3 1 0 0 0 0 0
#4 4 1 0 0 0 0 0
#5 5 1 0 0 0 0 0
#6 6 1 0 0 0 0 0

assigning new values based on the location in the sequence

Working in R.
The data tracks changes in brain activity over time. Column "mark" contains information when a particular treatment begins and ends. For examples, the first condition (mark==1) begins in row 3 and ends in row 6. The second experimental condition (mark==2) starts in row 9 and ends in 12. Another batch of treatment one is repeated between rows 15 and 18.
ob.id <- c(1:20)
mark <- c(0,0,1,0,0,1,0,0,2,0,0,2,0,0,1,0,0,1,0,0)
condition<-c(0,0,1,1,1,1,0,0,2,2,2,2,0,0,1, 1,1,1,0,0)
start <- data.frame(ob.id,mark)
result<-data.frame(ob.id,mark,condition)
print (start)
> print (start)
ob.id mark
1 1 0
2 2 0
3 3 1
4 4 0
5 5 0
6 6 1
7 7 0
8 8 0
9 9 2
10 10 0
11 11 0
12 12 2
13 13 0
14 14 0
15 15 1
16 16 0
17 17 0
18 18 1
19 19 0
20 20 0
I need to create a column that would have a dummy variable indicating the membership of an observation in corresponding experimental condition, like this:
> print(result)
ob.id mark condition
1 1 0 0
2 2 0 0
3 3 1 1
4 4 0 1
5 5 0 1
6 6 1 1
7 7 0 0
8 8 0 0
9 9 2 2
10 10 0 2
11 11 0 2
12 12 2 2
13 13 0 0
14 14 0 0
15 15 1 1
16 16 0 1
17 17 0 1
18 18 1 1
19 19 0 0
20 20 0 0
Thanks for your help!
This is a fun little problem. The trick I use below is to first calculate the rle of the mark vector, which makes the problem simpler, as the resulting values vector will always have just one 0 that may or may not need to be replaced (depending on the surrounding values).
# example vector with some edge cases
v = c(0,0,1,0,0,0,1,2,0,0,2,0,0,1,0,0,0,0,1,2,0,2)
v.rle = rle(v)
v.rle
#Run Length Encoding
# lengths: int [1:14] 2 1 3 1 1 2 1 2 1 4 ...
# values : num [1:14] 0 1 0 1 2 0 2 0 1 0 ...
vals = rle(v)$values
# find the 0's that need to be replaced and replace by the previous value
idx = which(tail(head(vals,-1),-1) == 0 & (head(vals,-2) == tail(vals,-2)))
vals[idx + 1] <- vals[idx]
# finally go back to the original vector
v.rle$values = vals
inverse.rle(v.rle)
# [1] 0 0 1 1 1 1 1 2 2 2 2 0 0 1 1 1 1 1 1 2 2 2
Probably the least cumbersome thing to do is to put the above in a function and then apply that to your data.frame vector (as opposed to manipulating the vector explicitly).
Another approach, based on #SimonO101's observation, involves constructing the right groups from the starting data (run the by part separately, piece by piece, to see how it works):
library(data.table)
dt = data.table(start)
dt[, result := mark[1],
by = {tmp = rep(0, length(mark));
tmp[which(mark != 0)[c(F,T)]] = 1;
cumsum(mark != 0) - tmp}]
dt
# ob.id mark result
# 1: 1 0 0
# 2: 2 0 0
# 3: 3 1 1
# 4: 4 0 1
# 5: 5 0 1
# 6: 6 1 1
# 7: 7 0 0
# 8: 8 0 0
# 9: 9 2 2
#10: 10 0 2
#11: 11 0 2
#12: 12 2 2
#13: 13 0 0
#14: 14 0 0
#15: 15 1 1
#16: 16 0 1
#17: 17 0 1
#18: 18 1 1
#19: 19 0 0
#20: 20 0 0
The latter approach will probably be more flexible.
Here is one way I could think of doing it:
# Find where experiments stop and start
ind <- which( result$mark != 0 )
[1] 3 6 9 12 15 18
# Make a matrix of the start and stop indices taking odd and even elements of the vector
idx <- cbind( head(ind , -1)[ 1:length(ind) %% 2 == 1 ] ,tail( ind , -1)[ 1:length(ind) %% 2 == 1 ] )
[,1] [,2]
[1,] 3 6
[2,] 9 12
[3,] 15 18
edit
I realised making the above index matrix would be easier with just taking odd and even elements:
idx <- cbind( ind[ 1:length(ind) %% 2 == 1 ] , ind[ 1:length(ind) %% 2 != 1 ] )
# Make vector of row indices to turn to 1's
ones <- as.vector( apply( idx , 1 , function(x) c( x[1]:x[2] ) ) )
# Make your new column and turn appropriate rows to 1
result$condition <- 0
result$condition[ ones ] <- 1
result
# ob.id mark condition
#1 1 0 0
#2 2 0 0
#3 3 1 1
#4 4 1 1
#5 5 1 1
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1
#10 10 1 1
#11 11 1 1
#12 12 1 1
#13 13 0 0
#14 14 0 0
#15 15 1 1
#16 16 1 1
#17 17 1 1
#18 18 1 1
#19 19 0 0
#20 20 0 0
Edit
#eddi pointed out I needed to put the value of the experiment in, not just one. So this is another strategy which uses gasp(!) a for loop. This will only be really detrimental if you have millions thousands of experiments (remember to pre-allocate your results vector):
ind <- matrix( which( start$mark != 0 ) , ncol = 2 , byrow = TRUE )
ind <- cbind( ind , start$mark[ ind[ , 1 ] ] )
# [,1] [,2] [,3]
#[1,] 3 6 1
#[2,] 9 12 2
#[3,] 15 18 1
res <- integer( nrow( start ) )
for( i in 1:nrow(ind) ){
res[ ind[i,1]:ind[i,2] ] <- ind[i,3]
}
[1] 0 0 1 1 1 1 0 0 2 2 2 2 0 0 1 1 1 1 0 0

Resources