R: counting and recoding consecutive values in a matrix - r

I am working with a matrix containing a large number of NA. I would like to record the length of each sequence of NA in a new matrix.
The following example should be more plain.
#Generating a random 5x5 population matrix with 15 NA
M=matrix(sample(1:9,25,T),5)
M[sample(1:length(M),15,F)]=NA
dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
M
Year1 Year2 Year3 Year4 Year5
City1 2 NA NA NA NA
City2 NA NA NA 6 8
City3 1 NA NA 6 NA
City4 NA 5 NA NA 1
City5 8 NA 1 NA 2
The desired output is the following. e.g. 4 4 4 4 denotes a sequence of 4 consecutive NA.
Year1 Year2 Year3 Year4 Year5
City1 0 4 4 4 4
City2 3 3 3 0 0
City3 0 2 2 0 1
City4 1 0 2 2 0
City5 0 1 0 1 0
Do you have an idea of how I could go about that?

Not the most efficient code ever:
r1=c(1,1,NA,1,1)
r2=c(1,NA,NA,1,1)
r3=c(1,NA,NA,NA,1)
r4=c(NA,NA,1,1,1)
r5=c(1,1,1,NA,NA)
M=rbind(r1,r2,r3,r4,r5)
like #Pascal pointed out, your approach will convert the entire matrix to characters, so you can assign the 1s to 0s instead and do this:
M[M == 1] <- 0
(xx <- t(apply(M, 1, function(x) {
s <- sum(is.na(x))
if (is.na(x[1])) x[is.na(x)] <- rep(4, s) else
if (is.na(tail(x, 1))) x[is.na(x)] <- rep(5, s) else
x[is.na(x)] <- s
x
})))
# [,1] [,2] [,3] [,4] [,5]
# r1 0 0 1 0 0
# r2 0 2 2 0 0
# r3 0 3 3 3 0
# r4 4 4 0 0 0
# r5 0 0 0 5 5
This is your desired output. If you don't believe me, convert the 0s back to 1s and assign the letters based on the integers
xx[xx > 0] <- letters[xx[xx > 0]]
xx[xx == '0'] <- 1
r1=c(1,1,"a",1,1)
r2=c(1,"b","b",1,1)
r3=c(1,"c","c","c",1)
r4=c("d","d",1,1,1)
r5=c(1,1,1,"e","e")
R=rbind(r1,r2,r3,r4,r5)
identical(R, xx)
# [1] TRUE

This is another basis for a function that would be applied over each row. I tried, but couldn't avoid a for loop:
x = c(1,NA,1,NA,NA,1,NA,NA,NA,1,NA,NA,NA,NA)
#Find the Start and End of each sequence of NA's (Vectorized)
(start <- is.na(x) * c(T,!is.na(x[-length(x)])))
#> [1] 0 1 0 1 0 0 1 0 0 0 1 0 0 0
(end <- is.na(x) * c(!is.na(x[-1]),T))
#> [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
# The difference betweeen the start and end of the sequence +1 is the sequence length
wStart <- which(!!start)
wEnd <- which(!!end)
sequenceLength <- wEnd[i] - wStart[i] + 1
# replace the sequence of NA's with it's class
for(i in seq_along(wStart))
x[`:`(wStart[i],wEnd[i])] <- letters[sequenceLength]
x
#> [1] "1" "a" "1" "b" "b" "1" "c" "c" "c" "1" "d" "d" "d" "d"
as in:
(xx <- t(apply(M, 1, function(x) {
wStart <- which(!!(is.na(x) * c(T,!is.na(x[-length(x)]))))
wEnd <- which(!!is.na(x) * c(!is.na(x[-1]),T))
sequenceLength <-
for(i in seq_along(wStart))
x[`:`(wStart[i],wEnd[i])] <- letters[wEnd[i] - wStart[i] + 1]
return(x)
})))

Related

How to construct dummy matrix with a list of data

The sample data is like this:
data1:
x1
x2
x3
x4
1
2
3
4
2
3
-1
-1
NA
NA
NA
NA
0
0
0
0
1
-1
-1
-1
NA
NA
NA
NA
4
3
-1
-1
0
0
0
0
data1[,1] means that data1[,1] belongs to group x1,x2,x3,x4.
-1 means that there is a blank.
0 means that the data does not belong to the corresponding group(i.e. if 0 is in x1, which means the datum does not belong to group 1.)
NA means missing data, where NA will randomly appear in the dataset.
Edit:
For example, in 1st row,
[1,2,3,4] means the first, second, third, and fourth columns.
Therefore, in the 1st row of data2, the row will be
[1,1,1,1].
In 1st row,
[2,3,-1,-1] means the second and third columns, -1 means that there is a blank.
Therefore, in the 1st row of data2, the row will be
[0,1,1,0].
My expected outcome is :
data2:
x1
x2
x3
x4
1
1
1
1
0
1
1
0
NA
NA
NA
NA
0
0
0
0
1
0
0
0
NA
NA
NA
NA
0
0
1
1
0
0
0
0
My code is as below:
for (i in 1:8){
if(data1$x1[i] %in% c(0)) {
data1[i,] = as.list(rep(0,4))
}
else if(is.na(data1$x1[i]))
{data1[i,] = as.list(rep(NA,4))
}}
for (i in which(data1$x1 %nin% c(NA,0))){
for (j in 1:4){
if (data1[i,j]<15 & data1[i,j]>0){
data1[i,j] = m
data1[i,m] = 1
}
}
}
#replace -1 to 0
data1[data1== -1] = 0
#This for loop creates dummy matrix
for (i in which(data1$x1%nin%c(NA,0))){
m = data1[i,]
m = m[m>0]
for(j in 1:length(m)){
data1[i,m] = 1
}
}
#replace the number that greater than zero to zero
data1[data1>1] = 0
I wonder if there is any function can be used to replace forloop. Please give me some suggestion, thank you!
I am still not entirely sure of logic, but this might be helpful. Using apply you can evaluate each row independently.
First, create a vector of NA. Then, where a value is greater than 1, set that element in the vector (column number) to 1.
Second, if the vector has at least one 1 value, then change the others missing to 0.
Third, if all elements are zero and no values are missing, then make all values in that row 0.
The end result is a matrix in this example.
t(apply(
data1,
MARGIN = 1,
\(x) {
vec <- rep(NA, length(x))
vec[x[x > 0]] <- 1
if (any(vec == 1, na.rm = T)) vec[is.na(vec)] <- 0
if (any(!is.na(x)) & all(x == 0)) vec <- rep(0, length(x))
vec
}
))
Output
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 0 1 1 0
[3,] NA NA NA NA
[4,] 0 0 0 0
[5,] 1 0 0 0
[6,] NA NA NA NA
[7,] 0 0 1 1
[8,] 0 0 0 0

R: Mutate last sequence of specific values

I have a dataframe containing columns with 0's and 1's. I want to mutate the last sequence of 1's into zeros like this:
# data
a <- c(0,1,1,0,1,1,1)
b <- c(0,1,1,1,0,1,1)
c <- data.frame(cbind(a,b))
head(c,7)
# desired output
a_desired <- c(0,1,1,0,0,0,0)
b_desired <- c(0,1,1,1,0,0,0)
c_desired <- data.frame(cbind(a_desired,b_desired))
head(c_desired,7)
such that I end up with the same sequence except that the last sequence of 1's has been mutated into 0's. I've tried using tail() but haven't found a solution so far
You may try using rle
apply(c, 2, function(x){
y <- max(which(rle(x == 1)$values))
x[(sum(rle(x == 1)$lengths[1:(y-1)]) + 1): sum(rle(x == 1)$lengths[1:y])] <- 0
x
})
a b
[1,] 0 0
[2,] 1 1
[3,] 1 1
[4,] 0 1
[5,] 0 0
[6,] 0 0
[7,] 0 0
purrr::map variant
library(purrr)
map(c, function(x){
last1 <- max(which(x == 1))
last0 <- which(x[1:last1] == 0)
c(x[seq_len(max(last0))], rep(0, length(x) - max(last0)))
})
You can try a combination of cumsum of x == 0 and replace the values where this is equal to max.
sapply(c, function(x) {
. <- cumsum(diff(c(0,x)==1)==1)
`[<-`(x, . == max(.), 0L)
#replace(x, . == max(.), 0L) #Alternaive to [<-
})
# a b
#[1,] 0 0
#[2,] 1 1
#[3,] 1 1
#[4,] 0 1
#[5,] 0 0
#[6,] 0 0
#[7,] 0 0
Or the same but written i a different way (thanks to #thelatemail
)
sapply(c, function(x) {
cs <- cumsum(diff(c(0,x)==1)==1)
x[cs == max(cs)] <- 0L
x
})
Or another variant iterating from the last element to the beginning until 0 is found.
sapply(c, function(x) {
n <- length(x)
i <- n
while(x[i] != 1 & i>1L) i <- i-1L
while(x[i] != 0 & i>1L) i <- i-1L
x[i:n] <- 0L
x
})
You can write your own function:
fun <- function(x){
y <- rle(x)
y$values[length(y$values)] <- 0
inverse.rle(y)
}
Now run:
data.frame(sapply(c, fun))
a b
1 0 0
2 1 1
3 1 1
4 0 1
5 0 0
6 0 0
7 0 0
If you sequences always end with 1s, you can try (given df <- data.frame(a,b))
> df * sapply(df, function(x) rev(cumsum(rev(x != 1)) != 0))
a b
1 0 0
2 1 1
3 1 1
4 0 1
5 0 0
6 0 0
7 0 0

I have a function which scores my data as 1 or 0(right or wrong) for MCQ responses(A,B,C,D). I want NA to be scored as NA not as 0

I have a column called 'KEY' which stores the correct response for each MCQ question. I also have the data set which contains scored responses such as A,B,C,D. This(notice how it has some responses which are NA or --) is how my data looks like. This is how my Key looks like.
I am using this function to score the data:-
key2binary <- function (fulldata, key, score_missing = FALSE){
if(missing(fulldata)) missingMsg('fulldata')
if(missing(key)) missingMsg('key')
if(is.vector(key)) key <- matrix(key)
if (ncol(fulldata) != nrow(key)) stop("Key is not the correct length.\n", call.=FALSE)
colname <- colnames(fulldata)
X <- matrix(0L, nrow(fulldata), ncol(fulldata))
colnames(X) <- colname
for(i in 1L:ncol(X)){
if(all(is.na(key[i,]))) next
X[,i] <- fulldata[,i] %in% key[i,] + 0L
}
if(!score_missing)
X[is.na(fulldata)] <- NA
X
}
The issue is that it also scores responses which are NA as 0. I want responses which are NA to be scored as NA only.
resp <- as.data.frame(matrix(c(
"B","B","NA","D","E",
"B","A","D","NA","E",
"B","A","D","C","E",
"D","D","D","C","E",
"B","C","A","D","A"), ncol=5, byrow=TRUE))
key <- c("B", "D", "D", "C", "E")
key2binary(resp, key)
This returns NA as 0 . But i want NA to be returned as NA
You have NA values as strings ('NA') and not actual NA's. It should work once you convert it to actual NA values.
resp[resp == 'NA'] <- NA
key2binary(resp, key)
# V1 V2 V3 V4 V5
#[1,] 1 0 NA 0 1
#[2,] 1 0 1 NA 1
#[3,] 1 0 1 1 1
#[4,] 0 1 1 1 1
#[5,] 1 0 0 0 0
We can do this in a single line
key2binary(resp, key) * NA^(resp == 'NA')
-output
V1 V2 V3 V4 V5
[1,] 1 0 NA 0 1
[2,] 1 0 1 NA 1
[3,] 1 0 1 1 1
[4,] 0 1 1 1 1
[5,] 1 0 0 0 0

Changing the conditions to replace elements in a vector

Consider the vector:
use = c(1,1,2,2,5,1,2,1,2,5,1)
I'm trying to replace all the numbers different from 5 to NA before the first number 5 shows up in the sequence:
ifelse(use != 5,NA,1).
After that the condition should be
ifelse(use != 5,0,1).
The output would be:
after = c(NA,NA,NA,NA,1,0,0,0,0,1,0)
Any tips?
You should try:
`is.na<-`(match(use, 5, 0), seq(match(5, use) - 1))
[1] NA NA NA NA 1 0 0 0 0 1 0
Here is a base R solution
after <- replace(v<- ifelse(use !=5,NA,1),
which(head(which(v==1),1)<seq_along(v) & is.na(v)),
0)
such that
> after
[1] NA NA NA NA 1 0 0 0 0 1 0
Weird subsetting:
c(NA[!cumsum(use == 5)], +(use[!!cumsum(use == 5)] == 5))
#[1] NA NA NA NA 1 0 0 0 0 1 0
We can use match
replace(use, seq_len(match(5, use) - 1), NA)
#[1] NA NA NA NA 5 1 2 1 2 5 1
Or as #M-- commented, this can be changed to binary with
+(replace(use, seq_len(match(5, use) - 1), NA)==5)
This will work if there's only one 5 in your vector
use = c(1,1,2,2,5,1,2,2,2)
use <- findInterval(use,5)*5
i <- which(use > 0)
if(i > 1) use[1:(i-1)] <- NA
Here is another variation. I through in some error handling in case there are no 5's in the vector.
test1 <- c(1,1,1,1,2,3,3)
test2 <- c(5,1,1,2,5,1,2,7,8)
test3 <- c(1,1,3,5,6,7,8,2)
test4 <- c(1,2,3,4,5,5,1,5,5,5,1,1,7,8,1)
find_and_replace <- function(vec, target){
tryCatch(
ifelse( seq_along(vec) %in% 1:{(which(vec == target)[[1]])-1}, NA, ifelse(vec == 5, 1, 0)),
error = function(x) {
warning(paste("Warning: No", target))
vec
}
)
}
find_and_replace(test1, 5)
#> Warning: No 5
#> [1] 1 1 1 1 2 3 3
find_and_replace(test2, 5)
#> [1] NA 0 0 0 1 0 0 0 0
find_and_replace(test3, 5)
#> [1] NA NA NA 1 0 0 0 0
find_and_replace(test4, 5)
#> [1] NA NA NA NA 1 1 0 1 1 1 0 0 0 0 0
The following code solves the problem:
use[1:(which(use == 5)[1]-1)] = NA
use[(which(use == 5)[1]+1):length(use)] = 0
use[which(use == 5)[1]] = 1
use
[1] NA NA NA NA 1 0 0 0 0
You can use which to find the location of the target, and then case_when
use <- c(1,1,2,2,5,1,2,1,2)
first_five <- min(which(use == 5))
dplyr::case_when(
seq_along(use) < first_five ~ NA_real_,
seq_along(use) == first_five ~ 1,
TRUE ~ 0
)
#> [1] NA NA NA NA 1 0 0 0 0
use
#> [1] 1 1 2 2 5 1 2 1 2
Created on 2020-01-14 by the reprex package (v0.3.0)
You could detect the first 5,
first_pos <- which(use==5)
and, if such elements exist, set all entries before the first occurence to NA:
if(length(first_pos)>0) {
use[seq(1,first_pos[1]-1)] <- NA
use[seq(1,first_pos[1])] <- 1
use[seq(first_pos[1]+1, length(use)] <- 0
}
Note that first_pos[1] is called in case there are more than one 5.

Aggregating every 10 columns in binary matrice

I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.

Resources