The sample data is like this:
data1:
x1
x2
x3
x4
1
2
3
4
2
3
-1
-1
NA
NA
NA
NA
0
0
0
0
1
-1
-1
-1
NA
NA
NA
NA
4
3
-1
-1
0
0
0
0
data1[,1] means that data1[,1] belongs to group x1,x2,x3,x4.
-1 means that there is a blank.
0 means that the data does not belong to the corresponding group(i.e. if 0 is in x1, which means the datum does not belong to group 1.)
NA means missing data, where NA will randomly appear in the dataset.
Edit:
For example, in 1st row,
[1,2,3,4] means the first, second, third, and fourth columns.
Therefore, in the 1st row of data2, the row will be
[1,1,1,1].
In 1st row,
[2,3,-1,-1] means the second and third columns, -1 means that there is a blank.
Therefore, in the 1st row of data2, the row will be
[0,1,1,0].
My expected outcome is :
data2:
x1
x2
x3
x4
1
1
1
1
0
1
1
0
NA
NA
NA
NA
0
0
0
0
1
0
0
0
NA
NA
NA
NA
0
0
1
1
0
0
0
0
My code is as below:
for (i in 1:8){
if(data1$x1[i] %in% c(0)) {
data1[i,] = as.list(rep(0,4))
}
else if(is.na(data1$x1[i]))
{data1[i,] = as.list(rep(NA,4))
}}
for (i in which(data1$x1 %nin% c(NA,0))){
for (j in 1:4){
if (data1[i,j]<15 & data1[i,j]>0){
data1[i,j] = m
data1[i,m] = 1
}
}
}
#replace -1 to 0
data1[data1== -1] = 0
#This for loop creates dummy matrix
for (i in which(data1$x1%nin%c(NA,0))){
m = data1[i,]
m = m[m>0]
for(j in 1:length(m)){
data1[i,m] = 1
}
}
#replace the number that greater than zero to zero
data1[data1>1] = 0
I wonder if there is any function can be used to replace forloop. Please give me some suggestion, thank you!
I am still not entirely sure of logic, but this might be helpful. Using apply you can evaluate each row independently.
First, create a vector of NA. Then, where a value is greater than 1, set that element in the vector (column number) to 1.
Second, if the vector has at least one 1 value, then change the others missing to 0.
Third, if all elements are zero and no values are missing, then make all values in that row 0.
The end result is a matrix in this example.
t(apply(
data1,
MARGIN = 1,
\(x) {
vec <- rep(NA, length(x))
vec[x[x > 0]] <- 1
if (any(vec == 1, na.rm = T)) vec[is.na(vec)] <- 0
if (any(!is.na(x)) & all(x == 0)) vec <- rep(0, length(x))
vec
}
))
Output
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 0 1 1 0
[3,] NA NA NA NA
[4,] 0 0 0 0
[5,] 1 0 0 0
[6,] NA NA NA NA
[7,] 0 0 1 1
[8,] 0 0 0 0
Related
I have a column called 'KEY' which stores the correct response for each MCQ question. I also have the data set which contains scored responses such as A,B,C,D. This(notice how it has some responses which are NA or --) is how my data looks like. This is how my Key looks like.
I am using this function to score the data:-
key2binary <- function (fulldata, key, score_missing = FALSE){
if(missing(fulldata)) missingMsg('fulldata')
if(missing(key)) missingMsg('key')
if(is.vector(key)) key <- matrix(key)
if (ncol(fulldata) != nrow(key)) stop("Key is not the correct length.\n", call.=FALSE)
colname <- colnames(fulldata)
X <- matrix(0L, nrow(fulldata), ncol(fulldata))
colnames(X) <- colname
for(i in 1L:ncol(X)){
if(all(is.na(key[i,]))) next
X[,i] <- fulldata[,i] %in% key[i,] + 0L
}
if(!score_missing)
X[is.na(fulldata)] <- NA
X
}
The issue is that it also scores responses which are NA as 0. I want responses which are NA to be scored as NA only.
resp <- as.data.frame(matrix(c(
"B","B","NA","D","E",
"B","A","D","NA","E",
"B","A","D","C","E",
"D","D","D","C","E",
"B","C","A","D","A"), ncol=5, byrow=TRUE))
key <- c("B", "D", "D", "C", "E")
key2binary(resp, key)
This returns NA as 0 . But i want NA to be returned as NA
You have NA values as strings ('NA') and not actual NA's. It should work once you convert it to actual NA values.
resp[resp == 'NA'] <- NA
key2binary(resp, key)
# V1 V2 V3 V4 V5
#[1,] 1 0 NA 0 1
#[2,] 1 0 1 NA 1
#[3,] 1 0 1 1 1
#[4,] 0 1 1 1 1
#[5,] 1 0 0 0 0
We can do this in a single line
key2binary(resp, key) * NA^(resp == 'NA')
-output
V1 V2 V3 V4 V5
[1,] 1 0 NA 0 1
[2,] 1 0 1 NA 1
[3,] 1 0 1 1 1
[4,] 0 1 1 1 1
[5,] 1 0 0 0 0
I have a data frame like this:
Q17a_17 Q17a_18 Q17a_19 Q17a_20 Q17a_21 Q17a_22 Q17a_23
1 NA NA NA NA NA NA NA
2 0 0 0 0 0 0 1
3 0 0 0 0 0 1 1
4 0 0 0 0 0 0 1
5 1 0 0 0 1 1 0
6 0 0 0 0 0 1 1
7 1 1 0 0 1 0 1
And I would like to merge Q17a_17, Q17a_19 and Q17a_23 in a new column with a new name. The "old" columns Q17a_17, Q17a_19 and Q17a_23 should be deleted.
In the new column should be just one value with the following conditions: "NA" if there was "NA" before, "1" if there was somewhere "1" as value before (like in row 3 or 4 or 7) and "0" if there were only zeros before.
Maybe this is really simple, but I struggle already for hours...
The approach I use here is to first compute a vector which is NA when an NA value occurs in at least one of the three columns, and zero otherwise. Also, we compute a vector containing the numerical result you want. What you want can be obtained by logically ORing together the three columns. Then, adding these two computed vectors together produces the desired result.
na.vector <- df$Q17a_17 * df$Q17a_19 * df$Q17a_23
na.vector[!is.na(na.vector)] <- 0
num.vector <- as.numeric(df$Q17a_17 | df$Q17a_19 | df$Q17a_23)
df$new_column <- na.vector + num.vector
df <- df[ , -which(names(df) %in% c("Q17a_17", "Q17a_19", "Q17a_23"))]
I have a time series (or simply a vector) that is binary, returning 0 or 1's depending on some condition (generated with ifelse). I would like to be able to return the counts (in this case corresponds to time series, so days) in between the 1's.
I can do this very easily in Excel, by simply calling the Column I am trying to calculate and then adding the row above (if working with Ascending data, or calling row below if working with descending). See below
I tried doing something similar in R but I am getting an error.
DaysBetweenCondition1 = as.numeric(ifelse((Condition1 ==0 ),0,lag(DaysBetweenCondition1)+1))
Is there an easier way to do this besides making a function
Row# Date Condition1 DaysBetweenCondition1
1 5/2/2007 NA NA
2 5/3/2007 NA NA
3 5/4/2007 NA NA
4 5/5/2007 NA NA
5 5/6/2007 0 NA
6 5/7/2007 0 NA
7 5/8/2007 0 NA
8 5/9/2007 0 NA
9 5/10/2007 0 NA
10 5/11/2007 0 NA
11 5/12/2007 0 NA
12 5/13/2007 0 NA
13 5/14/2007 1 0
14 5/15/2007 0 1
15 5/16/2007 0 2
16 5/17/2007 0 3
17 5/18/2007 0 4
18 5/19/2007 0 5
19 5/20/2007 0 6
20 5/21/2007 0 7
21 5/22/2007 1 0
22 5/23/2007 0 1
23 5/24/2007 0 2
24 5/25/2007 0 3
25 5/26/2007 0 4
26 5/27/2007 1 0
27 5/28/2007 0 1
28 5/29/2007 0 2
29 5/30/2007 1 0
(fwiw, the Dates in this example are made up, in the real data I am using business days so a bit different, and I dont want to reference them, just put in for clarity)
This gets the counting done in one line. Borrowing PhiSeu's code and a line from How to reset cumsum at end of consecutive string and modifying it to count zeros:
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
df_date$DaysBetweenCondition1<-sequence(rle(!df_date$Condition1)$lengths) * !df_date$Condition1
R is very good when working with rows that don't depend on each other. Therefore a lot of functions are vectorized. When working with functions that depend on the value of other rows it is not so easy.
At the moment I can only provide you with a solution using a loop. I assume there is a better solution without a loop.
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
# loop over rows
for(i in 1:nrow(df_date)){
if(is.na(df_date$Condition1[i])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0 & is.na(df_date$Condition1[i-1])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0) {
df_date$DaysBetweenCondition1[i] <- df_date$DaysBetweenCondition1[i-1]+1
} else {
df_date$DaysBetweenCondition1[i] <- 0
}
}
Here's a solution that should be relatively fast
f0 = function(x) {
y = x # template for return value
isna = is.na(x) # used a couple of times
grp = cumsum(x[!isna]) # use '1' to mark start of each group
lag = lapply(tabulate(grp + 1), function(len) {
seq(0, length.out=len) # sequence from 0 to len-1
})
split(y[!isna], grp) <- lag # split y, set to lag element, unsplit
data.frame(x, y)
}
A faster version avoids the lapply() loop; it creates a vector along x (seq_along(x)) and an offset vector describing how the vector along x should be corrected based on the start value of the original vector
f1 = function(x0) {
y0 = x0
x = x0[!is.na(x0)]
y = seq_along(x)
offset = rep(c(1, y[x==1]), tabulate(cumsum(x) + 1))
y0[!is.na(y0)] = y - offset
data.frame(x0, y)
}
Walking through the first solution, here's some data
> set.seed(123)
> x = c(rep(NA, 5), rbinom(30, 1, .15))
> x
[1] NA NA NA NA NA 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
[26] 1 0 0 1 0 0 0 0 0 0
use cumsum() to figure out the group the non-NA data belong to
> isna = is.na(x)
> grp = cumsum(x[!isna])
> grp
[1] 0 0 0 1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 7 7 7 8 8 8 8 8 8 8
use tabulate() to figure out the number of elements in each group, lapply() to generate the relevant sequences
> lag = lapply(tabulate(grp + 1), function(len) seq(0, length.out=len))
finally, create a vector to hold the result, and use spilt<- to update with the lag
> y = x
> split(y[!isna], grp) <- lag
> data.frame(x, y)
x y
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 0 0
7 0 1
8 0 2
9 1 0
10 1 0
11 0 1
12 0 2
13 1 0
14 0 1
15 0 2
16 1 0
17 0 1
...
The key to the second solution is the calculation of the offset. The goal is to be able to 'correct' y = seq_along(x) by the value of y at the most recent 1 in x, kind of like 'fill down' in Excel. The starting values are c(1, y[x==1]) and each needs to be replicated by the number of elements in the group tabulate(cumsum(x) + 1).
I am working with a matrix containing a large number of NA. I would like to record the length of each sequence of NA in a new matrix.
The following example should be more plain.
#Generating a random 5x5 population matrix with 15 NA
M=matrix(sample(1:9,25,T),5)
M[sample(1:length(M),15,F)]=NA
dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
M
Year1 Year2 Year3 Year4 Year5
City1 2 NA NA NA NA
City2 NA NA NA 6 8
City3 1 NA NA 6 NA
City4 NA 5 NA NA 1
City5 8 NA 1 NA 2
The desired output is the following. e.g. 4 4 4 4 denotes a sequence of 4 consecutive NA.
Year1 Year2 Year3 Year4 Year5
City1 0 4 4 4 4
City2 3 3 3 0 0
City3 0 2 2 0 1
City4 1 0 2 2 0
City5 0 1 0 1 0
Do you have an idea of how I could go about that?
Not the most efficient code ever:
r1=c(1,1,NA,1,1)
r2=c(1,NA,NA,1,1)
r3=c(1,NA,NA,NA,1)
r4=c(NA,NA,1,1,1)
r5=c(1,1,1,NA,NA)
M=rbind(r1,r2,r3,r4,r5)
like #Pascal pointed out, your approach will convert the entire matrix to characters, so you can assign the 1s to 0s instead and do this:
M[M == 1] <- 0
(xx <- t(apply(M, 1, function(x) {
s <- sum(is.na(x))
if (is.na(x[1])) x[is.na(x)] <- rep(4, s) else
if (is.na(tail(x, 1))) x[is.na(x)] <- rep(5, s) else
x[is.na(x)] <- s
x
})))
# [,1] [,2] [,3] [,4] [,5]
# r1 0 0 1 0 0
# r2 0 2 2 0 0
# r3 0 3 3 3 0
# r4 4 4 0 0 0
# r5 0 0 0 5 5
This is your desired output. If you don't believe me, convert the 0s back to 1s and assign the letters based on the integers
xx[xx > 0] <- letters[xx[xx > 0]]
xx[xx == '0'] <- 1
r1=c(1,1,"a",1,1)
r2=c(1,"b","b",1,1)
r3=c(1,"c","c","c",1)
r4=c("d","d",1,1,1)
r5=c(1,1,1,"e","e")
R=rbind(r1,r2,r3,r4,r5)
identical(R, xx)
# [1] TRUE
This is another basis for a function that would be applied over each row. I tried, but couldn't avoid a for loop:
x = c(1,NA,1,NA,NA,1,NA,NA,NA,1,NA,NA,NA,NA)
#Find the Start and End of each sequence of NA's (Vectorized)
(start <- is.na(x) * c(T,!is.na(x[-length(x)])))
#> [1] 0 1 0 1 0 0 1 0 0 0 1 0 0 0
(end <- is.na(x) * c(!is.na(x[-1]),T))
#> [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
# The difference betweeen the start and end of the sequence +1 is the sequence length
wStart <- which(!!start)
wEnd <- which(!!end)
sequenceLength <- wEnd[i] - wStart[i] + 1
# replace the sequence of NA's with it's class
for(i in seq_along(wStart))
x[`:`(wStart[i],wEnd[i])] <- letters[sequenceLength]
x
#> [1] "1" "a" "1" "b" "b" "1" "c" "c" "c" "1" "d" "d" "d" "d"
as in:
(xx <- t(apply(M, 1, function(x) {
wStart <- which(!!(is.na(x) * c(T,!is.na(x[-length(x)]))))
wEnd <- which(!!is.na(x) * c(!is.na(x[-1]),T))
sequenceLength <-
for(i in seq_along(wStart))
x[`:`(wStart[i],wEnd[i])] <- letters[wEnd[i] - wStart[i] + 1]
return(x)
})))
I would like to extract every row from the data frame my.data for which the first non-zero element is a 1.
my.data <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
0 2 1 1
2 1 2 1
1 1 1 2
0 0 0 0
0 1 0 0
', header = TRUE)
my.data
desired.result <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
1 1 1 2
0 1 0 0
', header = TRUE)
desired.result
I am not even sure where to begin. Sorry if this is a duplicate. Thank you for any suggestions or advice.
Here's one approach:
# index of rows
idx <- apply(my.data, 1, function(x) any(x) && x[as.logical(x)][1] == 1)
# extract rows
desired.result <- my.data[idx, ]
The result:
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Probably not the best answer, but:
rows.to.extract <- apply(my.data, 1, function(x) {
no.zeroes <- x[x!=0] # removing 0
to.return <- no.zeroes[1] == 1 # finding if first number is 0
# if a row is all 0, then to.return will be NA
# this fixes that problem
to.return[is.na(to.return)] <- FALSE # if row is all 0
to.return
})
my.data[rows.to.extract, ]
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Use apply to iterate over all rows:
first.element.is.one <- apply(my.data, 1, function(x) x[x != 0][1] == 1)
The function passed to apply compares the first [1] non-zero [x != 0] element of x to == 1. It will be called once for each row, x will be a vector of four in your example.
Use which to extract the indices of the candidate rows (and remove NA values, too):
desired.rows <- which(first.element.is.one)
Select the rows of the matrix -- you probably know how to do this.
Bonus question: Where do the NA values mentioned in step 2 come from?