Imagine an array of numbers called A. At each level of A, you want to find the most recent item with a matching value. You could easily do this with a for loop as follows:
A = c(1, 1, 2, 2, 1, 2, 2)
for(i in 1:length(A)){
if(i > 1 & sum(A[1:i-1] == A[i]) > 0){
answer[i] = max(which(A[1:i-1] == A[i]))
}else{
answer[i] = NA
}
}
But I want vectorize this for loop (because I'll be applying this principle on a very large data set). I tried using sapply:
answer = sapply(A, FUN = function(x){max(which(A == x))})
As you can see, I need some way of reducing the array to only values that come before x. Any advice?
We can use seq_along to loop over the index of each element and then subset it and get the max index where the value last occured.
c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
#[1] NA 1 -Inf 3 2 4 6
We can change the -Inf to NA if needed in that format
inds <- c(NA, sapply(seq_along(A)[-1], function(x) max(which(A[1:(x-1)] == A[x]))))
inds[is.infinite(inds)] <- NA
inds
#[1] NA 1 NA 3 2 4 6
The above method gives a warning, to remove the warning we can perform an additional check of the length
c(NA, sapply(seq_along(A)[-1], function(x) {
inds <- which(A[1:(x-1)] == A[x])
if (length(inds) > 0)
max(inds)
else
NA
}))
#[1] NA 1 NA 3 2 4 6
Here's an approach with dplyr which is more verbose, but easier for me to grok. We start with recording the row_number, make a group for each number we encounter, then record the prior matching row.
library(dplyr)
A2 <- A %>%
as_tibble() %>%
mutate(row = row_number()) %>%
group_by(value) %>%
mutate(last_match = lag(row)) %>%
ungroup()
You can do:
sapply(seq_along(A)-1, function(x)ifelse(any(a<-A[x+1]==A[sequence(x)]),max(which(a)),NA))
[1] NA 1 NA 3 2 4 6
Here's a function that I made (based upon Ronak's answer):
lastMatch = function(A){
uniqueItems = unique(A)
firstInstances = sapply(uniqueItems, function(x){min(which(A == x))}) #for NA
notFirstInstances = setdiff(seq(A),firstInstances)
lastMatch_notFirstInstances = sapply(notFirstInstances, function(x) max(which(A[1:(x-1)] == A[x])))
X = array(0, dim = c(0, length(A)))
X[firstInstances] = NA
X[notFirstInstances] = lastMatch_notFirstInstances
return(X)
}
Related
I have a dataframe with over hundreds of variables, grouped in different factors ("Happy_","Sad_", etc) and I want to create a set new variables indicating whether a participant put a rating of 4 in any of the variables in one factor. However, if any of the variable in that factor is NA, then the new variable will also output NA.
I have tried the following, but it didn't work:
library(tidyverse)
df <- data.frame(Subj = c("A", "B", "C", "D"),
Happy_1_Num = c(4,2,2,NA),
Happy_2_Num = c(4,2,2,1),
Happy_3_Num = c(1,NA,2,4),
Sad_1_Num = c(2,1,4,3),
Sad_2_Num = c(NA,1,2,4),
Sad_3_Num = c(4,2,2,1))
# Don't work
df <- df %>% mutate(Happy_Any4 = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), NA,
ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
Sad_Any4 = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), NA,
ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))
I tried a workaround by first generating a set of variables to indicate if that factor has any NA, and after that check if participant put any rating of "4". it works; but since I have many factors, I was wondering if there is a more elegant way of doing it.
# workaround
df <- df %>% mutate(
NA_Happy = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), 1,0),
NA_Sad = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), 1,0))
df <- df %>% mutate(
Happy_Any4 = ifelse(NA_Happy == 1, NA,
ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
Sad_Any4 = ifelse(NA_Sad == 1, NA,
ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))
Here is a base R option using split.default -
tmp <- df[-1]
cbind(df, sapply(split.default(tmp, sub('_.*', '', names(tmp))),
function(x) as.integer(rowSums(x== 4) > 0)))
# Subj Happy_1_Num Happy_2_Num Happy_3_Num Sad_1_Num Sad_2_Num Sad_3_Num Happy Sad
#1 A 4 4 1 2 NA 4 1 NA
#2 B 2 2 NA 1 1 2 NA 0
#3 C 2 2 2 4 2 2 0 1
#4 D NA 1 4 3 4 1 NA 1
sub would keep only either "Happy" or "Sad" part of the names, split.default splits the data based on that and use sapply to calculate if any value of 4 is present in a row.
If you can afford to write each and every factor manually you can do -
library(dplyr)
df %>%
mutate(Happy = as.integer(rowSums(select(., starts_with('Happy')) == 4) > 0),
Sad = as.integer(rowSums(select(., starts_with('Sad')) == 4) > 0))
here is another workaround by transposing the data.frame and an apply on colonns. I'm not sure it's more elegant but here it is ^^
tmp <- cbind(sub("^((Happy)|(Sad))(_.*_Num)$", "\\1", colnames(df)), t(df))
Happy_Any4 <- apply(tmp[tmp[,1]== "Happy", -1], 2,
function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )
Sad_Any4 <- apply(tmp[tmp[,1]== "Sad", -1], 2,
function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )
df <- cbind(df, Happy_Any4 = Happy_Any4, Sad_Any4 = Sad_Any4)
EDIT : Above was a strange test, but now this work with more beauty !
This is because the sum of anything where there is an NA will return NA.
df <- df %>% mutate(Happy_Any4 = apply(df[,grep("^Happy_.*_Num$", colnames(df))],
1, function(x) 1*(sum(x == 4) > 0)),
Sad_Any4 = apply(df[, grep("^Sad_.*_Num$", colnames(df))],
1, function(x) 1*(sum(x == 4) > 0)))
The apply will look every row, only on columns where we find the correct part in colnames (with grep. It then find every occurence of 4, which form a logical vector, and it's sum is the number of occurence. The presence of an NA will bring the sum to NA. I then just check if the sum is above 0 and the 1* will turn the numeric into logical.
Suppose you want to subset a data.frame where the rule for keeping rows is based
on a lag beteen rows 'a' and 'b':
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
#output
a b
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
Essentially, if 'a' = 1 you want to keep that row as well as the subsequent run of rows in
'b' that have a value of 1. This capture continues until the next row with a = 0 & b = 0.
I've tried using nested 'ifelse()' statements, but I am stuck incorporate logical tests based on a lag issue.
Suggestions?
This is how I would do it. There are probably options out there that require maybe 1 or 2 lines less.
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
library(dplyr)
df %>%
mutate(grp = cumsum(a==1|a+b==0)) %>%
group_by(grp) %>%
filter(any(a == 1)) %>%
ungroup() %>%
select(a, b)
A solution without dplyr. Work with a flag:
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# create new empty df
new_df <- read.table(text = "", col.names = c("a", "b"))
a_okay = FALSE # initialize the flag
for (row_number in seq(1:nrow(df))) { # loop over each row of the original df
# if a is 1, we add the row to the new df and set the flag to TRUE
if (df[row_number, "a"] == 1) {
a_okay = TRUE
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# now we consider the rows where a is not 1
else {
# if b is 1 and we are still following an a == 1: add the row
if (df[row_number, "b"] == 1 & a_okay) {
new_df[nrow(new_df) + 1, ] = c(df[row_number, "a"], df[row_number, "b"])
}
# if b is 0, we reset the flag
else {
a_okay = FALSE
}
}
}
Another base solution inspired by this post, #Wietse de Vries's answer and #Ben's comment.
# input
df <- data.frame(a = c(1,0,0,0,1,0,0,0,0,0,0,0),
b = c(0,1,1,0,0,1,1,0,0,0,1,1))
# identify groups
df$grp <- cumsum(df$a == 1 | df$b == 0)
# subset df by groups with first element of a == 1
df <- do.call(rbind, split(df, df$grp)[by(df, df$grp, function(x) {x$a[1] == 1})])
# remove grp
df$grp <- NULL
Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011
This question already has answers here:
How to count TRUE values in a logical vector
(8 answers)
Closed 2 years ago.
I have a dataset with only NA values, and I'm trying to produce a table that shows that this particular dataset is 100% missing.
But the output shows that the NA value is being counted both as "1" and "0." This code works for a different subset of data that doesn't contain missing values. Why is it different for this dataset?
t1 <- data.frame(characteristic = rep(NA, 5), year = sample(x = 1990:1995, size = 100, replace = TRUE))
t1 %>%
select(YEAR, CHARACTERISTIC) %>%
group_by(YEAR) %>%
mutate(YES = length(CHARACTERISTIC[CHARACTERISTIC == "1"]),
NO = length(CHARACTERISTIC[CHARACTERISTIC == "0"]),
COUNT = n(),
MISSING = sum(is.na(CHARACTERISTIC))) %>%
summarize(CHARACTERISTIC = paste(round(first(YES / COUNT) * 100, 2), "%"),
NO_CHARACTERISTIC= paste(round(first(NO / COUNT) * 100, 2), "%"),
MISSING = paste(round(first(MISSING / COUNT) * 100, 2), "%"))
length when compared (==) with NA returns NA and when you subset a vector with NA it returns NA, hence NA is calculated in length.
Check this example :
x <- c(1:3, NA, 2:3, NA)
length(x)
#[1] 7
x == 3
#[1] FALSE FALSE TRUE NA FALSE TRUE NA
x[x == 3]
#[1] 3 NA 3 NA
length(x[x == 3])
#[1] 4
Here, you expected output to be 2 but it gives 4 because of NA values. Perhaps, you can use :
length(na.omit(x[x == 3]))
#[1] 2
but that is very convoluted use sum on logical values instead.
sum(x == 3, na.rm = TRUE)
#[1] 2
So try :
library(dplyr)
t1 %>%
group_by(year) %>%
mutate(YES = sum(characteristic == "1", na.rm = TRUE),
NO = sum(characteristic == "0", na.rm = TRUE))
This might be slightly silly but I would appreciate a better way to deal with this problem. I have a dataframe as the following
a <- matrix(1,5,3)
a[1:2,2] <- NA
a[1,c(1,3)] <- NA
a[3:5,2] <- 2
a[2:5,3] <- 3
a <- data.frame(a)
colnames(a) = c("First", "Second", "Third")
I want to sum only some of, say, the columns but I would like to keep the NAs when all elements in the summed columns are NA. In short, if I sum First and Second columns I want to get something like
mySum <- c(NA, 1, 3, 3, 3)
Neither of the two options below provides what I want
rowSums(a[, c("First", "Second")])
rowSums(a[, c("First", "Second")], na.rm=TRUE)
but on the positive side I have resolved this by using a combination of is.na and all
mySum <- rowSums(a[, c("First", "Second")], na.rm=TRUE)
iNA = apply(a[, c("First", "Second")], 2, is.na)
iAllNA = apply(iNA, 1, all)
mySum[iAllNA] = NA
This feels slightly awkward though so I was wondering if there is a smarter way to handle this.
Using apply with margin = 1 for every row if all the row elements are NA we return NA or else we return the sum of them.
apply(a[c("First", "Second")], 1, function(x)
ifelse(all(is.na(x)), NA, sum(x, na.rm = TRUE)))
#[1] NA 1 3 3 3
mycols = c("First", "Second")
replace(x = rowSums(a[mycols], na.rm = TRUE),
list = rowSums(is.na(a[mycols])) == length(mycols),
values = NA)
#[1] NA 1 3 3 3