Sort vector keeping NAs position in R - r

Problem 1 (solved)
How can I sort vector DoB:
DoB <- c(NA, 9, NA, 2, 1, NA)
while keeping the NAs in the same position?
I would like to get:
> DoB
[1] NA 1 NA 2 9 NA
I have tried this (borrowing from this answer)
NAs_index <- which(is.na(DoB))
DoB <- sort(DoB, na.last = NA)
for(i in 0:(length(NAs_index)-1))
DoB <- append(DoB, NA, after=(NAs_index[i+1]+i))
but
> DoB
[1] 1 NA 2 9 NA NA
Answer is
DoB[!is.na(DoB)] <- sort(DoB)
Thanks to #BigDataScientist and #akrun
Now, Problem 2
Say, I have a vector id
id <- 1:6
That I would also like to sort by the same principle, so that the values of id are ordered according to order(DoB), but keeping the NAs fixed in the same position?:
> id
[1] 1 5 3 4 2 6

You could do:
DoB[!is.na(DoB)] <- sort(DoB)
Edit: Concerning the follow up question in the comments:
You can use order() for that and take care of the NAs with the na.last parameter,..
data <- data.frame(DoB = c(NA, 9, NA, 2, 1, NA), id = 1:6)
data$id[!is.na(data$DoB)] <- order(data$DoB, na.last = NA)
data$DoB[!is.na(data$DoB)] <- sort(data$DoB)

We create a logical index and then do the sort
i1 <- is.na(DoB)
DoB[!i1] <- sort(DoB[!i1])
DoB
#[1] NA 1 NA 2 9 NA

Related

R fuction composition for the substitution of values in dataframe

given the following reproducible example
my objective is to row-wise substitute the original values with NA in adjacent columns of a data frame; I know it's a problem (with so many variants) already posted but I've not yet found the solution with the approach I'm trying to accomplish: i.e. by applying a function composition
in the reproducible example the column driving the substitution with NA of the original values is column a
this is what I've done so far
the very last code snippet is a failing attempt of what I'm actually searching for...
#-----------------------------------------------------------
# ifelse approach, it works but...
# it's error prone: i.e. copy and paste for all columns can introduce a lot of troubles
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b<-ifelse(is.na(df$a), NA, df$b)
df$c<-ifelse(is.na(df$a), NA, df$c)
df
#--------------------------------------------------------
# extraction and subsitution approach
# same as above
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b[is.na(df$a)]<-NA
df$c[is.na(df$a)]<-NA
df
#----------------------------------------------------------
# definition of a function
# it's a bit better, but still error prone because of the copy and paste
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix<-function(x,y){
ifelse(is.na(x), NA, y)
}
df$b<-fix(df$a, df$b)
df$c<-fix(df$a, df$c)
df
#------------------------------------------------------------
# this approach is not working as expected!
# the idea behind is of function composition;
# lapply does the fix to some columns of data frame
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix2<-function(x){
x[is.na(x[1])]<-NA
x
}
df[]<-lapply(df, fix2)
df
any help for this particular approach?
I'm stuck on how to properly conceive the substitute function passed to lapply
thanx
Using lexical closure
If you use lexical closureing - you define a function which generates first the function you need.
And then you can use this function as you wish.
# given a column all other columns' values at that row should become NA
# if the driver column's value at that row is NA
# using lexical scoping of R function definitions, one can reach that.
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# whatever vector given, this vector's value should be changed
# according to first column's value
na_accustomizer <- function(df, driver_col) {
## Returns a function which will accustomize any vector/column
## to driver column's NAs
function(vec) {
vec[is.na(df[, driver_col])] <- NA
vec
}
}
df[] <- lapply(df, na_accustomizer(df, "a"))
df
## a b c
## 1 1 3 NA
## 2 2 NA 5
## 3 NA NA NA
#
# na_accustomizer(df, "a") returns
#
# function(vec) {
# vec[is.na(df[, "a"])] <- NA
# vec
# }
#
# which then can be used like you want:
# df[] <- lapply(df, na_accustomize(df, "a"))
Using normal functions
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# define it for one column
overtake_NA <- function(df, driver_col, target_col) {
df[, target_col] <- ifelse(is.na(df[, driver_col]), NA, df[, target_col])
df
}
# define it for all columns of df
overtake_driver_col_NAs <- function(df, driver_col) {
for (i in 1:ncol(df)) {
df <- overtake_NA(df, driver_col, i)
}
df
}
overtake_driver_col_NAs(df, "a")
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA
Generalize for any predicate function
driver_col_to_other_cols <- function(df, driver_col, pred) {
## overtake any value of the driver column to the other columns of df,
## whenever predicate function (pred) is fulfilled.
# define it for one column
overtake_ <- function(df, driver_col, target_col, pred) {
selectors <- do.call(pred, list(df[, driver_col]))
if (deparse(substitute(pred)) != "is.na") {
# this is to 'recorrect' NA's which intrude into the selector vector
# then driver_col has NAs. For sure "is.na" is not the only possible
# way to check for NA - so this edge case is not covered fully
selectors[is.na(selectors)] <- FALSE
}
df[, target_col] <- ifelse(selectors, df[, driver_col], df[, target_col])
df
}
for (i in 1:ncol(df)) {
df <- overtake_(df, driver_col, i, pred)
}
df
}
driver_col_to_other_cols(df, "a", function(x) x == 1)
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA 4 6
## if the "is.na" check is not done, then this would give
## (because of NA in selectorvector):
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA NA NA
## hence in the case that pred doesn't check for NA in 'a',
## these NA vlaues have to be reverted to the original columns' value.
driver_col_to_other_cols(df, "a", is.na)
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA
Try this function, in input you have your original dataset and in output the cleaned one:
Input
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
> df
a b c
1 1 3 NA
2 2 NA 5
3 NA 4 6
Function
fix<-function(df,var_x,list_y)
{
df[is.na(df[,var_x]),list_y]<-NA
return(df)
}
Output
fix(df,"a",c("b","c"))
a b c
1 1 3 NA
2 2 NA 5
3 NA NA NA

Loop to assign a value if an observation in one column is equal to other column's name in R

I have a vector and a data set that are similar to:
id_vector <- as.character(c("n01", "n02", "n03"))
df_1 <- data.frame("id" = c("n01", "n02", "n02", "n03"), "n01" = NA, "n02" = NA, "n03" = NA)
df_1$id <- as.character(df_1$id)
And I want the data set to be:
df_2 <- data.frame("id" = c("n01", "n02", "n02", "n03"), "n01" = c(1, NA, NA, NA), "n02" = c(NA, 1, 1, NA), "n03" = c(NA, NA, NA, 1))
The solution should be simple, something like:
for (i in id_vector){
df_1[i][df_1$id == i] <- 1
}
However, I can't use two []s. The error is:
Error in `[<-.data.frame`(`*tmp*`, df_1$id == i, value = 1) :
duplicate subscripts for columns
Any help?
Thanks!
Here, we can subset the vector with [[. df_1[1] is still a data.frame with a single column
for (i in id_vector){
df_1[[i]][df_1$id == i] <- 1
}
identical(df_1, df_2)
#[1] TRUE
You can create a row/column matrix to change value to 1.
df_1[id_vector][cbind(seq_len(nrow(df_1)), match(df_1$id, id_vector))] <- 1
df_1
# id n01 n02 n03
#1 n01 1 NA NA
#2 n02 NA 1 NA
#3 n02 NA 1 NA
#4 n03 NA NA 1
To explain he above, we use match to get column numbers to replace whereas seq_len(nrow(df_1)) gives us a sequence 1:nrow(df). Using cbind we turn them to matrix.
cbind(seq_len(nrow(df_1)), match(df_1$id, id_vector))
# [,1] [,2]
#[1,] 1 1
#[2,] 2 2
#[3,] 3 2
#[4,] 4 3
Now we subset only id_vector columns, subset the dataframe based on the above matrix and assign the values to 1.

recoding integers in a vector so they register as NA instead?

I hope this isn't a silly question but I am REALLY struggling to recode a variable in R so that certain values register as NA instead of the placeholder integer that got read in. respondents who did not answer the question for that column were originally coded as -88, -89 and -99 instead of NA and I only know how to remove them completely from that column.
I want to keep that row, just have those inputs registered as missing. Recode doesn't seem to work b/c NA isn't a value
Thanks!
Maybe you can try replace
v <- replace(v,v%in%c(-88,-89,-99),NA)
such that
> v
[1] 1 2 NA NA -1 NA NA
Dummy Data
v <- c(1,2,-88,-89,-1,-99,-89)
You can use the %in% operator to find all positions in a vector which match with another vector, and then set them to NA as follows:
dat = data.frame(V1 = c(10, 20, 30, -88, -89, -99))
dat$V1[dat$V1 %in% c(-88, -89, -99)] = NA
dat
V1
1 10
2 20
3 30
4 NA
5 NA
6 NA
Here's one way to do it, which will replace all values of -88, -89 and -99 in your data:
for (i in c(-88, -89, -99)){
data.df[data.df == i] <- NA
}
If you need to just replace in one column (e.g. column 'x'):
for (i in c(-88, -89, -99)){
data.df$x[data.df$x == i] <- NA
}
The correct/ most adequat answer to that question is depending on the exact specifics of your data, in case you have a numeric variable and all other values are positive, this would work.
somedata <-
tibble::tribble(
~v1, ~v2,
1, 2,
3, 4,
-88, 5,
6, -89,
-99, 1
)
library(tidyverse)
somedata %>%
mutate(v1 = ifelse(v1 < 0, NA, v1))
# A tibble: 5 x 2
v1 v2
<dbl> <dbl>
1 1 2
2 3 4
3 NA 5
4 6 -89
5 NA 1
Thanks so much again to everyone for your help!
I first converted the variable to numeric, then this seemed to work for me:
anesCSV$clinton.withNA <- replace(anesCSV$clintonthermo_numeric,anesCSV$clintonthermo_numeric%in%c(-88,-89,-99),NA)
As someone initially suggested:
v <- replace(v,v%in%c(-88,-89,-99),NA)
I did create a new variable to store the results personally!

R data management: Aggregating multiple variables into newly generated variable based on multiple conditions from other variables

I'm a brand New R user, this is my first attempt at conditional statements - and first time posting an online question - so please bear with me.
I'm managing data from a questionnaire survey into something that can be analyzed. As a rookie mistake, I've been using the question form "tick all that apply" frequently - obtaining a matrix with one column per alternative answer. I am trying to combine these by conditional logic into a single column per question.
Using one of these matrices as example, my dataframe BS looks like this:
ID Q29_ Q29_ Q29_3 Q29_4
1 1 NA NA 1
2 NA 1 1 1
3 1 NA NA NA
4 NA 1 NA NA
[…]
Using following code
BS1 <- BS %>%
mutate(WaterOutdoors = c("", "Yes")[(Q29_2 %in% c("Outdoors") |
Q29_3 == "Outdoors" |
Q29_4 == "Outdoors")+1])
I get
ID Q29_ Q29_ Q29_3 Q29_4 WaterOutdoors
1 1 NA NA 1 Yes
2 NA 1 1 1 Yes
3 1 NA NA NA
4 NA NA NA NA
5 NA 1 NA NA Yes
[…]
I would like to add a "No"-option to the "WaterOutdoors" variable for those rows who have ticked off Q29_1 (Q29_1="1") and did not tick of any of the following (Q29_2=NA, Q29_3=NA, Q29_4=NA), and at the same time leave "WaterOutdoors" either empty or NA if none of the Q29_1:_4 have been ticked of.
I have tried
BS1$WaterOutdoors <- with(BS1, ifelse(Q29_1 %>% c("InStable") &
is.na(Q29_2) &
is.na(Q29_3) &
is.na(Q29_4),
"No",""))
hoping to get
ID Q29_ Q29_ Q29_3 Q29_4 WaterOutdoors
1 1 NA NA 1 Yes
2 NA 1 1 1 Yes
3 1 NA NA NA No
4 NA NA NA NA NA
5 NA 1 NA NA Yes
[…]
but rather got the error-message: "Error in Q29_1 %>% c("InStable") & is.na(Q29_2) :
operations are possible only for numeric, logical or complex types"
My logic is obviously flawed, any input on how I should have gone about doing this would be greatly appreciated!
As is always the case, there are many possible solutions. Here's one using only base code.
## Recreating your dataset
df <- data.frame(
Q29_1 = as.integer(c(1, NA, 1, NA, NA)),
Q29_2 = as.integer(c(NA, 1, NA, NA, 1)),
Q29_3 = as.integer(c(NA, 1, NA, NA, NA)),
Q29_4 = as.integer(c(1, 1, NA, NA, NA)),
stringsAsFactors = FALSE
)
## Convert all NAs to 0
df[is.na(df)] <- 0
## Concatenate all answers into a string
df$WaterOutdoors <- as.character(paste0(df$Q29_1, df$Q29_2, df$Q29_3, df$Q29_4))
## Recode
df$WaterOutdoors <- as.character(
## If first value is 1 and all rest are 0, meaning 1000 then "No"
ifelse(df$WaterOutdoors == "1000", "No",
## If the substring of 2-4 contains a 1 then "Yes"
ifelse(grepl("1", substr(df$WaterOutdoors, 2, 4)), "Yes",
## Else NA (could also have said if else df$WaterOutdoors == "0000")
as.character(NA))
))
print(df)
I could not completely understand the code, but based on what you explained, it could be achieved using case_when from dplyr package.
case_when function will add an NA when any of the conditions is met. In this case if all variables are NA
library(dplyr)
Q29_1 <- c(1, NA, 1, NA)
Q29_2 <- c(NA, 1, NA, 1)
Q29_3 <- c(NA, 1, NA, NA)
Q29_4 <- c(1, 1, NA, NA)
BS <- data.frame(Q29_1 = Q29_1,
Q29_2 = Q29_2,
Q29_3 = Q29_3,
Q29_4 = Q29_4)
BS1 <- BS %>%
mutate(WaterOutdoors = case_when(rowSums(.[,c("Q29_2", "Q29_3", "Q29_4")], na.rm = TRUE) > 0 ~ "Yes",
Q29_1 == 1 & rowSums(.[,c("Q29_2", "Q29_3", "Q29_4")], na.rm = TRUE) == 0 ~ "No"))

matching id inside data frame

I made this simple data frame to make my question more clear:
id = c(11, 12, 13, 14, 15)
referenceperson = c("yes", "no", "yes", "no", "yes")
smoke = c(3, 4, 3, NA, 2)
spouseid = c(12, 11, NA, 15, 14)
dataframe = data.frame(id, referenceperson , smoke, spouseid)
I would like to get the the amount of smoking of the spouse of a reference person only, in this example value 4 of the first observation.
I'm lost here and thanks for any help
Using only the values in your dataframe object, will step though it and present a compact method of getting the single value you ask for and then all the values:
> dataframe[ match(dataframe$spouseid[1], data.frame$id) , 'smoke']
[1] 4
That was the method of getting the index of the spouse of the person in the first and using it to get the 'smoke' value in the referenced row. The next line demonstrates that match will get you all such indices and where they don't exist will return an NA.
> match(dataframe$spouseid, dataframe$id)
[1] 2 1 NA 5 4
In R using NA as an index into a dataframe will return an NA, rather than a null value. This preserves sequence information. Therefore, you can get all the smoking values of spouses with this:
> dataframe[ match(dataframe$spouseid, dataframe$id) , 'smoke']
[1] 4 3 NA 2 NA
And then assign those values to a column in the dataframe.
> dataframe$smk_stat_spouse <-
dataframe[ match(dataframe$spouseid, dataframe$id) , 'smoke']
> dataframe
id referenceperson smoke spouseid smk_stat_spouse
1 11 yes 3 12 4
2 12 no 4 11 3
3 13 yes 3 NA NA
4 14 no NA 15 2
5 15 yes 2 14 NA
I believe I found a solution, although it is very messy (I'm new to r)
df1 <- cbind(id, referenceperson)
df1 <- as.data.frame(df1)
df2 <- cbind(spouseid, smoke)
df2 <- as.data.frame(df2)
matched <- df2$smoke[match(df1$id, df2$spouseid) ]
refp <- ifelse(referenceperson=="yes", 1, referenceperson)
refp <- ifelse(refp=="no", NA, refp)
refp <- as.numeric(refp)
refp*matched

Resources