Store first non-missing value in a new column - r

Ciao, I have several columns that represents scores. For each STUDENT I want to take the first non-NA score and store it in a new column called TEST.
Here is my replicating example. This is the data I have now:
df <- data.frame(STUDENT=c(1,2,3,4,5),
CLASS=c(90,91,92,93,95),
SCORE1=c(10,NA,NA,NA,NA),
SCORE2=c(2,NA,8,NA,NA),
SCORE3=c(9,6,6,NA,NA),
SCORE4=c(NA,7,5,1,9),
ROOM=c(01,02, 03, 04, 05))
This is the column I am aiming to add:
df$FIRST <- c(10,6,8,1,9)
This is my attempt:
df$FIRSTGUESS <- max.col(!is.na(df[3:6]), "first")

This is exactly what coalesce from package dplyr does. As described in its documentation:
Given a set of vectors, coalesce() finds the first non-missing value
at each position.
Therefore, you can simplify do:
library(dplyr)
df$FIRST <- do.call(coalesce, df[grepl('SCORE', names(df))])
This is the result:
> df
STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRST
1 1 90 10 2 9 NA 1 10
2 2 91 NA NA 6 7 2 6
3 3 92 NA 8 6 5 3 8
4 4 93 NA NA NA 1 4 1
5 5 95 NA NA NA 9 5 9

You can do this with apply and which.min(is.na(...))
df$FIRSTGUESS <- apply(df[, grep("^SCORE", names(df))], 1, function(x)
x[which.min(is.na(x))])
df
# STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRSTGUESS
#1 1 90 10 2 9 NA 1 10
#2 2 91 NA NA 6 7 2 6
#3 3 92 NA 8 6 5 3 8
#4 4 93 NA NA NA 1 4 1
#5 5 95 NA NA NA 9 5 9
Note that we need is.na instead of !is.na because FALSE corresponds to 0 and we want to return the first (which.min) FALSE value.

Unfortunately, max.col gives indices of max values and not the values itself. However, we can subset the values from the original dataframe using the mapply call.
#Select only columns which has "SCORE" in it
sub_df <- df[grepl("SCORE", names(df))]
#Get the first non-NA value by row
inds <- max.col(!is.na(sub_df), ties.method = "first")
#Get the inds value by row
df$FIRSTGUESS <- mapply(function(x, y) sub_df[x,y], 1:nrow(sub_df), inds)
df
# STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRST FIRSTGUESS
#1 1 90 10 2 9 NA 1 10 10
#2 2 91 NA NA 6 7 2 6 6
#3 3 92 NA 8 6 5 3 8 8
#4 4 93 NA NA NA 1 4 1 1
#5 5 95 NA NA NA 9 5 9 9

Using zoo,na.locf, borrowing the setting up of sub_df from Ronak
df['New']=zoo::na.locf(t(sub_df),fromLast=T)[1,]
df
STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM New
1 1 90 10 2 9 NA 1 10
2 2 91 NA NA 6 7 2 6
3 3 92 NA 8 6 5 3 8
4 4 93 NA NA NA 1 4 1
5 5 95 NA NA NA 9 5 9

Related

How can I make some row values NA if other is NA in R?

I have a dataframe with three columns Time, observed value (Obs.Value), and an interpolated value (Interp.Value). If the value of Obs.Value is NA then the value of Interp.Value should also be NA. I can make the whole row NA but I need to keep the Time value.
Here is the repex:
dat <- data.frame(matrix(ncol = 3, nrow = 10))
x <- c("Time", "Obs.Value", "Interp.Value")
colnames(dat) <- x
dat$Time <- seq(1,10,1)
dat$Obs.Value <- c(5,6,7,NA,NA,5,4,3,NA,2)
interp <- approx(dat$Time,dat$Obs.Value,dat$Time)
dat$Interp.Value <- round(interp$y,1)
Here is the code that makes the whole row NA
dat[with(dat, is.na(Obs.Value)|is.na("Interp.Value")),] <- NA
Here is what the output should look like:
Time Obs.Value Interp.Value
1 1 5 5
2 2 6 6
3 3 7 7
4 4 NA NA
5 5 NA NA
6 6 5 5
7 7 4 4
8 8 3 3
9 9 NA NA
10 10 2 2
dat$Interp.Value[is.na(dat$Obs.Value)] <- NA
dat
# Time Obs.Value Interp.Value
# 1 1 5 5
# 2 2 6 6
# 3 3 7 7
# 4 4 NA NA
# 5 5 NA NA
# 6 6 5 5
# 7 7 4 4
# 8 8 3 3
# 9 9 NA NA
# 10 10 2 2
Or if either column being NA is sufficient, then
dat[!complete.cases(dat[,-1]),-1] <- NA
If there is only one column to change #r2evans' answer is pretty straightforward and way to go. If there are more than one column that you want to change you can use across in dplyr.
library(dplyr)
dat %>%
mutate(across(-c(Time,Obs.Value), ~replace(., is.na(Obs.Value), NA)))
# Time Obs.Value Interp.Value
#1 1 5 5
#2 2 6 6
#3 3 7 7
#4 4 NA NA
#5 5 NA NA
#6 6 5 5
#7 7 4 4
#8 8 3 3
#9 9 NA NA
#10 10 2 2

Update a variable if dplyr filter conditions are met

With the command df %>% filter(is.na(df)[,2:4]) filter function subset in a new df that has rows with NA's in columns 2, 3 and 4. What I want is not a new subsetted df but rather assign in example "1" to a new variable called "Exclude" in the actual df.
This example with mutate was not exactly what I was looking for, but close:
Use dplyr´s filter and mutate to generate a new variable
Also I would need the same to happen with other filter conditions.
Example I have the following:
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3,2:4] <- NA
df[5,2:4] <- NA
df
> df
A B C D
1 1 11 21 31
2 2 12 22 32
3 3 NA NA NA
4 4 14 24 34
5 5 NA NA NA
6 6 16 26 36
and would like
> df
A B C D Exclude
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Any good ideas how the filter subset could be used to update easy? The hard way work around would be to generate this subset, create new variable for all and then join back but that is not tidy code.
We can do this with base R using vectorized rowSums
df$Exclude <- NA^!rowSums(is.na(df[-1]))
-output
df
# A B C D Exclude
#1 1 11 21 31 NA
#2 2 12 22 32 NA
#3 3 NA NA NA 1
#4 4 14 24 34 NA
#5 5 NA NA NA 1
#6 6 16 26 36 NA
Does this work:
library(dplyr)
df %>% rowwise() %>%
mutate(Exclude = +any(is.na(c_across(everything()))), Exclude = na_if(Exclude, 0))
# A tibble: 6 x 5
# Rowwise:
A B C D Exclude
<int> <int> <int> <int> <int>
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Using anyNA.
df %>% mutate(Exclude=ifelse(apply(df[2:4], 1, anyNA), 1, NA))
# A B C D Exclude
# 1 1 11 21 31 NA
# 2 2 12 22 32 NA
# 3 3 NA NA NA 1
# 4 4 14 24 34 NA
# 5 5 NA NA NA 1
# 6 6 16 26 36 NA
Or just
df$Exclude <- ifelse(apply(df[2:4], 1, anyNA), 1, NA)
Another one-line solution:
df$Exclude <- as.numeric(apply(df[2:4], 1, function(x) any(is.na(x))))
Use rowwise, sum over all numeric columns, assign 1 or NA in ifelse.
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3, 2:4] <- NA
df[5, 2:4] <- NA
library(tidyverse)
df %>%
rowwise() %>%
mutate(Exclude = ifelse(
is.na(sum(c_across(where(is.numeric)))), 1, NA
))
#> # A tibble: 6 x 5
#> # Rowwise:
#> A B C D Exclude
#> <int> <int> <int> <int> <dbl>
#> 1 1 11 21 31 NA
#> 2 2 12 22 32 NA
#> 3 3 NA NA NA 1
#> 4 4 14 24 34 NA
#> 5 5 NA NA NA 1
#> 6 6 16 26 36 NA

how to avoid overwriting when merging multiple datasets in r

Suppose I have two datasets df1 and df2 as follows:
df1 <- data.frame(Id = c(1L,2L,3L,4L,5L,6L,7L,8L), pricetag = c("na","na","na","na","na","na","na","na"),stringsAsFactors=F)
df2 <- data.frame(Id=c(1L,2L,3L,4L), price = c(10,20,30,40), stringsAsFactors=F)
> df1
Id pricetag
1 1 na
2 2 na
3 3 na
4 4 na
5 5 na
6 6 na
7 7 na
8 8 na
> df2
Id price
1 1 10
2 2 20
3 3 30
4 4 40
I am trying to insert price values from df2 to df1 by matching the id using this function.
df1$pricetag <- df2$price[match(df1$Id, df2$Id)]
which provides this:
> df1
Id pricetag
1 1 10
2 2 20
3 3 30
4 4 40
5 5 NA
6 6 NA
7 7 NA
8 8 NA
I have the third dataset. I am trying to follow the same procedure.
df3 <- data.frame(Id=c(5L,6L,7L,8L), price=c(50,60,70,80),stringsAsFactors=F)
> df3
Id price
1 5 50
2 6 60
3 7 70
4 8 80
df1$pricetag <- df3$price[match(df1$Id, df3$Id)]
> df1
Id pricetag
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 50
6 6 60
7 7 70
8 8 80
However, it overwrites the price information coming from df2 in the df1. Is there any way to turn this option off when I replicate the same procedure?
Replace
df1$pricetag <- df3$price[match(df1$Id, df3$Id)]
in case you want to make an update-join (overwrite df1 with data in df3) with:
idx <- match(df1$Id, df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxn] <- df3$price[idx[idxn]]
rm(idx, idxn)
df1
# Id pricetag
#1 1 10
#2 2 20
#3 3 30
#4 4 40
#5 5 50
#6 6 60
#7 7 70
#8 8 80
in case you want to make a gap-fill-join (fill NA's in df1 with data in df3) with:
idxg <- which(is.na(df1$pricetag))
idx <- match(df1$Id[idxg], df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxg][idxn] <- df3$price[idx[idxn]]
rm(idxg, idx, idxn)
df1
# Id pricetag
#1 1 10
#2 2 20
#3 3 30
#4 4 40
#5 5 50
#6 6 60
#7 7 70
#8 8 80
You can use the is.na function to identify rows to look up:
w = which(is.na(df1$pricetag))
df1$pricetag[w] <- df3$price[match(df1$Id[w], df3$Id)]
Id category pricetag
1 1 na 10
2 2 na 20
3 3 na 30
4 4 na 40
5 5 na 50
6 6 na 60
7 7 na 70
8 8 na 80
There's some more convenient syntax for this with the data.table package:
df1 <- data.frame(Id=c(1L,2L,3L,4L,5L,6L,7L,8L), category="na", stringsAsFactors=FALSE)
library(data.table)
setDT(df1); setDT(df2); setDT(df3)
df1[, pricetag := NA_real_]
for (odf in list(df2, df3))
df1[is.na(pricetag),
pricetag := odf[.SD, on=.(Id), x.price]
][]
Id category pricetag
1: 1 na 10
2: 2 na 20
3: 3 na 30
4: 4 na 40
5: 5 na 50
6: 6 na 60
7: 7 na 70
8: 8 na 80
This kind of merge is called an "update join".
We can use {powerjoin} :
library(powerjoin)
library(tidyverse)
df1 %>%
# have all price cols be named the same
rename(price = pricetag) %>%
# make regular numeric NAs from your "na" characters
mutate_at("price", as.numeric) %>%
# fetch Id cols and incorporate them
power_left_join(df2, "Id", conflict = coalesce_xy) %>%
power_left_join(df3, "Id", conflict = coalesce_xy)
# Id price
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 6 60
# 7 7 70
# 8 8 80

Transpose multiple columns as column names and fill with values in R

The sample data as following:
x <- read.table(header=T, text="
ID CostType1 Cost1 CostType2 Cost2
1 a 10 c 1
2 b 2 c 20
3 a 1 b 50
4 a 40 c 1
5 c 2 b 30
6 a 60 c 3
7 c 10 d 1
8 a 20 d 2")
I want the second and third columns (CostType1 and CostType 2) to be the the names of new columns and fill the corresponding cost to certain cost type. If there's no match, filled with NA. The ideal format will be following:
a b c d
1 10 NA 1 NA
2 NA 2 20 NA
3 1 50 NA NA
4 40 1 NA NA
5 NA 30 2 NA
6 60 NA 3 NA
7 NA NA 10 1
8 20 NA NA 2
A solution using tidyverse. We can first get how many groups are there. In this example, there are two groups. We can convert each group, combine them, and then summarize the data frame with the first non-NA value in the column.
library(tidyverse)
# Get the group numbers
g <- (ncol(x) - 1)/2
x2 <- map_dfr(1:g, function(i){
# Transform the data frame one group at a time
x <- x %>%
select(ID, ends_with(as.character(i))) %>%
spread(paste0("CostType", i), paste0("Cost", i))
return(x)
}) %>%
group_by(ID) %>%
# Select the first non-NA value if there are multiple values
summarise_all(funs(first(.[!is.na(.)])))
x2
# # A tibble: 8 x 5
# ID a b c d
# <int> <int> <int> <int> <int>
# 1 1 10 NA 1 NA
# 2 2 NA 2 20 NA
# 3 3 1 50 NA NA
# 4 4 40 NA 1 NA
# 5 5 NA 30 2 NA
# 6 6 60 NA 3 NA
# 7 7 NA NA 10 1
# 8 8 20 NA NA 2
A base solution using reshape
x1 <- setNames(x[,c("ID", "CostType1", "Cost1")], c("ID", "CostType", "Cost"))
x2 <- setNames(x[,c("ID", "CostType2", "Cost2")], c("ID", "CostType", "Cost"))
reshape(data=rbind(x1, x2), idvar="ID", timevar="CostType", v.names="Cost", direction="wide")

Selecting values in a dataframe based on a priority list

I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1

Resources