how to avoid overwriting when merging multiple datasets in r - r

Suppose I have two datasets df1 and df2 as follows:
df1 <- data.frame(Id = c(1L,2L,3L,4L,5L,6L,7L,8L), pricetag = c("na","na","na","na","na","na","na","na"),stringsAsFactors=F)
df2 <- data.frame(Id=c(1L,2L,3L,4L), price = c(10,20,30,40), stringsAsFactors=F)
> df1
Id pricetag
1 1 na
2 2 na
3 3 na
4 4 na
5 5 na
6 6 na
7 7 na
8 8 na
> df2
Id price
1 1 10
2 2 20
3 3 30
4 4 40
I am trying to insert price values from df2 to df1 by matching the id using this function.
df1$pricetag <- df2$price[match(df1$Id, df2$Id)]
which provides this:
> df1
Id pricetag
1 1 10
2 2 20
3 3 30
4 4 40
5 5 NA
6 6 NA
7 7 NA
8 8 NA
I have the third dataset. I am trying to follow the same procedure.
df3 <- data.frame(Id=c(5L,6L,7L,8L), price=c(50,60,70,80),stringsAsFactors=F)
> df3
Id price
1 5 50
2 6 60
3 7 70
4 8 80
df1$pricetag <- df3$price[match(df1$Id, df3$Id)]
> df1
Id pricetag
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 50
6 6 60
7 7 70
8 8 80
However, it overwrites the price information coming from df2 in the df1. Is there any way to turn this option off when I replicate the same procedure?

Replace
df1$pricetag <- df3$price[match(df1$Id, df3$Id)]
in case you want to make an update-join (overwrite df1 with data in df3) with:
idx <- match(df1$Id, df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxn] <- df3$price[idx[idxn]]
rm(idx, idxn)
df1
# Id pricetag
#1 1 10
#2 2 20
#3 3 30
#4 4 40
#5 5 50
#6 6 60
#7 7 70
#8 8 80
in case you want to make a gap-fill-join (fill NA's in df1 with data in df3) with:
idxg <- which(is.na(df1$pricetag))
idx <- match(df1$Id[idxg], df3$Id)
idxn <- which(!is.na(idx))
df1$pricetag[idxg][idxn] <- df3$price[idx[idxn]]
rm(idxg, idx, idxn)
df1
# Id pricetag
#1 1 10
#2 2 20
#3 3 30
#4 4 40
#5 5 50
#6 6 60
#7 7 70
#8 8 80

You can use the is.na function to identify rows to look up:
w = which(is.na(df1$pricetag))
df1$pricetag[w] <- df3$price[match(df1$Id[w], df3$Id)]
Id category pricetag
1 1 na 10
2 2 na 20
3 3 na 30
4 4 na 40
5 5 na 50
6 6 na 60
7 7 na 70
8 8 na 80
There's some more convenient syntax for this with the data.table package:
df1 <- data.frame(Id=c(1L,2L,3L,4L,5L,6L,7L,8L), category="na", stringsAsFactors=FALSE)
library(data.table)
setDT(df1); setDT(df2); setDT(df3)
df1[, pricetag := NA_real_]
for (odf in list(df2, df3))
df1[is.na(pricetag),
pricetag := odf[.SD, on=.(Id), x.price]
][]
Id category pricetag
1: 1 na 10
2: 2 na 20
3: 3 na 30
4: 4 na 40
5: 5 na 50
6: 6 na 60
7: 7 na 70
8: 8 na 80
This kind of merge is called an "update join".

We can use {powerjoin} :
library(powerjoin)
library(tidyverse)
df1 %>%
# have all price cols be named the same
rename(price = pricetag) %>%
# make regular numeric NAs from your "na" characters
mutate_at("price", as.numeric) %>%
# fetch Id cols and incorporate them
power_left_join(df2, "Id", conflict = coalesce_xy) %>%
power_left_join(df3, "Id", conflict = coalesce_xy)
# Id price
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 6 60
# 7 7 70
# 8 8 80

Related

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Update a variable if dplyr filter conditions are met

With the command df %>% filter(is.na(df)[,2:4]) filter function subset in a new df that has rows with NA's in columns 2, 3 and 4. What I want is not a new subsetted df but rather assign in example "1" to a new variable called "Exclude" in the actual df.
This example with mutate was not exactly what I was looking for, but close:
Use dplyr´s filter and mutate to generate a new variable
Also I would need the same to happen with other filter conditions.
Example I have the following:
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3,2:4] <- NA
df[5,2:4] <- NA
df
> df
A B C D
1 1 11 21 31
2 2 12 22 32
3 3 NA NA NA
4 4 14 24 34
5 5 NA NA NA
6 6 16 26 36
and would like
> df
A B C D Exclude
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Any good ideas how the filter subset could be used to update easy? The hard way work around would be to generate this subset, create new variable for all and then join back but that is not tidy code.
We can do this with base R using vectorized rowSums
df$Exclude <- NA^!rowSums(is.na(df[-1]))
-output
df
# A B C D Exclude
#1 1 11 21 31 NA
#2 2 12 22 32 NA
#3 3 NA NA NA 1
#4 4 14 24 34 NA
#5 5 NA NA NA 1
#6 6 16 26 36 NA
Does this work:
library(dplyr)
df %>% rowwise() %>%
mutate(Exclude = +any(is.na(c_across(everything()))), Exclude = na_if(Exclude, 0))
# A tibble: 6 x 5
# Rowwise:
A B C D Exclude
<int> <int> <int> <int> <int>
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Using anyNA.
df %>% mutate(Exclude=ifelse(apply(df[2:4], 1, anyNA), 1, NA))
# A B C D Exclude
# 1 1 11 21 31 NA
# 2 2 12 22 32 NA
# 3 3 NA NA NA 1
# 4 4 14 24 34 NA
# 5 5 NA NA NA 1
# 6 6 16 26 36 NA
Or just
df$Exclude <- ifelse(apply(df[2:4], 1, anyNA), 1, NA)
Another one-line solution:
df$Exclude <- as.numeric(apply(df[2:4], 1, function(x) any(is.na(x))))
Use rowwise, sum over all numeric columns, assign 1 or NA in ifelse.
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3, 2:4] <- NA
df[5, 2:4] <- NA
library(tidyverse)
df %>%
rowwise() %>%
mutate(Exclude = ifelse(
is.na(sum(c_across(where(is.numeric)))), 1, NA
))
#> # A tibble: 6 x 5
#> # Rowwise:
#> A B C D Exclude
#> <int> <int> <int> <int> <dbl>
#> 1 1 11 21 31 NA
#> 2 2 12 22 32 NA
#> 3 3 NA NA NA 1
#> 4 4 14 24 34 NA
#> 5 5 NA NA NA 1
#> 6 6 16 26 36 NA

R: creating multiple new variables based on conditions of selection of other variables with similar names

I have a data frame where each condition (in the example: hope, dream, joy) has 5 variables (in the example, coded with suffixes x, y, z, a, b - the are the same for each condition).
df <- data.frame(matrix(1:16,5,16))
names(df) <- c('ID','hopex','hopey','hopez','hopea','hopeb','dreamx','dreamy','dreamz','dreama','dreamb','joyx','joyy','joyz','joya','joyb')
df[1,2:6] <- NA
df[3:5,c(7,10,14)] <- NA
This is how the data looks like:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16
I want to create a new variable for each condition (hope, dream, joy) that codes whether all of the variables x...b for that condition are NA (0 if all are NA, 1 if any is non-NA). And I want the new variables to be stored in the data frame. Thus, the output should be this:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope joy dream
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12 0 1 1
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13 1 1 1
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14 1 1 1
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15 1 1 1
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16 1 1 1
The code below does it, but I'm looking for a more elegant solution (e.g., for a case where I have even more conditions). I've tried with various combinations of all(), select(), mutate(), but while they all seem useful, I cannot figure out how to combine them to get what I want. I'm stuck and would be interested in learning to code more efficiently. Thanks in advance!
df$hope <- 0
df[is.na(df$hopex) == FALSE | is.na(df$hopey) == FALSE | is.na(df$hopez) == FALSE | is.na(df$hopea) == FALSE | is.na(df$hopeb) == FALSE, "hope"] <- 1
df$dream <- 0
df[is.na(df$dreamx) == FALSE | is.na(df$dreamy) == FALSE | is.na(df$dreamz) == FALSE | is.na(df$dreama) == FALSE | is.na(df$dreamb) == FALSE, "dream"] <- 1
df$joy<- 0
df[is.na(df$joyx) == FALSE | is.na(df$joyy) == FALSE | is.na(df$joyz) == FALSE | is.na(df$joya) == FALSE | is.na(df$joyb) == FALSE, "joy"] <- 1
Here is an option with tidyverse
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(hope = select(., starts_with('hope')) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer)
# hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope
#1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
#2 1 1 4 3 2 3 5 4 5 2 5 NA 4 3 1 1
#3 2 NA 4 4 4 3 5 NA 5 5 4 NA 4 5 1 1
#4 4 3 NA 1 1 1 5 2 NA 5 1 2 1 1 1 1
#5 1 NA 4 NA NA 2 1 5 1 2 NA 3 1 2 5 1
Or with rowSums
df %>%
mutate(hope = +(rowSums(!is.na(select(., starts_with('hope'))))!= 0))
For multiple columns, we can create a function
f1 <- function(dat, colSubstr) {
dplyr::select(dat, starts_with(colSubstr)) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer
}
df %>%
mutate(hope = f1(., 'hope'),
dream = f1(., 'dream'),
joy = f1(., 'joy'))
Or using base R
cbind(df, sapply(split.default(df, sub(".$", "", names(df))),
function(x) +(rowSums(!is.na(x)) != 0)))
If we want to subset columns
nm1 <- setdiff(names(df), "ID")
cbind(df, sapply(split.default(df[nm1], sub(".$", "", names(df[nm1]))),
function(x) +(rowSums(!is.na(x)) != 0)))
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 5 * 15, replace = TRUE),
ncol = 15, dimnames = list(NULL, paste0(rep(c("hope", "dream", "joy"),
each = 5), c('x', 'y', 'z', 'a', 'b')))))
df[1,] <- NA

Store first non-missing value in a new column

Ciao, I have several columns that represents scores. For each STUDENT I want to take the first non-NA score and store it in a new column called TEST.
Here is my replicating example. This is the data I have now:
df <- data.frame(STUDENT=c(1,2,3,4,5),
CLASS=c(90,91,92,93,95),
SCORE1=c(10,NA,NA,NA,NA),
SCORE2=c(2,NA,8,NA,NA),
SCORE3=c(9,6,6,NA,NA),
SCORE4=c(NA,7,5,1,9),
ROOM=c(01,02, 03, 04, 05))
This is the column I am aiming to add:
df$FIRST <- c(10,6,8,1,9)
This is my attempt:
df$FIRSTGUESS <- max.col(!is.na(df[3:6]), "first")
This is exactly what coalesce from package dplyr does. As described in its documentation:
Given a set of vectors, coalesce() finds the first non-missing value
at each position.
Therefore, you can simplify do:
library(dplyr)
df$FIRST <- do.call(coalesce, df[grepl('SCORE', names(df))])
This is the result:
> df
STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRST
1 1 90 10 2 9 NA 1 10
2 2 91 NA NA 6 7 2 6
3 3 92 NA 8 6 5 3 8
4 4 93 NA NA NA 1 4 1
5 5 95 NA NA NA 9 5 9
You can do this with apply and which.min(is.na(...))
df$FIRSTGUESS <- apply(df[, grep("^SCORE", names(df))], 1, function(x)
x[which.min(is.na(x))])
df
# STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRSTGUESS
#1 1 90 10 2 9 NA 1 10
#2 2 91 NA NA 6 7 2 6
#3 3 92 NA 8 6 5 3 8
#4 4 93 NA NA NA 1 4 1
#5 5 95 NA NA NA 9 5 9
Note that we need is.na instead of !is.na because FALSE corresponds to 0 and we want to return the first (which.min) FALSE value.
Unfortunately, max.col gives indices of max values and not the values itself. However, we can subset the values from the original dataframe using the mapply call.
#Select only columns which has "SCORE" in it
sub_df <- df[grepl("SCORE", names(df))]
#Get the first non-NA value by row
inds <- max.col(!is.na(sub_df), ties.method = "first")
#Get the inds value by row
df$FIRSTGUESS <- mapply(function(x, y) sub_df[x,y], 1:nrow(sub_df), inds)
df
# STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM FIRST FIRSTGUESS
#1 1 90 10 2 9 NA 1 10 10
#2 2 91 NA NA 6 7 2 6 6
#3 3 92 NA 8 6 5 3 8 8
#4 4 93 NA NA NA 1 4 1 1
#5 5 95 NA NA NA 9 5 9 9
Using zoo,na.locf, borrowing the setting up of sub_df from Ronak
df['New']=zoo::na.locf(t(sub_df),fromLast=T)[1,]
df
STUDENT CLASS SCORE1 SCORE2 SCORE3 SCORE4 ROOM New
1 1 90 10 2 9 NA 1 10
2 2 91 NA NA 6 7 2 6
3 3 92 NA 8 6 5 3 8
4 4 93 NA NA NA 1 4 1
5 5 95 NA NA NA 9 5 9

Transpose multiple columns as column names and fill with values in R

The sample data as following:
x <- read.table(header=T, text="
ID CostType1 Cost1 CostType2 Cost2
1 a 10 c 1
2 b 2 c 20
3 a 1 b 50
4 a 40 c 1
5 c 2 b 30
6 a 60 c 3
7 c 10 d 1
8 a 20 d 2")
I want the second and third columns (CostType1 and CostType 2) to be the the names of new columns and fill the corresponding cost to certain cost type. If there's no match, filled with NA. The ideal format will be following:
a b c d
1 10 NA 1 NA
2 NA 2 20 NA
3 1 50 NA NA
4 40 1 NA NA
5 NA 30 2 NA
6 60 NA 3 NA
7 NA NA 10 1
8 20 NA NA 2
A solution using tidyverse. We can first get how many groups are there. In this example, there are two groups. We can convert each group, combine them, and then summarize the data frame with the first non-NA value in the column.
library(tidyverse)
# Get the group numbers
g <- (ncol(x) - 1)/2
x2 <- map_dfr(1:g, function(i){
# Transform the data frame one group at a time
x <- x %>%
select(ID, ends_with(as.character(i))) %>%
spread(paste0("CostType", i), paste0("Cost", i))
return(x)
}) %>%
group_by(ID) %>%
# Select the first non-NA value if there are multiple values
summarise_all(funs(first(.[!is.na(.)])))
x2
# # A tibble: 8 x 5
# ID a b c d
# <int> <int> <int> <int> <int>
# 1 1 10 NA 1 NA
# 2 2 NA 2 20 NA
# 3 3 1 50 NA NA
# 4 4 40 NA 1 NA
# 5 5 NA 30 2 NA
# 6 6 60 NA 3 NA
# 7 7 NA NA 10 1
# 8 8 20 NA NA 2
A base solution using reshape
x1 <- setNames(x[,c("ID", "CostType1", "Cost1")], c("ID", "CostType", "Cost"))
x2 <- setNames(x[,c("ID", "CostType2", "Cost2")], c("ID", "CostType", "Cost"))
reshape(data=rbind(x1, x2), idvar="ID", timevar="CostType", v.names="Cost", direction="wide")

Resources