I would like to replace values of one dataframe with NA of another dataframe that have the same identifier. That is, for all values of df1 that have the same id, assign the "NA" values of df2 at the corresponding id and indices.
I have df1 and df2:
df1 =data.frame(id = c(1,1,2,2,6,6),a = c(2,4,1,7,5,3), b = c(5,3,0,3,2,5),c = c(9,3,10,33,2,5))
df2 =data.frame(id = c(1,2,6),a = c("NA",0,"NA"), b= c("NA", 9, 9),c=c(0,"NA","NA"))
what i would like is df3:
df3 = data.frame(id = c(1,1,2,2,6,6),a = c("NA","NA",1,7,"NA","NA"), b = c("NA","NA",0,3,2,5),c = c(9,3,"NA","NA","NA","NA"))
I have tried the lookup function and the library "data.table", but i could get the correct df3. Could anyone please help me with this?
We can do a join on 'id' and then replace the NA values by multiplying the .
library(data.table)
nm1 <- names(df1)[-1]
setDT(df1)[df2, (nm1) := Map(function(x, y) x*(NA^is.na(y)), .SD,
mget(paste0('i.', nm1))), on = .(id), .SDcols = nm1]
df1
# id a b c
#1: 1 NA NA 9
#2: 1 NA NA 3
#3: 2 1 0 NA
#4: 2 7 3 NA
#5: 6 NA 2 NA
#6: 6 NA 5 NA
data
df2 =data.frame(id = c(1,2,6),a = c(NA,0,NA), b= c(NA, 9, 9),c=c(0,NA,NA))
NOTE: In the OP's post NA were "NA"
Since your NA values are actually text "NA" you will have to turn all your variables into text (with as.character). You can join both dataframes by id column. Since both dataframes have columns a,b, and c R will rename then a.x, b.x and c.x (df1) and a.y, b.y and c.y (df2).
After that you can create new columns a,b, and c. These than have "NA" whenever a.y == "NA" and a.x otherwise (and so on). If your NA values were real NA you need to test differently is.na(value) (see example below in the code).
library(dplyr)
df1 %>%
mutate_all(as.character) %>% # allvariables as text
left_join(df2 %>%
mutate_all(as.character) ## all variables as text
, by = "id") %>% ## join tables by 'id'; a.x from df1 and a.y from df2 and so on
mutate(a = case_when(a.y == "NA" ~ "NA", TRUE ~ a.x), ## if a.y == "NA" take this,else a.x
b = case_when(b.y == "NA" ~ "NA", TRUE ~ b.x),
c = case_when(c.y == "NA" ~ "NA", TRUE ~ c.x)) %>%
select(id, a, b, c) ## keep only these initial columns
id a b c
1 1 NA NA 9
2 1 NA NA 3
3 2 1 0 NA
4 2 7 3 NA
5 6 NA 2 NA
6 6 NA 5 NA
##if your dataframe head real NA this is how you can test:
missing_value <- NA
is.na(missing_value) ## TRUE
missing_value == NA ## Does not work with R
Related
In a dataframe I want to add a new column next each column whose name matches a certain pattern, for example whose name starts with "ip_" and is followed by a number. The name of the new columns should follow the pattern "newCol_" suffixed by that number again. The values of the new columns should be NA's.
So this dataframe:
should be transformed to that dataframe:
A tidiverse solution with use of regex is much appreciated!
Sample data:
df <- data.frame(
ID = c("1", "2"),
ip_1 = c(2,3),
ip_9 = c(5,7),
ip_39 = c(11,13),
in_1 = c("B", "D"),
in_2 = c("A", "H"),
in_3 = c("D", "A")
)
To get the columns is easy with across -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}'))
# ID ip_1 ip_9 ip_39 in_1 in_2 in_3 newCol_1 newCol_9 newCol_39
#1 1 2 5 11 B A D NA NA NA
#2 2 3 7 13 D H A NA NA NA
To get the columns in required order -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}')) %>%
select(ID, starts_with('in'),
order(suppressWarnings(readr::parse_number(names(.))))) %>%
select(ID, ip_1:newCol_39, everything())
# ID ip_1 newCol_1 ip_9 newCol_9 ip_39 newCol_39 in_1 in_2 in_3
#1 1 2 NA 5 NA 11 NA B A D
#2 2 3 NA 7 NA 13 NA D H A
To add the new NA columns :
df[, sub("^ip", "newCol", grep("^ip", names(df), value = TRUE))] <- NA
To reorder them :
df <- df[, order(c(grep("newCol", names(df), invert = TRUE), grep("^ip", names(df))))]
edit :
If it's something you (or whoever stumble here) plan on doing often, you can use this function :
insertCol <- function(x, ind, col.names = ncol(df) + seq_along(colIndex), data = NA){
out <- x
out[, col.names] <- data
out[, order(c(col(x)[1,], ind))]
}
We are looking to rename columns in a dataframe in R, however the columns may be missing and this throws an error:
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
my_df %>% dplyr::rename(aa = a, bb = b, cc = c)
Error: Can't rename columns that don't exist.
x Column `c` doesn't exist.
our desired output is this, which creates a new column with NA values if the original column does not exist:
> my_df
aa bb c
1 1 4 NA
2 2 5 NA
3 3 6 NA
A possible solution:
library(tidyverse)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
cols <- c(a = NA_real_, b = NA_real_, c = NA_real_)
my_df %>% add_column(!!!cols[!names(cols) %in% names(.)]) %>%
rename(aa = a, bb = b, cc = c)
#> aa bb cc
#> 1 1 4 NA
#> 2 2 5 NA
#> 3 3 6 NA
You can use a named vector with any_of() to rename that won't error on missing variables. I'm uncertain of a dplyr way to then create the missing vars but it's easy enough in base R.
library(dplyr)
cols <- c(aa = "a", bb = "b", cc = "c")
my_df %>%
rename(any_of(cols)) %>%
`[<-`(., , setdiff(names(cols), names(.)), NA)
aa bb cc
1 1 4 NA
2 2 5 NA
3 3 6 NA
Here is a solution using the data.table function setnames. I've added a second "missing" column "d" to demonstrate generality.
library(tidyverse)
library(data.table)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
curr <- names(my_df)
cols <- data.frame(new=c("aa","bb","cc","dd"), old = c("a", "b", "c","d")) %>%
mutate(exist = old %in% curr)
foo <- filter(cols, exist)
bar <- filter(cols, !exist)
setnames(my_df, new = foo$new)
my_df[, bar$old] <- NA
my_df
#> my_df
# aa bb c d
#1 1 4 NA NA
#2 2 5 NA NA
#3 3 6 NA NA
Let's say I have the data frames with the same column names
DF1 = data.frame(a = c(0,1), b = c(2,3), c = c(4,5))
DF2 = data.frame(a = c(6,7), c = c(8,9))
and want to apply some basic calculation on them, for example add each column.
Since I also want the goal data frame to display missing data, I appended such a column to DF2, so I have
> DF2
a c b
1 6 8 NA
2 7 9 NA
What I tried here now is to create the data frame
for(i in names(DF2)){
DF3 = data.frame(i = DF1[i] + DF2[i])
}
(and then bind this together) but this obviously doesn't work since the order of the columns is mashed up.
SO,
what's the best way to do this pairwise calculation when the order of the columns is not the same, without reordering them?
I also tried doing (since this is what I thought would be a fix)
for(i in names(DF2)){
DF3 = data.frame(i = DF1$i + DF2$i)
}
but this doesn't work because DF1$i is NULL for all i.
Conlusion: I want the data frame
>DF3
a b c
1 6+0 NA 4+8
2 1+7 NA 5+9
Any help would be appreciated.
This may help -
#Get column names from DF1 and DF2
all_cols <- union(names(DF1), names(DF2))
#Fill missing columns with NA in both the dataframe
DF1[setdiff(all_cols, names(DF1))] <- NA
DF2[setdiff(all_cols, names(DF2))] <- NA
#add the two dataframes arranging the columns
DF1[all_cols] + DF2[all_cols]
# a b c
#1 6 NA 12
#2 8 NA 14
We can use bind_rows
library(dplyr)
library(data.table)
bind_rows(DF1, DF2, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 6 NA 12
2 8 NA 14
Another base R option using aggregate + stack + reshae
aggregate(
. ~ rid,
transform(
reshape(
transform(rbind(
stack(DF1),
stack(DF2)
),
rid = ave(seq_along(ind), ind, FUN = seq_along)
),
direction = "wide",
idvar = "rid",
timevar = "ind"
),
rid = 1:nrow(DF1)
),
sum,
na.action = "na.pass"
)[-1]
gives
values.a values.b values.c
1 6 NA 12
2 8 NA 14
What is the best function to use if I want to replace certain variables with NA based on a conditional?
If status = NA, then score_1:score_3 will be NA
tried:
if(df2$status == NA){
df2$score_2 <- NA
}else{
df2$score_2 <- df$score_2
}
Thanks in advance
One option is to find the NAs in 'status' and assign the columns that having 'score' as column name to NA in base R
i1 <- is.na(df2$Status)
df2[i1, grep("^Score_\\d+$", names(df2))] <- NA
Or an option in dplyr
library(dplyr)
df2 %>%
mutate_at(vars(starts_with('Score')), ~ replace(., is.na(Status), NA))
You can do this by finding out which rows in the data frame are NA and then setting the columns in those rows to NA.
df <- data.frame(client_id = 1:4,
Date = 1:4,
Status = c(1, NA, 1, NA),
Score1 = runif(4)*100,
Score2 = runif(4)*100,
Score3 = runif(4)*100)
idx <- is.na(df$Status)
df[idx, 4:6] <- NA
df
#> client_id Date Status Score1 Score2 Score3
#> 1 1 1 1 48.08677 16.62185 91.80062
#> 2 2 2 NA NA NA NA
#> 3 3 3 1 14.04552 64.55724 56.45998
#> 4 4 4 NA NA NA NA
I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.
Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA
With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))
If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3