One normal way to fill in NA values in a data frame, loan, is as follows:
for (i in 1: ncol(loan))
{
if (is.character(loan[,i]))
{
loan[is.na(loan[ ,i]), i] <- "missing"
}
if (is.numeric(loan[,i]))
{
loan[is.na(loan[ ,i]), i] <- 9999
}
}
But if the loan data-set is a tibble, the above method does not work as is.character(loan[,i]) is always FALSE and also is.numeric(loan[,i]) is also FALSE. Dataset loan's class is as below:
> class(loan)
[1] "tbl_df" "tbl" "data.frame"
To use the above for-loop for filing in missing values, I have to first convert 'loan' to a data frame with as.data.frame() and then use the for-loop.
Is it possible to directly manipulate a tibble without first converting it to a data.frame to fill in missing values?
We can use the tidyverse syntax to do this
library(tidyverse)
loan %>%
mutate_if(is.character, funs(replace(., is.na(.), "missing"))) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 9999)))
# A tibble: 20 × 3
# Col1 Col2 Col3
# <chr> <dbl> <chr>
#1 a 9999 A
#2 a 2 A
#3 d 3 A
#4 c 9999 missing
#5 c 1 missing
#6 e 3 missing
#7 a 9999 A
#8 d 2 A
#9 d 3 A
#10 a 9999 A
#11 c 1 A
#12 b 1 C
#13 d 1 A
#14 d 9999 B
#15 a 4 B
#16 e 1 C
#17 a 3 A
#18 missing 3 A
#19 c 3 missing
#20 missing 4 missing
As the dataset is a tibble, it will not get converted to vector by extracting with [, instead we need [[
for (i in 1: ncol(loan)) {
if (is.character(loan[[i]])) {
loan[is.na(loan[[i]]), i] <- "missing"
} if (is.numeric(loan[[i]])) {
loan[is.na(loan[[i]]), i] <- 9999
}
}
To understand the problem, we just need to look at the output of the extraction
head(is.na(loan[,1]))
# Col1
#[1,] FALSE
#[2,] FALSE
#[3,] FALSE
#[4,] FALSE
#[5,] FALSE
#[6,] FALSE
head(is.na(loan[[1]]))
#[1] FALSE FALSE FALSE FALSE FALSE FALSE
In the for loop, we are using the rowindex as a logical matrix with 1 column in the first case, and the second case it is a vector which makes the difference
data
set.seed(24)
loan <- as_tibble(data.frame(Col1 = sample(c(NA, letters[1:5]), 20,
replace = TRUE), Col2 = sample(c(NA, 1:4), 20, replace = TRUE),
Col3 = sample(c(NA, LETTERS[1:3]), 20, replace = TRUE),
stringsAsFactors=FALSE))
Related
I'm familiar with %in% generally, and I'm looking for a base R solution, if one exists.
Suppose I want to know whether a particular combination of values from multiple fields in a data frame exists in another data frame. As a work-around, sometimes I concatenate all these values into a single field and match on the custom concatenation, but I'm wondering if there's a way to pass the value combinations to %in% directly.
I'm imagining syntax similar to deduplicating on unique combinations of values across multiple columns, whose syntax works like this, by way of a generic example:
df[!duplicated(df[,c("col1","col2","col3")]),]
I was sort of expecting something like this to work, but I see why it doesn't:
df1[df1[,c("col1","col2")] %in% df2[,c("col1","col2")],]
... above, I'm attempting to ask which value pairs in df1 also exist as value pairs in df2.
You can use mapply to create a logical matrix of matches and then use it to subset df1.
Test data.
set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4),
col2 = sample(1:10, 4), col3 = 11:14)
Here I start by putting the columns in a vector, it simplifies the code.
cols <- c("col1", "col2")
(i <- mapply(\(x, y) x %in% y, df1[cols], df2[cols]))
# col1 col2
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] TRUE TRUE
# [5,] FALSE FALSE
# [6,] TRUE TRUE
# [7,] TRUE TRUE
# [8,] FALSE FALSE
# [9,] FALSE TRUE
#[10,] FALSE FALSE
Now subset. The question is not very clear on which of the following is asked for.
# at least one column match
j <- rowSums(i) > 0L
df1[j, ]
# col1 col2 col3
#3 c 3 13
#4 d 4 14
#6 f 6 16
#7 g 7 17
#9 i 9 19
# all columns match
k <- rowSums(i) == length(cols)
df1[k, ]
# col1 col2 col3
#4 d 4 14
#6 f 6 16
#7 g 7 17
I think just doing a merge() by the two columns of interest get you what you need. You can then subset the merged output to just columns from the original data.frame. This would return only rows of your query data.frame where col1 and col2 match their cognate values in the reference data.frame. Please clarify if that's NOT your goal.
# simulate two DFs with some common values in col1 and col2
x <- data.frame(col1 = LETTERS[1:5],
col2 = 1:5,
col3 = runif(5))
y <- data.frame(col1 = LETTERS[4:8],
col2 = 4:8,
col3 = runif(5))
x
#> col1 col2 col3
#> 1 A 1 0.4306611
#> 2 B 2 0.7149893
#> 3 C 3 0.2808990
#> 4 D 4 0.4383580
#> 5 E 5 0.1372991
y
#> col1 col2 col3
#> 1 D 4 0.40191250
#> 2 E 5 0.94833538
#> 3 F 6 0.85608320
#> 4 G 7 0.05758958
#> 5 H 8 0.29011770
# merge without adding .x suffix to col3 from x
# then subset to only keep columns from x
merge(x, y,
by = c("col1", "col2"),
suffixes = c("", ".drop"))[,1:ncol(x)]
#> col1 col2 col3
#> 1 D 4 0.4383580
#> 2 E 5 0.1372991
Created on 2022-01-08 by the reprex package (v2.0.1)
In the example below, I add a new column "equal.to.master" indicating whether any of the columns whose names start with "col" have the same value as "master".
library(dplyr)
df <- data.frame(
master = c(2,4,5,1,5),
col.1 = 1:5,
col.2 = 5:1,
col.3 = c(NA, 4, 4, 4, 4),
irrelevant = 2:-2
)
df = mutate(df, equal.to.master = col.1 == master | col.2 == master | col.3 == master)
df
master col.1 col.2 col.3 irrelevant equal.to.master
1 2 1 5 NA 2 NA
2 4 2 4 4 1 TRUE
3 5 3 3 4 0 FALSE
4 1 4 2 4 -1 FALSE
5 5 5 1 4 -2 TRUE
Two questions:
1) How do I write this concisely without all the "|" symbols? There must be some "any"-like command I can use in conjunction with "starts_with" but I can't seem to format it correctly. Note that I can't simply grab all the columns because I want to ignore the one named "irrelevant."
2) How do I fix the code so that NA's are ignored?
Here's a way using apply() -
df$equal.to.master <- apply(df, 1, function(x) {
x[1] %in% x[2:3]
})
df
master col.1 col.2 col.3 irrelevant equal.to.master
1 2 1 5 NA 2 FALSE
2 4 2 4 4 1 TRUE
3 5 3 3 4 0 FALSE
4 1 4 2 4 -1 FALSE
5 5 5 1 4 -2 TRUE
We can use vectorized approach with rowSums. Create the logical index for column names that startsWith "col" ('nm1'), subset the dataset and compare with the 'master' column using ==, get the rowSums and check if it is greater than 0
nm1 <- startsWith(names(df), "col")
df$equal.to.master <- rowSums(df[nm1] == df$master, na.rm = TRUE) > 0
df$equal.to.master
#[1] FALSE TRUE FALSE FALSE TRUE
Also, if any NA in a row needs to return NA, then remove the na.rm = TRUE (by default it is FALSE)
rowSums(df[nm1] == df$master, na.rm = FALSE) > 0
#[1] NA TRUE FALSE FALSE TRUE
Or another option is Reduce
Reduce(`|`, lapply(df[nm1], `==`, df$master))
I want to regroup a variable into a new one.
If value is 0, new one should be 0 too.
If value ist 999, then make it missing, NA.
Everything else 1
This is my try:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
df$variable2 <-
if (df$variable == 0) {
df$variable2 = 0
} else if (df$variable == 999){
df$variable2 = NA
} else {
df$variable2 = 1
}
And this the error message:
In if (df$variable == 0) { : the condition has length > 1 and only
the first element will be used
A pretty basic question but I'm a basic user. Thanks in advance!
Try ifelse
df$variable2 <- ifelse(df$variable == 999, NA, ifelse(df$variable > 0, 1, 0))
df
# id variable variable2
#1 1 0 0
#2 2 0 0
#3 3 0 0
#4 4 1 1
#5 5 2 1
#6 6 3 1
#7 7 4 1
#8 8 5 1
#9 9 999 NA
#10 10 999 NA
When you do df$variable == 0 the output / condition is
#[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
where it should be a length-one logical vector that is not NA in if(condition), see ?"if".
You can avoid ifelse, for example, like so
df$variable2 <- df$variable
df$variable2[df$variable2 == 999] <- NA
df$variable2[df$variable2 > 0] <- 1
It might be easier to avoid the if/else statement all together by using conditional statements within subset notation:
when df$variable is equal to zero, change it to zero
df$variable[df$variable==0] <- 0
when df$variable is equal to 999, change it to NA
df$variable[df$variable==999] <- NA
when df$variable is greater than 0 and is not equal to NA, change it to 1
df$variable[df$variable>0 & is.na(df$variable) == 'FALSE'] <- 1
Looks like you want to recode your variable. You can do this (and other data/variable transformations) with the sjmisc-package, in your case with the rec()-command:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
library(sjmisc)
rec(df, variable, rec = c("0=0;999=NA;else=1"))
#> id variable variable_r
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 0 0
#> 4 4 1 1
#> 5 5 2 1
#> 6 6 3 1
#> 7 7 4 1
#> 8 8 5 1
#> 9 9 999 NA
#> 10 10 999 NA
# or a single vector as input
rec(df$variable, rec = c("0=0;999=NA;else=1"))
#> [1] 0 0 0 1 1 1 1 1 NA NA
There are many examples, also in the help-file, and you can find a sjmisc-cheatsheet at the RStudio-Cheatsheet collection (or direct PDF-download here).
df$variable2 <- sapply(df$variable,
function(el) if (el == 0) {0} else if (el == 999) {NA} else {1})
This one-liner reflects your:
If value is 0, new one should be 0 too. If value ist 999, then make it
missing, NA. Everything else 1
Well, it is slightly slower than #markus's second or #SPJ's solutions which are most r-ish solutions.
Why one should put away the hands from ifelse
tt <- c(TRUE, FALSE, TRUE, FALSE)
a <- c("a", "b", "c", "d")
b <- 1:4
ifelse(tt, a, b) ## [1] "a" "2" "c" "4"
# totally perfect and as expected!
df <- data.frame(a=a, b=b, c=tt)
df$d <- ifelse(df$c, df$a, df$b)
## > df
## a b c d
## 1 a 1 TRUE 1
## 2 b 2 FALSE 2
## 3 c 3 TRUE 3
## 4 d 4 FALSE 4
######### This is wrong!! ##########################
## df$d is not [1] "a" "2" "c" "4"
## the problem is that
## ifelse(df$c, df$a, df$b)
## returns for each TRUE or FALSE the entire
## df$a or df$b intead of treating it like a vector.
## Since the last df$c is FALSE, df$b is returned
## Thus we get df$b for df$d.
## Quite an unintuitive behaviour.
##
## If one uses purely vectors, ifelse is fine.
## But actually df$c, df$a, df$b should be treated each like a vector.
## However, `ifelse` does not.
## No warnings that using `ifelse` with them will lead to a
## totally different behaviour.
## In my view, this is a design mistake of `ifelse`.
## Thus I decided myself to abandon `ifelse` from my set of R commands.
## To avoid that such kind of mistakes can ever happen.
#####################################################
As #Parfait pointed out correctly, it was a misinterpretation.
The problem was that df$a was treated in the data frame as a factor.
df <- data.frame(a=a, b=b, c=tt, stringsAsFactor = F)
df$d <- ifelse(df$c, df$a, df$b)
df
Gives the correct result.
a b c d
1 a 1 TRUE a
2 b 2 FALSE 2
3 c 3 TRUE c
4 d 4 FALSE 4
Thank you #Parfait to pointing that out!
Strange that I didn't recognized that in my initial trials.
But yeah, you are absolutely right!
I'm dealing with a data frame containing several columns that are a single value or NA's. I know how to find columns that are one or the other:
df1 <- data.frame(col1 = 1:10, col2 = 0, col3 = seq(1,20,2))
df1[c(1,4,7),'col2'] <- NA
names(df1)[sapply(df1, function(x) sum(is.na(x)) == length(x))]
names(df1)[sapply(df1, function(x) length(unique(x)) == length(x))]
However I can't think of a way to catch all NA's or a single value. In the above case col2 should be caught.
Any suggestions?
First you could check for the existence of NA within a column with:
any(is.na(df1$col2))
Then if you want to know if a column has all values set to zero without taking into consideration the NA values, use simply:
all(df1$col2 == 0, na.rm = TRUE)
Using rowSums as alex2006 suggests might lead to the inconvenience that you have an arrange of numbers whose sum is 0 and it would also flag that column.
If you're looking for columns were the variance is 0 you could try
colvar0<-apply(df1,2,function(x) var(x,na.rm=T)==0)
colvar0
col1 col2 col3
FALSE TRUE FALSE
to get the column names
names(df1)[colvar0]
edit: suppose you have some columns with only NA then colvar0 equals NA, you can retrieve all the column names with
names(df1)[colvar0|is.na(colvar0)]
Maybe the following will do it.
sapply(df1, function(x){
na <- is.na(x)
any(na) && length(unique(x[!na])) == 1
})
# col1 col2 col3
#FALSE TRUE FALSE
inx <- sapply(df1, function(x){
na <- is.na(x)
any(na) && length(unique(x[!na])) == 1
})
df1[which(inx)]
# col2
#1 NA
#2 0
#3 0
#4 NA
#5 0
#6 0
#7 NA
#8 0
#9 0
#10 0
df1[which(!inx)]
# col1 col3
#1 1 1
#2 2 3
#3 3 5
#4 4 7
#5 5 9
#6 6 11
#7 7 13
#8 8 15
#9 9 17
#10 10 19
Note: If you just want the column names, names[inx] gets the ones with variance zero.
sapply(df1, function(x) length(unique(sort(x))) %in% 0:1) #sort removes NA
# col1 col2 col3
#FALSE TRUE FALSE
OR
sapply(df1, function(x) length(unique(x[!is.na(x)])) %in% 0:1)
# col1 col2 col3
#FALSE TRUE FALSE
If you want to retrieve the actual row where this is happening I suggest the following:
which(is.na(rowSums(df1)) | rowSums(df1)==0)
I created two nested for loops to complete the following:
Iterating through each column that is not the first column:
Iterate through each row i that is NOT the last row (the last row is denoted j)
Compare the value in i to the value in j.
If i is NA, i = NA.
If i >= j, i = 0.
If i < j, i = 1.
Store the results of all iterations across all columns and rows in a df.
The code below creates some test data but produces a Value "out" that is NULL (empty). Any recommendations?
# Create df
a <- rnorm(5)
b <- c(rnorm(3),NA,rnorm(1))
c <- rnorm(5)
df <- data.frame(a,b,c)
rows <- nrow(df) # rows
cols <- ncol(df) # cols
out <- for (c in 2:cols){
for (r in 1:(rows - 1)){
ifelse(
is.na(df[r,c]),
NA,
df[r, c] <- df[r, c] < df[rows, c])
}
}
There's no need for looping at all. Use a vectorised function like sweep to compare via > your last row - df[nrow(df),] vs all the other rows df[-nrow(df),]:
df
# a b c
#1 -0.2739735 0.5095727 0.30664838
#2 0.7613023 -0.1509454 -0.08818313
#3 -0.4781940 1.5760307 0.46769601
#4 1.1754130 NA 0.33394212
#5 0.5448537 1.0493805 -0.10528847
sweep(df[-nrow(df),], 2, unlist(df[nrow(df),]), FUN=`>`)
# a b c
#1 FALSE FALSE TRUE
#2 TRUE FALSE TRUE
#3 FALSE TRUE TRUE
#4 TRUE NA TRUE
sweep(df[-nrow(df),], 2, unlist(df[nrow(df),]), FUN=`>`) + 0
# a b c
#1 0 0 1
#2 1 0 1
#3 0 1 1
#4 1 NA 1
Here is another option. We can replicate the last row to make the dimensions of both datasets equal and then do the > to get a logical index, which can be coerced to binary by wrapping with +.
+(df[-nrow(df),] > df[nrow(df),][col(df[-nrow(df),])])
# a b c
#1 0 0 1
#2 1 0 1
#3 0 1 1
#4 1 NA 1