I'm familiar with %in% generally, and I'm looking for a base R solution, if one exists.
Suppose I want to know whether a particular combination of values from multiple fields in a data frame exists in another data frame. As a work-around, sometimes I concatenate all these values into a single field and match on the custom concatenation, but I'm wondering if there's a way to pass the value combinations to %in% directly.
I'm imagining syntax similar to deduplicating on unique combinations of values across multiple columns, whose syntax works like this, by way of a generic example:
df[!duplicated(df[,c("col1","col2","col3")]),]
I was sort of expecting something like this to work, but I see why it doesn't:
df1[df1[,c("col1","col2")] %in% df2[,c("col1","col2")],]
... above, I'm attempting to ask which value pairs in df1 also exist as value pairs in df2.
You can use mapply to create a logical matrix of matches and then use it to subset df1.
Test data.
set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4),
col2 = sample(1:10, 4), col3 = 11:14)
Here I start by putting the columns in a vector, it simplifies the code.
cols <- c("col1", "col2")
(i <- mapply(\(x, y) x %in% y, df1[cols], df2[cols]))
# col1 col2
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] TRUE TRUE
# [5,] FALSE FALSE
# [6,] TRUE TRUE
# [7,] TRUE TRUE
# [8,] FALSE FALSE
# [9,] FALSE TRUE
#[10,] FALSE FALSE
Now subset. The question is not very clear on which of the following is asked for.
# at least one column match
j <- rowSums(i) > 0L
df1[j, ]
# col1 col2 col3
#3 c 3 13
#4 d 4 14
#6 f 6 16
#7 g 7 17
#9 i 9 19
# all columns match
k <- rowSums(i) == length(cols)
df1[k, ]
# col1 col2 col3
#4 d 4 14
#6 f 6 16
#7 g 7 17
I think just doing a merge() by the two columns of interest get you what you need. You can then subset the merged output to just columns from the original data.frame. This would return only rows of your query data.frame where col1 and col2 match their cognate values in the reference data.frame. Please clarify if that's NOT your goal.
# simulate two DFs with some common values in col1 and col2
x <- data.frame(col1 = LETTERS[1:5],
col2 = 1:5,
col3 = runif(5))
y <- data.frame(col1 = LETTERS[4:8],
col2 = 4:8,
col3 = runif(5))
x
#> col1 col2 col3
#> 1 A 1 0.4306611
#> 2 B 2 0.7149893
#> 3 C 3 0.2808990
#> 4 D 4 0.4383580
#> 5 E 5 0.1372991
y
#> col1 col2 col3
#> 1 D 4 0.40191250
#> 2 E 5 0.94833538
#> 3 F 6 0.85608320
#> 4 G 7 0.05758958
#> 5 H 8 0.29011770
# merge without adding .x suffix to col3 from x
# then subset to only keep columns from x
merge(x, y,
by = c("col1", "col2"),
suffixes = c("", ".drop"))[,1:ncol(x)]
#> col1 col2 col3
#> 1 D 4 0.4383580
#> 2 E 5 0.1372991
Created on 2022-01-08 by the reprex package (v2.0.1)
Related
I have a data of 80k rows and 874 columns. Some of these columns are empty. I use sum(is.na) in a for loop to determine the index of empty columns. Since the first column is not empty, if sum(is.na) is equal to the number of rows of the first column, it means that column is empty.
for (i in 1:ncol(loans)){
if (sum(is.na(loans[i])) == nrow(loans[1])){
print(i)
}
}
Now that I know the indices of empty columns, I want to drop them from the data. I thought about storing those indices in an array and dropping them in a loop but I don't think it will work since columns with data will replace the empty columns. How can I drop them?
You should try to provide a toy dataset for your question.
loans <- data.frame(
a = c(NA, NA, NA),
b = c(1,2,3),
c = c(1,2,3),
d = c(1,2,3),
e = c(NA, NA, NA)
)
loans[!sapply(loans, function(col) all(is.na(col)))]
sapply loops over columns of loans and applies the anonymous function checking if all elements are NA. It then coerces the output to a vector, in this case logical.
The tidyverse option:
loans[!purrr::map_lgl(loans, ~all(is.na(.x)))]
Does this work:
df <- data.frame(col1 = rep(NA, 5),
col2 = 1:5,
col3 = rep(NA,5),
col4 = 6:10)
df
col1 col2 col3 col4
1 NA 1 NA 6
2 NA 2 NA 7
3 NA 3 NA 8
4 NA 4 NA 9
5 NA 5 NA 10
df[,which(colSums(df, na.rm = TRUE) == 0)] <- NULL
df
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
Another approach:
df[!apply(df, 2, function(x) all(is.na(x)))]
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
A dplyr solution:
df %>%
select_if(!colSums(., na.rm = TRUE) == 0)
You can try to use fundamental skills like if else and for loops for almost all problems, although a drawback is that it will be slower.
# evaluate each column, if a column meets your condition, remove it, then next
for (i in 1:length(loans)){
if (sum(is.na(loans[,i])) == nrow(loans)){
loans[,i] <- NULL
}
}
I'm dealing with a data frame containing several columns that are a single value or NA's. I know how to find columns that are one or the other:
df1 <- data.frame(col1 = 1:10, col2 = 0, col3 = seq(1,20,2))
df1[c(1,4,7),'col2'] <- NA
names(df1)[sapply(df1, function(x) sum(is.na(x)) == length(x))]
names(df1)[sapply(df1, function(x) length(unique(x)) == length(x))]
However I can't think of a way to catch all NA's or a single value. In the above case col2 should be caught.
Any suggestions?
First you could check for the existence of NA within a column with:
any(is.na(df1$col2))
Then if you want to know if a column has all values set to zero without taking into consideration the NA values, use simply:
all(df1$col2 == 0, na.rm = TRUE)
Using rowSums as alex2006 suggests might lead to the inconvenience that you have an arrange of numbers whose sum is 0 and it would also flag that column.
If you're looking for columns were the variance is 0 you could try
colvar0<-apply(df1,2,function(x) var(x,na.rm=T)==0)
colvar0
col1 col2 col3
FALSE TRUE FALSE
to get the column names
names(df1)[colvar0]
edit: suppose you have some columns with only NA then colvar0 equals NA, you can retrieve all the column names with
names(df1)[colvar0|is.na(colvar0)]
Maybe the following will do it.
sapply(df1, function(x){
na <- is.na(x)
any(na) && length(unique(x[!na])) == 1
})
# col1 col2 col3
#FALSE TRUE FALSE
inx <- sapply(df1, function(x){
na <- is.na(x)
any(na) && length(unique(x[!na])) == 1
})
df1[which(inx)]
# col2
#1 NA
#2 0
#3 0
#4 NA
#5 0
#6 0
#7 NA
#8 0
#9 0
#10 0
df1[which(!inx)]
# col1 col3
#1 1 1
#2 2 3
#3 3 5
#4 4 7
#5 5 9
#6 6 11
#7 7 13
#8 8 15
#9 9 17
#10 10 19
Note: If you just want the column names, names[inx] gets the ones with variance zero.
sapply(df1, function(x) length(unique(sort(x))) %in% 0:1) #sort removes NA
# col1 col2 col3
#FALSE TRUE FALSE
OR
sapply(df1, function(x) length(unique(x[!is.na(x)])) %in% 0:1)
# col1 col2 col3
#FALSE TRUE FALSE
If you want to retrieve the actual row where this is happening I suggest the following:
which(is.na(rowSums(df1)) | rowSums(df1)==0)
One normal way to fill in NA values in a data frame, loan, is as follows:
for (i in 1: ncol(loan))
{
if (is.character(loan[,i]))
{
loan[is.na(loan[ ,i]), i] <- "missing"
}
if (is.numeric(loan[,i]))
{
loan[is.na(loan[ ,i]), i] <- 9999
}
}
But if the loan data-set is a tibble, the above method does not work as is.character(loan[,i]) is always FALSE and also is.numeric(loan[,i]) is also FALSE. Dataset loan's class is as below:
> class(loan)
[1] "tbl_df" "tbl" "data.frame"
To use the above for-loop for filing in missing values, I have to first convert 'loan' to a data frame with as.data.frame() and then use the for-loop.
Is it possible to directly manipulate a tibble without first converting it to a data.frame to fill in missing values?
We can use the tidyverse syntax to do this
library(tidyverse)
loan %>%
mutate_if(is.character, funs(replace(., is.na(.), "missing"))) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 9999)))
# A tibble: 20 × 3
# Col1 Col2 Col3
# <chr> <dbl> <chr>
#1 a 9999 A
#2 a 2 A
#3 d 3 A
#4 c 9999 missing
#5 c 1 missing
#6 e 3 missing
#7 a 9999 A
#8 d 2 A
#9 d 3 A
#10 a 9999 A
#11 c 1 A
#12 b 1 C
#13 d 1 A
#14 d 9999 B
#15 a 4 B
#16 e 1 C
#17 a 3 A
#18 missing 3 A
#19 c 3 missing
#20 missing 4 missing
As the dataset is a tibble, it will not get converted to vector by extracting with [, instead we need [[
for (i in 1: ncol(loan)) {
if (is.character(loan[[i]])) {
loan[is.na(loan[[i]]), i] <- "missing"
} if (is.numeric(loan[[i]])) {
loan[is.na(loan[[i]]), i] <- 9999
}
}
To understand the problem, we just need to look at the output of the extraction
head(is.na(loan[,1]))
# Col1
#[1,] FALSE
#[2,] FALSE
#[3,] FALSE
#[4,] FALSE
#[5,] FALSE
#[6,] FALSE
head(is.na(loan[[1]]))
#[1] FALSE FALSE FALSE FALSE FALSE FALSE
In the for loop, we are using the rowindex as a logical matrix with 1 column in the first case, and the second case it is a vector which makes the difference
data
set.seed(24)
loan <- as_tibble(data.frame(Col1 = sample(c(NA, letters[1:5]), 20,
replace = TRUE), Col2 = sample(c(NA, 1:4), 20, replace = TRUE),
Col3 = sample(c(NA, LETTERS[1:3]), 20, replace = TRUE),
stringsAsFactors=FALSE))
EX Data:
Col1 Col2
1 a
2 b
3 null
4 c
How do I change all elements of Col2 that are NULL to some predefined value. My actual data is about 250,000 rows so a for loop would take too much time. I was thinking about some kind of apply / ddply and ifelse combination but I can't seem to get it working.
More specifically, how do I change the for loop to something more efficient
for(I in 1:n)
{
if(col2(I) == NULL)
col2(I) = x
else...nothing happens
}
You can use ifelse to change the value from null to, say, XXX
> dat <- read.table(h=T, text = "Col1 Col2
1 a
2 b
3 null
4 c")
> dat
# Col1 Col2
# 1 1 a
# 2 2 b
# 3 3 null
# 4 4 c
> dat$Col2 <- ifelse(dat$Col2 == 'null', 'XXX', dat$Col2)
> dat
# Col1 Col2
# 1 1 1
# 2 2 2
# 3 3 XXX
# 4 4 3
An alternative way that may be easier to understand is
dat[,'Col2'] <- with(dat, ifelse(Col2 == 'null', 'XXX', Col2))
Furthermore, if you're dealing with factors and you want to change the name of a level,
> levels(dat$Col2)
## [1] "a" "b" "c" "null"
> levels(dat$Col2)[4] <- "XXX"
> levels(dat$Col2)
## [1] "a" "b" "c" "XXX"
Why not
Col2[Col2=="null"]<-"XXX"
Note - I don't think you can get a true NULL value in a data.frame like that.
Update for factors
In response to #beginneR,
If Col2 is a factor you can do this to change it:
levels(Col2)<-c(levels(Col2),"XXX")
Col2[Col2=="null"]<-"XXX"
Here is an approach using data.table, which should scale pretty well.
The example below for 1 million rows, takes less than 1/10 of a second on my laptop.
# Load package data.table
library(data.table)
# Set up data
Col1 <- rep(c(1,2,3,4), 250000)
Col2 <- rep(c("a", "b", "null", "c"), 250000)
# Define data as data.table
ex <- data.table(Col1, Col2)
# Substitute value "null" by "x" for variable "Col2"
ex[Col2=="null", Col2:="x"]
How about using replace()?
R:> Col1 <- c(1,2,3,4)
R:> Col2 <- c("a", "b", "null", "c")
R:> ex <- data.frame(Col1, Col2)
R:> ex
Col1 Col2
1 1 a
2 2 b
3 3 null
4 4 c
R:> typeof(ex$Col2)
[1] "integer"
R:> ex$Col2 <- as.character(ex$Col2)
R:> typeof(ex$Col2)
[1] "character"
R:> ex$Col2 <- replace(ex$Col2, which(ex$Col2 == "null"), "some")
R:> ex
Col1 Col2
1 1 a
2 2 b
3 3 some
4 4 c
R:>
I have a dataset
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
for every id the values are sorted with ascending order
i want to reduce the dtf to include only the first row for every id that the value exceeds a specified limit. Only one row per id, and that should be the one that the value first exceed a specified limit.
For this example and for the limit of 5 the dtf should reduce to :
A 6
B 6
Is the a nice way to do this?
Thanks a lot
It could be done with aggregate:
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
aggregate(value ~ id, dtf, function(x) x[x > limit][1])
The result:
id value
1 A 6
2 B 6
Update: A solution for multiple columns:
An example data frame, dtf2:
dtf2 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
value=c(2,4,6,8,4,6,8,10),
col3 = letters[1:8],
col4 = 1:8)
A solution including ave:
with(dtf2, dtf2[ave(value, id, FUN = function(x) cumsum(x > limit)) == 1, ])
The result:
id value col3 col4
3 A 6 c 3
6 B 6 f 6
Here is a "nice" option using data.table:
library(data.table)
DT <- data.table(dft, key = "id")
DT[value > 5, head(.SD, 1), by = key(DT)]
# id value
# 1: A 6
# 2: B 6
And, in the spirit of sharing, an option using sqldf which might be nice depending on whether you feel more comfortable with SQL.
sqldf("select id, min(value) as value from dtf where value > 5 group by id")
# id value
# 1 A 6
# 2 B 6
Update: Unordered source data, and a data.frame with multiple columns
Based on your comments to some of the answers, it seems like there might be a chance that your "value" column might not be ordered like it is in your example, and that there are other columns present in your data.frame.
Here are two alternatives for those scenarios, one with data.table, which I find easiest to read and is most likely the fastest, and one with a typical "split-apply-combine" approach that is commonly needed for such tasks.
First, some sample data:
dtf2 <- data.frame(id = c("A","A","A","A","B","B","B","B"),
value = c(6,4,2,8,4,10,8,6),
col3 = letters[1:8],
col4 = 1:8)
dtf2 # Notice that the value column is not ordered
# id value col3 col4
# 1 A 6 a 1
# 2 A 4 b 2
# 3 A 2 c 3
# 4 A 8 d 4
# 5 B 4 e 5
# 6 B 10 f 6
# 7 B 8 g 7
# 8 B 6 h 8
Second, the data.table approach:
library(data.table)
DT <- data.table(dtf2)
DT # Verify that the data are not ordered
# id value col3 col4
# 1: A 6 a 1
# 2: A 4 b 2
# 3: A 2 c 3
# 4: A 8 d 4
# 5: B 4 e 5
# 6: B 10 f 6
# 7: B 8 g 7
# 8: B 6 h 8
DT[order(value)][value > 5, head(.SD, 1), by = "id"]
# id value col3 col4
# 1: A 6 a 1
# 2: B 6 h 8
Second, base R's common "split-apply-combine" approach:
do.call(rbind,
lapply(split(dtf2, dtf2$id),
function(x) x[x$value > 5, ][which.min(x$value[x$value > 5]), ]))
# id value col3 col4
# A A 6 a 1
# B B 6 h 8
Another approach with aggregate:
> aggregate(value~id, dtf[dtf[,'value'] > 5,], min)
id value
1 A 6
2 B 6
This does depend on the elements being sorted, as that will be the entry returned by min
might aswell, an alternative with plyr and head :
library(plyr)
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
result <- ddply(dtf, "id", function(x) head(x[x$value > limit ,],1) )
> result
id value
1 A 6
2 B 6
This depends on your data.frame being sorted:
threshold <- 5
foo <- dtf[dtf$value>=threshold,]
foo[c(1,which(diff(as.numeric(as.factor(foo$id)))>0)),]