Searching a list across a dataframe in R? - r

Currently, I have a database built in R that looks like this:
df <- data.frame(c('ABC','DEF','HIJ'),
c(1,2,5),
c(2,5,9),
c(14,19,12))
And I have a function which searches for one value across the entire data frame and returns the entire row, the function for this is below:
df[which(df == 5,
arr.ind = TRUE)[,"row"],]
This function returns the following when executed:
HIJ 5 9 12
DEF 2 5 19
I would like to be able to enter a list of values as a vector and then filter through all the values in one shot using a loop to return values that have a match, however, I have been totally lost in creating a loop function with my search function above to find values from a vector in my dataset. Below is an example of what I am trying to achieve, by searching for values from vector v across data frame df to return all rows of df which have values in any column or row that are the same as values in v:
v <- c(1,2,13,19,16,120,2934,1087)
Searching this across the data frame I would like to return:
HIJ 5 9 12
DEF 2 5 19
I am wondering what would be the best way to perform a loop to do this search?

We can use :
df[rowSums(sapply(df, `%in%`, v)) > 0, ]
Or using dplyr :
library(dplyr)
df %>% filter_all(any_vars(. %in% v))

It may be easier to reshape your data first. I'll use data.table::melt:
library(data.table)
df = data.frame(
V1 = c("ABC", "DEF", "HIJ"),
V2 = c(1, 2, 5),
V3 = c(2, 5, 9),
V4 = c(14, 19, 12)
)
setDT(df)
# reshape long
melt_df = melt(df, id.vars = 'V1')
melt_df
# V1 variable value
# 1: ABC V2 1
# 2: DEF V2 2
# 3: HIJ V2 5
# 4: ABC V3 2
# 5: DEF V3 5
# 6: HIJ V3 9
# 7: ABC V4 14
# 8: DEF V4 19
# 9: HIJ V4 12
Now we can look it all up at once:
melt_df[value %in% v]
# V1 variable value
# 1: ABC V2 1
# 2: DEF V2 2
# 3: ABC V3 2
# 4: DEF V4 19
That's the gist of it. To get back your original desired output, we need to do some other steps:
df[.(V1 = melt_df[value %in% v, unique(V1)]), on = 'V1']
# V1 V2 V3 V4
# 1: ABC 1 2 14
# 2: DEF 2 5 19
this pulls the associated values of V1 from melt_df (unique removes duplicates) and joins them back to df (hence on='V1') to get the associated rows from df

Related

Checking conditions and assigning values by row in R

I have a dataset that has one row per subject, and there is a variable for which I want to reassign values based on a condition. For example, if the value of the variable is 6, I want to change the value to the mean of the other variables in the dataset.
Subject V1 V2 V3 V4
123 2 2 2 3
234 1 5 4 4
345 1 4 3 6
In the above dataset, for each patient, I want to reassign all 6's for V4 with the mean of that patient's V1, V2, V3. Thus, for subject 345, V4 would take on the new value 8/3 or ((1+4+3)/3). I was thinking of using an ifelse statement, but I haven't been able to get it to work. Any help would be greatly appreciated.
Given:
library(dplyr)
library(tibble)
data <- tibble(
Subject = c("123", "234", "345"),
V1 = c(2, 1, 1),
V2 = c(2, 5, 4),
V3 = c(2, 4, 3),
V4 = c(3, 4, 6)
)
You could do this using base-R:
data$V4 <- ifelse(data$V4 == 6,(data$V1 + data$V2 + data$V3)/3, data$V4)
Or using a dplyr chain:
data <- data %>%
mutate(V4 = ifelse(V4 == 6,(V1 + V2 + V3)/3, V4))
Turn the V4 value to NA and replace them with rowMeans.
df$V4[df$V4 == 6] <- NA
df$V4 <- ifelse(is.na(df$V4), rowMeans(df[-1], na.rm = TRUE), df$V4)
df
# Subject V1 V2 V3 V4
#1 123 2 2 2 3.00
#2 234 1 5 4 4.00
#3 345 1 4 3 2.67
You can use any of the below formula.
d[,4]<-ifelse(d[,4]==6,(d[,1]+d[,2]+d[,3])/3,d[,4])
d[,4]<-ifelse(d[,4]==6,rowMeans(d[,1:3]),d[,4])

In R subtract a vector from each row of a dataframe

I'm searching a better, more efficient solution to subtract a vector from each row of a dataframe (df1). My current solution repeats the vector (Vec) to create a dataframe (Vec_df1) with similar length as the df1 and then subtracts the two dataframes. Now I wonder if there is a more "direct" way to do this without having to create the new Vec_df1 dataframe (preferably in tidyverse). See example data below.
#Example data
V1 <- c(1, 2, 3)
V2 <- c(4, 5, 6)
V3 <- c(7, 8, 9)
df1 <- tibble(V1, V2, V3)
Vec <- c(1, 1, 2)
# Current solution, creates a dataframe with the same nrows by repeating the vector.
Vec_df1 <- tibble::as_tibble(t(Vec)) %>%
dplyr::slice(rep(dplyr::row_number(), nrow(df1)))
# Subtraction.
df2 <- df1-Vec_df1
df2
Thanks in advance
We can use sweep :
sweep(df1, 2, Vec, `-`)
# `-` is default FUN in sweep so you can also use
#sweep(df1, 2, Vec)
# V1 V2 V3
#1 0 3 5
#2 1 4 6
#3 2 5 7
Or an attempt similar to yours
df1 - rep(Vec, each = nrow(df1))
A similar approach using map2_df():
library(purrr)
map2_df(df1, Vec, `-`)
# A tibble: 3 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 0 3 5
2 1 4 6
3 2 5 7
the fastest way to do this :
as_tibble(t(t(df1) - Vec))
# A tibble: 3 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 0 3 5
2 1 4 6
3 2 5 7
We can also do
df1 - Vec[col(df1)]

Customise the aggregate function inside dcast based on the max value of a column in data.table?

I've got a data.table that i'd like to dcast based on three columns (V1, V2, V3). there are, however, some duplicates in V3 and I need an aggregate function that looks at a fourth column V4 and decides for the value of V3 based on maximum value of V4. I'd like to do this without having to aggregate DT separately prior to dcasting. can this aggregation be done in aggregate function of dcast or do I need to aggregate the table separately first?
Here is my data.table DT:
> DT <- data.table(V1 = c('a','a','a','b','b','c')
, V2 = c(1,2,1,1,2,1)
, V3 = c('st', 'cc', 'B', 'st','st','cc')
, V4 = c(0,0,1,0,1,1))
> DT
V1 V2 V3 V4
1: a 1 st 0
2: a 2 cc 0
3: a 1 B 1 ## --> i want this row to be picked in dcast when V1 = a and V2 = 1 because V4 is largest
4: b 1 st 0
5: b 2 st 1
6: c 1 cc 1
and the dcast function could look something like this:
> dcast(DT
, V1 ~ V2
, value.var = "V3"
#, fun.aggregate = V3[max.which(V4)] ## ?!?!?!??!
)
My desired output is:
> desired
V1 1 2
1: a B cc
2: b st st
3: c cc <NA>
Please note that aggregating DT before dcasting to get rid of the duplicates will solve the issue. I'm just wondering if dcasting can be done with the duplicates.
Here is one option where you take the relevent subset before dcasting:
DT[order(V4, decreasing = TRUE)
][, dcast(unique(.SD, by = c("V1", "V2")), V1 ~ V2, value.var = "V3")]
# V1 1 2
# 1: a B cc
# 2: b st st
# 3: c cc <NA>
Alternatively order and use a custom function in dcast():
dcast(
DT[order(V4, decreasing = TRUE)],
V1 ~ V2,
value.var = "V3",
fun.aggregate = function(x) x[1]
)
dplyr/tidyr option would be to group_by V1 and V2 select the maximum value in each group and then spread to wide format.
library(dplyr)
library(tidyr)
DT %>%
group_by(V1, V2) %>%
slice(which.max(V4)) %>%
select(-V4) %>%
spread(V2, V3)
# V1 `1` `2`
# <chr> <chr> <chr>
#1 a B cc
#2 b st st
#3 c cc NA

Remove filter based on table with NA

I'm assigning rows of data to several different groups. The main issue is there are many groups, but not every group is using the same set of fields. I would like to set up a reference table that I could loop over or shove through a function but I don't know how to remove the fields from the filter where they are unneeded.
Below is sample code, I've included a version of my current solution as well as an example table.
library(data.table)
set.seed(1)
n <- 1000
#Sample Data
ExampleData <- data.table(sample(1:3,n,replace = TRUE),
sample(10:12,n,replace = TRUE),
sample(letters[1:3],n,replace = TRUE),
sample(LETTERS[1:3],n,replace = TRUE))
#Current solution
ExampleData[V1 == 1 & V2 == 11 & V4 == "C", Group := "Group1"]
ExampleData[V1 == 2, Group := "Group2"]
ExampleData[V1 == 3 & V3 == "a" & V4 == "B", Group := "Group3"]
#Example reference table
ExampleRefTable <- data.table(Group = c("Group1","Group2","Group3"),
V1 = c(1,2,3),
V2 = c(11,NA,NA),
V3 = c(NA,NA,"a"),
V4 = c("C",NA,"B"))
(Thanks to #eddi:) You could iterate over rows/groups in the ref table with by=:
ExampleRefTable[,
ExampleData[copy(.SD), on = names(.SD)[!is.na(.SD)], grp := .BY$Group]
, by = Group]
For each Group, we are using .SD (the rest of the Subset of the ref table Data) for an update join, ignoring columns of .SD that are NA. .BY contains the per-group values of by=.
(My original answer:) You could split up the ref table into subsets with non-NA values:
ExampleRefTable[, gNA := .GRP, by=ExampleRefTable[, !"Group"]]
RefTabs = lapply(
split(ExampleRefTable, by="gNA", keep.by = FALSE),
FUN = Filter, f = function(x) !anyNA(x)
)
which looks like
$`1`
Group V1 V2 V4
1: Group1 1 11 C
$`2`
Group V1
1: Group2 2
$`3`
Group V1 V3 V4
1: Group3 3 a B
Then iterate over these tables with update joins:
ExampleData[, Group := NA_character_]
for (i in seq_along(RefTabs)){
RTi = RefTabs[[i]]
nmi = setdiff(names(RTi), "Group")
ExampleData[is.na(Group), Group :=
RTi[copy(.SD), on=names(.SD), x.Group]
, .SDcols=nmi][]
}
rm(RTi, nmi)
By filtering on is.na(Group), I'm assuming that the rules in the ref table are mutually exclusive.
The copy on .SD is needed due to an open issue.
This might be more efficient than #eddi's way (at the top of this answer) if there are many groups sharing the same missing/nonmissing columns.
If you are manually writing your ref table, I would suggest...
rbindlist(idcol = "Group", fill = TRUE, list(
NULL = list(V1 = numeric(), V2 = numeric(), V3 = character(), V4 = character()),
Group1 = list(V1 = 1, V2 = 11, V4 = "C"),
Group2 = list(V1 = 2),
Group3 = list(V1 = 3, V3 = "a", V4 = "B")
))
Group V1 V2 V3 V4
1: Group1 1 11 <NA> C
2: Group2 2 NA <NA> <NA>
3: Group3 3 NA a B
for easier reading and editing.
We can loop through the reference data frame and compare it to the example data assigning groups if the conditions are correct, this scales with any size reference table and data, although you may want to vectorize some things if the data is >~100k:
lenC<-ncol(ExampleRefTable)
lenT<-nrow(ExampleRefTable)
lenDat<-nrow(ExampleData)
ExampleData$Group<-"NA"
for(i in 1:lenT){
iter=i
Group_Assign<-ExampleRefTable[i,1]
Vals<-ExampleRefTable[iter,2:lenC]
for(i in 1:lenDat){
LogicArray<-ExampleData[i,1:4]==Vals
if(all(LogicArray, na.rm=T)==T){
ExampleData[i]$Group<-Group_Assign
}else{
}
}
}
> ExampleData
V1 V2 V3 V4 Group
1: 1 11 c C Group1
2: 2 12 c B Group2
3: 2 11 c A Group2
4: 3 12 b B NA
5: 1 10 a C NA
---
996: 3 12 a B Group3
997: 2 10 a C Group2
998: 1 10 a A NA
999: 1 10 a B NA
1000: 1 11 b C Group1
This example assumes that NA in the reference data can be matched to any value in the example data as long as the position is correct e.g:
#This is assigned Group1 since NA in the ref.table matched c in pos.3
> ExampleRefTable
V1 V2 V3 V4 Group
1: 1 11 NA C Group1
> ExampleData
V1 V2 V3 V4 Group
1: 1 11 c C Group1
If NA is supposed to be matched to only NA values (which none were in the example data), you will change this code:
for(i in 1:lenDat){
LogicArray<-ExampleData[i,1:4]==Vals
A<-Vals
B<-ExampleData[i,1:4]
NAA<-is.na(A)
NAB<-is.na(B)
if(all(NAA==NAB)==T && all(LogicArray, na.rm=T)==T){
ExampleData[i]$Group<-Group_Assign
}else{
}
}

Create a subsetting function according to one or more couple of values for a data.frame

How to make a function to use one or mores couples of values (x1,y1 ; x2,y2 ; ... according to need) to subset a data frame like
selection <- function(x1,y1, ...){
dfselected <- subset(df, V1 == "x1" & V2 == "y1"
## MAY OR MAY NOT BE PRESENT ##
| V1 == "x2" & V2 == "y2")
return(dfselected)
}
I can do it with subset() for a single indexing. Example:
df <- data.frame(
V1 = c(rep("a",5), rep("b",5)),
V2 = rep(c(1:5),2),
V3 = c(101:110)
)
ie
V1 V2 V3
a 1 101
a 2 102
a 3 103
a 4 104
a 5 105
b 1 106
b 2 107
b 3 108
b 4 109
b 5 110
And the subsetting for the couples ("a","3") and ("b","4") look likes
dfselected <- subset(df, V1 == "a" & V2 == 3 | V1 == "b" & V2 == 4 )
I couldn't find a similar function. I don't know if I have to pass an unspecified number of parameters to a function (the so-called "three dots") or to use if/else. I'am a beginner to functions, so links or examples are welcome too.
I started mostly with that: http://www.ats.ucla.edu/stat/r/library/intro_function.htm
------------------------------ Solution after hadley's answer
selection <- function (x,y){
match <- data.frame(
V1 = x,
V2 = y,
stringsAsFactors = FALSE
)
return(dplyr::semi_join(df, match))
}
It sounds like you want a semi-join: find all rows in x that have matching entries in y:
df <- data.frame(
V1 = c(rep("a",5), rep("b",5)),
V2 = rep(c(1:5), 2),
V3 = c(101:110),
stringsAsFactors = FALSE
)
match <- data.frame(
V1 = c("a", "b"),
V2 = c(3L, 4L),
stringsAsFactors = FALSE
)
library(dplyr)
semi_join(df, match)
Unless I'm missing something, you could just use base R's merge().
With the two example data.frames Hadley provided,
merge(df, match)
# V1 V2 V3
# 1 a 3 103
# 2 b 4 109

Resources