Check whether value in one dataframe is in another (larger) dataframe - r

I'm struggling to come up with a vectorised solution to the following problem. I have two dataframes:
> people <- data.frame(name = c('Fred', 'Bob'), profession = c('Builder', 'Baker'))
> people
name profession
1 Fred Builder
2 Bob Baker
> allowed <- data.frame(name = c('Fred', 'Fred', 'Bob', 'Bob'), profession = c('Builder', 'Baker', 'Barman', 'Biker'))
> allowed
name profession
1 Fred Builder
2 Fred Baker
3 Bob Barman
4 Bob Biker
That is to say, I want to check every person in people has a permitted profession, and return any names which do not.
For instance, Fred can be a Builder or a Baker, and so he is fine. However, Bob can be a Barman or a Biker, but not a Baker (note: there are only ever two permitted professions in my use case).
I would like to a return a data frame those names which do not have a permitted profession:
name profession permitted
1 Bob Baker Biker
2 Bob Baker Barman
Thanks for the help

Simple base-only solution. I'm sure someone can come up with something better.
out <- allowed[!allowed$name %in% merge(people, allowed)$name, ]
This gets you the desired people, along with their permitted professions. If you also want their actual professions:
names(out)[2] <- "permitted"
out <- merge(people, out, all.y=TRUE)

Here's a slightly more readable data.table solution. You can do the last step on the same line as well to make it a one-liner, if you consider that readable.
# load library, convert people to a data.table and set a key
library(data.table)
people = data.table(people, key = "name,profession")
# compute
result = data.table(allowed, key = "name")[people[!allowed]]
setnames(result, "profession.1", "permitted")
result
# name profession permitted
#1: Bob Barman Baker
#2: Bob Biker Baker

Probably there's another way, but this should work. I added a third person with an unpermitted profession to show you how to apply the function to the entire dataset.
currentprof <-structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(3L,
2L, 1L), .Label = c("Analyst", "Baker", "Builder"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -3L))
allowed <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(4L,
1L, 2L, 3L, 6L, 5L), .Label = c("Baker", "Barman", "Biker", "Builder",
"Driver", "Teacher"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -6L))
checkprof <- function(name){
allowedn <- allowed[allowed$name == name,]
currentprofn <- currentprof[currentprof$name==name,]
if(!currentprofn$profession %in% allowedn$profession)
{result <- merge(currentprofn, allowedn, by = "name", all.x=TRUE)} else
{result <-data.frame(col1=character(),
col2=character(),
col3=character(),
stringsAsFactors=FALSE)}
colnames(result) <- c("name","profession","permitted")
return(result)
}
do.call(rbind,lapply(levels(allowed$name),checkprof))

This is my take on it. May need some more testing though.I'd be open to suggestions myself. It works with your example but I am not sure if it would generalize.
people$check <- ifelse(people$profession %in% allowed[which(allowed$name == people$name),"profession"], TRUE,FALSE)
people_select <- people[people$check == TRUE,]
EDIT: and just for clarification in case this is holding you back from voting. The ifelse is vectorized and will run very fast.

Related

How to concatenate rows based on group as quickly as possible

I have a dataframe as follows
ClientVisitGUID LineNum TextCol
1 1 This was a great
1 2 report I did
2 3 was performed today
2 1 Another great report
2 2 for this person
3 2 good stuff
3 1 I really write very
3 3 when I put my
3 4 mind to it
I'd like to concatenate the rows based on the ClientVisitGUID and the line number so i can get the following output
ClientVisitGUID TextCol
1 This was a great report I did
2 Another great report for this person was performed today
3 I really write very good stuff when I put my mind to it
I tried dplyr but it takes a long time and can't deal with thousands of rows which is what I have
resultset2<-resultset %>%
group_by(ClientVisitGUID) %>%
arrange(LineNum) %>%
summarize_all(paste, collapse=",")
Is there a faster way? I'm not really familiar with data.table but is this fast?
A second data.table option, also using stringi for its performance
library(data.table)
library(stringi)
setDT(df)
setkey(df, ClientVisitGUID, LineNum)
df1 <- df[, .(new = stri_c(TextCol, collapse = " ")), by = ClientVisitGUID]
Result
df1
# ClientVisitGUID new
#1: 1 This was a great report I did
#2: 2 Another great report for this person was performed today
#3: 3 I really write very good stuff when I put my mind to it
data (thanks to #ThomasIsCoding)
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))
An base R option is using aggregate
result <- aggregate(TextCol~ClientVisitGUID,
df[order(df$ClientVisitGUID,df$LineNum),],
paste0,
collapse = " ")
which gives
> result
ClientVisitGUID TextCol
1 1 This was a great report I did
2 2 Another great report for this person was performed today
3 3 I really write very good stuff when I put my mind to it
Data
df <- structure(list(ClientVisitGUID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), LineNum = c(1L, 2L, 3L, 1L, 2L, 2L, 1L, 3L, 4L), TextCol = c("This was a great",
"report I did", "was performed today", "Another great report",
"for this person", "good stuff", "I really write very", "when I put my",
"mind to it")), class = "data.frame", row.names = c(NA, -9L))
If you want speed, data.table is indeed a great candidate:
library(data.table)
setDT(resultset)
data.table::setkeyv(resultset, "ClientVisitGUID")
resultset <- resultset[order(ClientVisitGUID, LineNum)]
resultset[, .(lapply(.SD, paste, collapse = ",")), by = "ClientVisitGUID"]
Setting the key takes some times at first but you will end up with faster operations afterwards. Setting the keys reorder rows belonging to the same group in contiguous memory slots
Example
data = data.table("a" = c("aaa","ffff","ttt"), "b" = c(1,1,2))
data[, .(lapply(.SD, paste, collapse = ",")), by = "b"]

Filter table by column in R

I would like to filter table if I have column name written in variable. I tried bellow code but it did not work. dat is a data frame, name of column is Name, and I would like to filter by "John".
colname <- "Name"
dat[dat$colname %in% "John",]
I saw that it works fine if I do not use variable for column name. (Bellow code works fine)
dat[dat$"Name" %in% "John",]
You may use the bracket function [.
colname <- "Name"
dat[dat[[colname]] %in% "John", ]
dat[dat[, colname] %in% "John", ] # or
# Name X1 X2
# 8 John 0.8646536 1.2688507
# 9 John -1.7201559 -0.3125515
Data
dat <- structure(list(Name = structure(c(3L, 3L, 2L, 4L, 4L, 2L, 3L,
1L, 1L, 2L), .Label = c("John", "Linda", "Mary", "Olaf"), class = "factor"),
X1 = c(0.758396178001042, -1.3061852590117, -0.802519568703793,
-1.79224083446114, -0.0420324540227439, 2.15004261784474,
-1.77023083820321, 0.864653594565389, -1.72015589816109,
0.134125668141181), X2 = c(-0.0758265646523722, 0.85830054437592,
0.34490034810227, -0.582452690107777, 0.786170375925402,
-0.692099286413293, -1.18304353631275, 1.26885070606311,
-0.31255154601115, 0.0305712590978896)), class = "data.frame", row.names = c(NA,
-10L))
An approach with dplyr using non-standard evaluation. Using #jay.sf's data
library(dplyr)
dat %>% filter(!!sym(colname) == "John")
# Name X1 X2
#1 John 0.864654 1.268851
#2 John -1.720156 -0.312552
In data.table, we can use get
library(data.table)
setDT(dat)[get(colname) == "John"]
Since we have only one value to compare we can use == here instead of %in%.
With data.table, we can use eval with as.symbol
library(data.table)
setDT(dat)[eval(as.symbol(colname)) == "John"]
# Name X1 X2
#1: John 0.8646536 1.2688507
#2: John -1.7201559 -0.3125515

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Merging two rows of data in R based on rules

I have merged two data frames using bind_rows. I have a situation where I have two rows of data as for example below:
Page Path Page Title Byline Pageviews
/facilities/when-lighting-strikes NA NA 668
/facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
When I have these type of duplicate page paths I'd like to merge the identical page paths, eliminate the two NA's in the first row keeping the page title (When Lighting Strikes) and Byline (Tom Jones) and then keep the pageviews result of 668 from the first row. Somehow it seems that I need
to identify the duplicate pages paths
look to see if there are different titles and bylines; remove NAs
keep the row with the pageview result; remove the NA row
Is there a way I can do this in R dplyr? Or is there a better way?
A simple solution:
library(dplyr)
df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
#
# PagePath PageTitle Byline Pageviews
# (fctr) (fctr) (fctr) (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
If your data is more complicated, you may need a more robust approach.
Data
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Use replace function in for loop
for(i in unique(df$Page_Path)){
df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}
df <- subset(df, !is.na(Page_Title))
print(df)
Page_Path Page_Title Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
Here is an option using data.table and complete.cases. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'PathPath', loop through the columns of the dataset (lapply(.SD, ..) and remove the NA elements with complete.cases. The complete.cases returns a logical vector and can be used for subsetting. According to this, complete.cases usage is much more faster than na.omit and coupled with data.table it would increase the efficiency.
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
# PagePath PageTitle Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
data
df <- structure(list(PagePath = structure(c(1L, 1L),
.Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
Another way to do this (similar to a previous solutions that uses dplyr) would be:
df %>% group_by(PagePath) %>%
dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
Byline = paste(na.omit(Byline)),
Pageviews =paste(na.omit(Pageviews)))
An alternative approach using fill. Using tidyverse 1.3.0+ with dplyr 0.8.5+, you can use fill to fill in missing values.
See this for more information https://tidyr.tidyverse.org/reference/fill.html
DATA Thanks Alistaire
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes NA NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
CODE
I just did this for PageTitle but you can repeat fill to do it for other columns. (dplyr gurus might have a smarter way to do all 3 columns at once). If you have ordered data like dates, then you can set .direction to be just down for example (look at past data).
df.new <- df %>% group_by(PagePath)
%>% fill(PageTitle, .direction = "updown")
which gives you
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
Once you have all the NAs cleaned up then you can use distinct or rank to get your final summarised dataframe.

data.table weird behaviour when used in a function

I have a data.frame as follows.
data <- structure(list(V1 = structure(1:3, .Label = c("S01", "S02", "S03"), class = "factor"), V2 = structure(c(1L, 3L, 2L), .Label = c("Alan", "Bruce", "Jay"), class = "factor"), V3 = structure(c(3L, 1L, 2L), .Label = c("Barry", "Dick", "Hal"), class = "factor"), V4 = structure(c(1L, 3L, 2L), .Label = c("Guy", "Jean-Paul", "Wally"), class = "factor"), V5 = structure(c(3L, 1L, 2L), .Label = c("Bart", "Damien", "John"), class = "factor")), .Names = c("V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -3L))
It is not a data.table
is.data.table(data)
[1] FALSE
I have a function foo for example which utilizes data.table for doing some manipulations in the data.frame as follows.
foo <- function(df) {
if(!is.data.frame(df)) stop('"df" is not a data.frame')
setDT(df)
setkey(df, V1)
df[, "NEW" := paste0(V3, V4), with = FALSE]
setDF(df)
return(df)
}
However when I run the function with the data.frame data (not a data.table), the output out is a data.frame (because of setDF(df)).
out <- foo(data)
is.data.table(out)
[1] FALSE
But now the original data.frame data is a data.table.
is.data.table(data)
[1] TRUE
I understand this is because data.table works by reference. However how to deal with this when being used in a function. I dont' wan't to inadvertently change any data.frame in environment. Should I always force copy with copy or <- instead of setDT whenever data.table is used in a function, or is there another way?
With regard to
is there another way?
Instead of setDT() inside the function, you could use as.data.table()
foo <- function(df) {
if(!is.data.frame(df)) stop('"df" is not a data.frame')
df <- as.data.table(df)
setkey(df, V1)
df[, NEW := paste0(V3, V4)]
setDF(df)
return(df)
}
foo(data)
# V1 V2 V3 V4 V5 NEW
# 1 S01 Alan Hal Guy John HalGuy
# 2 S02 Jay Barry Wally Bart BarryWally
# 3 S03 Bruce Dick Jean-Paul Damien DickJean-Paul
is.data.table(data)
# [1] FALSE
For some examples of functions that turn the input data frame into a data table but do not change the original data frame at all, I'd definitely recommend looking at source code for the functions in package splitstackshape.

Resources