How to combine the use of %in% with OR operator? - r

I would like to look up and test whether values from one set ("set A") appear in either set B or set C. I was trying to use the %in% operator for this purpose, but couldn't figure out how to combine it with OR.
A reproducible example follows at the bottom, but just the gist of what I'm trying to get is something like:
set_a %in% (set_b | set_c)
where I want to know which values from set_a exist in either set_b or set_c, or in both.
Example
#Step 1 :: Creating the data
set_a <- unlist(strsplit("Eden Kendall Cali Ari Madden Leo Stacy Emmett Marco Bridger Alissa Elijah Bryant Pierre Sydney Luis", split=" "))
set_b <- as.data.table(unlist(strsplit("Kathy Ryan Brice Rowan Nina Abram Miles Kristina Gabriel Madden Jasper Emmett Marco Bridger Alissa Elijah Bryant Pierre Sydney Luis", split=" ")))
set_c <- as.data.table(unlist(strsplit("Leo Stacy Emmett Marco Moriah Nola Jorden Dalia Kenna Laney Dillon Trystan Elijah Bryant Pierr", split=" ")))
NamesList <- list(set_b, set_c) #set_b and set_c will now become neighboring data.table dataframes in one list.
> NamesList
[[1]]
V1
1: Kathy
2: Ryan
3: Brice
4: Rowan
5: Nina
6: Abram
7: Miles
8: Kristina
9: Gabriel
10: Madden
11: Jasper
12: Emmett
13: Marco
14: Bridger
15: Alissa
16: Elijah
17: Bryant
18: Pierre
19: Sydney
20: Luis
[[2]]
V1
1: Leo
2: Stacy
3: Emmett
4: Marco
5: Moriah
6: Nola
7: Jorden
8: Dalia
9: Kenna
10: Laney
11: Dillon
12: Trystan
13: Elijah
14: Bryant
15: Pierr
#Step 2 :: Checking which values from set_a appear in either set_b or set_c
matches <- set_a %in% (set_b | set_c)
#doesn't work!
Any ideas? By the way, it is important to me to use a data.table format.

You could try the conditions separately
set_a %in% set_b | set_a %in% set_c
Or use union or unique
set_a %in% union(set_b, set_c)
set_a %in% unique(c(set_b, set_c))

We can use
Reduce(`|`, lapply(list(set_b, set_c), `%in%`, set_a))

Related

How to FILL DOWN (autofill) value , eg replace NA with first value in group, using data.table in R?

Very simple and common task:
I need to FILL DOWN in data.table (similar to autofill function in MS Excel) so that
library(data.table)
DT <- fread(
"Paul 32
NA 45
NA 56
John 1
NA 5
George 88
NA 112")
becomes
Paul 32
Paul 45
Paul 56
John 1
John 5
George 88
George 112
Thank you!
Yes the best way to do this is to use #Rui Barradas idea of the zoo package. You can simply do it in one line of code with the na.locf function.
library(zoo)
DT[, V1:=na.locf(V1)]
Replace the V1 with whatever you name your column after reading in the data with fread. Good luck!
For example 2, you can consider using stats::spline for extrapolation as follows:
DT2[is.na(V2), V2 :=
as.integer(DT2[, spline(.I[!is.na(V2)], V2[!is.na(V2)], xout=.I[is.na(V2)]), by=.(V1)]$y)]
output:
V1 V2
1: Paul 1
2: Paul 2
3: Paul 3
4: Paul 4
5: John 100
6: John 110
7: John 120
8: John 130
data:
DT2 <- fread(
"Paul, 1
Paul, 2
Paul, NA
Paul, NA
John, 100
John, 110
John, NA
John, NA")

Turn names into numbers in a dataframe based on the row index of the name in another dataframe

I have two dataframes. One is just the names of my facebook friends and another one is the links with a sorce and target columns. I want to turn the names in the links dataframe to numbers based on the row index of that name in the friends dataframe.
friends
name
1 Andrewt Thomas
2 Robbie McCord
3 Mohammad Mojadidi
4 Andrew John
5 Professor Owk
6 Joseph Charles
links
source target
1 Andrewt Thomas Andrew John
2 Andrewt Thomas James Zou
3 Robbie McCord Bz Benz
4 Robbie McCord Yousef AL-alawi
5 Robbie McCord Sherhan Asimov
6 Robbie McCord Aigerim Aig
Seems trivial, but I cannot figure it out. Thanks for help.
Just use a simple match
links$source <- match(links$source, friends$name)
links
# source target
# 1 1 Andrew John
# 2 1 James Zou
# 3 2 Bz Benz
# 4 2 Yousef AL-alawi
# 5 2 Sherhan Asimov
# 6 2 Aigerim Aig
Something like this?
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
Full example
links <- data.frame(source = c("John", "John", "Alice"), target = c("Jimmy", "Al", "Chris"))
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
links$source
[1] 3 3 2

How do I find last date in which a value increased in another column?

I have a data frame in R that looks something like this:
person date level
Alex 2007-06-01 3
Alex 2008-12-01 4
Alex 2009-12-01 3
Beth 2008-03-01 6
Beth 2010-10-01 6
Beth 2010-12-01 6
Mary 2009-11-04 9
Mary 2012-04-25 9
Mary 2013-09-10 10
I have sorted it first by "person" and second by "date".
I am trying to find out when the last increase in "level" occurred for each person. Ideally, the output would look something like:
person date
Alex 2008-12-01
Beth NA
Mary 2013-09-10
Using dplyr
library(dplyr)
dat %>% group_by(person) %>%
mutate(inc = c(F, diff(level) > 0)) %>%
summarize(date = last(date[inc], default = NA))
Yielding:
Source: local data frame [3 x 2]
person date
1 Alex 2008-12-01
2 Beth <NA>
3 Mary 2013-09-10
Try data.table version:
library(data.table)
setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
If na also needs to be included:
dd=setDT(dat)[order(person),diff:=c(NA,diff(level)),by=person][diff>0,tail(.SD,1),by=person][,-c(3,4),with=F]
dd2 =data.frame(unique(ddt[!(person %in% dd$person),,]$person),NA)
names(dd2) = c('person','date')
rbind(dd, dd2)
person date
1: Alex 2008-12-01
2: Mary 2013-09-10
3: Beth NA
A base-R version, using data frame df:
sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
})
produces the named vector
Alex Beth Mary
"2008-12-01" NA "2013-09-10"
Easy to wrap this to produce any output format you need:
last.level.up <- function(df) {
data.frame(Date=sapply(levels(df$Person), function(p) {
s <- df[df$Person==p,]
i <- 1+nrow(s)-match(TRUE,rev(diff(s$Level)>0))
ifelse(is.na(i), NA, as.character(s$Date[i]))
}))
}
last.level.up(df)
Date
Alex 2008-12-01
Beth <NA>
Mary 2013-09-10

String tokenization inside R data frame [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have some data that looks a little bit like this:
test.frame <- read.table(text = "name amounts
JEAN 318.5,45
GREGORY 1518.5,67,8
WALTER 518.5
LARRY 518.5,55,1
HARRY 318.5,32
",header = TRUE,sep = "")
I'd like it to look more like this ...
name amount
JEAN 318.5
JEAN 45
GREGORY 1518.5
GREGORY 67
GREGORY 8
WALTER 518.5
LARRY 518.5
LARRY 55
LARRY 1
HARRY 318.5
HARRY 32
It seems like there should be a straightforward way to break out the "amounts" column, but I'm not coming up with it. Happy to take a "RTFM page for this particular command" answer. What's the command I'm looking for?
(test.frame <- read.table(text = "name amounts
JEAN 318.5,45
GREGORY 1518.5,67,8
WALTER 518.5
LARRY 518.5,55,1
HARRY 318.5,32
",header = TRUE,sep = ""))
# name amounts
# 1 JEAN 318.5,45
# 2 GREGORY 1518.5,67,8
# 3 WALTER 518.5
# 4 LARRY 518.5,55,1
# 5 HARRY 318.5,32
tmp <- setNames(strsplit(as.character(test.frame$amounts),
split = ','), test.frame$name)
data.frame(name = rep(names(tmp), sapply(tmp, length)),
amounts = unlist(tmp), row.names = NULL)
# name amounts
# 1 JEAN 318.5
# 2 JEAN 45
# 3 GREGORY 1518.5
# 4 GREGORY 67
# 5 GREGORY 8
# 6 WALTER 518.5
# 7 LARRY 518.5
# 8 LARRY 55
# 9 LARRY 1
# 10 HARRY 318.5
# 11 HARRY 32
The fastest way (probably) will be data.table
library(data.table)
setDT(test.frame)[, lapply(.SD, function(x) unlist(strsplit(as.character(x), ','))),
.SDcols = "amounts", by = name]
## name amounts
## 1: JEAN 318.5
## 2: JEAN 45
## 3: GREGORY 1518.5
## 4: GREGORY 67
## 5: GREGORY 8
## 6: WALTER 518.5
## 7: LARRY 518.5
## 8: LARRY 55
## 9: LARRY 1
## 10: HARRY 318.5
## 11: HARRY 32
A generalization of David Arenburg's solution would be to use my cSplit function. Get it from the Git Hub Gist (https://gist.github.com/mrdwab/11380733) or load it with "devtools":
# library(devtools)
# source_gist(11380733)
The "long" format would be what you are looking for...
cSplit(test.frame, "amounts", ",", "long")
# name amounts
# 1: JEAN 318.5
# 2: JEAN 45
# 3: GREGORY 1518.5
# 4: GREGORY 67
# 5: GREGORY 8
# 6: WALTER 518.5
# 7: LARRY 518.5
# 8: LARRY 55
# 9: LARRY 1
# 10: HARRY 318.5
# 11: HARRY 32
But the function can create wide output formats too:
cSplit(test.frame, "amounts", ",", "wide")
# name amounts_1 amounts_2 amounts_3
# 1: JEAN 318.5 45 NA
# 2: GREGORY 1518.5 67 8
# 3: WALTER 518.5 NA NA
# 4: LARRY 518.5 55 1
# 5: HARRY 318.5 32 NA
One advantage with this function is being able to split multiple columns at once.
This isn't a super standard format, but here is one way you can transform your data. First, I would use stringsAsFactors=F with your read.table to make sure everything is a character variable rather than a factor. Alternatively you can do as.character() on those columns.
First I split the values in the amounts using the comma then I combine values with the names column
md <- do.call(rbind, Map(cbind, test.frame$name,
strsplit(test.frame$amounts, ",")))
Then I paste everything back together and send it to read.table to do the variable conversion
read.table(text=apply(md,1,paste, collapse="\t"),
sep="\t", col.names=names(test.frame))
Alternatively you could just make a data.frame from the md matrix and do the class conversions yourself
data.frame(names=md[,1], amount=as.numeric(md[,2]))
Here is a plyr solution:
Split.Amounts <- function(x) {
amounts <- unlist(strsplit(as.character(x$amounts), ","))
return(data.frame(name = x$name, amounts = amounts, stringsAsFactors=FALSE))
}
library(plyr)
ddply(test.frame, .(name), Split.Amounts)
Using dplyr:
library(dplyr)
test.frame %>%
group_by(name) %>%
do(Split.Amounts(.))

Getting "raw" data from frequency table

I've been looking around for some data about naming trends in USA. I managed to get top 1000 names for babies born in 2008. The data is formated in this manor:
male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
I want to get a data.frame with 2 variables: name and gender.
This can be done with looping, but I consider it rather inefficient way of solving this problem. I reckon that some reshape function will suite my needs.
Let's presuppose that this tab-delimited data is saved into a data.frame named bnames. Looping can be done with function:
tmp <- character()
for (i in 1:nrow(bnames)) {
tmp <- c(tmp, rep(bnames[i,1], bnames[i,2]))
}
But I want to achieve this with vector-based approach. Any suggestions?
So one quick version would be to transform the data.frame and use the rbind() function
to get what you want.
dataNEW <- data.frame(bnames[,1],c("m"), bnames[,c(2,3)], c("f"), bnames[,4])
colnames(dataNEW) <- c("name", "gender", "value", "name", "gender", "value")
This will give you:
name gender value name gender value
1 Jacob m 22272 Emma f 18587
2 Michael m 20298 Isabella f 18377
3 Ethan m 20004 Emily f 17217
4 Joshua m 18924 Madison f 16853
5 Daniel m 18717 Ava f 16850
6 Alexander m 18423 Olivia f 16845
7 Anthony m 18158 Sophia f 15887
8 William m 18149 Abigail f 14901
9 Christopher m 17783 Elizabeth f 11815
10 Matthew m 17337 Chloe f 11699
Now you can use rbind():
dataNGV <- rbind(dataNEW[1:3],dataNEW[4:6])
which leads to:
name gender value
1 Jacob m 22272
2 Michael m 20298
3 Ethan m 20004
4 Joshua m 18924
5 Daniel m 18717
6 Alexander m 18423
7 Anthony m 18158
8 William m 18149
9 Christopher m 17783
10 Matthew m 17337
11 Emma f 18587
12 Isabella f 18377
13 Emily f 17217
14 Madison f 16853
15 Ava f 16850
16 Olivia f 16845
17 Sophia f 15887
18 Abigail f 14901
19 Elizabeth f 11815
20 Chloe f 11699
Direct vector-based solution (replace the loop) will be
# your data:
bnames <- read.table(textConnection(
"male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
"), sep=" ", header=TRUE, stringsAsFactors=FALSE)
# how to avoid loop
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ]
It's based on fact that rep can do at once thing you do in loop.
But for final result you should combine mropa and gd047 answers.
Or with my solution:
data_final <- data.frame(
name = c(
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ],
bnames$female.name[ rep(1:nrow(bnames), times=bnames$n.female) ]
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)
[EDIT] Simplify:
data_final <- data.frame(
name = rep(
c(bnames$male.name, bnames$female.name),
times = c(bnames$n.male, bnames$n.female)
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)
I think (if I have understood correctly) that mropa's solution needs one more step to get what you want
library(plyr)
data <- ddply(dataNGV, .(name,gender),
function(x) data.frame(name=rep(x[,1],x[,3]),gender=rep(x[,2],x[,3])))
Alternatively, download the full (cleaned up) baby names dataset from http://github.com/hadley/data-baby-names.

Resources