Check if a string is a subset in another string in R - r

I've this data below that includes ID, and Code (chr type)
ID <- c(1,1,1,2,2,3,3,3, 4, 4)
Code <- c("0011100000", "0001100000", "1001100000", "1100000000",
"1000000000", "1000000000", "0100000000", "0010000000", "0010000001", "0010000001")
df <- data.frame(ID, Code)
I need to remove records (within each ID) based Code value pattern, That is:
For each ID, we look at the values of Code, and we remove the ones that are subset of other row.
For example, for ID=1, row #2 is a subset of row #1, so we remove row #2. But, row #3 is NOT a subset of row #2 or #3, so we keep it.
For ID=2, row #5 is a subset of row #4, so we remove it.
For ID=3, they are all different, so we keep them all.
For ID=4, since the Code for both records are the same, then keep the first one.
Here is the expected final view of the results:

It's not that pretty, but a bit of checking of every combination with a join will do it.
Convert to a data.table
library(data.table)
setDT(df)
Make a row counter, and identify all the 1 locations in each string and save to a list.
df[, rn := .I]
df[, ones := gregexpr("1", df$Code)]
Join each group to itself, and compare the lists where the row numbers don't match. Then keep the row numbers where the lists are subsets, and drop these rows from the original data. In the case of duplicates, only remove the first occasion of the duplicate.
df[
funion(
df[df, on=c("ID","rn>rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"],
df[df, on=c("ID","rn<rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"]
),
on=c("ID","Code"),
mult="first",
drop := 1
]
df[is.na(drop), -c("rn","ones","drop")]
# ID Code
#1: 1 0011100000
#2: 1 1001100000
#3: 2 1100000000
#4: 3 1000000000
#5: 3 0100000000
#6: 3 0010000000
#7: 4 0010000001

Related

Filtering data.table rows by the presence of a column in a strsplit of another column

I have a data table:
dt <- data.table(col1=c('aa,bb', 'bb,cc,ee', 'dd,ee'), col2=c('aa', 'cc', 'aa'))
> dt
col1 col2
1: aa,bb aa
2: bb,cc,ee cc
3: dd,ee aa
I want to check if column 2 occurs in the strsplit of column one, so for the first row if aa is present in aa,bb split by a comma, which is true. It's also true for the second row, and false for the third. I only want to keep the rows where this occurs, so only row 1 and 2.
My first thought was doing it like this:
dt[col2 %in% strsplit(col1, ',')]
However, that returns an empty data.table.
I can think of multiple solutions to solve this, including making new columns using tstrsplit, or melting the data table, but all of these are a bit tedious for such a seemingly simple task. Any suggestions?
We can use str_detect from stringr
library(stringr)
dt[, flag := str_detect(col1, col2)]
dt
# col1 col2 flag
#1: aa,bb aa TRUE
#2: bb,cc,ee cc TRUE
#3: dd,ee aa FALSE
Also, to avoid any substring matches, we can specify the word boundary (\\b)
dt[, str_detect(col1, str_c("\\b", col2, "\\b"))]
#[1] TRUE TRUE FALSE
Regarding the use of strsplit, the output would be a list of vectors. So, we need to use a function that checks the values of 'col1' are in the corresponding elements of list. Map does that
dt[, unlist(Map(`%in%`, col2, strsplit(col1, ",")))]
To apply the filter in the same step and return the 2 row data.table:
dt[unlist(Map(`%in%`, col2, strsplit(col1, ",")))]

Generate group by condition on row value in column R data.table

I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]

How to create ID (by) before merging in R?

I have two dataframes df.o and df.m as defined below. I need to find which observation in df.o (dimension table) corresponds which observations in df.m (fact table) based on two criteria: 1) df.o$Var1==df.o$Var1 and df.o$date1 < df.m$date2 < df.o$date3 such that I get the correct value of df.o$oID in df.m$oID (the correct value is manually entered in df.m$CORRECToID). I need the ID to complete a merge afterwards.
df.o <- data.frame(oID=1:4,
Var1=c("a","a","b","c"),
date3=c(2015,2011,2014,2015),
date1=c(2013,2009,2012,2013),
stringsAsFactors=FALSE)
df.m <- data.frame(mID=1:3,
Var1=c("a","a","b"),
date2=c(2014,2010,2013),
oID=NA,
CORRECToID=c(1,2,3),
points=c(5, 10,15),
stringsAsFactors=FALSE)
I have tried various combinations of like the code below, but without luck:
df.m$oID[df.m$date2 < df.o$date3 & df.m$date2 > df.o$date1 & df.o$Var1==df.m$Var1] <- df.o$oID
I have also tried experimenting with various combinations of ifelse, which and match, but none seem to do the trick.
The problem I keep encountering is that my replacement was a different number of rows than data and that "longer object length is not a multiple of shorter object length".
What you are looking for is called an "overlap join", you could try the data.table::foverlaps function in order to achieve this.
The idea is simple
Create the columns to overlap on (add an additional column to df.m)
key by these columns
run foverlaps and select the column you want back
library(data.table)
setkey(setDT(df.m)[, date4 := date2], Var1, date2, date4)
setkey(setDT(df.o), Var1, date1, date3)
foverlaps(df.m, df.o)[, names(df.m), with = FALSE]
# mID Var1 date2 oID CORRECToID points date4
# 1: 2 a 2010 2 2 10 2010
# 2: 1 a 2014 1 1 5 2014
# 3: 3 b 2013 3 3 15 2013

Categorise multiple rows into one variable

A simple question, but apparently not answered in StO yet.
I've got a long data frame where 3 of the columns are:
person | trip | driver
=======================
1 car
1 bike
1 train
1 walk
2 walk
2 train
2 boat
What I'd like is to populate column 'driver', so that it reads 1 if at least one of the trips is made by car, 0 otherwise:
person | driver
================
1 1
1 1
1 1
1 1
2 0
2 0
2 0
I have a slight preference for doing this without recurring to fancy packages, but I am happy with most of the popular ones (e.g. plyr, data.table,sqldf....), or even new ones that prove helpful in the long term.
Thanks in advance, .p.
We could use data.table, convert 'data.frame' to 'data.table' (setDT(df1)), we check whether there is any 'car' in the 'trip' grouped by 'person', convert the logical output to numeric (+0L or wrapping with as.numeric) and assign (:=) to 'driver' column. If needed, we can remove the 'trip' column by assigning it to NULL or subset by [, c(1,3), with=FALSE]
library(data.table)
setDT(df1)[, driver := any(trip == 'car')+0L, by = person][, trip := NULL]
Or instead of any, we can use max(trip=='car') as #Arun mentioned in the comments
setDT(df1)[, driver := max(trip == 'car'), by = person]
Or using a similar logic as above, we group_by 'person' and create a new column with mutate and remove the unwanted columns with select
library(dplyr)
df1 %>%
group_by(person) %>%
mutate(driver= any(trip=='car')+0L) %>%
select(-trip)
Or with base R, we can use ave to create 'driver' and then subset to remove the 'trip' column.
df1$driver <- with(df1, ave(trip=='car', person, FUN=any)+0L)
subset(df1, select=-trip)

R Using a for() loop to fill one dataframe with another

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

Resources