R: Constructing dummy columns based on partial string matches from two columns

R: Constructing dummy columns based on partial string matches from two columns - r

I have a data frame df1 with information of acquisitions by ID. Every acquirer A and target B have their four-digit SIC codes on one line separated by "/".
df1 <- data.frame(ID = c(1,2,3,4),
A = c("1230/1344/2334/2334","3322/3344/3443", "1112/9099", "3332/4483"),
B = c("1333/2334","3344/8840", "4454", "9988/2221/4483"))
ID A B
1 1230/1344/2334/2334 1333/2334
2 3322/3344/3443 3344/8840
3 1112/9099 4454
4 3332/4483 9988/2221/4483
I would need to classify each transaction ID as follows:
If the primary code (i.e. the first four digits) of either A or B matches any other code than the primary code of B or A, then the Primary.other.match column takes a value of 1 and 0 else.
If any other than code the primary code of A or B matches any other than the primary code of B or A, then the Other.other.match column takes value of 1 and 0 else.
The desired output is shown below in the updated df1.
df1 <- data.frame(ID = c(1,2,3,4),
A = c("1230/1344/2334/2334","3322/3344/3443", "1112/9099", "3332/4483"),
B = c("1333/2334","3344/8840", "4454", "9988/2221/4483"),
Primary.other.match = c(0,1,0,0), #only if primary Code of A or B matches
any other code of B or A
Other.other.match = c(1,0,0,1)) # only if primary codes do not match
primary or any other codes, but any other codes match
ID A B Primary.other.match Other.other.match
1 1230/1344/2334/2334 1333/2334 0 1
2 3322/3344/3443 3344/8840 1 0
3 1112/9099 4454 0 0
4 3332/4483 9988/2221/4483 0 1
Thank you for your help!

here is a solution within the tidyverse.
You first create a function which checks whether there is a primary match or a other match and then apply this function column wise with purrr::map:
library(tidyverse)
fun1 <- function(str1, str2){
str1 <- str1 %>% str_split("/") %>% unlist()
str2 <- str2 %>% str_split("/") %>% unlist()
str1p <- str1[1]
str2p <- str2[1]
pom <- ifelse(str1p %in% str2 | str2p %in% str1, 1, 0)
oom <- ifelse(pom == 0 & length(intersect(str1, str2)) > 0, 1, 0)
tibble(pom = pom, oom = oom)
}
df1 %>% as_tibble() %>%
mutate(result = map2(A, B, fun1)) %>%
unnest(result)
# A tibble: 4 x 5
ID A B pom oom
<dbl> <fct> <fct> <dbl> <dbl>
1 1 1230/1344/2334/2334 1333/2334 0 1
2 2 3322/3344/3443 3344/8840 1 0
3 3 1112/9099 4454 0 0
4 4 3332/4483 9988/2221/4483 0 1

Related

R: Vlookup for a 'for' loop

i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!

You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2

Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2

You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)

I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())

Filtering rows based on two conditions at the ID level

I have long data where a given subject has 4 observations. I want to only include a given id that meets the following conditions:
has at least one 3
has at least one of 1,2 OR NA
My data structure:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3), a=c(NA,1,2,3, NA,3,2,0, NA,NA,1,1))
My unsuccessful attempt (I get an empty data frame):
df %>% dplyr::group_by(id) %>% filter(a==3 & a %in% c(1,2,NA))

An option is to group by 'id', create a logic to return single TRUE/FALSE as output. Based on the OP's post, we need both values '3' and either one of the values 1, 2, NA in the column 'a'. So, 3 %in% a returns a logical vector of length 1, then wrap any on the second set where we do a comparison with multiple values or check the NA elements (is.na), merge both logical output with &
library(dplyr)
df %>%
group_by(id) %>%
filter((3 %in% a) & any(c(1, 2) %in% a|is.na(a)) )
# A tibble: 8 x 2
# Groups: id [2]
# id a
# <dbl> <dbl>
#1 1 NA
#2 1 1
#3 1 2
#4 1 3
#5 2 NA
#6 2 3
#7 2 2
#8 2 0

I have done this a bit of a long way to show how an idea could work. You can consolidate this a bit.
df %>%
group_by(id) %>%
mutate(has_3 = sum(a == 3, na.rm = T) > 0,
keep_me = has_3 & (sum(is.na(a)) > 0 | sum(a %in% c(1, 2)) > 0)) %>%
filter(keep_me == TRUE) %>%
select(id, a)
id a
<dbl> <dbl>
1 1 NA
2 1 1
3 1 2
4 1 3
5 2 NA
6 2 3
7 2 2
8 2 0

As I read it, the filter should keep ids 1 and 2. So I would use combo of all/any:
df %>%
group_by(id) %>%
filter(all(3 %in% a) & any(c(1,2,NA) %in% a))

Dummify character column, BUT with unequal number of categories in each row [duplicate]

This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 5 years ago.
I have a dataframe with the following structure
test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))
Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below
data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))
a ff cc rr e
1 1 1 1 1 0
2 1 0 1 1 1
I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.

I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner,
i.e.
library(splitstackshape)
library(qdapTools)
mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
# a cc ff rr e
#V1 1 1 1 1 0
#V2 1 1 0 1 1
It can also be full splitstackshape as #A5C1D2H2I1M1N2O1R2T1 mentions in comments,
cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)

Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row
library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
# variable a cc e ff rr
# 1: 1 1 1 0 1 1
# 2: 2 1 1 1 0 1

Here's a base R approach:
x <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x)) # get unique elements
x <- lapply(x, factor, levels = lvl) # convert to factor
t(sapply(x, table)) # count elements and transpose
# a ff cc rr e
#[1,] 1 1 1 1 0
#[2,] 1 0 1 1 1

We can do this with tidyverse
library(tidyverse)
rownames_to_column(test, 'grp') %>%
separate_rows(col) %>%
filter(col!="") %>%
count( grp, col) %>%
spread(col, n, fill = 0) %>%
ungroup() %>%
select(-grp)
# A tibble: 2 × 5
# a cc e ff rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 0 1 1
#2 1 1 1 0 1

Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.
test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';')))
yy=strsplit(as.character(test$col), split = ';')
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))

Another approach with tidytext and tidyverse
library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
unnest_tokens(word, col) %>%
rownames_to_column(var="row") %>%
mutate(row = floor(parse_number(row)),
val = 1) %>%
spread(word, val, fill = 0) %>%
select(-row)
df
# a cc e ff rr
#1 1 1 0 1 1
#2 1 1 1 0 1

Another simple solution without any extra packages:
x = c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))

keeping certain rows in data frame with a condition

I have a data frame in R for which I want to remove certain rows provided that match certain conditions. How can I do it ?
I have tried using dplyr and ifelse but my code does not give right answer
check8 <- distinct(df5,prod,.keep_all = TRUE)
Does not work! gives the entire data set
Input is:
check1 <- data.frame(ID = c(1,1,2,2,2,3,4),
prod = c("R","T","R","T",NA,"T","R"),
bad = c(0,0,0,1,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 1 T 0
# 3 2 R 0
# 4 2 T 1
# 5 2 <NA> 0
# 6 3 T 1
# 7 4 R 0
Output expected:
data.frame(ID = c(1,2,3,4),
prod = c("R","R","T","R"),
bad = c(0,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 2 R 0
# 3 3 T 1
# 4 4 R 0
I want to have the output such that for IDs where both prod or NA are there, keep only rows with prod R, but if only one prod is there then keep that row despite the prod .

Using dplyr we can use filter to select rows where prod == "R" or if there is only one row in the group, select that row.
library(dplyr)
check1 %>%
group_by(ID) %>%
filter(prod == "R" | n() == 1)
# ID prod bad
# <dbl> <fct> <dbl>
#1 1 R 0
#2 2 R 0
#3 3 T 1
#4 4 R 0

Here solution using an anti_join
library(dplyr)
check1 <- data.frame(ID = c(1,1,2,2,2,3,4), prod = c("R","T","R","T",NA,"T","R"), bad = c(0,0,0,1,0,1,0))
# First part: select all the IDs which contain 'R' as prod
p1 <- check1 %>%
group_by(ID) %>%
filter(prod == 'R')
# Second part: using anti_join get all the rows from check1 where there are not
# matching values in p1
p2 <- anti_join(check1, p1, by = 'ID')
solution <- bind_rows(
p1,
p2
) %>%
arrange(ID)

Dummify character column and find unique values [duplicate]

This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 5 years ago.
I have a dataframe with the following structure
test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))
Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below
data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))
a ff cc rr e
1 1 1 1 1 0
2 1 0 1 1 1
I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.

I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner,
i.e.
library(splitstackshape)
library(qdapTools)
mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
# a cc ff rr e
#V1 1 1 1 1 0
#V2 1 1 0 1 1
It can also be full splitstackshape as #A5C1D2H2I1M1N2O1R2T1 mentions in comments,
cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)

Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row
library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
# variable a cc e ff rr
# 1: 1 1 1 0 1 1
# 2: 2 1 1 1 0 1

Here's a base R approach:
x <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x)) # get unique elements
x <- lapply(x, factor, levels = lvl) # convert to factor
t(sapply(x, table)) # count elements and transpose
# a ff cc rr e
#[1,] 1 1 1 1 0
#[2,] 1 0 1 1 1

We can do this with tidyverse
library(tidyverse)
rownames_to_column(test, 'grp') %>%
separate_rows(col) %>%
filter(col!="") %>%
count( grp, col) %>%
spread(col, n, fill = 0) %>%
ungroup() %>%
select(-grp)
# A tibble: 2 × 5
# a cc e ff rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 0 1 1
#2 1 1 1 0 1

Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.
test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';')))
yy=strsplit(as.character(test$col), split = ';')
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))

Another approach with tidytext and tidyverse
library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
unnest_tokens(word, col) %>%
rownames_to_column(var="row") %>%
mutate(row = floor(parse_number(row)),
val = 1) %>%
spread(word, val, fill = 0) %>%
select(-row)
df
# a cc e ff rr
#1 1 1 0 1 1
#2 1 1 1 0 1

Another simple solution without any extra packages:
x = c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Constructing dummy columns based on partial string matches from two columns - r

Related

R: Vlookup for a 'for' loop

Filtering rows based on two conditions at the ID level

Dummify character column, BUT with unequal number of categories in each row [duplicate]

keeping certain rows in data frame with a condition

Dummify character column and find unique values [duplicate]

Categories

Resources