Loops: How can I loop case_when function in R? - r

Here's the code, where I am trying to create a variable by detecting the words and matching them. Here I use dplyr package and its function mutate in combination with case_when. The problem is I am adding each one of the values manually as you see. How can I automate it by applying some loop functions to match the two?
city <- LETTERS #26 cities
district <- letters[10:20] #11 districts
streets <- paste0(district, district)
streets <- streets[-c(5:26)] #4 streets
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
library(dplyr)
library(stringi)
df2 <- df %>%
mutate(districts = case_when(
stri_detect_fixed(address, "b") ~ "b", #address[1]
#address[2]
stri_detect_fixed(address, "a") ~ "a", #address[3]
#address[4]
stri_detect_fixed(address, "cc") ~ "cc" #address[5]
))
The code scans through address for the value in district vector. I would love to do the same for city and street variables. So I used the modified version of the code from another question in Stack Overflow. It produces an error.
for (j in town_village2) {
trn_house3[,93] <- case_when(
stri_detect_fixed(trn_house3[1:6469, 4], j) ~ j)
}
I seek to produce this result:
x address city district street
1 A, b, cc, A b cc
2 B, dd B NA dd
3 a, dd NA a dd
4 C C NA NA
5 D, a, cc D a cc

If you are going to add a loop, it makes no sense to use case_when(); you don't have to add all options into it if you can loop over them.
You can solve it with a for-loop:
library(stringi)
df2 <- df
for(c in city) df2$city[stri_detect_fixed(df2$address, c)] <- c
for(d in district) df2$district[stri_detect_fixed(df2$address, d)] <- d
for(s in streets) df2$street[stri_detect_fixed(df2$address, s)] <- s
Note that your example code didn't work; the district names are 'a' and 'b' in your example dataset, but you generate names 'j' through 't'. I fixed that in my code above.
And it will cause an error if names of cities, districts and/or streets overlap. For instance, if one row is in the district 'b', and in the street 'cc', stri_detect_fixed will also see the 'c' and think it is in 'c'. I propose a completely different method to overcome this:
Alternative method
Given your example data, it makes most sense to first split the given address by ,, then look for exact matches with your reference city/district/street names. We can look for those exact matches with intersect().
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# vectorize address into elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
Compare df$address and the newly created address_elems:
> df$address
[1] "A, b, cc," "B, dd" "a, dd" "C" "D, a, cc"
> address_elems
[[1]]
[1] "A" "b" "cc"
[[2]]
[1] "B" "dd"
[[3]]
[1] "a" "dd"
[[4]]
[1] "C"
[[5]]
[1] "D" "a" "cc"
We could find matching cities for just the first vector in address_elems in with intersect(cities, address_elems[[1]]).
Because we might get multiple matches, we only take the first element, with intersect(cities, address_elems[[1]])[[1]].
To apply this to every vector in address_elems, we can use sapply() or lapply():
# intersect the respective reference lists with each list of
# address items, taking only the first element
df$cities = sapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
PIAT
Putting it all together we get:
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# create vector of address elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
# intersect the respecitve reference lists with each list of
# address items, take only the first element
df$cities = lapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
# cleanup
rm(address_elems)

This will separate the elements into vectors:
library(tidyverse)
df <- data.frame(
x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc")
)
df3 <-
df %>%
separate_rows(address, sep = "[, ]+") %>%
filter(nchar(address) > 0) %>%
nest(address) %>%
transmute(x, districts = data %>% map(~ .x[[1]]))
#> Warning: All elements of `...` must be named.
#> Did you want `data = address`?
df3
#> # A tibble: 5 × 2
#> x districts
#> <int> <list>
#> 1 1 <chr [3]>
#> 2 2 <chr [2]>
#> 3 3 <chr [2]>
#> 4 4 <chr [1]>
#> 5 5 <chr [3]>
df3$districts[[1]]
#> [1] "A" "b" "cc"
Created on 2022-04-14 by the reprex package (v2.0.0)

a data.table approach
library(data.table)
DT <- data.table(city, streets, district)
# create a lookup table with all elements
lookup <- melt(DT, measure.vars = names(DT))
# set df to data.table format
setDT(df)
final <- df[, .(address = unlist(tstrsplit(address, ",[ ]*", perl = TRUE))), by = .(x)]
# now add elements
final[lookup, type := i.variable, on = .(address = value)]
# and dcast to wide
dcast(final, x ~ type, value.var = "address")
# x city streets district
# 1: 1 A cc b
# 2: 2 B dd <NA>
# 3: 3 <NA> dd a
# 4: 4 C <NA> <NA>
# 5: 5 D cc a

Related

Looping in R with dynamic variables as dataframe names

I am trying to loop through dataframes where my search variable is in the name of the dataframe. Here I have multiple dataframes beginning with "person", "place", or "thing" and ending with either "5" or "8." I would like to loop through the many combinations of beginning and ending to create a temporary dataframe. The temporary dataframe will be used to create a plot and save the plot.
When I try my current code, I'm able to get the variable name to loop correctly (in other words, I can get "person_odds5" or "place_odds5"), but I cannot use those variables to access the corresponding column in the dataframe.
My current code is:
person_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=1:4, or_uci95=11:14, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
place_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=5:8, or_uci95=15:18, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
thing_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=9:12, or_uci95=19:22, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
nouns <- list("person", "place", "thing")
for (x in nouns) {
pval <- c(5)
for (p in pval) {
name <- paste(x,"_odds",p, sep="")
odds <- paste(name,"$odds", sep="")
temp_dat <- data.frame(odds=odds, index=1:nrow(name))
}
}
When I run this code, my output for "name" is "person_odds5" as character type; my output for "odds" is "person_odds5$odds" as character type, and I encounter "Error in 1:nrow(name) : argument of length 0." Basically, it appears that I can't parse my name assignment through the original dataframe.
Input:
>person_odds5
odds or_lci95 or_uci95 id.exposure id.outcome
1 a 1 11 f w
2 b 2 12 g x
3 c 3 13 h y
4 d 4 14 i z
>
Desired output:
>temp_dat
odds index
1 a 1
2 b 2
3 c 3
4 d 4
>

Compare two vectors within a data frame with %in% with R

Compare two vectors within a data frame with %in%
I have the following data
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
Col1
Col2
a
a,b,c
b
aa,c,d
aa
c,d,e
d
d,f,g
I want to select the rows that contain a character from this vector c("a", "e", "g"), specifying the columna
library(dplyr)
T1 %>% filter(Col1 %in% c("a", "e", "g"))
I returned
1 a a,b,c
It is correct, but if I want to compare two vectors, example:
With unlist and strsplit, I transform the value of each row to a character vector and try to compare it with the reference vector to select the rows that contain any of the values:
unlist(strsplit(T1$Col2[1],","))
[1] "a" "b" "c"
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
It gives me an error:
Error in filter():
! Problem while computing ..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g").
✖ Input ..1 must be of size 4 or 1, not size 12.
Run ]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; to see where the error occurred.
I can do it like this:
T1[grep(c("a|e|g"), T1$Col2),]
1 a a,b,c
2 b aa,c,d
3 aa c,d,e
4 d d,f,g
But it's wrong, row 3 aa c,d,e, shouldn't be, because it's not a, it's aa
To search for the "a" alone, you would have to do:
T1[grep(c("\\<a\\>"), T1$Col2),]
I think that with this form I will end up making a mistake, it would give me more security to be able to do it comparing vector with vector:
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
Edited answer
You can use the syntax \\b for regular expressions word boundary. The | is for boundaries adjacent to like an or operation. You can use the following code:
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
library(dplyr)
library(stringr)
T1 %>%
filter(grepl("\\b(a|e|g)\\b", Col2))
#> Col1 Col2
#> 1 a a,b,c
#> 2 aa c,d,e
#> 3 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
Note: \\b is for R version 4.1+ otherwise use \b.
old answer
It returns all rows back because you check if one of the strings exists in Col2 and you can see that in row 3, "e" exists which is one of the strings and that's why it returns also row 4. You could also use str_detect like this:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(any(str_detect(Col2, paste0(vector, collapse="|"))))
#> Col1 Col2
#> 1 a a,b,c
#> 2 b aa,c,d
#> 3 aa c,d,e
#> 4 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
If you want to check if the strings exists, one of them, in both columns. You can use the following code:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(Reduce(`|`, across(all_of(colnames(T1)), ~str_detect(paste0(vector, collapse="|"), .x))))
#> Col1 Col2
#> 1 a a,b,c
Created on 2022-07-16 by the reprex package (v2.0.1)
Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.
T1 %>%
rowwise() %>%
filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)

How do I Identify by row id the values in a data frame column not in another data frame column?

How do I identify by row id the values in data frame d2 column c3 that are not in data frame d1 column c1? My which function returns all records when sub-setting as shown. My requirement is to follow this sub set structure and not value$field design which works:
c1 <- c("A", "B", "C", "D", "E")
c2 <- c("a", "b", "c", "d", "e")
c3 <- c("A", "z", "C", "z", "E", "F")
c4 <- c("a", "x", "x", "d", "e", "f")
d1 <- data.frame(c1, c2, stringsAsFactors = F)
d2 <- data.frame(c3, c4, stringsAsFactors = F)
x <- unique(d1["c1"])
y <- d2[,"c3"]
id <- which(!(y %in% x) ) # incorrect, all row ids returned
I am trying to find the id's of rows in y where the specified column does not include values of x
I believe setdiff would work here. I see z and F are what you want, right? They are not in d1[,"c1"] but are in d2[,"c3"]
includes <- setdiff(d2[,"c3"], d1[,"c1"])
d2_new <- d2[d2[,"c3"] %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
# or
ids <- rownames(d2[d2[,"c3"] %in% includes,])
output
d2_new
# c3 c4 id
#2 z x 2
#4 z d 4
#6 F f 6
ids
#[1] "2" "4" "6"
I had the same problem, and this code worked for me. However, indexing did not work for me. With a slight change it worked perfect.
includes <- setdiff(d2$c3, d1$c3)
d2_new <- d2[d2$c3 %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
thank you #jpsmith

Filtering only unique value from multiple column in R

I have data like this:
X <- data.frame(fac_1 = c("A", "B", "C", "X", "Y"), fac_2 = c("B", "X", "P", "Q", "C"), fac_3 = c("C", "P", "Q", "T", "U"))
fac_1 fac_2 fac_3
A B C
B X P
C P Q
X Q T
Y C U
I want only those alphabet which are common
(1) between fac_1 and fac_2 (like B,C,X) and
(2) all factors which are common among fac_1, fac_2 and fac_3 (like C only)
You can use intersect
intersect(intersect(X$fac_1, X$fac_2), X$fac_3)
#[1] "C"
intersect(X$fac_1, X$fac_2)
#[1] "B" "C" "X"
Alternatively, the function Reduce can be used as described by #docendo discimus at comments section.
Reduce(intersect, X)
#[1] "C"

Create new variable condition on multiple variables R code

I have a data set named "dat".
TEAM1 TEAM2 WINNER
A P A
I S I
P S S
S I I
S P P
W P W
A E A
A S S
E A E
I want to create variable "LOSER" using R code. I have tried like this
Loser <- NULL
for (i in 1: nrow(dat)){
if(match(dat$Team1[i],dat$Winner)==TRUE){
Loser[i] <- cricket$Team2[i]
}else if(match(dat$Team1[i],dat$Winner)==FALSE ){
Loser[i] <- dat$Team1[i]
}
}
But this does not give exact result. What is wrong with this code?
Desired out put:
TEAM1 TEAM2 WINNER LOSER
A P A P
I S I S
P S S P
S I I S
S P P S
W P W P
A E A E
A S S A
E A E A
We can get the desired output by comparing the 'TEAM1' with the 'WINNER' column. Add 1 to it to coerce 'FALSE/TRUE' to '1/2'. This can be used as a column index. We can then cbind with row number and get the corresponding elements to create the 'LOSER' column
dat$LOSER <- dat[cbind(1:nrow(dat), with(dat, TEAM1 == WINNER) + 1)]
dat$LOSER
#[1] "P" "S" "P" "S" "S" "P" "E" "A" "A"
NOTE: Modified based on #David Arenburg's comments. Also, in the dataset, 1st and 2nd columns were the 'TEAM1' and 'TEAM2'. If we have a dataset with many columns and these are not in the 1st and 2nd positions, we can subset the dataset as I showed in the comments to have only two columns
dat$LOSER <- dat[paste0('TEAM', 1:2)][cbind(1:nrow(dat),
with(dat, TEAM1==WINNER)+1L)]
Another option using data.table. For TRUE values in TEAM1==WINNER, we assign (:=) 'LOSER' as 'TEAM2'. Then, we replace the NA values in 'LOSER' with 'TEAM1'
library(data.table)
setDT(dat)[TEAM1==WINNER, LOSER:= TEAM2][is.na(LOSER), LOSER:= TEAM1]
dat
data
dat <- structure(list(TEAM1 = c("A", "I", "P", "S", "S", "W", "A", "A",
"E"), TEAM2 = c("P", "S", "S", "I", "P", "P", "E", "S", "A"),
WINNER = c("A", "I", "S", "I", "P", "W", "A", "S", "E")),
.Names = c("TEAM1",
"TEAM2", "WINNER"), class = "data.frame", row.names = c(NA, -9L))
I was unable to resist to write a dplyr way.
library(dplyr)
dat %>%
mutate(LOSER = ifelse(TEAM1 == WINNER, TEAM2, TEAM1))
TEAM1 TEAM2 WINNER LOSER
1 A P A P
2 I S I S
3 P S S P
4 S I I S
5 S P P S
6 W P W P
7 A E A E
8 A S S A
9 E A E A

Resources