Create new variable condition on multiple variables R code - r

I have a data set named "dat".
TEAM1 TEAM2 WINNER
A P A
I S I
P S S
S I I
S P P
W P W
A E A
A S S
E A E
I want to create variable "LOSER" using R code. I have tried like this
Loser <- NULL
for (i in 1: nrow(dat)){
if(match(dat$Team1[i],dat$Winner)==TRUE){
Loser[i] <- cricket$Team2[i]
}else if(match(dat$Team1[i],dat$Winner)==FALSE ){
Loser[i] <- dat$Team1[i]
}
}
But this does not give exact result. What is wrong with this code?
Desired out put:
TEAM1 TEAM2 WINNER LOSER
A P A P
I S I S
P S S P
S I I S
S P P S
W P W P
A E A E
A S S A
E A E A

We can get the desired output by comparing the 'TEAM1' with the 'WINNER' column. Add 1 to it to coerce 'FALSE/TRUE' to '1/2'. This can be used as a column index. We can then cbind with row number and get the corresponding elements to create the 'LOSER' column
dat$LOSER <- dat[cbind(1:nrow(dat), with(dat, TEAM1 == WINNER) + 1)]
dat$LOSER
#[1] "P" "S" "P" "S" "S" "P" "E" "A" "A"
NOTE: Modified based on #David Arenburg's comments. Also, in the dataset, 1st and 2nd columns were the 'TEAM1' and 'TEAM2'. If we have a dataset with many columns and these are not in the 1st and 2nd positions, we can subset the dataset as I showed in the comments to have only two columns
dat$LOSER <- dat[paste0('TEAM', 1:2)][cbind(1:nrow(dat),
with(dat, TEAM1==WINNER)+1L)]
Another option using data.table. For TRUE values in TEAM1==WINNER, we assign (:=) 'LOSER' as 'TEAM2'. Then, we replace the NA values in 'LOSER' with 'TEAM1'
library(data.table)
setDT(dat)[TEAM1==WINNER, LOSER:= TEAM2][is.na(LOSER), LOSER:= TEAM1]
dat
data
dat <- structure(list(TEAM1 = c("A", "I", "P", "S", "S", "W", "A", "A",
"E"), TEAM2 = c("P", "S", "S", "I", "P", "P", "E", "S", "A"),
WINNER = c("A", "I", "S", "I", "P", "W", "A", "S", "E")),
.Names = c("TEAM1",
"TEAM2", "WINNER"), class = "data.frame", row.names = c(NA, -9L))

I was unable to resist to write a dplyr way.
library(dplyr)
dat %>%
mutate(LOSER = ifelse(TEAM1 == WINNER, TEAM2, TEAM1))
TEAM1 TEAM2 WINNER LOSER
1 A P A P
2 I S I S
3 P S S P
4 S I I S
5 S P P S
6 W P W P
7 A E A E
8 A S S A
9 E A E A

Related

Looping in R with dynamic variables as dataframe names

I am trying to loop through dataframes where my search variable is in the name of the dataframe. Here I have multiple dataframes beginning with "person", "place", or "thing" and ending with either "5" or "8." I would like to loop through the many combinations of beginning and ending to create a temporary dataframe. The temporary dataframe will be used to create a plot and save the plot.
When I try my current code, I'm able to get the variable name to loop correctly (in other words, I can get "person_odds5" or "place_odds5"), but I cannot use those variables to access the corresponding column in the dataframe.
My current code is:
person_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=1:4, or_uci95=11:14, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
place_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=5:8, or_uci95=15:18, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
thing_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=9:12, or_uci95=19:22, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
nouns <- list("person", "place", "thing")
for (x in nouns) {
pval <- c(5)
for (p in pval) {
name <- paste(x,"_odds",p, sep="")
odds <- paste(name,"$odds", sep="")
temp_dat <- data.frame(odds=odds, index=1:nrow(name))
}
}
When I run this code, my output for "name" is "person_odds5" as character type; my output for "odds" is "person_odds5$odds" as character type, and I encounter "Error in 1:nrow(name) : argument of length 0." Basically, it appears that I can't parse my name assignment through the original dataframe.
Input:
>person_odds5
odds or_lci95 or_uci95 id.exposure id.outcome
1 a 1 11 f w
2 b 2 12 g x
3 c 3 13 h y
4 d 4 14 i z
>
Desired output:
>temp_dat
odds index
1 a 1
2 b 2
3 c 3
4 d 4
>

How do I Identify by row id the values in a data frame column not in another data frame column?

How do I identify by row id the values in data frame d2 column c3 that are not in data frame d1 column c1? My which function returns all records when sub-setting as shown. My requirement is to follow this sub set structure and not value$field design which works:
c1 <- c("A", "B", "C", "D", "E")
c2 <- c("a", "b", "c", "d", "e")
c3 <- c("A", "z", "C", "z", "E", "F")
c4 <- c("a", "x", "x", "d", "e", "f")
d1 <- data.frame(c1, c2, stringsAsFactors = F)
d2 <- data.frame(c3, c4, stringsAsFactors = F)
x <- unique(d1["c1"])
y <- d2[,"c3"]
id <- which(!(y %in% x) ) # incorrect, all row ids returned
I am trying to find the id's of rows in y where the specified column does not include values of x
I believe setdiff would work here. I see z and F are what you want, right? They are not in d1[,"c1"] but are in d2[,"c3"]
includes <- setdiff(d2[,"c3"], d1[,"c1"])
d2_new <- d2[d2[,"c3"] %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
# or
ids <- rownames(d2[d2[,"c3"] %in% includes,])
output
d2_new
# c3 c4 id
#2 z x 2
#4 z d 4
#6 F f 6
ids
#[1] "2" "4" "6"
I had the same problem, and this code worked for me. However, indexing did not work for me. With a slight change it worked perfect.
includes <- setdiff(d2$c3, d1$c3)
d2_new <- d2[d2$c3 %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
thank you #jpsmith

Loops: How can I loop case_when function in R?

Here's the code, where I am trying to create a variable by detecting the words and matching them. Here I use dplyr package and its function mutate in combination with case_when. The problem is I am adding each one of the values manually as you see. How can I automate it by applying some loop functions to match the two?
city <- LETTERS #26 cities
district <- letters[10:20] #11 districts
streets <- paste0(district, district)
streets <- streets[-c(5:26)] #4 streets
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
library(dplyr)
library(stringi)
df2 <- df %>%
mutate(districts = case_when(
stri_detect_fixed(address, "b") ~ "b", #address[1]
#address[2]
stri_detect_fixed(address, "a") ~ "a", #address[3]
#address[4]
stri_detect_fixed(address, "cc") ~ "cc" #address[5]
))
The code scans through address for the value in district vector. I would love to do the same for city and street variables. So I used the modified version of the code from another question in Stack Overflow. It produces an error.
for (j in town_village2) {
trn_house3[,93] <- case_when(
stri_detect_fixed(trn_house3[1:6469, 4], j) ~ j)
}
I seek to produce this result:
x address city district street
1 A, b, cc, A b cc
2 B, dd B NA dd
3 a, dd NA a dd
4 C C NA NA
5 D, a, cc D a cc
If you are going to add a loop, it makes no sense to use case_when(); you don't have to add all options into it if you can loop over them.
You can solve it with a for-loop:
library(stringi)
df2 <- df
for(c in city) df2$city[stri_detect_fixed(df2$address, c)] <- c
for(d in district) df2$district[stri_detect_fixed(df2$address, d)] <- d
for(s in streets) df2$street[stri_detect_fixed(df2$address, s)] <- s
Note that your example code didn't work; the district names are 'a' and 'b' in your example dataset, but you generate names 'j' through 't'. I fixed that in my code above.
And it will cause an error if names of cities, districts and/or streets overlap. For instance, if one row is in the district 'b', and in the street 'cc', stri_detect_fixed will also see the 'c' and think it is in 'c'. I propose a completely different method to overcome this:
Alternative method
Given your example data, it makes most sense to first split the given address by ,, then look for exact matches with your reference city/district/street names. We can look for those exact matches with intersect().
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# vectorize address into elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
Compare df$address and the newly created address_elems:
> df$address
[1] "A, b, cc," "B, dd" "a, dd" "C" "D, a, cc"
> address_elems
[[1]]
[1] "A" "b" "cc"
[[2]]
[1] "B" "dd"
[[3]]
[1] "a" "dd"
[[4]]
[1] "C"
[[5]]
[1] "D" "a" "cc"
We could find matching cities for just the first vector in address_elems in with intersect(cities, address_elems[[1]]).
Because we might get multiple matches, we only take the first element, with intersect(cities, address_elems[[1]])[[1]].
To apply this to every vector in address_elems, we can use sapply() or lapply():
# intersect the respective reference lists with each list of
# address items, taking only the first element
df$cities = sapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
PIAT
Putting it all together we get:
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# create vector of address elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
# intersect the respecitve reference lists with each list of
# address items, take only the first element
df$cities = lapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
# cleanup
rm(address_elems)
This will separate the elements into vectors:
library(tidyverse)
df <- data.frame(
x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc")
)
df3 <-
df %>%
separate_rows(address, sep = "[, ]+") %>%
filter(nchar(address) > 0) %>%
nest(address) %>%
transmute(x, districts = data %>% map(~ .x[[1]]))
#> Warning: All elements of `...` must be named.
#> Did you want `data = address`?
df3
#> # A tibble: 5 × 2
#> x districts
#> <int> <list>
#> 1 1 <chr [3]>
#> 2 2 <chr [2]>
#> 3 3 <chr [2]>
#> 4 4 <chr [1]>
#> 5 5 <chr [3]>
df3$districts[[1]]
#> [1] "A" "b" "cc"
Created on 2022-04-14 by the reprex package (v2.0.0)
a data.table approach
library(data.table)
DT <- data.table(city, streets, district)
# create a lookup table with all elements
lookup <- melt(DT, measure.vars = names(DT))
# set df to data.table format
setDT(df)
final <- df[, .(address = unlist(tstrsplit(address, ",[ ]*", perl = TRUE))), by = .(x)]
# now add elements
final[lookup, type := i.variable, on = .(address = value)]
# and dcast to wide
dcast(final, x ~ type, value.var = "address")
# x city streets district
# 1: 1 A cc b
# 2: 2 B dd <NA>
# 3: 3 <NA> dd a
# 4: 4 C <NA> <NA>
# 5: 5 D cc a

search for next closest element not in a list

I am trying to replace 2 alphabets (repeats ) from vector of 26 alphabets.
I already have 13 of 26 alphabets in my table (keys), so replacement alphabets should not be among those 13 'keys'.
I am trying to write code to replace C & S by next present alphabet which should not be part of 'keys'.
The following code is replacing repeat C by D and S by T, but those both letters are in my 'keys'. Could someone know how I can implement condition so that code will re-run loop if letter to be replace is already present in 'key'?
# alphabets <- toupper(letters)
keys <- c("I", "C", "P", "X", "H", "J", "S", "E", "T", "D", "A", "R", "L")
repeats <- c("C", "S")
index_of_repeat_in_26 <- which(repeats %in% alphabets)
# index_of_repeat_in_26 is 3 , 19
# available_keys <- setdiff(alphabets,keys)
available <- alphabets[available_keys]
# available <- c("B", "F", "G", "K", "O", "Q", "U", "V", "W", "Y", "Z")
index_available_keys <- which(alphabets %in% available_keys)
# 2 6 7 11 15 17 21 22 23 25 26
for (i in 1:length(repeat)){
for(j in 1:(26-sort(index_of_repeat_in_26)[1])){
if(index_of_repeat_in_26[i]+j %in% index_available_keys){
char_to_replace_in_key[i] <- alphabets[index_of_capital_repeat_in_26[i]+1]
}
else{
cat("\n keys not available to replace \n")
}
}
}
keys <- c("I", "C", "P", "X", "H", "J", "S", "E", "T", "D", "A", "R", "L")
repeats <- c("C", "S")
y = sort(setdiff(LETTERS, keys)) # get the letters not present in 'keys'
y = factor(y, levels = LETTERS) # make them factor so that we can do numeric comparisons with the levels
y1 = as.numeric(y) # keep them numeric to compare
z = factor(repeats, levels = LETTERS)
z1 = as.numeric(z)
func <- function(x) { # so here, in each iteration, the index(in this case 1:4 gets passed)
xx = y1 - z1[x] # taking the difference between each 'repeat' element from all 'non-keys'
xx = which(xx>0)[1]# choose the one with smallest difference(because 'y1' is already sorted. So the first nearest non-key gets selected
r = y[xx] # extract the corresponding 'non-key' element
y <<- y[-xx] # after i get the closest letter, I remove that from global list so that it doesn't get captured the next time
y1 <<- y1[-xx] # similarily removed from the equivalent numeric list
r # return the extracted 'closest non-key' chracter
}
# sapply is also a for-loop by itself, in which a single element get passed ro func at a time.
# Here 'seq_along' is used to pass the index. i.e. for 'C' - 1, for 'S' - 2 , etc gets passed.
ans = sapply(seq_along(repeats), func)
if (any(is.na(ans))){
cat("\n",paste0("keys not available to replace for ",
paste0(repeats[which(is.na(ans))], collapse = ",")) ,
"\n")
ans <- ans[!is.na(ans)]
}
# example 2 with :
repeats <- c("Y", "Z")
# output :
# keys not available to replace for Z
# ans
# [1] Z
Note : to understand how each ieration of sapply() works : you should run debug(func) and then run the sapply() call. You can then check on console how each variable xx, r is getting evaluated. Hope this helps!

Filtering only unique value from multiple column in R

I have data like this:
X <- data.frame(fac_1 = c("A", "B", "C", "X", "Y"), fac_2 = c("B", "X", "P", "Q", "C"), fac_3 = c("C", "P", "Q", "T", "U"))
fac_1 fac_2 fac_3
A B C
B X P
C P Q
X Q T
Y C U
I want only those alphabet which are common
(1) between fac_1 and fac_2 (like B,C,X) and
(2) all factors which are common among fac_1, fac_2 and fac_3 (like C only)
You can use intersect
intersect(intersect(X$fac_1, X$fac_2), X$fac_3)
#[1] "C"
intersect(X$fac_1, X$fac_2)
#[1] "B" "C" "X"
Alternatively, the function Reduce can be used as described by #docendo discimus at comments section.
Reduce(intersect, X)
#[1] "C"

Resources