If/else if on a text column in R - r

I'm a new to R and need help with the following.
Have
Need
Male_18_24_pn
18_24
Male_25_39_pn
25_39
Male_40_64_pn
40_64
Male_65_84_pn
65_84
Male_85_plus_pn
85_plus
Female_18_24_pn
18_24
I need to create the "Need" column using the "Have" column, wondering how I can achieve this in R. As a initial effort, I tried the following code to test but got warning message and every cell of "Need" populated with "18_24":
if (str_detect(pe_1P_new$Have,"18_24")) {pe_1P_new$Need= "18_24"}
Warning message:
In if (str_detect(pe_1P_new$Have, "18_24")) { :
the condition has length > 1 and only the first element will be used
Your help is greatly appreciated. Thanks in advance!

You can do:
require(data.table)
dt = data.table(
have = c("Male_18_24_pn", "Male_25_39_pn",
"Male_40_64_pn", "Male_65_84_pn",
"Male_85_plus_pn", "Female_18_24_pn")
)
dt[ , need := gsub('(Male|Female)_(.+)(_pn)', '\\2', have) ]
Base R solution:
dt$need = gsub('(Male|Female)_(.+)(_pn)', '\\2', dt$have)
No need for a loop or any conditional statements. You can extract the needed information with the help of a vectorized function, such as gsub() and a simple regex.
[Volunteer edit] This is an attempt to explain the regular expression logic of that substitution pattern (see ?regex for a very terse but complete description). You need to understand what a capture class can do.
gsub('(Male|Female)#This matches "Male" or "Female", the first "capture classes"
_(.+)# Second capture class, matching anything after an underscore
(_pn)',# ... up to but not including an "_pn"
'\\2', # replace anything matched with only the second capture class
dt$have)
You will not be able to run this version because the carriage retruns and spaces get in the way of the regex engine process.

We could use trimws as well
dt[, need := trimws(have, whitespace = '[[:alpha:]]+_|_[[:alpha:]]+')]

We can try gsub like below
> dt[, need := gsub(".*?_(.*)_.*", "\\1", have)][]
have need
1: Male_18_24_pn 18_24
2: Male_25_39_pn 25_39
3: Male_40_64_pn 40_64
4: Male_65_84_pn 65_84
5: Male_85_plus_pn 85_plus
6: Female_18_24_pn 18_24

You can use the package {unglue}
df <- data.frame(
Have = c("Male_18_24_pn", "Male_25_39_pn",
"Male_40_64_pn", "Male_65_84_pn",
"Male_85_plus_pn", "Female_18_24_pn")
)
library(unglue)
unglue_unnest(df, Have, "{}_{Need}_pn", remove = FALSE)
#> Have Need
#> 1 Male_18_24_pn 18_24
#> 2 Male_25_39_pn 25_39
#> 3 Male_40_64_pn 40_64
#> 4 Male_65_84_pn 65_84
#> 5 Male_85_plus_pn 85_plus
#> 6 Female_18_24_pn 18_24
Created on 2021-07-20 by the reprex package (v0.3.0)

Related

Getting different results from the same function inside and outside a mutate function call

Can someone explain to me why I get a different result when I run the convertToDisplayTime function inside mutate than when I run it on its own? The correct result is the one I obtain when I run it on its own. Also, why do I get these warnings? It feels like I might be passing the whole timeInSeconds column as an argument when I call convertToDisplayTime in the mutate function, but I'm not sure that I really understand the mechanics in play here.
library('tidyverse')
#> Warning: package 'tibble' was built under R version 4.1.2
convertToDisplayTime <- function(timeInSeconds){
## Takes a time in seconds and converts it
## to a xx:xx:xx string format
if(timeInSeconds>86400){ #Not handling time over a day
stop(simpleError("Enter a time below 86400 seconds (1 day)"))
} else if(timeInSeconds>3600){
numberOfMinutes = 0
numberOfHours = timeInSeconds%/%3600
remainingSeconds = timeInSeconds%%3600
if(remainingSeconds>60){
numberOfMinutes = remainingSeconds%/%60
remainingSeconds = remainingSeconds%%60
}
if(numberOfMinutes<10){displayMinutes = paste0("0",numberOfMinutes)}
else{displayMinutes = numberOfMinutes}
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfHours,":",displayMinutes,":", displaySeconds))
} else if(timeInSeconds>60){
numberOfMinutes = timeInSeconds%/%60
remainingSeconds = timeInSeconds%%60
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfMinutes,":", displaySeconds))
} else{
return(paste0("0:",timeInSeconds))
}
}
(df <- tibble(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10)) %>% mutate(displayTime = convertToDisplayTime(timeInSeconds)))
#> Warning in if (timeInSeconds > 86400) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 3600) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 60) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (remainingSeconds < 10) {: the condition has length > 1 and only
#> the first element will be used
#> # A tibble: 4 x 2
#> timeInSeconds displayTime
#> <dbl> <chr>
#> 1 2710. 45:10
#> 2 2705. 45:5
#> 3 2692. 44:52
#> 4 2708. 45:8
convertToDisplayTime(2710.46)
#> [1] "45:10"
convertToDisplayTime(2705.04)
#> [1] "45:05"
convertToDisplayTime(2691.66)
#> [1] "44:52"
convertToDisplayTime(2708.10)
#> [1] "45:08"
Created on 2022-01-06 by the reprex package (v2.0.1)
Like mentioned in the comments, the problem here is that your function is not vectorized: it works with a single value for an input and outputs a single value. However, this does not work when the input is a vector of values, hence the condition has length 1 warning you get:
1: Problem with `mutate()` column `displayTime`.\
ℹ `displayTime = convertToDisplayTime(timeInSeconds)`.
ℹ the condition has length > 1 and only the first element will be used
Here, when you use dplyr::mutate, you're technically trying to feed a vector to your function, which is not formatted to process it.
Several options you may consider:
1. The "fast and ugly" way:
df <- data.frame(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10))
## This one does not work
df %>% mutate(displayTime = convertToDisplayTime(timeInSeconds))
## This one works
df %>%
rowwise() %>%
mutate(displayTime = convertToDisplayTime(timeInSeconds)) %>%
ungroup()
dplyr::rowwise() allows dplyr::mutate() to work on each row independently, rather than by columns. I assume this is the behavior you initially expected. dplyr::ungroup() sorta reverts rowwise, eg. go back to the default column-wise behavior.
I may be a little harsh on this one, but this is the kind of trick that I used back when I did not quite understand my way around dataframes and their manipulation...
2. Vectorize directly from your dplyr verbs:
df %>%
mutate(displayTime = base::mapply(convertToDisplayTime, timeInSeconds))
## or
df %>%
mutate(displayTime = purrr::map_chr(timeInSeconds, convertToDisplayTime))
Both options are similar.
3. Vectorize your function:
convertToDisplayTime_vec <- base::Vectorize(convertToDisplayTime)
# class(convertToDisplayTime_vec)
df %>% mutate(displayTime = convertToDisplayTime_vec(timeInSeconds))
## or
convertToDisplayTime_vec2 <- function(timeInSeconds_vec) {
mapply(FUN = convertToDisplayTime, timeInSeconds_vec)
}
# class(convertToDisplayTime_vec2)
df %>%
mutate(displayTime = convertToDisplayTime_vec2(timeInSeconds))
# Still works on single variables!
# convertToDisplayTime_vec2(6475)
This is my favourite option, as once it is implemented you can use it either on single variables, vectors or dataframes, without worring about it.
A little documentation to dig a little into the subject.
PS: As an aside, a little tip worth remembering: you may want to be careful when manipulating data.frame and tibble objects. Despite their similarity, they have slight differences, and some functions deal differently with one or the other, or actually convert one to the other without your noticing...

R: Replace all Values that are not equal to a set of values

All.
I've been trying to solve a problem on a large data set for some time and could use some of your wisdom.
I have a DF (1.3M obs) with a column called customer along with 30 other columns. Let's say it contains multiple instances of customers Customer1 thru Customer3000. I know that I have issues with 30 of those customers. I need to find all the customers that are NOT the customers I have issues and replace the value in the 'customer' column with the text 'Supported Customer'. That seems like it should be a simple thing...if it werent for the number of obs, I would have loaded it up in Excel, filtered all the bad customers out and copy/pasted the text 'Supported Customer' over what remained.
Ive tried replace and str_replace_all using grepl and paste/paste0 but to no avail. my current code looks like this:
#All the customers that have issues
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127",
"Customer128", ..... , "Customer140")
#Look for everything that is NOT in the list above and replace with "Enabled"
orderData$customer <- str_replace_all(orderData$customer, paste0("[^", paste(out, collapse =
"|"), "]"), "Enabled Customers")
That code gets me this error:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
I've tried the inverse of this approach and pulled a list of all obs that dont match the list of out customers. Something like this:
in <- orderData %>% filter(!customer %in% out) %>% select(customer) %>%
distinct(customer)
This gets me a much larger list of customers that ARE enabled (~3,100). Using the str_replace_all and paste approach seems to have issues though. At this large number of patterns, paste no longer collapses using the "|" operator. instead I get a string that looks like:
"c(\"Customer1\", \"Customer2345\", \"Customer54\", ......)
When passed into str_replace_all, this does not match any patterns.
Anyways, there's got to be an easier way to do this. Thanks for any/all help.
Here is a data.table approach.
First, some example data since you didn't provide any.
customer <- sample(paste0("Customer",1:300),5000,replace = TRUE)
orderData <- data.frame(customer = sample(paste0("Customer",1:300),5000,replace = TRUE),stringsAsFactors = FALSE)
orderData <- cbind(orderData,matrix(runif(0,100,n=5000*30),ncol=30))
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127", "Customer128","Customer140")
library(data.table)
setDT(orderData)
result <- orderData[!(customer %in% out),customer := gsub("Customer","Supported Customer ",customer)]
result
customer 1 2 3 4 5 6 7 8 9
1: Supported Customer 134 65.35091 8.57117 79.594166 84.88867 97.225276 84.563997 17.15166 41.87160 3.717705
2: Supported Customer 225 72.95757 32.80893 27.318046 72.97045 28.698518 60.709381 92.51114 79.90031 7.311200
3: Supported Customer 222 39.55269 89.51003 1.626846 80.66629 9.983814 87.122153 85.80335 91.36377 14.667535
4: Supported Customer 184 24.44624 20.64762 9.555844 74.39480 49.189537 73.126275 94.05833 36.34749 3.091072
5: Supported Customer 194 42.34858 16.08034 34.182737 75.81006 35.167769 23.780069 36.08756 26.46816 31.994756
---

R, getting an invalid argument to unary operator when using order function

I'm essentially doing the exact same thing 3 times, and when adding a new variable I get this error
Error in -emps$EV : invalid argument to unary operator
The code chunk causing this is
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
Works like a charm for the first list, but the identical code thereafter causes the error.
This specific line is causing the error
sort3<-emps[order(-emps$EV),]
How can I fix/workaround this?
Full Code
url <- getURL("https://raw.githubusercontent.com/M-ttM/Basketball/master/class.csv")
shots <- read.csv(text = url)
shots$make<-shots$points>0
shots2<-shots[which(!(shots$player=="Luc Richard Mbah a Moute")),]
fit1<-glm(make~factor(type)+factor(period), data=shots2,family="binomial")
summary(fit1)
shots2$makeodds<-fitted(fit1)
shots2$EV<-shots2$makeodds*ifelse(shots2$type=="3pt",3,2)
shots3<-shots2[which(shots2$y>7),]
locmakes<-data.frame(table(shots3[, c("x", "y")]))
s1k <- shots2[with(shots2, player %in% names(which(table(player)>=1000))), ]
pps<-aggregate(points~player,s1k,mean)
sort<-pps[order(-PPS$points),]
head(sort,10)
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
The error message seems to occur when trying to order columns including chr type data. A possible workaround is to use the reverse function rev() instead of the minus sign, like so:
column_a = c("a","a","b","b","c","c")
column_b = seq(6)
df = data.frame(column_a, column_b)
df$column_a = as.character(df$column_a)
df[with(df, order(-column_a, column_b)),]
> Error in -column_a : invalid argument to unary operator
df[with(df, order(rev(column_a), column_b)),]
column_a column_b
5 c 5
6 c 6
3 b 3
4 b 4
1 a 1
2 a 2
Let me know if it works in your case.
On this line, emps$EV doesn't exist.
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
You probably meant
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EM),]
head(sort3,10)

Convert several levels of factor into 2 in R

I have 197 levels relating to location, I want to simplify this by creating a new variable "INSIDE" which stores 1 when location is a building/home/etc and 0 when location is outside. I have tried grepl() but it gives an error
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE")),1,0)
Warning message:
In grepl(crime_3yr$Premise.Description, pattern = c("BUILDING", :
argument 'pattern' has length > 1 and only the first element will be used
I have tried using lapply() but it did not work too.
I want the output to be like this:
BUILDING 1
SHOP 1
Street 0
grepl takes a regex instead of a list of options, try this:
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = "BUILDING|ROOM|AUTO|BALCONY|BANK|BAR|STORE|CHURCH|COLLEGE|CONDOMINIUM|CENTER|DAY CARE|SCHOOL|HOSPITAL|LIBRARY|PARLOR|OFFICE|MOSQUE|CLUB|PORCH|MALL|WAREHOUSE"),1,0)
If you want to keep the code similar to what you listed you need to look into regular expressions which is what the pattern part of the grepl needs to be.
data$Inside<-ifelse(grepl(data$Premise.Description,pattern = "BUILDING|ROOM|AUTO|BALCONY|BANK|BAR|STORE|CHURCH|COLLEGE|CONDOMINIUM|CENTER|DAY CARE|SCHOOL|HOSPITAL|LIBRARY|PARLOR|OFFICE|MOSQUE|CLUB|PORCH|MALL|WAREHOUSE"),1,0)
Try this code:
Your data.frame:
data<-data.frame(Premise.Description= c("BUILDING 1","MY ROOM","AUTO","BALCONY","OTHER"))
The solution:
toMatch<-c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE")
data$Inside<-grepl(paste(toMatch,collapse="|"), data$Premise.Description)
data
Premise.Description Inside
1 BUILDING 1 TRUE
2 MY ROOM TRUE
3 AUTO TRUE
4 BALCONY TRUE
5 OTHER FALSE
You might be better off using data.table:
library(data.table)
setDT(data)
data[
grepl(c("BUILDING","ROOM","AUTO","BALCONY","BANK","BAR","STORE","CHURCH","COLLEGE","CONDOMINIUM","CENTER","DAY CARE","SCHOOL","HOSPITAL","LIBRARY","PARLOR","OFFICE","MOSQUE","CLUB","PORCH","MALL","WAREHOUSE"), Premise),
Inside := TRUE
]

Why does conditional assignment not occur in data table join using binary search?

I want to assign to a column of a data table if some condition holds true. It occurs when using vector scan but not when using binary search. Could you explain the cause of this?
dt['1291703']$test = 'ali'
# Error in `[<-.data.table`(`*tmp*`, "1291703", value = list(filename =
# "1291703", : i[1] is NA. Can't assign by reference to row 'NA'.
dt[cik=='1291703']$test = 'ali'
dt[cik=='1291703']
## filename cik signatureDate test
## 1: 0000919574-09-007207 1291703 2009-03-12 ali
## 2: 0000919574-09-007310 1291703 2009-03-19 ali
Without additional information, it seems that the problem with your binary search attempt is that you are trying to assign ali to temp column on row number 1291703, which probably doesn't exist.
In binary search, you should first of all remember to key your data set by the column you want to search in. The second thing to understand is, that when you are passing an integer to i, it is searching for that row number by default, thus you should use J() (or .() in version 1.9.4+) instead.
You should also use := instead of $ in order to assign by reference. Thus the solution should look something like this:
setkey(dt, cik)[J(1291703), test := 'ali']
dt
# filename cik signatureDate test
# 1: 0000919574-09-007207 1291703 2009-03-12 ali
# 2: 0000919574-09-007310 1291703 2009-03-19 ali
If cik is of class character (which is unclear from your question), the following should also work
dt[, test := NULL] # Removing `test` for illustration
dt[, cik := as.character(cik)] # Converting `cik` to class `character`, also for illustration
setkey(dt, cik)['1291703', test := 'ali'] # Implementing binary search without using `J()`
dt
# filename cik signatureDate test
# 1: 0000919574-09-007207 1291703 2009-03-12 ali
# 2: 0000919574-09-007310 1291703 2009-03-19 ali

Resources