How to replicate SPSS recode function in R? - r

Aim is to simulate the SPSS recode procedure in R. The copy-command is hard to translate.
In SPSS I have code as
RECODE A (1,2 = 1) (3,4 = copy) (8 thru hi = 3) (else = 1) into B.
Applied over A which looks like
A <- c(1,2,3,4,5,NA,7,8,9)
i get the following (SPSS) result:
A = 1,1,3,4,1,1,1,3,3
In R a similar code would look like this:
B <- Recode(A, recodes = ("c(1,2) = 1; c(3,4) = c(3,4); c(8,9) = 3; else = 1"), as.numeric.result = TRUE)
A = 1,1,3,4,1,1,1,3,3
The general Problem is to indicate the Values in the SPSS-copy statement. Here I wrote c(3,4) = c(3,4) - of course, it doesn´t work.
In SPSS also exists the possibility to say else = copy what returns the same output as R do.
Does anyone have a R function that works in the same way as SPSS?

use the levels function. Here's and example with a built in data set:
InsectSprays
levels(InsectSprays$spray)<-list(new1=c("A","C"),YEPS=c("B","D","E"),LASTLY="F")
InsectSprays
use this to reset the data set:
InsectSprays <- datasets::InsectSprays

You can combine ifelse and car::recode to achieve the result you want.
library(car)
A <- c(1,2,3,4,5,NA,7,8,9)
B <- ifelse(A %in% c(3,4), A, recode(A, "c(1,2) = 1; 8:hi = 3; else = 1"))
cbind(A, B)

You might want to check out the car package. Unfortunately, there is no "copy" functionality available.
library(car)
?recode
A <- c(1,2,3,4,5,NA,7,8,9)
B <- recode(A, "c(1,2) = 1; 3 = 3; 4 = 4; 8:hi = 3; else = 1")
B
## SPSS result: A = 1,1,3,4,1,1,1,3,3
## > B
## [1] 1 1 3 4 1 1 1 3 3
## >

library(expss)
a = c(1,2,3,4,5,NA,7,8,9)
# '%into%' supports multi-value assignment, eg: ... %into% (a1 %to% a3)
recode(a, 1:2 ~ 1, 3:4 ~ copy, 8 %thru% hi ~ 3, other ~ 1) %into% b
b
# 1 1 3 4 1 1 1 3 3
or, with standard R assignment:
b = recode(a, 1:2 ~ 1, 3:4 ~ copy, 8 %thru% hi ~ 3, other ~ 1)
b
# 1 1 3 4 1 1 1 3 3

Related

How can I map a function to apply to only certain rows in a data frame?

I have a function that I want to iterate over only certain rows of my dataset, and then save the results in a variable in the dataset.
So for example say I have this set up:
library(tidyverse)
add_one <- function(vector, x_id){
return(vector[x_id] + 1)
}
test <- data.frame(x = c(1,2,3,4), y = c(1,2,3,4), run_on = c(TRUE,FALSE,TRUE,FALSE))
test
So the test data frame looks like:
> x y run_on
>1 1 1 TRUE
>2 2 2 FALSE
>3 3 3 TRUE
>4 4 4 FALSE
So what I want to do is iterate over the dataframe and set the y column to be the result of applying the function add_one() to the x column for just the rows where run_on is TRUE. I want the end result to look like this:
> x y run_on
>1 1 2 TRUE
>2 2 2 FALSE
>3 3 4 TRUE
>4 4 4 FALSE
I have been able to iterate the function over all of the rows using apply(). So for example:
test$y <- apply(test,1,add_one,x_id = 1)
test
> x y run_on
>1 1 2 TRUE
>2 2 3 FALSE
>3 3 4 TRUE
>4 4 5 FALSE
But this also applies the function to rows 2 and 4, which I do not want. I suspect there may be some way to do this using versions of the map() functions from ::purrr, which is why I tagged this post as such.
In reality, I am using this kind of procedure to repeatedly iterate over a large dataset multiple times, so I need it to be done automatically and cleanly. Any help or suggestions would be very much appreciated.
UPDATE
I managed to find a solution. Some of the solutions offered here did work in my toy example but did not extend to the more complex function I was actually using. Ultimately what worked was something similar to what tmfmnk suggested. I just wrapped the original function inside another function that included an if statement to determine whether or not to apply the original function. So to extend my toy example, my solution looks like this:
add_one_if <- function(vector, x_id, y_id, run_on_id){
if(vector[run_on_id]){
return(add_one(vector,x_id))}
else{
return(vector[x_id])
}
}
test$y <- apply(test, 1, add_one_if, x_id = 1, y_id = 2, run_on_id = 3)
It seems a little convoluted, but it worked for me and is reproducible and reliable in the way I need it to be.
You can also do:
add_one <- function(data, vector, x_id, n, is.true = c(TRUE, FALSE)) {
if (is.true) {
return(data[[vector]] + (data[[x_id]]) * n)
} else {
return(data[[vector]] + (!data[[x_id]]) * n)
}
}
add_one(test, vector = "y", x_id = "run_on", 1, is.true = TRUE)
[1] 2 2 4 4
add_one(test, vector = "y", x_id = "run_on", 5, is.true = FALSE)
[1] 1 7 3 9
It may be that your real case is more complicated than allowed by this, but why not just use ifelse?
test$y <- ifelse(test$run_on,add_one(test,x),y)
Or even:
test$y[test$run_on]<-add_one(test[run_on,],x)
You won't need to use purrr until you are applying the same function to multiple columns. Since you want to modify only one column, but based on a condition you can use mutate() + case_when().
mutate(test, y = case_when(run_on ~ add_one(y),
!run_on ~ y))
#> x y run_on
#> 1 1 2 TRUE
#> 2 2 2 FALSE
#> 3 3 4 TRUE
#> 4 4 4 FALSE

Expand grid (or power set) with constraints

This question might be too general, but I feel it comes up again and again in my work and thus is probably of interest to others.
Suppose I want to create a data table (or matrix) that is based off of expanding a grid.
library(data.table)
Vmat1 = data.table(expand.grid(c(list(d = 1:5, w = 1:(3)))))
Suppose however, that if I were to do this, this would result in a memory error for the true power set. However, there are constraints that I want to impose, for example:
If w>1, then it must be that d<3
This then gives smaller final set that would not result in a memory error:
Vmat1[w>1 & d<3 | w==1]
d w
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 1 2
7: 2 2
8: 1 3
9: 2 3
My question is, is it possible to ex ante impose the restriction when creating the grid? It is too costly to build the full power set and then condition and reduce.
If your concern is memory you can split into sets as #chinsoon suggests.
# Filter Method
V = CJ(d = 1:5, w = 1:3) # same as data.table(expand.grid(c(list(d = 1:5, w = 1:(3))))) except ordering
a <- V[w>1 & d<3 | w==1]
# Sets Method
d <- 1:5
w <- 1:3
b <- rbindlist(list(
CJ(d = d[d < 3], w = w[w > 1])
, CJ(d = d, w = w[w == 1])))
all(a == b[order(d, w)])
# [1] TRUE

Nested ifelse: improved syntax

Description
ifelse() function allows to filter the values in a vector through a series of tests, each of them producing different actions in case of a positive result. For instance, let xx be a data.frame, as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx
a b
1 1
2 2
1 3
3 4
Suppose that you want to create a new column, c, from column b, but depending on the values in column a in the following way:
For each row,
if the value in column a is 1, the value in column c, is the same value in column b.
if the value in column a is 2, the value in column c, is 100 times the value in column b.
in any other case, the value in column c is the negative of the value in column b.
Using ifelse(), a solution could be:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
-xx$b))
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
Problem 1
An aesthetic problem arises when the number of tests increases, say, four tests:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
ifelse(xx$a==3, ...,
ifelse(xx$a==4, ...,
...))))
I found partial solution to the problem in this page, which consists in the definition of the functions if.else_(), i_(), e_(), as follows:
library(lazyeval)
i_ <- function(if_stat, then) {
if_stat <- lazyeval::expr_text(if_stat)
then <- lazyeval::expr_text(then)
sprintf("ifelse(%s, %s, ", if_stat, then)
}
e_ <- function(else_ret) {
else_ret <- lazyeval::expr_text(else_ret)
else_ret
}
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string))
}
In this way, the problem given in the Description, can be rewritten as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
e_(-xx$b)
)
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
And the code for the four tests will simply be:
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
i_(xx$a==3, ...), # dots meaning actions for xx$a==3
i_(xx$a==4, ...), # dots meaning actions for xx$a==4
e_(...) # dots meaning actions for any other case
)
Problem 2 & Question
The given code apparently solves the problem. Then, I wrote the following test function:
test.ie <- function() {
dd <- data.frame(a=c(1,2,1,3), b=1:4)
if.else_(
i_(dd$a==1, dd$b),
i_(dd$a==2, dd$b*100),
e_(-dd$b)
) # it should give c(1, 200, 3, -4)
}
When I tried the test:
test.ie()
it spit the following error message:
Error in ifelse(dd$a == 1, dd$b, ifelse(dd$a == 2, dd$b * 100, -dd$b)) :
object 'dd' not found
Question
Since the if.else_() syntactic constructor is not supposed to run only from the console, is there a way for it to 'know' the variables from the function that calls it?
Note
In "Best way to replace a lengthy ifelse structure in R", a similar problem was posted. However, the given solution there focuses on building the table's new column with the given constant output values (the "then" or "else" slots of the ifelse() function), whereas my case addresses a syntactic problem in which the "then" or "else" slots can even be expressions in terms of other data.frame elements or variables.
I think you can use dplyr::case_when inside dplyr::mutate to achieve this.
library(dplyr)
df <- tibble(a=c(1,2,1,3), b=1:4)
df %>%
mutate(
foo = case_when(
.$a == 1 ~ .$b,
.$a == 2 ~ .$b * 100L,
TRUE ~ .$b * -1L
)
)
#> # A tibble: 4 x 3
#> a b foo
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 2 2 200
#> 3 1 3 3
#> 4 3 4 -4
In the upcoming relase of dplyr 0.6.0 you won't need to use the akward work-around of .$, and you can just use:
df %>%
mutate(
foo = case_when(
a == 1 ~ b,
a == 2 ~ b * 100L,
TRUE ~ b * -1L
)
)
Taking into account MrFlick's advice, I re-coded the if.else_() function as follows:
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string), envir = parent.frame())
}
Now the test.ie() function runs properly
test.ie()
[1] 1 200 3 -4
With full respect to the OP's remarkable effort to improve nested ifelse(), I prefer a different approach which I believe is easy to write, concise, maintainable and fast:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
library(data.table)
# coerce to data.table, and set the default first
setDT(xx)[, c:= -b]
xx[a == 1L, c := b] # 1st special case
xx[a == 2L, c := 100L*b] # 2nd special case, note use of integer 100L
# xx[a == 3L, c := ...] # other cases
# xx[a == 4L, c := ...]
#...
xx
# a b c
#1: 1 1 1
#2: 2 2 200
#3: 1 3 3
#4: 3 4 -4
Note that for the 2nd special case b is multiplied by the integer constant 100L to make sure that the right hand sides are all of type integer in order to avoid type conversion to double.
Edit 2: This can also be written in an even more concise (but still maintainable) way as a one-liner:
setDT(xx)[, c:= -b][a == 1L, c := b][a == 2L, c := 100*b][]
data.table chaining works here, because c is updated in place so that subsequent expressions are acting on all rows of xx even if the previous expression was a selective update of a subset of rows.
Edit 1: This approach can be implemented with base R as well:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
xx$c <- -xx$b
idx <- xx$a == 1L; xx$c[idx] <- xx$b[idx]
idx <- xx$a == 2L; xx$c[idx] <- 100 * xx$b[idx]
xx
# a b c
#1 1 1 1
#2 2 2 200
#3 1 3 3
#4 3 4 -4

R - How to Create Custom Ifelse function that repeats

I am quite familiar with R's standard ifelse statement, and how to create nested ifelse statements. I however want to create a "better" version, so that I dont have to copy / paste ifelse so many times.
Take this nested ifelse statement for example:
df <- data.frame(b = 1:5)
df$a <- ifelse(df$b == 1,1,
ifelse(df$b == 2,2,
ifelse(df$b == 3,3,4)))
Instead, what I would like to do is create a function like I could call like this:
df$a <- myFunction(df$b == 1,1,
df$b == 2,2,
df$b == 3,3,4)
I would want the function to be able to pick up how many arguments I have entered, and thus know how many ifelse statements to include and then plug the arguments into the correct position, up to however many I want.
There is still some repetition, but when creating longer nested ifelse statements it would be nice to not have to repeat that piece of code, and then try to keep track of ending paren's.
We can use Reduce() to build up the required parse tree of nested ifelse() calls and then eval() it:
ifelses <- function(...) {
## validate number of args is at least 3 and odd
stopifnot(nargs()>=3L);
stopifnot(nargs()%%2L==1L);
## precompute the required number of calls and the argument parse tree list
num <- (nargs()-1L)%/%2L;
cl <- match.call();
## build up the parse tree of nested ifelse() calls using Reduce(), then eval() it
## terminology (following docs): ifelse(test,yes,no)
eval(Reduce(
function(i,noArg) call('ifelse',cl[[i]],cl[[i+1L]],noArg),
seq(2L,by=2L,len=num), ## indexes of "test" args
cl[[length(cl)]], ## first (innermost) "no" arg
T ## proceed from right-to-left, IOW inside-out wrt parse tree
));
}; ## end ifelses()
Useful docs:
nargs()
stopifnot()
match.call()
Reduce()
call()
eval()
seq()
ifelse()
Demo:
ifelses(c(F,T,F,F),1:4,c(T,F,F,F),5:8,c(F,T,F,T),9:12,13:16);
## [1] 5 2 15 12
OP's example:
df <- data.frame(b=1:5);
df$a <- ifelses(df$b==1L,1L,df$b==2L,2L,df$b==3L,3L,4L);
df;
## b a
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 4
This is a job for merging with a lookup table. You can wrap that in a function, but usually I wouldn't bother:
df <- data.frame(b = 1:5)
lookupif <- function(df, x, y, else.val = NA, on.col, res.col = "val") {
lookup <- data.frame(x, y)
names(lookup)[1] <- res.col
df <- merge(df, lookup, by.x = on.col, by.y = "y", all.x = TRUE)
df[is.na(df[[res.col]]), res.col] <- else.val
df
}
lookupif(df, 1:3, 1:3, 4, "b")
# b val
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 4
dplyr::case_when is a cascading alternative to nested ifelses, e.g.
library(dplyr)
df <- data.frame(b = 1:5)
df %>% mutate(a = case_when(b == 1 ~ 1,
b == 2 ~ 2,
b == 3 ~ 3,
TRUE ~ 4))
#> b a
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 4
or just steal it and put it in base syntax:
df$a <- with(df, dplyr::case_when(b == 1 ~ 1,
b == 2 ~ 2,
b == 3 ~ 3,
TRUE ~ 4))
which returns the same thing.
Since it's already about as simple as you can get without sacrificing the versatility of ifelse, it may not need to be put into a function, but it could, if you like. Using the development version's new rlang NSE syntax,
add_cases <- function(.data, .col, ...){
.data %>% mutate(!!.col := case_when(!!!quos(...)))
}
df %>% add_cases(.col = 'a',
b == 1 ~ 1,
b == 2 ~ 2,
b == 3 ~ 3,
TRUE ~ 4)
#> b a
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 4
Sorry for shameless advertisement - you can try if_val function in my package expss
b = sample(1:7, 10, replace = TRUE)
if_val(b, 1 ~ 1, 2 ~ 2, 3 ~ 3, other ~ 4)
There is also ifs function: ifs(b==1 ~ 1, b==2 ~ 2, b==3 ~ 3, TRUE ~ 4).

How to use lists?

I am trying to use lists in R as dictionaries in computing winning percentages for basketball teams. Basically, for each win, I'd like to increment the appropriate dictionary amount, and for each game, I'd like to increment the appropriate dictionary amount. Somehow, the answers I'm getting seem reasonable but are incorrect, and I can't figure out why the program logically doesn't give the expected outputs. Any suggestions or tips would be appreciated. The code I'm using is below:
games <- read.csv(game_pathname, header = FALSE)
names(games) <- c("GameDate", "DateCount", "HomeID", "AwayID", "HomePts", "AwayPts", "HomeAbbr", "AwayAbbre", "HomeName", "AwayName")
wins = list()
total = list()
for (team in unique(games$HomeName)) {
wins[team] <- 0
total[team] <- 0
}
for (i in 1:nrow(games)) {
if (games$HomePts[i] > games$AwayPts[i]) {
wins[games$HomeName[i]] <- wins[[games$HomeName[i]]] + 1
} else {
wins[games$AwayName[i]] <- wins[[games$AwayName[i]]] + 1
}
total[games$HomeName[i]] <- total[[games$HomeName[i]]] + 1
total[games$AwayName[i]] <- total[[games$AwayName[i]]] + 1
}
for (team in unique(games$HomeName)) {
print(paste(team, wins[[team]] / total[[team]]))
}
As I looked in the code and by creation toy example there is no problems in the algorithm. In the simulation below I used three teams, where one is complete looser, another break even, and the third is a champion.
games <- data.frame(HomeName = c("a", "b", "c"),
HomePts = c(1, 2, 3),
AwayPts = c(3, 1, 2),
AwayName = c("c", "a", "b") )
wins = list()
total = list()
for (team in unique(games$HomeName)) {
wins[team] <- 0
total[team] <- 0
}
for (i in 1:nrow(games)) {
if (games$HomePts[i] > games$AwayPts[i]) {
wins[games$HomeName[i]] <- wins[[games$HomeName[i]]] + 1
} else {
wins[games$AwayName[i]] <- wins[[games$AwayName[i]]] + 1
}
total[games$HomeName[i]] <- total[[games$HomeName[i]]] + 1
total[games$AwayName[i]] <- total[[games$AwayName[i]]] + 1
}
for (team in unique(games$HomeName)) {
print(paste(team, wins[[team]] / total[[team]]))
}
games
wins
total
The output of your algorithm is below:
[1] "a 0"
[1] "b 0.5"
[1] "c 1"
> games
HomeName HomePts AwayPts AwayName
1 a 1 3 c
2 b 2 1 a
3 c 3 2 b
> wins
$`a`
[1] 0
$b
[1] 1
$c
[1] 2
> total
$`a`
[1] 2
$b
[1] 2
$c
[1] 2
However it not very much "R-style" as using for and direct manipulation with list indices is considered like not "comme il faut" :)
You can get similar results with e.g. dplyr packagу, which is a part of tidyverse packages. Code below is a comparison of the results of the games, then split it into two data frames and merge it row-wise. Finally group by team name and calculate mean win rate. Please see below:
library(dplyr)
df <- games %>% mutate(hwins = (HomePts > AwayPts), awins = !hwins)
df_home <- df %>% select(HomeName, hwins) %>% rename(name = HomeName, wins = hwins)
df_away <- df %>% select(AwayName, awins) %>% rename(name = AwayName, wins = awins)
df <- bind_rows(df_home, df_away) %>% group_by(name) %>% summarise(mean_wins = mean(wins))
df
Output:
# A tibble: 3 x 2
name mean_wins
<fct> <dbl>
1 a 0
2 b 0.5
3 c 1

Resources