Use of a like operator in dplyr - r

I have a basketball data with a bunch of different types of shots, and I want to reduce the number of different names. For example, I have 'stepback jumpshot' and 'pull up jumpshot'.
I want to add a new variable that does something like:
df %>% mutate(NewVar== case when(Var1 like jumpshot then Jumpshot))
so all my different jumpshots are renamed as Jumpshot.

To elaborate on #r2evans comment, what you are looking for is grepl(). This function can tell you whether a string exists in another string. It will return a TRUE or FALSE. You don't actually need the mutate or the case when, and could do it with Base R:
Var1 <- c("Free Throw", "stepback jumpshot", "pull up jumpshot", "hail mary")
df <- data.frame(Var1)
df$Var2 <- ifelse(grepl("jumpshot", Var1, fixed = TRUE), "Jumpshot", Var1)
df
# Var1 Var2
# 1 Free Throw Free Throw
# 2 stepback jumpshot Jumpshot
# 3 pull up jumpshot Jumpshot
# 4 hail mary hail mary
But if you really want to use dplyr functions, the case statement #r2evans gave will work:
Var1 <- c("Free Throw", "stepback jumpshot", "pull up jumpshot", "hail mary")
df <- data.frame(Var1)
df2 <- df %>%
mutate(Var2 = case_when(grepl("jumpshot", Var1) ~ "Jumpshot",
grepl("block", Var1) ~ "Block",
TRUE ~ Var1))
df2
# Var1 Var2
# 1 Free Throw Free Throw
# 2 stepback jumpshot Jumpshot
# 3 pull up jumpshot Jumpshot
# 4 hail mary hail mary

Don't forget str_detect from stringr...
Var1 <- c("Free Throw", "stepback jumpshot", "pull up jumpshot", "hail mary")
df <- data.frame(Var1,stringsAsFactors = FALSE)
df2 <- df %>%
mutate(Var2 = case_when(str_detect(Var1,"jumpshot") ~ "Jumpshot",
str_detect(Var1,"block") ~ "Block",
TRUE ~ Var1))
It's a little faster than grep (see What's the difference between the str_detect function in stringer and grepl and grep?)

Related

Dplyr _if verbs with predicate function referring to the column names & multiple conditions?

I'm trying to use mutate_if or select_if, etc, verbs with column names within the predicate function.
See example below:
> btest <- data.frame(
+ sjr_first = c('1','2','3',NA, NA, '6'),
+ jcr_first = c('1','2','3',NA, NA, '6'),
+ sjr_second = LETTERS[1:6],
+ jcr_second = LETTERS[1:6],
+ sjr_third = as.character(seq(6)),
+ jcr_fourth = seq(6) + 5,
+ stringsAsFactors = FALSE)
>
> btest %>% select_if(.predicate = ~ str_match(names(.), 'jcr'))
Error in selected[[i]] <- eval_tidy(.p(column, ...)) :
replacement has length zero
I'm aware I could use btest %>% select_at(vars(dplyr::matches('jcr'))) but my goal here is actually to combine the column name condition with another condition (e.g. is.numeric) using mutate_if() to operate on a subset of my columns. However I'm not sure how to get the first part with the name matching to work...
You can do:
btest %>%
select_if(str_detect(names(.), "jcr") & sapply(., is.numeric))
jcr_fourth
1 6
2 7
3 8
4 9
5 10
6 11
Tidyverse solution:
require(dplyr)
# Return (get):
btest %>%
select_if(grepl("jcr", names(.)) & sapply(., is.numeric))
# Mutate (set):
btest %>%
mutate_if(grepl("jcr", names(.)) & sapply(., is.numeric), funs(paste0("whatever", .)))
Base R solution:
# Return (get):
btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric), drop = FALSE]
# Mutate (set):
btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric)] <- paste0("whatever", unlist(btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric)]))
You could separate two select_if calls
library(dplyr)
library(stringr)
btest %>% select_if(str_detect(names(.), 'jcr')) %>% select_if(is.numeric)
# jcr_fourth
#1 6
#2 7
#3 8
#4 9
#5 10
#6 11
We cannot combine the two calls because the first one operates on entire dataframe together whereas the second one operates column-wise.

R - using LIKE operator with variable

I want to substitute a variable instead of a string in the %like% function from DescTools package. What I want to do with it after is to have a loop where the variable changes value and I get a different results.
I've tried a few ways but can't get it working.
Here is my sample code:
library(DescTools)
library(dplyr)
x <- c(1,2,3,4,5,6)
y <- c("a","b","c","a","a","a")
df <- data.frame(x = x, y = y)
df
Here is what I get if I seach for "a" in the x column. This is the desired output.
df %>% filter(y %like% "%a%")
# desired output
> df %>% filter(y %like% "%a%")
x y
1 1 a
2 4 a
3 5 a
4 6 a
Now I want to create a variable which will hold the value I want to search
# create a variable which will take out the value I'm looking for
let <- '"%a%"'
If I use that variable in place of the string, I get either no result or the wrong result.
Is there any way for me to use a variable insetead of a string?
#not working
df %>% filter(y %like% let)
> df %>% filter(y %like% let)
[1] x y
<0 rows> (or 0-length row.names)
#not working
df %>% filter(y %like% cat(let))
> df %>% filter(y %like% cat(let))
"%a%" x y
1 1 a
2 2 b
3 3 c
4 4 a
5 5 a
6 6 a
Option 1: Evaluate the variable.
df %>% filter(y %like% eval(parse(text = let)))
Option 2: Take advantage of the filter_ function in dplyr.
df %>% filter_(paste0("y %like%", let))
Edit: actually, the comments are better answers because it's less convoluted---it was the quote level that was the problem.

Grouping with numeric variables

I hava a dataframe like this:
name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google
I would like to convert the values of rows of the second column to columns and keep the first one and in other have a numeric value to 0 and 1 if not exist or exist the value. Here an example of the expected output:
name,Google,Yahoo
stockA,1,1
stockB,0,0
stockC,1,0
I tried this:
library(reshape2)
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
and the error it gives me is this:
Using value as value column: use value.var to override.
Error in `[.data.frame`(x, i) : undefined columns selected
Any idea for the error?
An example in which the previous code works.
Data (df):
name,nam2,value
stockA,sth1,Yahoo
stockA,sth2,NA
stockB,sth3,Google
and this works:
df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name + nam2 ~ value, length)
The OP has asked to get an explanation for the error caused by
dcast(melt(df, 1:2, na.rm = TRUE), df + name ~ value, length)
(I'm quite astonished that no one so far has tried to improve the OP's reshape2 approach to return exactly the expected answer).
There are several issues with OP's code:
df appears in the dcast() formula.
The second parameter to melt() is 1:2 which means that all columns are used as id.vars. It should read 1.
But the most crucial point is that the data.frame df already is in long format and doesn't need to be reshaped.
So, df can be used directly in dcast():
library(reshape2)
dcast(df[!is.na(df$value), ], name ~ value, length, drop = FALSE)
# name Google Yahoo
#1 stockA 1 1
#2 stockB 0 0
#3 stockC 1 0
In order to avoid a third NA column appearing in the result, the NA rows have to be filtered out of df before reshaping. On the other hand, drop = FALSE is required to ensure stockB is included in the result.
Data
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df
# name value
#1 stockA Google
#2 stockA Yahoo
#3 stockB <NA>
#4 stockC Google
You can do that with spread from the tidyr package.
df <- data.frame(name = c("stockA", "stockA", "stockB", "stockC"),
value = c("Google", "Yahoo", NA, "Google"))
df$row <- 1
df %>%
spread(value, row, fill = 0) %>%
select(-`<NA>`)
Try df2 <- dcast(melt(df, 1:2, na.rm = TRUE), name ~ value, length)
Just remove df + from the equation.
Though this will give you an extra column for NA values, which makes me think the na.rm argument isn't working properly in your formulation.
You can do it also with base R:
df <- read.table(header=TRUE, sep=',', text=
'name, value
stockA,Google
stockA,Yahoo
stockB,NA
stockC,Google')
xtabs(~., data=df)
# value
#name Google Yahoo
# stockA 1 1
# stockB 0 0
# stockC 1 0

R apply correlation function to a list

I have a data frame like this:
set.seed(1)
category <- c(rep('A',100), rep('B',100), rep('C',100))
var1 = rnorm(1:300)
var2 = rnorm(1:300)
df<-data.frame(category=category, var1 = var1, var2=var2)
I need to calculate the correlations between var1 and var2 by category. I think I can first split the df by category and apply the cor function to the list. But I am really confused about hot to use the lapply function.
Could someone kindly help me out?
This should produce the desired result:
lapply(split(df, category), function(dfs) cor(dfs$var1, dfs$var2))
EDIT:
You can also use by (as suggested by #thelatemail):
by(df, df$category, function(x) cor(x$var1,x$var2))
You can use sapply to get the same but as a vector, not a list
sapply(split(df, category), function(dfs) cor(dfs$var1, dfs$var2))
And just for comparison, here's how you'd do it with the dplyr package.
library(dplyr)
df %>% group_by(category) %>% summarize(cor=cor(var1,var2))
# category cor
# 1 A -0.05043706
# 2 B 0.13519013
# 3 C -0.04186283

R equivalent of Stata *

In Stata, if I have these variables: var1, var2, var3, var4, var5, and var6, I can select all of them with the command var*. Does R have a similar functionality?
The select function from the "dplyr" package offers several flexible ways to select variables. For instance, using #Marius's sample data, try the following:
library(dplyr)
df %>% select(starts_with("var")) # At the start
df %>% select(num_range("var", 1:3)) # specifying range
df %>% select(num_range("var", c(1, 3))) # gaps are allowed
You can grep to do this kind of regexp matching among the column names:
x = c(1, 2, 3)
df = data.frame(var1=x, var2=x, var3=x, other=x)
df[, grep("var*", colnames(df))]
Output:
var1 var2 var3
1 1 1 1
2 2 2 2
3 3 3 3
So, basically just making use of the usual df[rows_to_keep, columns_to_keep]
indexing syntax, and feeding in the results of grep as the columns_to_keep.

Resources