dplyr: how to (concisely) use a mutate conditional on different cases? - r

Consider the following example
data <- data_frame(name = c('A','B','C','C',NA,'D'))
> data
# A tibble: 6 × 1
name
<chr>
1 A
2 B
3 C
4 C
5 <NA>
6 D
Here, I know that the variable name actually maps to 'A' -> 'one' and 'B' -> 'two'. I would simply like to create a variable that gets the mapping value. Of course, in my original dataset I have many more cases to map.
Something that does not work is the following.
data <- data %>%
mutate(mapping = ifelse(name == 'A', 'one', name),
mapping = ifelse(name == 'B', 'two', name))
> data
# A tibble: 6 × 2
name mapping
<chr> <chr>
1 A A
2 B two
3 C C
4 C C
5 <NA> <NA>
6 D D
What is wrong here? What is the most efficient way to do so in dplyr?
Many thanks!

If you want to avoid nested ifelse , you should simply create a mapping data frame and inner join with it .
mapping_df <- data.frame(name = c('A', 'B', 'C' . . . . 'Z'), mapping = c(1:26))
left_join(data, mapping_df, by = "name")

data %>% mutate(mapping = recode(name, A="one", B="two"))
Recode may be handy when there aren't too many replacements.

For two values you could try something like:
data <- data %>%
mutate(mapping = ifelse(name == 'A', 'one',
ifelse(name == 'B', 'two', 'other')))
However you would be better off creating a separate data frame that contained the map and then using dplyr::left_join() to add it to your main df.

Related

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

tidyverse's way to look up values in data frame given a series of keys

Suppose I have a data frame like the following
df=data.frame(x=1:5,y=c("a","b","c","d","e"))
where y is the key column. Sometimes I want to look up values of x corresponding to a series of keys in y. To accomplish this, I can
row.names(df)=df$y
df[c("b","d","c"),c("x")]
and I will get
[1] 2 4 3
Note the order of values returned is the same as that of the series of given keys.
Now I want to achieve the same thing using tidyverse's tibble. But since tibble does not have row.names, I have no idea how to do it.
My question is, what is the "most clever" way (or idiomatic way, to borrow a term from Python) to look up values in a tibble given a series of keys, following the order of the keys?
The non-rownames way of doing this with a data.frame is
df[match(c('b', 'd', 'c'), df$y), 'x']
This works with tibbles as well. Alternatively, use dplyr verbs:
df %>% slice(match(c('b', 'd', 'c'), y)) %>% pull(x)
I would use filter
library(tidyverse)
df <- tibble(
x = 1:5,
y = c("a","b","c","d","e")
)
df %>%
filter(y %in% c("b","d","c"))
#> # A tibble: 3 x 2
#> x y
#> <int> <chr>
#> 1 2 b
#> 2 3 c
#> 3 4 d
Created on 2018-07-12 by the reprex package (v0.2.0.9000).

Counting categorical variables within group_by()

I am examining conservation easement data from NCED. I have a data frame of parcels that have some repeated IDs and owners. I want to group the repeated IDs into a single row with a count of the distinct number of owners... but based on this question and answer I'm just returning a count of the number of rows of the ID.
uniqueID <- c(1:10)
parcelID <- c('a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c')
owner <- c('owner1', 'owner1', 'owner1', 'owner2', 'owner3',
'owner2', 'owner2', 'owner2', 'owner3', 'owner1')
mydat1 <- data.frame(uniqueID, parcelID, owner)
numberOwners <- mydat1 %>% group_by(parcelID, owner) %>% tally()
My desired output would be:
parcelID_grouped nOwners
1 a 3
2 b 1
3 c 2
Using dplyr there a couple of ways to do this:
library(dplyr)
mydat1 %>% distinct(parcelID, owner) %>% count(parcelID)
mydat1 %>% group_by(parcelID) %>% summarise(n = n_distinct(owner))
Both calls resulting in:
# parcelID n
# 1 a 3
# 2 b 1
# 3 c 2
Using data.table:-
library(data.table)
setDT(mydat1)
mydat1[, uniqueID := NULL]
mydat1 <- unique(mydat1)
mydat1[, nOwners := .N, by = parcelID]
mydat1[, owner := NULL]
mydat1 <- unique(mydat1)
setnames(mydat1, "parcelID", "parcelID_grouped")
You'll get the desired output:-
parcelID_grouped nOwners
1: a 3
2: b 1
3: c 2

R How to apply function to rows of grouped dataframe?

Suppose I have a dataframe generated like this
dataframe <- data.frame(name = (rep(c('A', 'B', 'C', 'D'), 25)), probe = rep(number, each = 4), a = rnorm(100), b = (rnorm(100)+1), c = (rnorm(100)+5))
> head(dataframe)
name probe a b c
1 A 1 0.03394554 2.97384424 4.173368
2 B 1 1.64304498 2.67977648 5.027671
3 C 1 0.35266588 1.62455820 5.664635
4 D 1 -1.24197302 0.29907974 5.243112
5 A 2 -0.20330593 0.45405930 6.603498
6 B 2 -1.06909795 -0.02575508 4.318659
The samples are in the columns. Variables are in the rows.
I need to calculate the ratio (A+B)/(C+D) for very set of samples using the same probe, such as when probe == 1 or probe == 2.
I can groupby by probe.
But it seems functions can be applied to the columns, how to apply functions to the rows in a groupby object?
Thanks for the help!
I'd reshape.
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, -name, -probe) %>%
spread(name, value) %>%
mutate(ratio = (A+B)/(C+D) )
Or we could use recast from reshape2. It is a convenient wrapper for melt/dcast. We add the new column 'ratio' after the reshape.
library(reshape2)
transform(recast(df, measure.var=c('a', 'b', 'c'),
probe+variable~name, value.var='value'), ratio= (A+B)/(C+D))

dcast in R - creating pivot table

Here is my example
Student <- c('A', 'B', 'B')
Assessor <- c('C', 'D', 'D')
Score <- c(1, 5, 7)
df <- data.frame(Student, Assessor, Score)
df <- dcast(df, Student ~ Assessor,fun.aggregate=(function (x) x), value = 'Score')
print(df)
The output:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
While I want to get something like
C D
A 1 NaN
B NaN 5
B NaN 7
What I am missing?
In addition, if I replace Score with
Score <- c('foo', 'bar','bar')
The output will be:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
Any thoughts?
Since dcast spread among unique values of the left side of the formula I think you can achieve your goal with a (not so elegant hack) but I bet there are other ways to do that with table maybe.
library(reshape2)
dcast(df, Student + Score ~ ...)[-2]
Using Score as value column: use value.var to override.
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7
The hack is to just spread by remaining Student and Score the same and then spread other variables (in this case Assessor) and the with [-2] remove the Score column in order to get the desired output (unless your first column is made by column names actually, which is impossible in base R; in that case you need a data.table solution)
Using the dev version of tidyr (0.3.0) get it from github.
First we complete the combinations of Student/Assessor, then we nest it all into a list, spread and then unnest the list into new rows.
library(dplyr)
library(tidyr)
df %>% complete(Student, Assessor) %>%
nest(Score) %>%
spread(Assessor, Score) %>%
unnest(C) %>%
unnest(D)
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7

Resources