recoding categorical with no mapping values - r

Got a data frame with a lot of variables (82), many of them are used for further calculations. So I've tried to convert to numerical but there's a huge work guessing distinct values for every variable and then assign numbers.
I wonder if there's a more automated way of doing it since I don't care which number is assigned to any value as it is not repeated.
My approach so far (for he sake of clarity, dummy data):
df <- data.frame(original.var1 = c("display","memory","software","display","disk","memory"),
original.var2 = c("skeptic","believer","believer","believer","skeptic","believer"),
original.var3 = c("round","square","triangle","cube","sphere","hexagon"),
original.var4 = c(10,20,30,40,50,60))
taking into account this worked fine
library(dplyr)
library(magrittr)
df$NEW1 <- as.numeric(interaction(df$original.var1, drop=TRUE))
I've tried to adapt to dplyr and pipes this way
df %<>% mutate(VAR1= as.numeric(interaction(original.var1, drop=TRUE))) %>%
mutate(VAR2= as.numeric(interaction(original.var2, drop=TRUE))) %>%
mutate(VAR3= as.numeric(interaction(original.var2, drop=TRUE)))
but results got wrong from third VAR ahead
df %>% dplyr::group_by(original.var1,VAR1) %>% tally()
# A tibble: 4 x 3
# Groups: original.var1 [?]
original.var1 VAR1 n
<fctr> <dbl> <int>
1 disk 1 1
2 display 2 2
3 memory 3 2
4 software 4 1
> df %>% dplyr::group_by(original.var2,VAR2) %>% tally()
# A tibble: 2 x 3
# Groups: original.var2 [?]
original.var2 VAR2 n
<fctr> <dbl> <int>
1 believer 1 4
2 skeptic 2 2
> df %>% dplyr::group_by(original.var3,VAR3) %>% tally()
# A tibble: 6 x 3
# Groups: original.var3 [?]
original.var3 VAR3 n
<fctr> <dbl> <int>
1 cube 1 1
2 hexagon 1 1
3 round 2 1
4 sphere 2 1
5 square 1 1
6 triangle 1 1
Any approach or package to recode not having the mapping declared previously?

You can use mutate_if,
library(dplyr)
mutate_if(df, is.factor, funs(as.numeric(interaction(., drop = TRUE))))
which gives,
original.var1 original.var2 original.var3 original.var4
1 2 2 3 10
2 3 1 5 20
3 4 1 6 30
4 2 1 1 40
5 1 2 4 50
6 3 1 2 60
Alternatively you can read your data frame with stringsAsFactors = FALSE and use is.character but it's the same thing
To address your comment, If you want to also keep your original columns, then,
mutate_if(df, is.factor, funs(new = as.numeric(interaction(., drop = TRUE))))

Using purrr Keep the factor columns only and operate on them. Merge with numerical at the end.
df %>% purrr::keep(is.factor) %>% mutate_all(funs(as.numeric(interaction(., drop = TRUE))))

Related

Stepwise column sum in data frame based on another column in R

I have a data frame like this:
Team
GF
A
3
B
5
A
2
A
3
B
1
B
6
Looking for output like this (just an additional column):
Team
x
avg(X)
A
3
0
B
5
0
A
2
3
A
3
2.5
B
1
5
B
6
3
avg(x) is the average of all previous instances of x where Team is the same. I have the following R code which gets the overall average, however I'm looking for the "step-wise" average.
new_df <- df %>% group_by(Team) %>% summarise(avg_x = mean(x))
Is there a way to vectorize this while only evaluating the previous rows on each "iteration"?
You want the cummean() function from dplyr, combined with lag():
df %>% group_by(Team) %>% mutate(avg_x = replace_na(lag(cummean(x)), 0))
Producing the following:
# A tibble: 6 × 3
# Groups: Team [2]
Team x avg_x
<chr> <dbl> <dbl>
1 A 3 0
2 B 5 0
3 A 2 3
4 A 3 2.5
5 B 1 5
6 B 6 3
As required.
Edit 1:
As #Ritchie Sacramento pointed out, the following is cleaner and clearer:
df %>% group_by(Team) %>% mutate(avg_x = lag(cummean(x), default = 0))

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

Using dplyrs group_by and summarise to find number of intersections with a different vector

I have a situation where I am trying to find the number of intersections with a vector per group in another tibble.
Data example
a <- tibble(EXPERIMENT = rep(c("a","b","c"),each =4),
ECOTYPE = rep(1:12))
b <- tibble(ECOTYPE = c(1,1,5,4,8,7,6,1,4,4,2,5,6,7,1))
I want to find the number of intersections between ECOTYPE in b and ECOTYPEper EXPERIMENT in a.
I wonder if I can use dplyr to solve this, as the group_by function seems to fit this problem, but when I run:
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, .$ECOTYPE))
I only get the total number of intersections between a and b.
Am I missing something?
Edit:
Sorry for not posting my desired output. I would like something like this:
# A tibble: 3 x 2
EXPERIMENT INTERSECTIONS
<chr> <dbl>
1 a 8
2 b 7
3 c 0
Depending how you want to count, this will give the number of rows in b matching a:
b %>% mutate(b_flag = 1) %>%
right_join(a) %>%
group_by(EXPERIMENT) %>%
summarize(INTERSECTIONS = sum(b_flag, na.rm = T))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <dbl>
# 1 a 8
# 2 b 7
# 3 c 0
I think the only problem with your code is the unnecessary .$, but it gives the counts of distinct ecotypes in b, ignoring the fact that b has three ECOTYPE = 1 rows, for example.
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, ECOTYPE)))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <int>
# 1 a 3
# 2 b 4
# 3 c 0
This is a result of how intersect works:
intersect(c(1, 2, 3), c(1, 1, 1))
# [1] 1
Join the two and count how many are left:
inner_join(a,b, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% count()
# A tibble: 2 x 2
# Groups: EXPERIMENT [2]
EXPERIMENT n
<chr> <int>
1 a 8
2 b 7
Now, if you add an indicator column to b, you can start to count absences as well:
b %>% mutate(present=TRUE) %>% right_join(a, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% summarise(n(), missing=sum(is.na(present)))
# A tibble: 3 x 3
EXPERIMENT `n()` missing
<chr> <int> <int>
1 a 9 1
2 b 7 0
3 c 4 4

Find difference between rows by id, but place difference on first row in R

I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.

dplyr: access current group variable

After using data.table for quite some time I now thought it's time to try dplyr. It's fun, but I wasn't able to figure out how to access
the current grouping variable
returning multiple values per group
The following example shows is working fine with data.table. How would you write this with dplyr
library(data.table)
foo <- matrix(c(1, 2, 3, 4), ncol = 2)
dt <- data.table(a = c(1, 1, 2), b = c(4, 5, 6))
# data.table (expected)
dt[, .(c = foo[, a]), by = a]
a c
1: 1 1
2: 1 2
3: 2 3
4: 2 4
# dplyr (?)
library(dplyr)
dt %>%
group_by(a) %>%
summarize(c = foo[a])
We can use do from dplyr. (No other packages used). The do is very handy for expanding rows. We only need to wrap with data.frame.
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, unique(.$a)]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or instead of unique we can subset by the 1st observation
dt %>%
group_by(a) %>%
do(data.frame(c = foo[, .$a[1]]))
# a c
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
Or with dplyr >= 1.0.0 (EDIT: Based on #Todd West comments)
dt %>%
reframe(c = foo[, cur_group()$a], .by = 'a')
a c
1 1 1
2 1 2
3 2 3
4 2 4
This can be also done without using any packages
stack(lapply(split(dt$a, dt$a), function(x) foo[,unique(x)]))[2:1]
# ind values
#1 1 1
#2 1 2
#3 2 3
#4 2 4
You can still access the group variable but it is like a normal vector with one unique value for each group, so if you put unique around it, it will work. And at same time, dplyr does not seem to expand rows like data.table automatically, you will need the unnest from tidyr package:
library(dplyr); library(tidyr)
dt %>%
group_by(a) %>%
summarize(c = list(foo[,unique(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
Or we can use first to speed up, since we've already know the group variable vector is the same for every group:
dt %>%
group_by(a) %>%
summarize(c = list(foo[,first(a)])) %>%
unnest()
# Source: local data frame [4 x 2]
# a c
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
To access a grouping variable in a grouped operation (map, walk, mutate), we can refer to .y which is exposed automatically within the evaluation context.
Example
> iris %>% group_by(Species) %>% group_walk(~{ print(.y) })
# A tibble: 1 x 1
Species
<fct>
1 setosa
# A tibble: 1 x 1
Species
<fct>
1 versicolor
# A tibble: 1 x 1
Species
<fct>
1 virginica
This is also documented with more details in https://dplyr.tidyverse.org/reference/group_map.html
The key, a tibble with exactly one row and columns for each grouping variable, exposed as .y.
Regarding the other proposed solutions: Afaik do is not recommended any longer, and the other solution with unqiue is imho clumsy (as it requires another reference to the dataframe in question).

Resources