Dynamically Normalize all rows with first element within a group - r

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.

We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

Related

R - how to sum each columns from df

I have this df
df <- read.table(text="
id month gas tickets
1 1 13 14
2 1 12 1
1 2 4 5
3 1 5 7
1 3 0 9
", header=TRUE)
What I like to do is calculate sum of gas, tickets (and another 50+ rows in my real df) for each month. Usually I would do something like
result <-
df %>%
group_by(month) %>%
summarise(
gas = sum(gas),
tickets = sum(tickets)
) %>%
ungroup()
But since I have really lot of columns in my dataframe, I don´t want to repeat myself with creating sum function for each column. I´m wondering if is possible to create some more elegant - function or something that will create sum of each column except id and month with grouped month column.
You can use summarise_at() to ignore id and sum the rest:
df %>%
group_by(month) %>%
summarise_at(vars(-id), list(sum = ~sum))
# A tibble: 3 x 3
month gas_sum tickets_sum
<int> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9
You can use aggregate as markus recommends in the comments. If you want to stick to the tidyverse you could try something like this:
df %>%
select(-id) %>%
group_by(month) %>%
summarise_if(is.numeric, sum)
#### OUTPUT ####
# A tibble: 3 x 3
month gas tickets
<fct> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Finding individual values from cumulative mean in R Dataframe

I am having troubles finding how to find individual values from the running mean in an R dataframe.
I have an R dataframe:
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
Where the mean is the mean for the x measurements for the specific ID in the dataframe.
To find the individual values at each x value rather than the mean, I was thinking that I needed to apply a recursive function on the dataframe and group by the ID. How could I do this in a dataframe while grouping by one of the values when any apply function wouldn't have access to the previous entry in the dataframe?
When completed and appended to the dataframe, I am hoping it to look like this:
x ID Mean IndivValues
1 1 1 1
1 2 5 5
2 1 3 5
2 2 6 7
It's much easier to calculate this from totals -> to individual observation, as below:
Example data.frame:
df <- read.table(text='
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
', header=T)
Solution:
library(dplyr); library(magrittr)
df %>%
group_by(id) %>%
mutate(
total = mean * x,
ind_value = total - lag(total, default=0) )
## A tibble: 4 x 5
## Groups: ID [2]
# x ID Mean total ind_value
# <int> <int> <int> <int> <int>
#1 1 1 1 1 1
#2 1 2 5 5 5
#3 2 1 3 6 5
#4 2 2 6 12 7

recoding categorical with no mapping values

Got a data frame with a lot of variables (82), many of them are used for further calculations. So I've tried to convert to numerical but there's a huge work guessing distinct values for every variable and then assign numbers.
I wonder if there's a more automated way of doing it since I don't care which number is assigned to any value as it is not repeated.
My approach so far (for he sake of clarity, dummy data):
df <- data.frame(original.var1 = c("display","memory","software","display","disk","memory"),
original.var2 = c("skeptic","believer","believer","believer","skeptic","believer"),
original.var3 = c("round","square","triangle","cube","sphere","hexagon"),
original.var4 = c(10,20,30,40,50,60))
taking into account this worked fine
library(dplyr)
library(magrittr)
df$NEW1 <- as.numeric(interaction(df$original.var1, drop=TRUE))
I've tried to adapt to dplyr and pipes this way
df %<>% mutate(VAR1= as.numeric(interaction(original.var1, drop=TRUE))) %>%
mutate(VAR2= as.numeric(interaction(original.var2, drop=TRUE))) %>%
mutate(VAR3= as.numeric(interaction(original.var2, drop=TRUE)))
but results got wrong from third VAR ahead
df %>% dplyr::group_by(original.var1,VAR1) %>% tally()
# A tibble: 4 x 3
# Groups: original.var1 [?]
original.var1 VAR1 n
<fctr> <dbl> <int>
1 disk 1 1
2 display 2 2
3 memory 3 2
4 software 4 1
> df %>% dplyr::group_by(original.var2,VAR2) %>% tally()
# A tibble: 2 x 3
# Groups: original.var2 [?]
original.var2 VAR2 n
<fctr> <dbl> <int>
1 believer 1 4
2 skeptic 2 2
> df %>% dplyr::group_by(original.var3,VAR3) %>% tally()
# A tibble: 6 x 3
# Groups: original.var3 [?]
original.var3 VAR3 n
<fctr> <dbl> <int>
1 cube 1 1
2 hexagon 1 1
3 round 2 1
4 sphere 2 1
5 square 1 1
6 triangle 1 1
Any approach or package to recode not having the mapping declared previously?
You can use mutate_if,
library(dplyr)
mutate_if(df, is.factor, funs(as.numeric(interaction(., drop = TRUE))))
which gives,
original.var1 original.var2 original.var3 original.var4
1 2 2 3 10
2 3 1 5 20
3 4 1 6 30
4 2 1 1 40
5 1 2 4 50
6 3 1 2 60
Alternatively you can read your data frame with stringsAsFactors = FALSE and use is.character but it's the same thing
To address your comment, If you want to also keep your original columns, then,
mutate_if(df, is.factor, funs(new = as.numeric(interaction(., drop = TRUE))))
Using purrr Keep the factor columns only and operate on them. Merge with numerical at the end.
df %>% purrr::keep(is.factor) %>% mutate_all(funs(as.numeric(interaction(., drop = TRUE))))

Find difference between rows by id, but place difference on first row in R

I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.

Resources