R How to apply function to rows of grouped dataframe? - r

Suppose I have a dataframe generated like this
dataframe <- data.frame(name = (rep(c('A', 'B', 'C', 'D'), 25)), probe = rep(number, each = 4), a = rnorm(100), b = (rnorm(100)+1), c = (rnorm(100)+5))
> head(dataframe)
name probe a b c
1 A 1 0.03394554 2.97384424 4.173368
2 B 1 1.64304498 2.67977648 5.027671
3 C 1 0.35266588 1.62455820 5.664635
4 D 1 -1.24197302 0.29907974 5.243112
5 A 2 -0.20330593 0.45405930 6.603498
6 B 2 -1.06909795 -0.02575508 4.318659
The samples are in the columns. Variables are in the rows.
I need to calculate the ratio (A+B)/(C+D) for very set of samples using the same probe, such as when probe == 1 or probe == 2.
I can groupby by probe.
But it seems functions can be applied to the columns, how to apply functions to the rows in a groupby object?
Thanks for the help!

I'd reshape.
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, -name, -probe) %>%
spread(name, value) %>%
mutate(ratio = (A+B)/(C+D) )

Or we could use recast from reshape2. It is a convenient wrapper for melt/dcast. We add the new column 'ratio' after the reshape.
library(reshape2)
transform(recast(df, measure.var=c('a', 'b', 'c'),
probe+variable~name, value.var='value'), ratio= (A+B)/(C+D))

Related

Using tidyverse in R how do I arrange my column values in a fixed way?

ID score
a 1
a 2
b 2
b 4
c 4
c 5
I want to change id to "a,b,c" order this to
ID score
a 1
b 2
c 4
a 2
b 4
c 5
What I tried
> data <- read_csv(data)
> data <- factor(data$id, levels = c('a', 'b', 'c'))
This works for tables so I tried it but didn't work for this. Anybody know if there is a way?
Instead of assigning the 'id' column to data <- (which would replace the data with the 'id' values) it would be used for ordering. In base R, this can be done with
data1 <- data[order(duplicated(data$ID)),]
row.name(data1) <- NULL
Or with dplyr
library(dplyr)
library(data.table)
data %>%
arrange(rowid(ID))
library(dplyr)
d %>%
group_by(ID) %>%
mutate(r = row_number()) %>%
ungroup() %>%
arrange(r, ID, score) %>%
select(-r)
OR in base R
with(d, d[order(ave(seq(NROW(d)), d$ID, FUN = seq_along), ID, score),])

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

dplyr: how to (concisely) use a mutate conditional on different cases?

Consider the following example
data <- data_frame(name = c('A','B','C','C',NA,'D'))
> data
# A tibble: 6 × 1
name
<chr>
1 A
2 B
3 C
4 C
5 <NA>
6 D
Here, I know that the variable name actually maps to 'A' -> 'one' and 'B' -> 'two'. I would simply like to create a variable that gets the mapping value. Of course, in my original dataset I have many more cases to map.
Something that does not work is the following.
data <- data %>%
mutate(mapping = ifelse(name == 'A', 'one', name),
mapping = ifelse(name == 'B', 'two', name))
> data
# A tibble: 6 × 2
name mapping
<chr> <chr>
1 A A
2 B two
3 C C
4 C C
5 <NA> <NA>
6 D D
What is wrong here? What is the most efficient way to do so in dplyr?
Many thanks!
If you want to avoid nested ifelse , you should simply create a mapping data frame and inner join with it .
mapping_df <- data.frame(name = c('A', 'B', 'C' . . . . 'Z'), mapping = c(1:26))
left_join(data, mapping_df, by = "name")
data %>% mutate(mapping = recode(name, A="one", B="two"))
Recode may be handy when there aren't too many replacements.
For two values you could try something like:
data <- data %>%
mutate(mapping = ifelse(name == 'A', 'one',
ifelse(name == 'B', 'two', 'other')))
However you would be better off creating a separate data frame that contained the map and then using dplyr::left_join() to add it to your main df.

dcast in R - creating pivot table

Here is my example
Student <- c('A', 'B', 'B')
Assessor <- c('C', 'D', 'D')
Score <- c(1, 5, 7)
df <- data.frame(Student, Assessor, Score)
df <- dcast(df, Student ~ Assessor,fun.aggregate=(function (x) x), value = 'Score')
print(df)
The output:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
While I want to get something like
C D
A 1 NaN
B NaN 5
B NaN 7
What I am missing?
In addition, if I replace Score with
Score <- c('foo', 'bar','bar')
The output will be:
Using Score as value column: use value.var to override.
Error in .fun(.value[0], ...) : unused argument (value = "Score")
Any thoughts?
Since dcast spread among unique values of the left side of the formula I think you can achieve your goal with a (not so elegant hack) but I bet there are other ways to do that with table maybe.
library(reshape2)
dcast(df, Student + Score ~ ...)[-2]
Using Score as value column: use value.var to override.
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7
The hack is to just spread by remaining Student and Score the same and then spread other variables (in this case Assessor) and the with [-2] remove the Score column in order to get the desired output (unless your first column is made by column names actually, which is impossible in base R; in that case you need a data.table solution)
Using the dev version of tidyr (0.3.0) get it from github.
First we complete the combinations of Student/Assessor, then we nest it all into a list, spread and then unnest the list into new rows.
library(dplyr)
library(tidyr)
df %>% complete(Student, Assessor) %>%
nest(Score) %>%
spread(Assessor, Score) %>%
unnest(C) %>%
unnest(D)
Student C D
1 A 1 NA
2 B NA 5
3 B NA 7

Resources