Calculate values in "long format" dataframe in R - r

I'm looking for a way to calculate values in LONG format data frame without switching between long and wide formats. Data frame structure is basically like this:
index <- rep(seq(1:3),2)
category <- c("a","a","a","b","b","b")
value <- c(3,6,8,9,7,4)
df <- data.frame(index, category,value, stringsAsFactors = FALSE)
Say, I need to calculate a new category, c by adding up a and b. That is very easy to do by transforming the data frame to "wide" format with category as the key column, adding new c variable by the calculation and switching back to "long" format.
However, I have hundreds of new categories to be calculated from hundreds of source items and it would be a very time-consuming solution. I'm sure there must be a smarter way, but I haven't been able to find it. Any ideas? Thank you!

We can use data.table
library(data.table)
rbind(setDT(df), df[, .(category = 'c', value = sum(value)), index])
# index category value
#1: 1 a 3
#2: 2 a 6
#3: 3 a 8
#4: 1 b 9
#5: 2 b 7
#6: 3 b 4
#7: 1 c 12
#8: 2 c 13
#9: 3 c 12

With dplyr we can group_by index to match the values, sum values for each group and bind the rows to the original dataframe.
library(dplyr)
bind_rows(df, df %>%
group_by(index) %>%
summarise(category = 'c',
value = sum(value)))
# index category value
#1 1 a 3
#2 2 a 6
#3 3 a 8
#4 1 b 9
#5 2 b 7
#6 3 b 4
#7 1 c 12
#8 2 c 13
#9 3 c 12
The same with base R would be using aggregate and rbind
rbind(df, transform(aggregate(value~index, df, sum), category = 'c'))

Related

Turning a data frame and a list into long format with dplyr

Here is a puzzle.
Assume you have a data frame and a list. The list has as many elements as the df has rows:
dd <- data.frame(ID=1:3, Name=LETTERS[1:3])
dl <- map(4:6, rnorm) %>% set_names(letters[1:3])
Is there a simple way (preferably with dplyr / tidyverse) to make a long format, such that the elements of the list are joined with the corresponding rows of the data frame? Here is what I have in mind illustrated with not-so-elegant way:
rows <- map(1:length(dl), ~ rep(., length(dl[[.]]))) %>% unlist()
dd <- dd[rows,]
dd$value <- unlist(dl)
As you can see, for each vector in dl, we replicated the corresponding row as many times as necessary to accommodate each value.
In base R, you can get your result with stack followed by merge:
res <- merge(stack(dl), dd, by.x="ind", by.y="Name")
head(res)
# ind values ID
#1 A -0.79616693 1
#2 A 0.37720953 1
#3 A 1.30273712 1
#4 A 0.19483859 1
#5 B 0.18770716 2
#6 B -0.02226917 2
NB: I supposed the names for dl were supposed to be in uppercases but if they are indeed lowercase, the following line needs to be pass instead:
res <- merge(stack(setNames(dl, toupper(names(dl)))), dd, by.x="ind", by.y="Name")
Since a dplyr solution has already been provided, another option is to subset dl for each Name value in dd using data.table grouping
library(data.table)
setDT(dd)
dd[, .(values = dl[[tolower(Name)]]), by = .(ID, Name)]
# ID Name values
# 1: 1 A -1.09633600
# 2: 1 A -1.26238190
# 3: 1 A 1.15220845
# 4: 1 A -1.45741071
# 5: 2 B -0.49318131
# 6: 2 B 0.59912670
# 7: 2 B -0.73117632
# 8: 2 B -1.09646143
# 9: 2 B -0.79409753
# 10: 3 C -0.08205888
# 11: 3 C 0.21503398
# 12: 3 C -1.17541571
# 13: 3 C -0.10020616
# 14: 3 C -1.01152362
# 15: 3 C -1.03693337
We can create a list column and unnest
library(tidyverse)
dd %>%
mutate(value = dl) %>%
unnest
# ID Name value
#1 1 A 1.57984385
#2 1 A 0.66831102
#3 1 A -0.45472145
#4 1 A 2.33807619
#5 2 B 1.56716709
#6 2 B 0.74982763
#7 2 B 0.07025534
#8 2 B 1.31174561
#9 2 B 0.57901536
#10 3 C -1.36629653
#11 3 C -0.66437155
#12 3 C 2.12506187
#13 3 C 1.20220402
#14 3 C 0.10687018
#15 3 C 0.15973401
Note that if the criteria is based on the compactness of code, if we remove the %>%
unnest(mutate(dd, value = dl))
Or another option is uncount and mutate
dd %>%
uncount(lengths(dl)) %>%
mutate(value = flatten_dbl(unname(dl)))
If it needs a join based on the names of the 'dl'
enframe(dl, name = 'Name') %>%
mutate(Name = toupper(Name)) %>%
left_join(dd) %>%
unnest
In base R, we can replicate the rows of 'dd' with lengths of 'dl' and transform to create the 'value' as unlisted 'dl'
transform(dd[rep(seq_len(nrow(dd)), lengths(dl)),], value = unlist(dl))

Using dplyr first function but ignoring a particular character

I wish to add the first feature in the following dataset in a new column
mydf <- data.frame (customer= c(1,2,1,2,2,1,1) , feature =c("other", "a", "b", "c", "other","b", "c"))
customer feature
1 1 other
2 2 a
3 1 b
4 2 c
5 2 other
6 1 b
7 1 c
by using dplyr. However, I wish to my code ignore the "other" feature in the data set and choose the first one after "other".
so the following code is not sufficient:
library (dplyr)
new <- mydf %>%
group_by(customer) %>%
mutate(firstfeature = first(feature))
How can I ignore "other" so that I reach the following ideal output:
customer feature firstfeature
1 1 other b
2 2 a a
3 1 b b
4 2 c a
5 2 other a
6 1 b b
With dplyr we can group by customer and take the first feature for every group.
library(dplyr)
mydf %>%
group_by(customer) %>%
mutate(firstfeature = feature[feature != "other"][1])
# customer feature firstfeature
# <dbl> <chr> <chr>
#1 1 other b
#2 2 a a
#3 1 b b
#4 2 c a
#5 2 other a
#6 1 b b
#7 1 c b
Similarly we can also do this with base R ave
mydf$firstfeature <- ave(mydf$feature, mydf$customer,
FUN= function(x) x[x!= "other"][1])
Another option is data.table
library(data.table)
setDT(mydf)[, firstfeature := feature[feature != "other"][1], customer]

How to subset data based on combination of criteria in R

I have a several million rows of data and I need to create a subset. No success despite of trying hard and searching all over the web. The question is:
How to create a subset including only the smallest values of value for all ID & item combinations?
The data structure looks like this:
> df = data.frame(ID = c(1,1,1,1,2,2,2,2),
item = c('A','A','B','B','A','A','B','B'),
value = c(10,5,3,2,7,8,9,10))
> df
ID item value
1 1 A 10
2 1 A 5
3 1 B 3
4 1 B 2
5 2 A 7
6 2 A 8
7 2 B 9
8 2 B 10
The the result should look like this:
ID item value
1 A 5
1 B 2
2 A 7
2 B 9
Any hints greatly appreciated. Thank you!
We can use aggregate from baseR with grouping variables 'ID' and 'item' to get the min of 'value'
aggregate(value~., df, min)
# ID item value
#1 1 A 5
#2 2 A 7
#3 1 B 2
#4 2 B 9
Or using dplyr
library(dplyr)
df %>%
group_by(ID, item) %>%
summarise(value = min(value))
Or with data.table
library(data.table)
setDT(df)[, .(value = min(value)) , .(ID, item)]
Or another option would be to order and get the first row after grouping
setDT(df)[order(value), head(.SD, 1), .(ID, item)]

Determine subgroup index

I have a large data frame with groups and subgroups. I would like to determine the index of the subgroup in each group, like shown in the OUTPUT column of the following data frame:
df <- data.frame(
Group = factor(c("A","A","A","A","A","B","B","B","B")),
Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
OUTPUT = c(1,1,2,2,2,1,1,2,2)
)
I've tried several possibilities with without any success. I'd like to work with dplyr, but I'm not sure how to go about this. The following code returns an unexpected result.
require(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(
OUTPUT_2 = dplyr::id(Subgroup)
)
#df
# Group Subgroup OUTPUT_2
# (fctr) (fctr) (int)
#1 A a 8
#2 A a 8
#3 A b 8
#4 A b 8
#5 A b 8
#6 B a 4
#7 B a 4
#8 B b 4
#9 B b 4
I've the feeling I'm close, but not getting there. Can anybody help?
Here is a solution with data.table without aggregation:
dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]
This will be much faster compared to methods based on aggregation.
We can use the factor route with dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
# Group Subgroup OUTPUT
# <fctr> <fctr> <dbl>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
Or another option is match with the unique elements of 'Subgroup' after grouping by 'Group'
df %>%
group_by(Group) %>%
mutate(OUTPUT = match(Subgroup, unique(Subgroup)) )
# Group Subgroup OUTPUT
# <fctr> <fctr> <int>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
# Group Subgroup idx OUTPUT
#1: A a 1 1
#2: A a 1 1
#3: A b 2 2
#4: A b 2 2
#5: A b 2 2
#6: B a 1 1
#7: B a 1 1
#8: B b 2 2
#9: B b 2 2
Translation to dplyr should be straightforward.
Another method, following the idea of using factors from aosmith's comment, is:
dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]
This will create a factor with correct levels per Group which is the indexing you're after.

"Collapse" multiple columns into two columns using column name as ID

In R, I'm trying to collapse multiple columns of a data frame into two columns, with the column names from the first data frame being copied into their own column in the resulting data frame. For instance, I have the following data frame df :
A B C D
1 2 3 4
5 6 7 8
And I'm trying to get this output, which is helpful when performing ANOVA tests:
DV IV
A 1
A 5
B 2
B 6
C 3
C 7
D 4
D 8
I've been going about this manually by declaring new data frame like so:
df2 <- data.frame("DV" = c(rep("A", 2), rep("B", 2), rep("C", 2), rep("D", 2)),
"IV" = c(df$A, df$B, df$C, df$D))
I suspect aggregate() or melt() could do this more efficiently, but I'm lost in the syntax. Thanks in advance!
You can use melt from reshape2 package
library(reshape2)
melt(df, variable.name = "DV", value.name = "IV")
DV IV
1 A 1
2 A 5
3 B 2
4 B 6
5 C 3
6 C 7
7 D 4
8 D 8
A <- c(1,5)
B <- c(2,6)
C <- c(3,7)
D <- c(4,8)
df <- data.frame(A,B,C,D)
I like gather of the package tidyr, since the code is so short
library(tidyr)
df2 <- gather(df, DV, IV)
You could just use stack(df) or, prettying up the result a bit:
setNames(rev(stack(df)), c("DV", "IV"))
DV IV
1 A 1
2 A 5
3 B 2
4 B 6
5 C 3
6 C 7
7 D 4
8 D 8

Resources