Summarizing a difficult dataset - r

I have a dataset that is basically formatted backwards from how I need it to perform a specific analysis. It represents entities and the articles they are found in, represented by id numbers (see below. Column headings [article 1, 2, 3, etc.] are just the 1st, 2nd, 3rd articles they appear in. The index in the cell is the substantive part). What I'd like to get is a count of how many entities appear in each article, which I think I could do with something like dplyr's group_by and summarise, but I can't find anywhere where you can apply it to a range of columns (there are actually 97 article columns in the dataset).
entity
article 1
article 2
article 3
Bach
51
72
122
Mozart
2
83
95
Two specific transformations that would be useful for me are
The number of entities in each article calculated as the count of the times each unique ID appears in an entity row. eg:
id
count
51
5424
72
1001
122
4000
The entities in each article. eg:
id
entity 1
entity 2
entity 3
51
Bach
Mozart
etc
72
Mozart
Liszt
etc
All this should be possible from this dataset, I just can't figure out how to get it into a workable format. Thanks for your help!

For number 1, you can pivot to long format, then get the counts for each unique id for each entity using tally.
library(tidyverse)
df %>%
pivot_longer(-entity) %>%
group_by(entity, value) %>%
tally()
# A tibble: 6 × 3
# Groups: entity [2]
entity value n
<chr> <dbl> <int>
1 Bach 51 1
2 Bach 72 2
3 Bach 122 1
4 Mozart 2 1
5 Mozart 83 2
6 Mozart 95 1
It is a little unclear exactly what you are looking for, as the output seems different than what you describe. So, if you just want the total counts for each unique id, then you could drop entity in the group_by statement.
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
tally()
# A tibble: 6 × 2
value n
<dbl> <int>
1 2 1
2 51 1
3 72 2
4 83 2
5 95 1
6 122 1
For number 2, you could do something like this:
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
mutate(name = paste0("entity " , 1:n())) %>%
pivot_wider(names_from = "name", values_from = "entity")
# A tibble: 6 × 3
# Groups: value [6]
value `entity 1` `entity 2`
<dbl> <chr> <chr>
1 51 Bach NA
2 72 Bach Bach
3 122 Bach NA
4 2 Mozart NA
5 83 Mozart Mozart
6 95 Mozart NA
Data
df <- structure(
list(
entity = c("Bach", "Mozart"),
article.1 = c(51, 2),
article.2 = c(72, 83),
article.3 = c(122, 95),
article.4 = c(72, 83)
),
class = "data.frame",
row.names = c(NA,-2L)
)

Related

Group all rows that meet a certain condition

I have the following dataframe df1:
company_location count
<chr> <int>
1 DE 28
2 JP 6
3 GB 47
4 HN 1
5 US 355
6 HU 1
I want to get to df2:
company_location count
<chr> <int>
1 DE 28
2 GB 47
3 US 355
4 OTHER 8
df2 is the same as df1 but sums together all the columns with count<10 and aggregates them in a row called OTHER
Does something like this exist: A group_by() function that only groups all the rows that match a particular condition into one group and leaves all the other rows in groups only containing them alone?
This is what fct_lump_min is for - it's a function from forcats, which is part of the tidyverse.
library(tidyverse)
df %>%
group_by(company_location = fct_lump_min(company_location, 10, count)) %>%
summarise(count = sum(count))
#> # A tibble: 4 x 2
#> company_location count
#> <fct> <int>
#> 1 DE 28
#> 2 GB 47
#> 3 US 355
#> 4 Other 8
Make a temporary variable regrouping company_location based on count, then summarise:
library(dplyr)
df1 %>%
group_by(company_location = replace(company_location, count < 10, 'OTHER')) %>%
summarise(count = sum(count))
# company_location count
# <chr> <int>
#1 DE 28
#2 GB 47
#3 OTHER 8
#4 US 355

R output BOTH maximum and minimum value by group in dataframe

Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82

What is this arrange function in the second line doing here?

I'm currently reviewing R for Data Science when I encounter this chunk of code.
The question for this code is as follows. I don't understand the necessity of the arrange function here. Doesn't arrange function just reorder the rows?
library(tidyverse)
library(nycflights13))
flights %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
However, it does output differently with or without the arrange function, as shown below:
#with the arrange function
tailnum n
<chr> <int>
1 N954UW 206
2 N952UW 163
3 N957UW 142
4 N5FAAA 117
5 N38727 99
6 N3742C 98
7 N5EWAA 98
8 N705TW 97
9 N765US 97
10 N635JB 94
# ... with 3,745 more rows
and
#Without the arrange function
tailnum n
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N961UW 139
5 N713TW 128
6 N765US 122
7 N721TW 120
8 N5FAAA 117
9 N945UW 104
10 N19130 101
# ... with 3,774 more rows
I'd appreciate it if you can help me understand this. Why is it necessary to include the arrange function here?
Yes, arrange just orders the rows but you are filtering after that which changes the result.
Here is a simplified example to demonstrate how the output differs with and without arrange.
library(dplyr)
df <- data.frame(a = 1:5, b = c(7, 8, 9, 1, 2))
df %>% filter(cumsum(b) < 20)
# a b
#1 1 7
#2 2 8
df %>% arrange(b) %>% filter(cumsum(b) < 20)
# a b
#1 4 1
#2 5 2
#3 1 7
#4 2 8

Merge rows containing similar strings using dplyr

I have a table containing the following data:
df <- tibble(
dose = seq(10, 50, 10),
date = c("2007-12-15", "2007-10-13","2007-10-13","2007-09-30","2007-09-30"),
response = c(45, 67, 66, 54, 55),
name = c("Peter,Martin", "Gale,Rebecca", "Rebecca,Gale", "Jonathan,Smith", "Smith,Jonathan")
)
The table:
# A tibble: 5 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 20 2007-10-13 67 Gale,Rebecca
3 30 2007-10-13 66 Rebecca,Gale
4 40 2007-09-30 54 Jonathan,Smith
5 50 2007-09-30 55 Smith,Jonathan
One of the columns called name either has a string "FirstName,LastName" or "LastName,FirstName". I wish to merge the rows that contain the same names if they are ordered either way. For example, the rows containing Rebecca,Gale and Gale,Rebecca should merge.
While merging, I wish to get the sums of the columns dose and response and want to keep the first of the date and name entries.
Expected outcome:
# A tibble: 3 x 4
dose date response name
<dbl> <chr> <dbl> <chr>
1 10 2007-12-15 45 Peter,Martin
2 50 2007-10-13 133 Gale,Rebecca
3 90 2007-09-30 109 Jonathan,Smith
Please note that I always want to merge using the name column and not the date column because even if the example contains the same dates, my bigger table has different dates for the same name.
Here is one idea.
library(tidyverse)
df2 <- df %>%
mutate(date = as.Date(date)) %>%
mutate(name = map_chr(name, ~toString(sort(str_split(.x, ",")[[1]])))) %>%
group_by(name) %>%
summarize(dose = sum(dose),
response = sum(response),
date = first(date)) %>%
select(names(df)) %>%
ungroup()
df2
# # A tibble: 3 x 4
# dose date response name
# <dbl> <date> <dbl> <chr>
# 1 50 2007-10-13 133 Gale, Rebecca
# 2 90 2007-09-30 109 Jonathan, Smith
# 3 10 2007-12-15 45 Martin, Peter

add a column based on unlike value in another column

I am trying to add a column of data where the value is attributed to a different row with one id that is the same, and the other id is not the same. The data is below.
class_id student score other_score
1 23 87 93
1 27 93 87
2 14 77 90
2 19 90 77
The other_score column is what I am looking to achieve, given the first three coulmns. I have already tried:
df$other_score = df[df$class_id == df$class_id & df$student != df$student,]$score
I might be under complicating it but if there is always just two kids, sum after group by then remove score
library(dplyr)
output = df %>%
group_by(class_id) %>%
mutate(other_score = sum(score)-score)
output
# A tibble: 4 x 4
# Groups: class_id [2]
class_id student score other_score
<dbl> <dbl> <dbl> <dbl>
1 1 23 87 93
2 1 27 93 87
3 2 14 77 90
4 2 19 90 77
One option would be to use lead and lag, and to retain the non NA value, whatever that might be:
library(dplyr)
output <- df %>%
group_by(class_id) %>%
mutate(other_score <- ifelse(is.na(lead(score, order_by=student)),
lag(score, order_by=student),
lead(score, order_by=student)))
One option using setdiff is to ignore the current index (row_number()) and select the score from remaining index.
library(dplyr)
library(purrr)
df %>%
group_by(class_id) %>%
mutate(other = score[map_dbl(seq_len(n()), ~setdiff(seq_len(n()), .))])
# class_id student score other_score
# <int> <int> <int> <int>
#1 1 23 87 93
#2 1 27 93 87
#3 2 14 77 90
#4 2 19 90 77
If you have more than two value in each class_id use
setdiff(seq_len(n()), .)[1])])
which will select only one value or you could also do
sample(setdiff(seq_len(n()), .))[1])])
to randomly select one value from the remaining score.

Resources