add a column based on unlike value in another column - r

I am trying to add a column of data where the value is attributed to a different row with one id that is the same, and the other id is not the same. The data is below.
class_id student score other_score
1 23 87 93
1 27 93 87
2 14 77 90
2 19 90 77
The other_score column is what I am looking to achieve, given the first three coulmns. I have already tried:
df$other_score = df[df$class_id == df$class_id & df$student != df$student,]$score

I might be under complicating it but if there is always just two kids, sum after group by then remove score
library(dplyr)
output = df %>%
group_by(class_id) %>%
mutate(other_score = sum(score)-score)
output
# A tibble: 4 x 4
# Groups: class_id [2]
class_id student score other_score
<dbl> <dbl> <dbl> <dbl>
1 1 23 87 93
2 1 27 93 87
3 2 14 77 90
4 2 19 90 77

One option would be to use lead and lag, and to retain the non NA value, whatever that might be:
library(dplyr)
output <- df %>%
group_by(class_id) %>%
mutate(other_score <- ifelse(is.na(lead(score, order_by=student)),
lag(score, order_by=student),
lead(score, order_by=student)))

One option using setdiff is to ignore the current index (row_number()) and select the score from remaining index.
library(dplyr)
library(purrr)
df %>%
group_by(class_id) %>%
mutate(other = score[map_dbl(seq_len(n()), ~setdiff(seq_len(n()), .))])
# class_id student score other_score
# <int> <int> <int> <int>
#1 1 23 87 93
#2 1 27 93 87
#3 2 14 77 90
#4 2 19 90 77
If you have more than two value in each class_id use
setdiff(seq_len(n()), .)[1])])
which will select only one value or you could also do
sample(setdiff(seq_len(n()), .))[1])])
to randomly select one value from the remaining score.

Related

R output BOTH maximum and minimum value by group in dataframe

Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82

Summarizing a difficult dataset

I have a dataset that is basically formatted backwards from how I need it to perform a specific analysis. It represents entities and the articles they are found in, represented by id numbers (see below. Column headings [article 1, 2, 3, etc.] are just the 1st, 2nd, 3rd articles they appear in. The index in the cell is the substantive part). What I'd like to get is a count of how many entities appear in each article, which I think I could do with something like dplyr's group_by and summarise, but I can't find anywhere where you can apply it to a range of columns (there are actually 97 article columns in the dataset).
entity
article 1
article 2
article 3
Bach
51
72
122
Mozart
2
83
95
Two specific transformations that would be useful for me are
The number of entities in each article calculated as the count of the times each unique ID appears in an entity row. eg:
id
count
51
5424
72
1001
122
4000
The entities in each article. eg:
id
entity 1
entity 2
entity 3
51
Bach
Mozart
etc
72
Mozart
Liszt
etc
All this should be possible from this dataset, I just can't figure out how to get it into a workable format. Thanks for your help!
For number 1, you can pivot to long format, then get the counts for each unique id for each entity using tally.
library(tidyverse)
df %>%
pivot_longer(-entity) %>%
group_by(entity, value) %>%
tally()
# A tibble: 6 × 3
# Groups: entity [2]
entity value n
<chr> <dbl> <int>
1 Bach 51 1
2 Bach 72 2
3 Bach 122 1
4 Mozart 2 1
5 Mozart 83 2
6 Mozart 95 1
It is a little unclear exactly what you are looking for, as the output seems different than what you describe. So, if you just want the total counts for each unique id, then you could drop entity in the group_by statement.
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
tally()
# A tibble: 6 × 2
value n
<dbl> <int>
1 2 1
2 51 1
3 72 2
4 83 2
5 95 1
6 122 1
For number 2, you could do something like this:
df %>%
pivot_longer(-entity) %>%
group_by(value) %>%
mutate(name = paste0("entity " , 1:n())) %>%
pivot_wider(names_from = "name", values_from = "entity")
# A tibble: 6 × 3
# Groups: value [6]
value `entity 1` `entity 2`
<dbl> <chr> <chr>
1 51 Bach NA
2 72 Bach Bach
3 122 Bach NA
4 2 Mozart NA
5 83 Mozart Mozart
6 95 Mozart NA
Data
df <- structure(
list(
entity = c("Bach", "Mozart"),
article.1 = c(51, 2),
article.2 = c(72, 83),
article.3 = c(122, 95),
article.4 = c(72, 83)
),
class = "data.frame",
row.names = c(NA,-2L)
)

Calculating total sum of line segments overlapping on a line

I'm trying to calculate the total sum of overlapping line segments across a single line. With line A, the segments are disjointed, so it's pretty simple to calculate. However, with lines B and C, there are overlapping line segments, so it's more complicated. I would need to somehow exclude parts of the previous lines that already part of the total sum.
data = read.table(text="
line left_line right_line small_line left_small_line right_small_line
A 100 120 101 91 111
A 100 120 129 119 139
B 70 90 63 53 73
B 70 90 70 60 80
B 70 90 75 65 85
C 20 40 11 1 21
C 20 40 34 24 44
C 20 40 45 35 55", header=TRUE)
This should be the expected result.
result = read.table(text="
total_overlapping
A 0.6
B 0.75
C 0.85", header=TRUE)
EDIT: Added a picture to better illustrate what I'm trying to figure out. There's 3 different pictures of lines (solid red line), with line segments (the dashed lines) overlapping. The goal is to figure out how much of the dashed lines are covering/overlapping.
Line A
Line B
Line C
If I understand correctly, the small_line variable is irrelevant here. The rest of the columns can be used to get the sum of overlapping segments:
Step 1. Get the start & end point for each segment's overlap with the corresponding line:
library(dplyr)
data1 <- data %>%
rowwise() %>%
mutate(overlap.start = max(left_line, left_small_line),
overlap.end = min(right_line, right_small_line)) %>%
ungroup() %>%
select(line, overlap.start, overlap.end)
> data1
# A tibble: 8 x 3
line overlap.start overlap.end
<fct> <int> <int>
1 A 100 111
2 A 119 120
3 B 70 73
4 B 70 80
5 B 70 85
6 C 20 21
7 C 24 40
8 C 35 40
Step 2. Within the rows corresponding to each line, sort the overlaps in order. consider it a new overlapping section if it is the first overlap, OR the previous overlap ends before it started. Label each new overlapping section:
data2 <- data1 %>%
arrange(line, overlap.start, overlap.end) %>%
group_by(line) %>%
mutate(new.section = is.na(lag(overlap.end)) |
lag(overlap.end) <= overlap.start) %>%
mutate(section.number = cumsum(new.section)) %>%
ungroup()
> data2
# A tibble: 8 x 5
line overlap.start overlap.end new.section section.number
<fct> <int> <int> <lgl> <int>
1 A 100 111 TRUE 1
2 A 119 120 TRUE 2
3 B 70 73 TRUE 1
4 B 70 80 FALSE 1
5 B 70 85 FALSE 1
6 C 20 21 TRUE 1
7 C 24 40 TRUE 2
8 C 35 40 FALSE 2
Step 3. Within each overlapping section, take the earliest starting point & the latest ending point. Calculate the length of each overlap:
data3 <- data2 %>%
group_by(line, section.number) %>%
summarise(overlap.start = min(overlap.start),
overlap.end = max(overlap.end)) %>%
ungroup() %>%
mutate(overlap = overlap.end - overlap.start)
> data3
# A tibble: 5 x 5
line section.number overlap.start overlap.end overlap
<fct> <int> <dbl> <dbl> <dbl>
1 A 1 100 111 11
2 A 2 119 120 1
3 B 1 70 85 15
4 C 1 20 21 1
5 C 2 24 40 16
Step 4. Sum the length of overlaps for each line:
data4 <- data3 %>%
group_by(line) %>%
summarise(overlap = sum(overlap)) %>%
ungroup()
> data4
# A tibble: 3 x 2
line overlap
<fct> <dbl>
1 A 12
2 B 15
3 C 17
Now, your expected result shows the expected percentage of overlap on each line, rather than the sum. If that's what you are looking for, you can add the length for each line to data4, & calculate accordingly:
data5 <- data4 %>%
left_join(data %>%
select(line, left_line, right_line) %>%
unique() %>%
mutate(length = right_line - left_line) %>%
select(line, length),
by = "line") %>%
mutate(overlap.percentage = overlap / length)
> data5
# A tibble: 3 x 4
line overlap length overlap.percentage
<fct> <dbl> <int> <dbl>
1 A 12 20 0.6
2 B 15 20 0.75
3 C 17 20 0.85

How do I combine row entries for the same patient ID# in R while keeping other columns and NA values?

I need to combine some of the columns for these multiple IDs and can just use the values from the first ID listing for the others. For example here I just want to combine the "spending" column as well as the heart attack column to just say whether they ever had a heart attack. I then want to delete the duplicate ID#s and just keep the values from the first listing for the other columns:
df <- read.table(text =
"ID Age Gender heartattack spending
1 24 f 0 140
2 24 m na 123
2 24 m 1 58
2 24 m 0 na
3 85 f 1 170
4 45 m na 204", header=TRUE)
What I need:
df2 <- read.table(text =
"ID Age Gender ever_heartattack all_spending
1 24 f 0 140
2 24 m 1 181
3 85 f 1 170
4 45 m na 204", header=TRUE)
I tried group_by with transmute() and sum() as follows:
df$heartattack = as.numeric(as.character(df$heartattack))
df$spending = as.numeric(as.character(df$spending))
library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))
But this removes all the other columns! Also it turns NA values into zeros...for example I still want "NA" to be the value for patient ID#4, I don't want to change the data to say they never had a heart attack!
> print(dfa) #This doesn't at all match df2 :(
ID ever_heartattack all_spending
1 1 0 140
2 2 1 181
3 2 1 181
4 2 1 181
5 3 1 170
6 4 0 204
Could you do this?
aggregate(
spending ~ ID + Age + Gender,
data = transform(df, spending = as.numeric(as.character(spending))),
FUN = sum)
# ID Age Gender spending
#1 1 24 f 140
#2 3 85 f 170
#3 2 24 m 181
#4 4 45 m 204
Some comments:
The thing is that when aggregating you don't give clear rules how to deal with data in additional columns that differ (like heartattack in this case). For example, for ID = 2 why do you retain heartattack = 1 instead of heartattack = na or heartattack = 0?
Your "na"s are in fact not real NAs. That leads to spending being a factor column instead of a numeric column vector.
To exactly reproduce your expected output one can do
df %>%
mutate(
heartattack = as.numeric(as.character(heartattack)),
spending = as.numeric(as.character(spending))) %>%
group_by(ID, Age, Gender) %>%
summarise(
heartattack = ifelse(
any(heartattack %in% c(0, 1)),
max(heartattack, na.rm = T),
NA),
spending = sum(spending, na.rm = T))
## A tibble: 4 x 5
## Groups: ID, Age [?]
# ID Age Gender heartattack spending
# <int> <int> <fct> <dbl> <dbl>
#1 1 24 f 0 140
#2 2 24 m 1 181
#3 3 85 f 1 170
#4 4 45 m NA 204
This feels a bit "hacky" on account of the rules not being clear which heartattack value to keep. In this case we
keep the maximum value of heartattack if heartattack contains either 0 or 1.
return NA if heartattack does not contain 0 or 1.

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4

Resources