find out the biggest valu element by every title [duplicate] - r

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 3 years ago.
I have the following data:
library(tidyverse)
df1 <- tibble(
title = c("AA", "AA", "AA", "B", "C", "D", "D"),
rate = c(100, 100, 100, 95, 92, 90, 90),
name = c("G", "N", "E", "T", "O", "W", "L"),
pos = c(10, 1, 2, 2, 3, 5, 4)
)
title rate name pos
<chr> <dbl> <chr> <dbl>
AA 100 G 10
AA 100 N 1
AA 100 E 2
B 95 T 2
C 92 O 3
D 90 W 5
D 90 L 4
I want to find out at every title which name has the biggest pos value.
So, for title AA, it would be G, for title B, it would be T, for title C it would be O and for title D it would be W.

For B it should be "T"?
df1 %>% group_by(title) %>% top_n(1,pos) %>% pull(name)

Related

Compute the difference between two columns by pair in R

I have the following data:
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
I want to "extend" this data frame to make name pairs for every possible combination of names without repetition like so:
names_1 <- c("a", "a", "a", "b", "b", "c")
names_2 <- c("b", "c", "d", "c", "d", "d")
scores_1 <- c(95, 95, 95, 55, 55, 100)
scores_2 <- c(55, 100, 60, 100, 60, 60)
df_extended <- cbind.data.frame(names_1, names_2, scores_1, scores_2)
In the extended data, scores_1 are the scores for the corresponding name in names_1, and scores_2 are for names_2.
The following bit of code makes the appropriate name pairs. But I do not know how to get the scores in the right place after that.
t(combn(df$names,2))
The final goal is to get the row-wise difference between scores_1 and scores_2.
df_extended$score_diff <- abs(df_extended$scores_1 - df_extended$scores_2)
df_ext <- data.frame(t(combn(df$names, 2,\(x)c(x, df$scores[df$names %in%x]))))
df_ext <- setNames(type.convert(df_ext, as.is =TRUE), c('name_1','name_2', 'type_1', 'type_2'))
df_ext
name_1 name_2 type_1 type_2
1 a b 95 55
2 a c 95 100
3 a d 95 60
4 b c 55 100
5 b d 55 60
6 c d 100 60
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
library(tidyverse)
map(df, ~combn(x = .x, m = 2)%>% t %>% as_tibble) %>%
imap_dfc(~set_names(x = .x, nm = paste(.y, seq(ncol(.x)), sep = "_"))) %>%
mutate(score_diff = scores_1 - scores_2)
#> # A tibble: 6 × 5
#> names_1 names_2 scores_1 scores_2 score_diff
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 a b 95 55 40
#> 2 a c 95 100 -5
#> 3 a d 95 60 35
#> 4 b c 55 100 -45
#> 5 b d 55 60 -5
#> 6 c d 100 60 40
Created on 2022-06-06 by the reprex package (v2.0.1)
First, we can create a new data frame with the unique combinations of names. Then, we can merge on the scores to match the names for both names_1 and names_2 to get the final data.frame.
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
new_df <- data.frame(t(combn(df$names,2)))
names(new_df)[1] <- "names_1"; names(new_df)[2] <- "names_2"
new_df <- merge(new_df, df, by.x = 'names_1', by.y = 'names')
new_df <- merge(new_df, df, by.x = 'names_2', by.y = 'names')
names(new_df)[3] <- "scores_1"; names(new_df)[4] <- "scores_2"
> new_df
names_2 names_1 scores_1 scores_2
1 b a 95 55
2 c a 95 100
3 c b 55 100
4 d a 95 60
5 d b 55 60
6 d c 100 60

Convert information from rows to new columns

Is there a way in R to place every three values in the column "V" (below) to new columns? In others words, I need to reshape the data from long to wide, but only to three columns and where the values are what appears in column V. Below is a demonstration.
Thank you in advance!
data = structure(list(Key = c(200, 200, 200, 200, 200, 200, 300, 300,
300, 300, 300, 300, 400, 400, 400, 400, 400, 400),
V = c("a", "b", "c", "b", "d", "c", "d", "b", "c", "a", "f", "c", "d", "b",
"c", "a", "b", "c")),
row.names = c(NA, 18L),
class = "data.frame")
Here is one option
data %>%
group_by(Key) %>%
mutate(
grp = gl(n() / 3, 3),
col = c("x", "y", "z")[(row_number() + 2) %% 3 + 1]) %>%
group_by(Key, grp) %>%
spread(col, V) %>%
ungroup() %>%
select(-grp)
## A tibble: 6 x 4
# Key x y z
# <dbl> <chr> <chr> <chr>
#1 200 a b c
#2 200 b d c
#3 300 d b c
#4 300 a f c
#5 400 d b c
#6 400 a b c
Note: This assumes that the number of entries per Key is divisible by 3.
Instead of grp = gl(n() / 3, 3) you can also use grp = rep(1:(n() / 3), each = 3).
Update
In response to your comments, let's create sample data by removing some rows from data such that for Key = 200 and Key = 300 we don't have a multiple of 3 V entries.
data2 <- data %>% slice(-c(1, 8))
Then we can do
data2 %>%
group_by(Key) %>%
mutate(grp = gl(ceiling(n() / 3), 3)[1:n()]) %>%
group_by(Key, grp) %>%
mutate(col = c("x", "y", "z")[1:n()]) %>%
spread(col, V) %>%
ungroup() %>%
select(-grp)
## A tibble: 6 x 4
# Key x y z
# <dbl> <chr> <chr> <chr>
#1 200 b c b
#2 200 d c NA
#3 300 d c a
#4 300 f c NA
#5 400 d b c
#6 400 a b c
Note how "missing" values are filled with NA.

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 3 years ago.
I have one data frame, I want to find the rows where both columns A and B are duplicated, and then combine the rows by combing the elements in C column together.
My example:
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
My expected result:
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
Thanks a lot
Without packages:
DF <- aggregate(C ~ A + B, FUN = function(x) paste(x, collapse = "; "), data = DF)
Output:
A B C
1 1 a M
2 2 a X
3 1 b N
4 3 c M; N
Or with data.table:
setDT(DF)[, .(C = paste(C, collapse = "; ")), by = .(A, B)]
This is a tidyverse based solution where you can use paste with collapse after grouping it.
library(dplyr)
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
DF %>%
group_by(A,B) %>%
summarise(C = paste(C, collapse = ";"))
#> # A tibble: 4 x 3
#> # Groups: A [3]
#> A B C
#> <dbl> <fct> <chr>
#> 1 1 a M
#> 2 1 b N
#> 3 2 a X
#> 4 3 c M;N
Created on 2019-03-19 by the reprex package (v0.2.1)

Create column identifying minimum character from within a group and label ties

I have paired data for 10 subjects (with some missing and some ties). My goal is to select the eye with the best disc_grade (A > B > C) and label ties accordingly from the data frame below.
I'm stuck on how to use R code to select the rows with the best disc_grade for each subject.
df <- structure(list(patientID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
6, 7, 7, 8, 8, 9, 9, 10, 10), eye = c("R", "L", "R", "L", "R",
"L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L",
"R", "L"), disc_grade = c(NA, "B", "C", "B", "B", "C", "B", "C",
"B", "A", "B", "B", "C", "B", NA, NA, "B", "C", "B", "C")), .Names = c("patientID", "eye", "disc_grade"), class = c("tbl_df", "data.frame"), row.names = c(NA, -20L))
The desired output is:
patientID eye disc_grade
2 1 L B
4 2 L B
5 3 R B
7 4 R B
10 5 L A
11 6 Tie B
14 7 L B
17 9 R B
19 10 R B
This seems to work:
df %>%
group_by(patientID) %>%
filter(disc_grade == min(disc_grade, na.rm=TRUE)) %>%
summarise(eye = if (n()==1) eye else "Tie", disc_grade = first(disc_grade))
patientID eye disc_grade
(dbl) (chr) (chr)
1 1 L B
2 2 L B
3 3 R B
4 4 R B
5 5 L A
6 6 Tie B
7 7 L B
8 9 R B
9 10 R B
There is a warning for group 8, but we get the desired result thanks to how filter works on NAs.
With data.table:
setDT(df)[,
.SD[ disc_grade == min(disc_grade, na.rm=TRUE) ][,
.( eye = if (.N==1) eye else "Tie", disc_grade = disc_grade[1] )
]
, by=patientID]
Again, there's a warning, but now we do get a row for group 8, since [ does not ignore NAs. To get around this, you could filter the NAs before or after the operation (as in other answers). My best idea for doing it during the main operation is pretty convoluted:
setDT(df)[,
.SD[ which(disc_grade == min(disc_grade, na.rm=TRUE)) ][,
if (.N >= 1) list( eye = if (.N==1) eye else "Tie", disc_grade = disc_grade[1] )
]
, by=patientID]
One option with data.table
library(data.table)
na.omit(setDT(df))[, eye:=if(uniqueN(disc_grade)==1 &
.N >1) 'Tie' else eye, patientID
][order(factor(disc_grade, levels=c('A', 'B', 'C'))),
.SD[1L] ,patientID][order(patientID)]
# patientID eye disc_grade
#1: 1 L B
#2: 2 L B
#3: 3 R B
#4: 4 R B
#5: 5 L A
#6: 6 Tie B
#7: 7 L B
#8: 9 R B
#9: 10 R B
library(dplyr)
df <- structure(list(patientID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
6, 7, 7, 8, 8, 9, 9, 10, 10), eye = c("R", "L", "R", "L", "R",
"L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L",
"R", "L"), disc_grade = c(NA, "B", "C", "B", "B", "C", "B", "C",
"B", "A", "B", "B", "C", "B", NA, NA, "B", "C", "B", "C")), .Names = c("patientID", "eye", "disc_grade"), class = c("tbl_df", "data.frame"), row.names = c(NA, -20L))
df %>%
filter(!is.na(disc_grade)) %>% ## remove rows with NAs
group_by(patientID) %>% ## for each patient
filter(disc_grade == min(disc_grade)) %>% ## keep the row (his eye) that has the best score
mutate(eye_upd = ifelse(n() > 1, "tie", eye)) %>% ## if you kept both eyes you have a tie
select(patientID,eye_upd,disc_grade) %>%
distinct()
# patientID eye_upd disc_grade
# (dbl) (chr) (fctr)
# 1 1 L B
# 2 2 L B
# 3 3 R B
# 4 4 R B
# 5 5 L A
# 6 6 tie B
# 7 7 L B
# 8 9 R B
# 9 10 R B
There's certainly a better way to do this, but this gets the job done...need more coffee...
df_orig <- df
library(dplyr)
df %>%
filter(!is.na(disc_grade)) %>%
group_by(patientID) %>%
summarise(best = min(disc_grade)) %>%
left_join(., df_orig, by = c("patientID" = "patientID",
"best" = "disc_grade")) %>%
group_by(patientID) %>%
mutate(eye = ifelse(n() > 1, "tie", eye)) %>%
distinct(patientID) %>%
select(patientID, eye, best)
Note: I am able to get away with min(disc_grade) because of type conversation. Consider looking at as.numeric(as.factor(df$disc_grade)).

in R: reorder the rows of a dataframe based on those in another table [duplicate]

This question already has answers here:
Order data frame rows according to vector with specific order
(6 answers)
Closed 8 years ago.
I have a table as follows:
tab1 <- as.table(matrix(c(8,6,9,0,8,4,0,12,7,10), ncol = 2, byrow = FALSE,
dimnames = list(site = c("beta", "alpha", "gamma", "theta", "delta"),
count = c("low", "high"))))
> tab1
count
site low high
beta 8 4
alpha 6 0
gamma 9 12
theta 0 7
delta 8 10
and a data.frame that maps the site names to siteID's:
data.frame(site = c("alpha", "beta", "gamma", "delta", "theta"), siteId = c(1102, 3154, 9000, 1101, 1103))
site siteId
1 alpha 1102
2 beta 3154
3 gamma 9000
4 delta 1101
5 theta 1103
Finally, I have a data.frame that contains these siteID's and some other variables:
data.frame(siteId = c(1101, 1102, 1103, 3154, 9000), treatment = c("A", "B", "C", "E", "F"))
siteId treatment
1 1101 A
2 1102 B
3 1103 C
4 3154 E
5 9000 F
What I need to be able to do is to order the columns in the last dataframe in the same way that the rows in tab1 were ordered, so it should yield:
siteId treatment
1 3154 E
2 1102 B
3 9000 F
4 1003 C
5 1001 A
How can I do that, without engaging in elaborate looping? The actual dataset is quite large, so looping would take much more time than I would want to.
You can do this by matching the IDs from the different data frames. By the way I changed epsilon in data frame a to theta, as there's no epsilon in tab1.
tab1 <- as.table(matrix(c(8,6,9,0,8,4,0,12,7,10), ncol = 2, byrow = FALSE,
dimnames = list(site = c("beta", "alpha", "gamma", "theta", "delta"),
count = c("low", "high"))))
a = data.frame(site = c("alpha", "beta", "gamma", "delta", "theta"), siteId = c(1102, 3154, 9000, 1101, 1103))
b = data.frame(siteId = c(1101, 1102, 1103, 3154, 9000), treatment = c("A", "B", "C", "E", "F"))
# put a in the order of tab1
a = a[match(a$site,rownames(tab1)),]
# put b in order of a
b = b[match(a$siteId, b$siteId),]
> b
# siteId treatment
#4 3154 E
#2 1102 B
#5 9000 F
#3 1103 C
#1 1101 A

Resources