There is the table below: (columns: ID-CAUSE-WORK)
ID | CAUSE | WORK
A | C1 | W1
B | C1 | W1
C | C1 | W1
D | C1 | W1
E | C1 | W2
F | C1 | W2
G | C1 | W2
H | C1 | W3
I | C1 | W3
FF | C2 | W4
FG | C2 | W4
FG | C2 | W1
FG | C2 | W1
FG | C2 | W6
I want the two max values of work's count per cause. That is, with a simple count(work) group by cause, the result would be:
cause | work| count(work)
c1 | w1 | 4
c1 | w2 | 3
c1 | w3 | 2
c2 | w4 | 2
c2 | w1 | 2
c2 | w6 | 1
I want to take only the 2 max counts works per cause:
c1 | w1 | 4
c1 | w2 | 3
c2 | w4 | 2
c2 | w1 | 2
This should work:
select cause,
work,
cnt as "COUNT"
from (
select cause,
work,
count(work) as cnt,
row_number() over (partition by cause order by count(work) desc, work desc) as rown
from your_table group by cause, work
) where rown <= 2;
select cause,work,Count from
(
select cause,work,Count(Work) as Count
from table_name
group by cause,work
)
where Count = (select Max(Count(Work)) as Count
from table_name
group by cause,work)
Related
I would like to assign each unique combination of variables a value and list those values in a new column called ID, as shown below. For example I would like patients who are Ta cancer, N0 lymph, and 1 immunotherapy ID'd as 1. Patients who are TA, NX, and 1 as ID 2 and so on... Below is a table of what the data looks like before, and what I would like it to look like as after. Data was loaded from .csv
So to summarize:
Patients TA, N0, 1 ID = 1
Patients TA, N0, 2 ID = 2
Patients TA, Nx, 0 ID = 3
Patients TA, Nx, 1 ID = 4
Patients TA, N0, 0 ID = 5
Patients TA, Nx, 2 ID = 6
Before:
| Cancer | Lymph |Immunotherapy
| -------- | -------- |---------
| TA | N0 |1
| TA | N0 |2
| TA | N0 |1
| TA | Nx |0
| TA | Nx |1
| TA | N0 |0
| TA | Nx |1
| TA | Nx |2
After:
| Cancer | Lymph |Immunotherapy|ID
| -------- | -------- |--------- |-------
| TA | N0 |1 | 1
| TA | N0 |2 | 2
| TA | N0 |1 | 1
| TA | Nx |0 | 3
| TA | Nx |1 | 4
| TA | N0 |0 | 5
| TA | Nx |1 | 4
| TA | Nx |2 | 6
I attempted to use group_by() dplyr and mutate with no luck. Any help would be much appreciated. Thanks!
in Base R:
d <- do.call(paste, df)
cbind(df, id = as.numeric(factor(d, unique(d))))
Cancer Lymph Immunotherapy id
1 TA N0 1 1
2 TA N0 2 2
3 TA N0 1 1
4 TA Nx 0 3
5 TA Nx 1 4
6 TA N0 0 5
7 TA Nx 1 4
8 TA Nx 2 6
library(dplyr)
df %>%
group_by(Cancer, Lymph, Immunotherapy) %>%
mutate(ID = cur_group_id()) %>%
ungroup()
alternatively:
df %>%
left_join(df %>%
distinct(Cancer,Lymph,Immunotherapy) %>%
mutate(ID = row_number())
)
This question already has answers here:
Cartesian product with dplyr
(7 answers)
Closed 1 year ago.
I have two tables:
table 1:
| | a | b |
|---|----|----|
| 1 | a1 | b1 |
| 2 | a2 | b2 |
and
table 2:
| | c | d |
|---|----|----|
| 1 | c1 | d1 |
| 2 | c2 | d2 |
I want to join them in a way that each row of table one bind column-wise with table two to get this result:
| | a | b | c | d |
|---|----|----|----|----|
| 1 | a1 | b1 | c1 | d1 |
| 2 | a1 | b1 | c2 | d2 |
| 3 | a2 | b2 | c1 | d1 |
| 4 | a2 | b2 | c2 | d2 |
I feel like this is a duplicated question, but I could not find right wordings and search terms to find the answer.
There is no need to join, we can use tidyr::expand_grid:
library(dplyr)
library(tidyr)
table1 <- tibble(a = c("a1", "a2"),
b = c("b1", "b2"))
table2 <- tibble(c = c("c1","c2"),
d = c("d1", "d2"))
expand_grid(table1, table2)
#> # A tibble: 4 x 4
#> a b c d
#> <chr> <chr> <chr> <chr>
#> 1 a1 b1 c1 d1
#> 2 a1 b1 c2 d2
#> 3 a2 b2 c1 d1
#> 4 a2 b2 c2 d2
Created on 2021-09-17 by the reprex package (v2.0.1)
I found a crude answer:
table1$key <- 1
table2$key <- 1
result <- left_join(table1,table2, by="key") %>%
select(-key)
Any better answers is much appreciated.
I wish to calculate the unique values by row by group in r .The unique value by row should not include the blank cell.
for e.g,
df<-data.frame(
Group=c("A1","A1","A1","A1","A1","B1","B1","B1"),
Segment=c("A",NA,"A","B","A",NA,"A","B")
)
INPUT:
+---------+--------+
| Group |Segment |
+---------+--------+
| A1 |A |
| A1 |NA |
| A1 |A |
| A1 |B |
| A1 |A |
| B1 |NA |
| B1 |A |
| B1 |B |
+---------+--------+
I have used for loop in solving the problem but in the big dataset it is taking more time in getting the result.
Expected output in Distinct column
+---------+--------+----------+
| Group |Segment | distinct |
+---------+--------+----------+
| A1 |A | 1 |
| A1 |NA | 1 |
| A1 |A | 1 |
| A1 |B | 2 |
| A1 |A | 2 |
| B1 |NA | 0 |
| B1 |A | 1 |
| B1 |B | 1 |
+---------+--------+----------+
duplicated is useful for this, although the NAs make it a bit tricky:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(distinct = cumsum(!duplicated(Segment) & !is.na(Segment)))
# A tibble: 8 x 3
# Groups: Group [2]
Group Segment distinct
<fct> <fct> <int>
1 A1 A 1
2 A1 NA 1
3 A1 A 1
4 A1 B 2
5 A1 A 2
6 B1 NA 0
7 B1 A 1
8 B1 B 2
I have two dataframes: df_workingFile and df_groupIDs
df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. So group "a1" has 2 events in df_workingFile, "v" and "w". I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. The final output should look like this:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
Now here's the problem. I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. I need help figuring out how to rewrite my code to be more efficient. Here's what I currently have:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}
Using dplyr:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column):
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))
This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from.
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
giving:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
Note: The two input data frames shown reproducibly are:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)
I want to spread name column.
d <- data.frame(ID = c(1,1,2,2,2,3,3),
name = c("a", "b", "a", "c", "d","c","d"))
| ID | name |
|-----|------|
| 1 | a |
| 1 | b |
| 2 | a |
| 2 | c |
| 2 | d |
| 3 | c |
| 3 | d |
using tidyr::spread() can get like under the data.frame
d %>% tidyr::spread(name,name)
| ID| a | b | c | d |
| 1 | a | b | NA| NA|
| 2 | a | NA| c | d |
| 3 | NA| NA| c | d |
but I want to get like this data.frame.
| ID | name1 | name2 | name3 |
|-----|-------|-------|-------|
| 1 | a | b | NA |
| 2 | a | c | d |
| 3 | c | d | NA |
We can create a new column and spread
library(tidyverse)
d %>%
group_by(ID) %>%
mutate(new = paste0("name", row_number())) %>%
spread(new, name)
# ID name1 name2 name3
#* <dbl> <fctr> <fctr> <fctr>
#1 1 a b NA
#2 2 a c d
#3 3 c d NA
It is relatively concise with dcast
library(data.table)
dcast(setDT(d), ID~paste0("name", rowid(ID)), value.var = "name")