This question already has answers here:
Cartesian product with dplyr
(7 answers)
Closed 1 year ago.
I have two tables:
table 1:
| | a | b |
|---|----|----|
| 1 | a1 | b1 |
| 2 | a2 | b2 |
and
table 2:
| | c | d |
|---|----|----|
| 1 | c1 | d1 |
| 2 | c2 | d2 |
I want to join them in a way that each row of table one bind column-wise with table two to get this result:
| | a | b | c | d |
|---|----|----|----|----|
| 1 | a1 | b1 | c1 | d1 |
| 2 | a1 | b1 | c2 | d2 |
| 3 | a2 | b2 | c1 | d1 |
| 4 | a2 | b2 | c2 | d2 |
I feel like this is a duplicated question, but I could not find right wordings and search terms to find the answer.
There is no need to join, we can use tidyr::expand_grid:
library(dplyr)
library(tidyr)
table1 <- tibble(a = c("a1", "a2"),
b = c("b1", "b2"))
table2 <- tibble(c = c("c1","c2"),
d = c("d1", "d2"))
expand_grid(table1, table2)
#> # A tibble: 4 x 4
#> a b c d
#> <chr> <chr> <chr> <chr>
#> 1 a1 b1 c1 d1
#> 2 a1 b1 c2 d2
#> 3 a2 b2 c1 d1
#> 4 a2 b2 c2 d2
Created on 2021-09-17 by the reprex package (v2.0.1)
I found a crude answer:
table1$key <- 1
table2$key <- 1
result <- left_join(table1,table2, by="key") %>%
select(-key)
Any better answers is much appreciated.
Related
I have a df like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
In R, how do I filter for VisitIDs as long as they contain Item A & B?
Expected Outcome:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
I tried df %>% group_by(VisitID) %>% filter(any(Item == 'A' & Item == 'B')) but it doesn't work..
df <- read_delim("ID | Item
1 | A
1 | B
2 | A
3 | B
1 | C
4 | C
5 | B
3 | A
4 | A
5 | D", delim = "|", trim_ws = TRUE)
Since you want both "A" and "B" you can use all
library(dplyr)
df %>% group_by(VisitID) %>% filter(all(c("A", "B") %in% Item))
# VisitID Item
# <int> <chr>
#1 1 A
#2 1 B
#3 1 C
#4 1 D
#5 2 A
#6 2 D
#7 2 B
OR if you want to use any use them separately.
df %>% group_by(VisitID) %>% filter(any(Item == 'A') && any(Item == 'B'))
An otion with data.table
library(data.table)
setDT(df)[, .SD[all(c("A", "B") %in% Item)], VisitID]
I wish to calculate the unique values by row by group in r .The unique value by row should not include the blank cell.
for e.g,
df<-data.frame(
Group=c("A1","A1","A1","A1","A1","B1","B1","B1"),
Segment=c("A",NA,"A","B","A",NA,"A","B")
)
INPUT:
+---------+--------+
| Group |Segment |
+---------+--------+
| A1 |A |
| A1 |NA |
| A1 |A |
| A1 |B |
| A1 |A |
| B1 |NA |
| B1 |A |
| B1 |B |
+---------+--------+
I have used for loop in solving the problem but in the big dataset it is taking more time in getting the result.
Expected output in Distinct column
+---------+--------+----------+
| Group |Segment | distinct |
+---------+--------+----------+
| A1 |A | 1 |
| A1 |NA | 1 |
| A1 |A | 1 |
| A1 |B | 2 |
| A1 |A | 2 |
| B1 |NA | 0 |
| B1 |A | 1 |
| B1 |B | 1 |
+---------+--------+----------+
duplicated is useful for this, although the NAs make it a bit tricky:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(distinct = cumsum(!duplicated(Segment) & !is.na(Segment)))
# A tibble: 8 x 3
# Groups: Group [2]
Group Segment distinct
<fct> <fct> <int>
1 A1 A 1
2 A1 NA 1
3 A1 A 1
4 A1 B 2
5 A1 A 2
6 B1 NA 0
7 B1 A 1
8 B1 B 2
I have two dataframes: df_workingFile and df_groupIDs
df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. So group "a1" has 2 events in df_workingFile, "v" and "w". I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. The final output should look like this:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
Now here's the problem. I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. I need help figuring out how to rewrite my code to be more efficient. Here's what I currently have:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}
Using dplyr:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column):
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))
This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from.
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
giving:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
Note: The two input data frames shown reproducibly are:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
EDIT:
Upon further examination, this dataset is way more insane than I previously believed.
Values have been encapsulated in the column names!
My dataframe looks like this:
| ID | Year1_A | Year1_B | Year2_A | Year2_B |
|----|---------|---------|---------|---------|
| 1 | a | b | 2a | 2b |
| 2 | c | d | 2c | 2d |
I am searching for a way to reformat it as such:
| ID | Year | _A | _B |
|----|------|-----|-----|
| 1 | 1 | a | b |
| 1 | 2 | 2a | 2b |
| 2 | 1 | c | d |
| 2 | 2 | 2c | 2d |
The answer below is great, and works perfectly, but the issue is that the dataframe needs more work -- somehow possibly be spread back out, so that each row has 3 columns.
My best idea was to do merge(df, df, by="ID") and then filter out the unwanted rows but this is quickly becoming unwieldy.
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
library(tidyr)
# your example data
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
# the solution
df <- gather(df, Year, value, -ID)
# cleaning up
df$Year <- gsub("Year", "", df$Year)
Result:
> df
ID Year value
1 1 1_A a
2 2 1_A c
3 1 1_B b
4 2 1_B d
5 1 2_A 2a
6 2 2_A 2c
7 1 2_B 2b
8 2 2_B 2d
There is the table below: (columns: ID-CAUSE-WORK)
ID | CAUSE | WORK
A | C1 | W1
B | C1 | W1
C | C1 | W1
D | C1 | W1
E | C1 | W2
F | C1 | W2
G | C1 | W2
H | C1 | W3
I | C1 | W3
FF | C2 | W4
FG | C2 | W4
FG | C2 | W1
FG | C2 | W1
FG | C2 | W6
I want the two max values of work's count per cause. That is, with a simple count(work) group by cause, the result would be:
cause | work| count(work)
c1 | w1 | 4
c1 | w2 | 3
c1 | w3 | 2
c2 | w4 | 2
c2 | w1 | 2
c2 | w6 | 1
I want to take only the 2 max counts works per cause:
c1 | w1 | 4
c1 | w2 | 3
c2 | w4 | 2
c2 | w1 | 2
This should work:
select cause,
work,
cnt as "COUNT"
from (
select cause,
work,
count(work) as cnt,
row_number() over (partition by cause order by count(work) desc, work desc) as rown
from your_table group by cause, work
) where rown <= 2;
select cause,work,Count from
(
select cause,work,Count(Work) as Count
from table_name
group by cause,work
)
where Count = (select Max(Count(Work)) as Count
from table_name
group by cause,work)