R programming: ifelse on multiple tables [duplicate] - r

This question already has answers here:
Merge dataframes of different sizes
(4 answers)
Update columns by joining more than one columns
(2 answers)
Closed 4 years ago.
I am just venturing into R programming and finding my way around.
Lets say I have a table as below:
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
Y | A | 1
Y | B | 2
Y | C | 5
Z | A | 3
Z | B | 6
Z | C | 2
I need to change the sales values of certain products based on another table. Please find below:
Product | Sales
A | 10
B | 7
C | 15
My final table should be:
Store | Product | Sales
X | A | 10
X | B | 7
X | C | 15
Y | A | 10
Y | B | 7
Y | C | 15
Z | A | 10
Z | B | 7
Z | C | 15
I have 2 methods of doing this now:
1) Using joins
2) Using an if-else statement inside a for loop to subset the
Is there any other way to do this more effectively and in fewer steps?
Thanks in advance!
EDIT: I forgot to mention an exception earlier. What if my dataset is like below?
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
X | D | 4
Y | A | 1
Y | B | 2
Y | C | 5
Y | D | 2
Z | A | 3
Z | B | 6
Z | C | 2
Z | D | 3
There's an extra product(D) with sales. I want to retain the value of sales for that product if it is not present in the 2nd table which is:
Product | Sales
A | 10
B | 7
C | 15

How about this join?
Since you want to change the Sales values of certain Products only so to illustrate this I have considered only two products in lookup_df
library(dplyr)
df %>%
left_join(lookup_df, by = "Product", suffix = c("_Original", "_New")) %>%
mutate(Sales_New = coalesce(Sales_New, Sales_Original))
Output is:
Store Product Sales_Original Sales_New
1 X A 2 10
2 X B 1 1
3 X C 3 15
4 Y A 1 10
5 Y B 2 2
6 Y C 5 15
7 Z A 3 10
8 Z B 6 6
9 Z C 2 15
Sample data:
df <- structure(list(Store = c("X", "X", "X", "Y", "Y", "Y", "Z", "Z",
"Z"), Product = c("A", "B", "C", "A", "B", "C", "A", "B", "C"
), Sales = c(2L, 1L, 3L, 1L, 2L, 5L, 3L, 6L, 2L)), .Names = c("Store",
"Product", "Sales"), class = "data.frame", row.names = c(NA,
-9L))
lookup_df <- structure(list(Product = c("A", "C"), Sales = c(10L, 15L)), .Names = c("Product", "Sales"), class = "data.frame", row.names = c(NA,
-2L))
# Product Sales
#1 A 10
#2 C 15

If you use a lookup-vector, it is relatively short:
d <- read.table(text = "
Store | Product | Sales
X | A | 2
X | B | 1
X | C | 3
Y | A | 1
Y | B | 2
Y | C | 5
Z | A | 3
Z | B | 6
Z | C | 2", sep = "|", header = T, stringsAsFactors = F)
lookup <- read.table(text = "Product | Sales
A | 10
B | 7
C | 15", sep = "|", header = T, stringsAsFactors = F)
lookup$Product <- gsub("^\\s+|\\s+$", "", lookup$Product) # remove spaces
lookup <- setNames(lookup$Sales, lookup$Product) # convert to vector
d$Product <- gsub("^\\s+|\\s+$", "", d$Product) # remove spaces
d$Sales <- lookup[d$Product] # main part
d

Related

Expand data frame and add a new variable

I have a data frame structured like this:
+----------+------+--------+-------+
| Location | year | group1 | Value |
+----------+------+--------+-------+
| a | 2020 | 1 | x |
| a | 2020 | 2 | y |
| a | 2020 | 3 | z |
| a | 2021 | 1 | x |
| a | 2021 | 2 | y |
| a | 2021 | 3 | z |
| b | 2020 | 1 | x |
| b | 2020 | 2 | y |
| b | 2020 | 3 | z |
+----------+------+--------+-------+
I would like to expand the data frame to include 3 rows for every location, year, and group1 combination and generate a group2 variable that identifies these new combinations (1-3). Ideally, the data frame will look like this:
+----------+------+--------+-------+--------+
| Location | year | group1 | Value | group2 |
+----------+------+--------+-------+--------+
| a | 2020 | 1 | x | 1 |
| a | 2020 | 1 | x | 2 |
| a | 2020 | 1 | x | 3 |
| a | 2020 | 2 | y | 1 |
| a | 2020 | 2 | y | 2 |
| a | 2020 | 2 | y | 3 |
| ... | ... |... |... |... |
+----------+------+--------+-------+--------+
I was able to expand the dataframe to the correct number of total rows using the following code:
df[rep(seq_len(nrow(df)),3), 1:4]
But couldn't figure out how to add the group2 variable shown above.
With tidyr you can use expand - this will expand your data frame to all combinations of values with your sequence of 1 to 3:
library(tidyverse)
df %>%
group_by(Location, year, group1, Value) %>%
expand(group2 = 1:3)
Output
Location year group1 Value group2
<fct> <dbl> <int> <fct> <int>
1 a 2020 1 x 1
2 a 2020 1 x 2
3 a 2020 1 x 3
4 a 2020 2 y 1
5 a 2020 2 y 2
6 a 2020 2 y 3
...
Your approach looks close, and I suppose you could just add on group2 like this:
cbind(df[rep(seq_len(nrow(df)), each = 3), ], group2 = 1:3)
Here is the solution you are looking for
library(dplyr)
# 1. Data set
df <- data.table(
location = c("a","a","a","a","a","a","b","b","b"),
year = c(2020,2020,2020,2021,2021,2021,2020,2020,2020),
group1 = c(1,2,3,1,2,3,1,2,3),
value = c("x","y","z","x","y","z","x","y","z"),
stringsAsFactors = FALSE)
# 2. Your code to expand data frame
df <- df[rep(seq_len(nrow(df)), 3), 1:4]
# 3. Arrange
df <- df %>% arrange(location, year, group1, value)
# 4. Add 'group2'
df <- df %>%
group_by(location, year, group1, value) %>%
mutate(group2 = cumsum(group1) / group1) %>%
arrange(location, year, group1, value, group2)
Hope it works
We can use crossing from tidyr
library(tidyr)
library(dplyr)
crossing(df1, group2 = 1:3)
# A tibble: 27 x 5
# Location year group1 Value group2
# <chr> <int> <int> <chr> <int>
# 1 a 2020 1 x 1
# 2 a 2020 1 x 2
# 3 a 2020 1 x 3
# 4 a 2020 2 y 1
# 5 a 2020 2 y 2
# 6 a 2020 2 y 3
# 7 a 2020 3 z 1
# 8 a 2020 3 z 2
# 9 a 2020 3 z 3
#10 a 2021 1 x 1
# … with 17 more rows
Or create a list column and then unnest
df1 %>%
mutate(group2 = list(1:3)) %>%
unnest(c(group2))
data
df1 <- structure(list(Location = c("a", "a", "a", "a", "a", "a", "b",
"b", "b"), year = c(2020L, 2020L, 2020L, 2021L, 2021L, 2021L,
2020L, 2020L, 2020L), group1 = c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L), Value = c("x", "y", "z", "x", "y", "z", "x", "y", "z"
)), class = "data.frame", row.names = c(NA, -9L))

R how to group column values by set theory

I have dataset like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
I would like to create a classification column for items according to the set theory conditions:
VisitID contains A only, B only, C only, A&B, A&C, B&C, A&B&C, Others (Neither A,B,C exists)
The results should look like this:
VisitID | Item | Classification |
1 | A | A&B&C |
1 | B | A&B&C |
1 | C | A&B&C |
1 | D | A&B&C |
2 | A | A&B |
2 | D | A&B |
2 | B | A&B |
3 | B | B&C |
3 | C | B&C |
4 | D | C only |
4 | C | C only |
How can I do this in R, especially with dplyr?
You can use a left_join of the data with a group_by, filtered, summarised one.
library(dplyr)
data %>% left_join(
group_by(data, VisitID) %>%
distinct(VisitID, Item) %>%
filter(Item %in% c("A","B","C")) %>%
summarise(set=paste0(Item, collapse="&")),
by="VisitID")
Output:
VisitID Item set
1 1 A A&B&C
2 1 B A&B&C
3 1 C A&B&C
4 1 D A&B&C
5 2 A A&B
6 2 D A&B
7 2 B A&B
8 3 B B&C
9 3 C B&C
10 4 D C
11 4 C C
12 5 D <NA>
13 5 E <NA>
Data:
dput(data)
structure(list(VisitID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L), Item = c("A", "B", "C", "D", "A", "D", "B",
"B", "C", "D", "C", "D", "E")), class = "data.frame", row.names = c(NA,
-13L))
We can write a custom function :
paste_values <- function(x) {
x1 <- x[x %in% c("A", "B", "C")]
if (n_distinct(x1) == 1)
#If want to keep in base R
#if (length(unique(x1) == 1)
paste0(unique(x1), " only")
else
paste0(unique(x1), collapse = " & ")
}
and apply it for each group.
library(dplyr)
df %>% group_by(VisitID) %>% mutate(Item = paste_values(Item))
# VisitID Item
# <int> <chr>
# 1 1 A & B & C
# 2 1 A & B & C
# 3 1 A & B & C
# 4 1 A & B & C
# 5 2 A & B
# 6 2 A & B
# 7 2 A & B
# 8 3 B & C
# 9 3 B & C
#10 4 C only
#11 4 C only
We can also use the same function in base R :
df$Item <- with(df, ave(Item, VisitID, FUN = paste_values))

R : How to tag a subject if one of their columns has a certain value

This is what my data looks like:
+---------+--+----------+--+
| Subj_ID | | Location | |
+---------+--+----------+--+
| 1 | | 1 | |
| 1 | | 2 | |
| 1 | | 3 | |
| 2 | | 1 | |
| 2 | | 4 | |
| 2 | | 2 | |
| 3 | | 1 | |
| 3 | | 2 | |
| 3 | | 5 | |
+---------+--+----------+--+
In this dataset, only subject 1 has a location value of 3, so I want to label subject 1 as YES for intervention. Since subject 2 and 3 didn't have a location value of 3, they need to be labeled as false.
This is what I want the data to look like.
| Subj_ID | | Location | Intervention |
+---------+--+----------+--------------+
| 1 | | 1 | YES |
| 1 | | 2 | YES |
| 1 | | 3 | YES |
| 2 | | 1 | NO |
| 2 | | 4 | NO |
| 2 | | 3 | NO |
| 3 | | 1 | NO |
| 3 | | 2 | NO |
| 3 | | 5 | NO |
+---------+--+----------+-----+
Thanks in advance for the help! Dplyr preferred if possible.
An option with dplyr is after grouping by 'Subj_ID', check whether 3 is %in/% Location which returns a single TRUE/FALSE, change that to a numeric index to replace the values with "NO", "YES"
library(dplyr)
df1 %>%
group_by(Subj_ID) %>%
mutate(Intervention = c("NO", "YES")[(3 %in% Location)+1])
# A tibble: 9 x 3
# Groups: Subj_ID [3]
# Subj_ID Location Intervention
# <int> <dbl> <chr>
#1 1 1 YES
#2 1 2 YES
#3 1 3 YES
#4 2 1 NO
#5 2 4 NO
#6 2 2 NO
#7 3 1 NO
#8 3 2 NO
#9 3 5 NO
Or use any
df1 %>%
group_by(Subj_ID) %>%
mutate(Intervention = case_when(any(Location == 3) ~ "YES", TRUE ~ "NO"))
Or using base R
df1$Intervention <- with(df1, c("NO", "YES")[1 + (Subj_ID %in%
Subj_ID[Location == 3])])
data
df1 <- data.frame(Subj_ID = rep(1:3, each = 3),
Location = c(1:3, 1, 4, 2, 1, 2, 5))
We can use match for each Subj_ID to check if 3 is present in any Location.
library(dplyr)
df %>%
group_by(Subj_ID) %>%
mutate(Intervention = c('Yes', 'No')[is.na(match(3,Location)) + 1])
#Can also use
#mutate(Intervention = c('No', 'Yes')[(match(3,Location, nomatch = 0L) > 0) + 1])
# Subj_ID Location Intervention
# <int> <dbl> <chr>
#1 1 1 Yes
#2 1 2 Yes
#3 1 3 Yes
#4 2 1 No
#5 2 4 No
#6 2 2 No
#7 3 1 No
#8 3 2 No
#9 3 5 No
data
df <- structure(list(Subj_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
Location = c(1, 2, 3, 1, 4, 2, 1, 2, 5)), class = "data.frame",
row.names = c(NA, -9L))

Quick way of matching data between two dataframes [R]

I have two dataframes: df_workingFile and df_groupIDs
df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
For df_groupIDs, I want to get the ID and Date of the event with the max sales in that group. So group "a1" has 2 events in df_workingFile, "v" and "w". I want to identify that event "w" has the Max sales value and bring it's information into df_groupIDs. The final output should look like this:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
Now here's the problem. I wrote code that does this, but it's very inefficient and takes forever to process when I deal with datasets of 50-100K rows. I need help figuring out how to rewrite my code to be more efficient. Here's what I currently have:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}
Using dplyr:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
Another simpler method, if the Sales are integers (and can thus be relied on for equality testing with the MaxSales column):
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))
This makes use of a feature that SQLite has that if max is used on a line then it automatically brings along the row that the maximum came from.
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
giving:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
Note: The two input data frames shown reproducibly are:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)

How to reshape a dataframe in R, when values and variables are in the column names? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
EDIT:
Upon further examination, this dataset is way more insane than I previously believed.
Values have been encapsulated in the column names!
My dataframe looks like this:
| ID | Year1_A | Year1_B | Year2_A | Year2_B |
|----|---------|---------|---------|---------|
| 1 | a | b | 2a | 2b |
| 2 | c | d | 2c | 2d |
I am searching for a way to reformat it as such:
| ID | Year | _A | _B |
|----|------|-----|-----|
| 1 | 1 | a | b |
| 1 | 2 | 2a | 2b |
| 2 | 1 | c | d |
| 2 | 2 | 2c | 2d |
The answer below is great, and works perfectly, but the issue is that the dataframe needs more work -- somehow possibly be spread back out, so that each row has 3 columns.
My best idea was to do merge(df, df, by="ID") and then filter out the unwanted rows but this is quickly becoming unwieldy.
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
library(tidyr)
# your example data
df <- data.frame(ID = 1:2, Year1_A = c('a', 'c'), Year1_B = c('b','d' ), Year2_A = c('2a', '2c'), Year2_B = c('2b', '2d'))
# the solution
df <- gather(df, Year, value, -ID)
# cleaning up
df$Year <- gsub("Year", "", df$Year)
Result:
> df
ID Year value
1 1 1_A a
2 2 1_A c
3 1 1_B b
4 2 1_B d
5 1 2_A 2a
6 2 2_A 2c
7 1 2_B 2b
8 2 2_B 2d

Resources