Join Tables in R or Python [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two tables-Price_list and order_list. The price_list table gives me all the prices that were active with date from all stores by product_id. While order_list gives me the list of orders placed i.e. who placed the order and from which store.
Price_list - date, product_id, store_id, selling_price
order_list - date, product_id, store_id, selling_price, order_id, email, product_order_id (unique key - concatenation of product_id and order_id as there could more than one product in an order)
I want to combine the above two tables in such a way that for each product_order_id i get a list of all prices that were available for the product. Basically i want to see what were the prices available and what did the customer choose. The table below illustrates my query.
|product_order_id Date product_id store_id selling_price Placed|
|134323_3545 2016/03/11 134323 6433 2560.00 Yes |
|134323_3545 2016/03/11 134323 6343 2534.00 No |
|134323_3545 2016/03/11 134323 1243 2313.00 No |
|134323_3545 2016/03/11 134323 2424 2354.00 No |
|145565_9965 2016/03/11 145565 9887 5432.00 No |
|145565_9965 2016/03/11 145565 7645 5321.00 Yes |
I am not able to get around to solving this in R. Although i prefer R for this, i am open if there is a solution in mysql or python. The steps to get this done is (a) select product_order_id (B) on that date for each product_id in the product_order_id search for all entries in price_list (C) append this to a table and add a column specifying product_order_id this list applies to (d) repeat the steps for the next product_order_id. Once the dataframe is prepared i can left join order_list table on column(product_order_id) to get the final dataframe. I have not yet been able to grasp how to do it in R.
After reading about loops and some help i was able to create a loop for searching all price entries for each product_id on a day (product_date is a concatenation of date and product_id):
datalist <- list()
for(i in (orderlisit_test$product_date){
dat <- filter(pricelist, pricelist$product_date==i)
datalist[[i]] <- dat
}
big_data = do.call("rbind", datalist)
However, i also want to add another column specifying the order_id or product_order_id for each iteration. So if anyone could help me in how should i loop as well as add another column at the same time that will help me a lot.

This will retain all the rows for every product_id
library(dplyr)
order_list_joined<-full_join(Price_list,order_list,by="product_id")
Then if there is no order_id for a given product_id, we assume there is no order place.
order_list_joined<-order_list_joined %>% mutate(Placed = ifelse(is.na(order_id),"No","Yes")

Related

How does R interpret the joins ? from left to right or right to left [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 4 years ago.
I have a disagreement with a colleague over the below two answers so need a third opinion.
Suppose you have 2 data frames: Salary and Employee.
Question: Which command would you use to join Employee and Salary by matching the rows from Salary to Employee?
Employee %>% left_join(Salary, by=c("F_NAME"="NAME"))
or
Employee %>% right_join(Salary, by=c("F_NAME"="NAME"))
Both of these commands will work, assuming that Employee$F_NAME and Salary$NAME contain matching items. The difference is in how rows that do not have matches are handled.
left_join will retain all rows in Employee. For rows that are in Employee but not Salary, any columns unique to Salary will be filled with NA.
right_join will retain all rows in Salary. For rows that are in Salary but not Employee, any columns unique to Employee will be filled with NA.
inner_join will retain only rows that are matched in both Salary and Employee. All others are dropped.
full_join will retain all rows from both data frames. Any rows that are not matched will have their missing left- or right-side columns filled with NA.
See also: some very nice illustrations about join types.
This is actually more specifically related to dplyr as opposed to the native R merge. When you use
Employee %>% left_join(Salary, by=c("F_NAME"="NAME"))
you are concatenating the rows in Employee with all columns from Employee and Salary. Missing values will be given NA. Similarly,
Employee %>% right_join(Salary, by=c("F_NAME"="NAME"))
will yield all rows in Salary with all columns from both data frames.
I think your question may be more related to a full_join, but here is a good place to get familiar with the methods.

Redshift join with metadata table and select columns

I have created a subset of the pg_table_def table with table_name,col_name and data_type. I have also added a column active with 'Y' as value for some of the rows. Let us call this table as config.Table config looks like below:
table_name column_name
interaction_summary name_id
tag_transaction name_id
interaction_summary direct_preference
bulk_sent email_image_click
crm_dm web_le_click
Now I want to be able to map the table names from this table to the actual table and fetch values for the corresponding column. name_id will be the key here which will be available in all tables. My output should look like below:
name_id direct_preference email_image_click web_le_click
1 Y 1 2
2 N 1 2
The solution needs to be dynamic so that even if the table list extends tomorrow, the new table should be able to accommodate. Since I am new to Redshift, any help is appreciated. I am also considering to do the same via R using the dplyr package.
I understood that dynamic queries don't work with Redshift.
My objective was to pull any new table that comes in and use their columns for regression analysis in R.
I made this working by using listagg feature and concat operation. And then wrote the output to a dataframe in R. This dataframe would have 'n' number of select queries as different rows.
Below is the format:
df <- as.data.frame(tbl(conn,sql("select 'select ' || col_names|| ' from ' || table_name as q1 from ( select distinct table_name, listagg(col_name,',') within group (order by col_name)
over (partition by table_name) as col_names
from attribute_config
where active = 'Y'
order by table_name )
group by 1")))
Once done, I assigned every row of this dataframe to a new dataframe and fetched the output using below:
df1 <- tbl(conn,sql(df[1,]))
I know this is a round about solution. But it works !! Fetches about 17M records under 1 second.

How do I intersect two data.frames in R? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have two tables that are in the data.frame structure. Table 1 contains a column of 200 gene IDs (letters and numbers) and Table 2 contains a list of 4,000 gene IDs (in rows) as well as 20 additional columns. I want to intersect these two tables and generate a new Table 3 that contains the 200 gene IDs as well as the associated information in the 20 columns.
table3 <- table1%n%table2
You want something like
table3 <- merge(table1, table2, by.x="id", by.y="id", all.x=T, all.y=F)
You might also be able to do subsetting with something like this:
table3 <- table2[table2$id %in% table1$id,]
A reprex would have made this post more likely to get a good response, but you should have been able to find something to help you with a little searching. If these don't work because you have a unique problem no one has asked before, give is a reprex and we can try to give you alternative solutions.
edit: for a little more context, here's a similar question I replied to last week and here's a great post on understanding merges.
I recommend the dplyr package. It works more intuitively than merge in my opinion.
you can just type:
table3 <- left_join(table1, table2, by = "unique_id")

How to pull the required columns from the csv file? [duplicate]

This question already has answers here:
Only read selected columns
(5 answers)
Closed 5 years ago.
I have a grocery sales data which has 11 columns like store name,item name,price etc. For my analysis i do not require all the column values. I need only few column values for generating a report.
what is the R code for this?
Example: Below are the column names of an sales data. i need only 6 of the below column values. I tried that coding, but error is shown, also those answers I don't understand
STORE_NAME STORE_ID DEVICE_SERIAL_NUMBER BILL_NUMBER BARCODE ITEM_NAME VARIANT_NAME BASEPACK CATEGORY BRAND MANUFACTURER QUANTITY_SOLD PRICE PURCHASE_PRICE SELLING_PRICE SALES_VAT USER_NAME COUNTER CUSTOMER_NAME CUSTOMER_PHONE BILL_DATE CREATED_DATE
Read all the data with read.table or read.csv and then extract only those, that you can use. That's what we use square brackets for in R. You can do it either by column number or column name:
lots.of.cols <- data.frame(a=1:20, b=2:21, c=3:22, d=runif(20), e=runif(20))
only.first.two.cols <- lots.of.cols[,c(1,2)] #extract only column 1 and 2
str(only.first.two.rows)
only.a.and.b <- lots.of.cols[,c("a", "b")]
str(only.a.and.b)

Writing data from one dataframe to a new column of another in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two data frames in R. data, a frame with monthly sales per department in a store, looks like this:
While averages, a frame with the average sales over all months per department, looks like this:
What I'd like to do is add a column to data containing the average sales (column 3 of averages) for each department. So whereas now I have an avg column with all zeroes, I'd like it to contain the overall average sales for whatever department is listed in that row. This is the code I have now:
for(j in 1:nrow(avgs)){
for(i in 1:nrow(data)){
if(identical(data[i,4], averages[j,1])){
gd[i,10] <- avgs[j,3] } } }
After running the loop, the avg column in data is still all zeroes, which makes me think that if(identical(data[i,4], averages[j,1])) is always evaluating to FALSE... But why would this be? How can I troubleshoot this issue / is there a better way to do this?
Are you looking for merge function?
merge(x = data, y = avgs, by = "departmentName", all.x=TRUE)
I would use dplyr by doing:
dplyr::full_join(data, averages, by = "departmentName")
The great thing about dplyr (besides being fast) is that it has a very simple syntax. Moreover, if your two tables have variables with different names, that can also be specified. Imagine you have data_departmentName in table data and avgs_departmentName in the table averages:
dplyr::full_join(data, averages, by = c("data_departmentName" = "averages_departmentName"))
And then I would filter the dataset if you only want a specific column from the second dataset. If you know your data is ordered and has the same lenght, then you could just add it like:
data$avgs <- averages$avgs
But I'd rather join first then filter.

Resources