How do I merge 2 dataframes without a corresponding column to match by? - r

I'm trying to use the Merge() function in RStudio. Basically I have two tables with 5000+ rows. They both have the same amount of rows. Although there is no corresponding Columns to merge by. However the rows are in order and correspond. E.g. The first row of dataframe1 should merge with first row dataframe2...2nd row dataframe1 should merge with 2nd row dataframe2 and so on...
Here's an example of what they could look like:
Dataframe1(df1):
+-------------------------------------+
| Name | Sales | Location |
+-------------------------------------+
| Rod | 123 | USA |
| Kelly | 142 | CAN |
| Sam | 183 | USA |
| Joyce | 99 | NED |
+-------------------------------------+
Dataframe2(df2):
+---------------------+
| Sex | Age |
+---------------------+
| M | 23 |
| M | 33 |
| M | 31 |
| F | 45 |
+---------------------+
NOTE: this is a downsized example only.
I've tried to use the merge function in RStudio, here's what I've done:
DFMerged <- merge(df1, df2)
This however increases both the rows and columns. It returns 16 rows and 5 columns for this example.
What am I missing from this function, I know there is a merge(x,y, by=) argument but I'm unable to use a column to match them.
The output I would like is:
+----------------------------------------------------------+
| Name | Sales | Location | Sex | Age |
+----------------------------------------------------------+
| Rod | 123 | USA | M | 23 |
| Kelly | 142 | CAN | M | 33 |
| Sam | 183 | USA | M | 31 |
| Joyce | 99 | NED | F | 45 |
+-------------------------------------+--------------------+
I've considering making extra columns in each dataframes, says row# and match them by that.

You could use cbind:
cbind(df1, df2)
If you want to use merge you could use:
merge(df1, df2, by=0)

You could use:
cbind(df1,df2)
This will necessarily work with same number of rows in two data frames

Related

Is there a way in R to create a column based on order of multiple values in one another column in dataframe? [duplicate]

This question already has answers here:
Aggregating all unique values of each column of data frame
(2 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
I would like to create a column in my R data frame based on the order in which multiple values occur in one column.
For example, my data frame has an id column and an item type column, and the values of the order column is what I would like to add. Is there a way to tell R to look at the order of values in the item column so that it can spit out "ABCD" or "ADCB" (any other order) as the cell value under the 3rd column?
| id | item | order |
| 11 | A | ABCD |
| 11 | A | ABCD |
| 11 | B | ABCD |
| 11 | B | ABCD |
| 11 | C | ABCD |
| 11 | C | ABCD |
| 11 | D | ABCD |
| 11 | D | ABCD |
| 12 | A | ADCB |
| 12 | A | ADCB |
| 12 | D | ADCB |
| 12 | D | ADCB |
| 12 | C | ADCB |
| 12 | C | ADCB |
| 12 | B | ADCB |
| 12 | B | ADCB |
...

R - Join two dataframes based on date difference

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks
Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']

How to reshape the data in stacked format by merging two dataframes in r?

Is there any function in R which can reshape the data in stacked format by merging two dataframes?
Table1
Vegetables|Onion|Potato|Tomato|
Cheap | 20 | 30 | 40 |
Table2
Vegetables|Cabbage|Carrot|Cauliflower|Eggplant
Mid | 20 | 30 | 40 | 30
Required Output
Vegetables | Type |Quantity|
Cheap | Onion | 20 |
Cheap | Potato | 30 |
Cheap | Tomato | 40 |
Mid | Cabbage | 20 |
Mid | Carrot | 30 |
Mid |Cauliflower| 40 |
Mid | Eggplant | 30 |
Thanks in advance for your help! you are awesome.
The name of the parameter you're trying to use is id.vars instead of id. Try melt(Table1,id.vars="Vegetable")
library(reshape2)
df1 <- data.frame(Vegetables="Cheap", Onion=20,Potato=30,Tomato=40)
df2 <- data.frame(Vegetables="Mid", Cabbage=20, Carrot=30,Cauliflower=40, Eggplant=30)
do.call("rbind", lapply(list(df1,df2),function(d){melt(d,id.vars="Vegetables", variable.name="Type", value.name="Quantity")}))
You can then insert all your tables into a list and simply use lapply. (The do.call then combines melted datasets into one)

Copy column data when function unaggregates a single row into multiple in R

I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.

Hmisc Table Creation

Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)

Resources