How to repeat the rows in a table - r

I have two tables.
Table1:
+-----+--------+-------------+
| Key | region | region_name |
+-----+--------+-------------+
| ABC | NT | NORTH |
| ABC | ST | SOUTH |
| XYZ | NT | NORTH |
| XYZ | ST | SOUTH |
| DEF | ST | SOUTH |
+-----+--------+-------------+
Table2:
+-----+-------+------+--------+
| KEY | Sales | cost | profit |
+-----+-------+------+--------+
| ABC | 130 | 100 | 30 |
| XYZ | 120 | 95 | 25 |
| DEF | 110 | 90 | 20 |
+-----+-------+------+--------+
I want the final output be like below.
+-----+-------+------+--------+--------+-------------+
| KEY | Sales | cost | profit | region | region_name |
+-----+-------+------+--------+--------+-------------+
| ABC | 130 | 100 | 30 | NT | NORTH |
| ABC | 130 | 100 | 30 | ST | SOUTH |
| XYZ | 120 | 95 | 25 | NT | NORTH |
| XYZ | 120 | 95 | 25 | ST | SOUTH |
| DEF | 110 | 90 | 20 | ST | SOUTH |
+-----+-------+------+--------+--------+-------------+
Thanks in advance..!

We can use left_join
library(dplyr)
left_join(df1, df2, by = 'Key')
Or with merge in base R
merge(df1, df2, by = 'Key', all.x = TRUE)

Related

change column type and convert the existing values from string to integer in mariadb

I have a table name employees
MariaDB [company]> describe employees;
+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| employee_id | char(10) | NO | | NULL | |
| first_name | varchar(20) | NO | | NULL | |
| last_name | varchar(20) | NO | | NULL | |
| email | varchar(60) | NO | | NULL | |
| phone_number | char(14) | NO | | NULL | |
| hire_date | date | NO | | NULL | |
| job_id | int(11) | NO | | NULL | |
| salary | varchar(30) | NO | | NULL | |
| commission_pct | char(10) | NO | | NULL | |
| manager_id | char(10) | NO | | NULL | |
| department_id | char(10) | NO | | NULL | |
+----------------+-------------+------+-----+---------+-------+
MariaDB [company]> select * from employees;
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| employee_id | first_name | last_name | email | phone_number | hire_date | job_id | salary | commission_pct | manager_id | department_id |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| 100 | Steven | King | sking#gmail.com | 515.123.4567 | 2003-06-17 | 1 | 24000.00 | 0.00 | 0 | 90 |
| 101 | Neena | Kochhar | nkochhar#gmail.com | 515.123.4568 | 2005-09-21 | 2 | 17000.00 | 0.00 | 100 | 90 |
| 102 | Lex | Wow | Lwow#gmail.com | 515.123.4569 | 2001-01-13 | 2 | 17000.00 | 0.00 | 100 | 9 |
| 103 | Alexander | Hunold | ahunold#gmail.com | 590.423.4567 | 2006-01-03 | 3 | 9000.00 | 0.00 | 102 | 60 |
| 104 | Bruce | Ernst | bernst#gmail.com | 590.423.4568 | 2007-05-21 | 3 | 6000.00 | 0.00 | 103 | 60 |
| 105 | David | Austin | daustin#gmail.com | 590.423.4569 | 2005-06-25 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 106 | Valli | Pataballa | vpatabal#gmail.com | 590.423.4560 | 2006-02-05 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 107 | Diana | Lorentz | dlorentz#gmail.com | 590.423.5567 | 2007-02-07 | 3 | 4200.00 | 0.00 | 103 | 60 |
| 108 | Nancy | Greenberg | ngreenbe#gmail.com | 515.124.4569 | 2002-08-17 | 4 | 12008.00 | 0.00 | 101 | 100 |
| 109 | Daniel | Faviet | dfaviet#gmail.com | 515.124.4169 | 2002-08-16 | 5 | 9000.00 | 0.00 | 108 | 100 |
| 110 | John | Chen | jchen#gmail.com | 515.124.4269 | 2005-09-28 | 5 | 8200.00 | 0.00 | 108 | 100 |
| 111 | Ismael | Sciarra | isciarra#gmail.com | 515.124.4369 | 2005-09-30 | 5 | 7700.00 | 0.00 | 108 | 100 |
| 112 | Jose | Urman | jurman#gmail.com | 515.124.4469 | 2006-03-07 | 5 | 7800.00 | 0.00 | 108 | 100 |
| 113 | Luis | Popp | lpopp#gmail.com | 515.124.4567 | 2007-12-07 | 5 | 6900.00 | 0.00 | 108 | 100 |
| 114 | Den | Raphaely | drapheal#gmail.com | 515.127.4561 | 2002-12-07 | 6 | 11000.00 | 0.00 | 100 | 30 |
| 115 | Alexander | Khoo | akhoo#gmail.com | 515.127.4562 | 2003-05-18 | 7 | 3100.00 | 0.00 | 114 | 30 |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
I wanted to change the salary column from string to integer. So, I ran this command
MariaDB [company]> alter table employees modify column salary int;
ERROR 1292 (22007): Truncated incorrect INTEGER value: '24000.00'
As you can see it gave me truncation error. I found some previous questions where they showed how to use convert() and trim() but those actually didn't answer my question.
sql code and data can be found here https://0x0.st/oYoB.com_5zfu
I tested this on MySQL and it worked fine. So it is apparently an issue only with MariaDB.
The problem is that a string like '24000.00' is not an integer. Integers don't have a decimal place. So in strict mode, the implicit type conversion fails.
I was able to work around this by running this update first:
update employees set salary = round(salary);
The column is still a string, but '24000.00' has been changed to '24000' (with no decimal point character or following digits).
Then you can alter the data type, and implicit type conversion to integer works:
alter table employees modify column salary int;
See demonstration using MariaDB 10.6:
https://dbfiddle.uk/V6LrEMKt
P.S.: You misspelled the column name "commission_pct" as "comission_pct" in your sample DDL, and I had to edit that to test. In the future, please use one of the db fiddle sites to share samples, because they will test your code.

How to join dataframes by ID column? [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
I have 3 Dataframes like the ones below with IDs that may not necessarily occur in all
DF1:
ID | Name | Phone# | Country | State | Amount_month1
0210 | John K. | 8942829725 | USA | PA | 1300
0215 | Peter | 8711234566 | USA | KS | 50
2312 | Steven | 9012341221 | USA | TX | 1000
0005 | Haris | 9167456363 | USA | NY | 1200
DF2:
ID | Name | Phone# | Country | State | Amount_month2
0210 | John K. | 8942829725 | USA | PA | 200
2312 | Steven | 9012341221 | USA | TX | 350
2112 | Jerry | 9817273794 | USA | CA | 100
DF3:
ID | Name | Phone# | Country | State | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 300
0005 | Haris | 9167456363 | USA | NY | 1250
1212 | Jerry | 9817273794 | USA | CA | 1200
1210 | Drew | 8012341234 | USA | TX | 1400
I would like to join these 3 dataframes by ID and add the varying column amounts as separate columns, the missing amount values can be either 0 or NA such as:
ID | Name | Phone# | Country |State| Amount_month1 | Amount_month2 | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 1300 | 200 | 300
0215 | Peter | 8711234566 | USA | KS | 50 | 0 | 0
2312 | Steven | 9012341221 | USA | TX | 1000 | 350 | 0
0005 | Haris | 9167456363 | USA | NY | 1200 | 0 | 1250
1212 | Jerry | 9817273794 | USA | CA | 0 | 100 | 1200
1210 | Drew | 8012341234 | USA | TX | 0 | 0 | 1400
It can be done in a single line using Reduce and merge
Reduce(function(x, y) merge(x, y, all=TRUE), list(DF1, DF2, DF3))
You can use left_join from the package dplyr first joining the first two df`s, then joining that result with the third df:
library(dplyr)
df_12 <- left_join(df1,df2, by = "ID")
df_123 <- left_join(df_12, df3, by = "ID")
Result:
df_123
ID Amount_month1 Amount_month2 Amount_month3
1 1 100 NA NA
2 2 200 50 NA
3 3 300 177 666
4 4 400 NA 77
Mock data:
df1 <- data.frame(
ID = as.character(1:4),
Amount_month1 = c(100,200,300,400)
)
df2 <- data.frame(
ID = as.character(2:3),
Amount_month2 = c(50,177)
)
df3 <- data.frame(
ID = as.character(3:4),
Amount_month3 = c(666,77)
)

QBO3 Matrix Filter does not sort as expected

I have two QBO matricies:
Matrix A has 1 input and 1 outputs,
Matrix B has 3 inputs and 1 outputs
When I navigate to Matrix A, I see something like this:
| Client | Price |
| ------ | ------ |
| | 120.00 |
| A,B,C | 100.00 |
| D,E | 90.00 |
| F | 95.00 |
when I enter E into the client filter, the Matrix sorts as I would expect, with the first row 'valid' and the others crossed out:
| Client | Price |
| ------ | ------ |
| D,E | 90.00 |
| | 120.00 |
| A,B,C | 100.00 |
| F | 95.00 |
Matrix B looks like this:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 |
| San Diego | CA | | 110.00 |
| | FL | | 95.00 |
| Miami | FL | 3 | 105.00 |
When I enter 'Miami' into the Client column filter, the sorting appears like this with all columns crossed out:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 |
| | FL | | 95.00 |
| San Diego | CA | | 110.00 |
| Miami | FL | 3 | 105.00 |
Why don't I see the row with Miami at the top?
In your second example, the row with Miami also requires State = 'FL', so that row is not a match. To get the Miami row to match, you must enter City = 'Miami' and State = 'FL'.
When the Matrix is evaluated, it calculates the following:
for each row of the matrix
for each input of the matrix
if the row has a value that equals your input, that's a 'match'
if the row has a value and you provided no input, that's a 'mismatch'
if the row has a NOT(value) and you provided no input, that's a 'match'
sort by
weight descending,
then by number of mismatches,
then by the number of matches
For the Matrix B output:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 | Mismatch = 1, Match=1
| | FL | | 95.00 | Mismatch = 1, Match=0
| San Diego | CA | | 110.00 | Mismatch = 2, Match=1
| Miami | FL | 3 | 105.00 | Mismatch = 2, Match=1
In short:
The 'Miami' line requires 3 inputs to match, and 2 did not, so it's at the bottom
the first line (CA) actually matches your 'empty' input because it just cares about the Investor NOT being Investor 2 or 3 -- no investor matches this requirement!

change column value only if two other columns are duplicates

I am having a hard time to figure this out in R.
This is what I would like to do.
In a data frame like below, I would like to do if Name and Class duplicates add two row's score and if not, leave it as it is.
+------------------+-----------+-------+
| Name | Class | Score |
+------------------+-----------+-------+
| Sara | Sophomore | 10 |
| John | Freshman | 20 |
| Taylor | Sophomore | 30 |
| Tyler | Junior | 10 |
| Keith | Junior | 20 |
| Andrew | Senior | 30 |
| Victor | Senior | 10 |
| Nancy |Sophomore | 20 |
| Taylor | Junior | 30 |
| John | Senior | 10 |
| Victor | Freshman | 20 |
| Sara | Sophomore | 30 |
| John | Freshman | 10 |
| Taylor | Sophomore | 20 |
| John | Senior | 30 |
+------------------+-----------+-------+
So basically, the end result should look like:
+--------+-----------+-------+--+--+--+--+
| Name | Class | Score | | | | |
+--------+-----------+-------+--+--+--+--+
| Sara | Sophomore | 40 | | | | |
| John | Freshman | 30 | | | | |
| Taylor | Sophomore | 50 | | | | |
| Tyler | Junior | 10 | | | | |
| Keith | Junior | 20 | | | | |
| Andrew | Senior | 30 | | | | |
| Victor | Senior | 10 | | | | |
| Nancy | Sophomore | 20 | | | | |
| Taylor | Junior | 30 | | | | |
| John | Senior | 40 | | | | |
| Victor | Freshman | 20 | | | | |
+--------+-----------+-------+--+--+--+--+
As you see if name is the only duplicated value, it does not change (Example of John Freshman and John Senior). If class is the only duplicated value, it does not change either... Two columns in a row have to be duplicated to make the change.
My try is as below, but it is not working and am getting error message
'Error in if ((experiment[i, 1] == experiment[j, 1]) & (experiment[i, 2] == : missing value where TRUE/FALSE needed'
My code:
# creating an empty data frame
experiment1<-data.frame(matrix(ncol=3, nrow=15))
for(i in 1: nrow(experiment)){
for(j in i+1: nrow(experiment)){
if((experiment[i,1] == experiment[j,1]) & (experiment[i,2] == experiment[j,2])){
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3] + experiment[j,3]}
else{
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3]}}}
Could anyone help fixing my code or figuring out "nobler" code?
Aggregation is like the first argument explained in any basic R tutorial, I suggest you go and follow some.
base R
aggregate(formula = Score ~ Name + Class, data = mydf, FUN = sum)
dplyr
mydf %>% group_by(Name, Class) %>% summarize(scoreSum = sum(Score))
data.table
setDT(mydf)[ , .(scoreSum = sum(number)), by = .(Name, Class)]

How to make a multiple corpora in R

This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!
Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.

Resources