change column value only if two other columns are duplicates - r

I am having a hard time to figure this out in R.
This is what I would like to do.
In a data frame like below, I would like to do if Name and Class duplicates add two row's score and if not, leave it as it is.
+------------------+-----------+-------+
| Name | Class | Score |
+------------------+-----------+-------+
| Sara | Sophomore | 10 |
| John | Freshman | 20 |
| Taylor | Sophomore | 30 |
| Tyler | Junior | 10 |
| Keith | Junior | 20 |
| Andrew | Senior | 30 |
| Victor | Senior | 10 |
| Nancy |Sophomore | 20 |
| Taylor | Junior | 30 |
| John | Senior | 10 |
| Victor | Freshman | 20 |
| Sara | Sophomore | 30 |
| John | Freshman | 10 |
| Taylor | Sophomore | 20 |
| John | Senior | 30 |
+------------------+-----------+-------+
So basically, the end result should look like:
+--------+-----------+-------+--+--+--+--+
| Name | Class | Score | | | | |
+--------+-----------+-------+--+--+--+--+
| Sara | Sophomore | 40 | | | | |
| John | Freshman | 30 | | | | |
| Taylor | Sophomore | 50 | | | | |
| Tyler | Junior | 10 | | | | |
| Keith | Junior | 20 | | | | |
| Andrew | Senior | 30 | | | | |
| Victor | Senior | 10 | | | | |
| Nancy | Sophomore | 20 | | | | |
| Taylor | Junior | 30 | | | | |
| John | Senior | 40 | | | | |
| Victor | Freshman | 20 | | | | |
+--------+-----------+-------+--+--+--+--+
As you see if name is the only duplicated value, it does not change (Example of John Freshman and John Senior). If class is the only duplicated value, it does not change either... Two columns in a row have to be duplicated to make the change.
My try is as below, but it is not working and am getting error message
'Error in if ((experiment[i, 1] == experiment[j, 1]) & (experiment[i, 2] == : missing value where TRUE/FALSE needed'
My code:
# creating an empty data frame
experiment1<-data.frame(matrix(ncol=3, nrow=15))
for(i in 1: nrow(experiment)){
for(j in i+1: nrow(experiment)){
if((experiment[i,1] == experiment[j,1]) & (experiment[i,2] == experiment[j,2])){
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3] + experiment[j,3]}
else{
experiment1[i,1] <- experiment[i,1]
experiment1[i,2] <- experiment[i,2]
experiment1[i,3] <- experiment[i,3]}}}
Could anyone help fixing my code or figuring out "nobler" code?

Aggregation is like the first argument explained in any basic R tutorial, I suggest you go and follow some.
base R
aggregate(formula = Score ~ Name + Class, data = mydf, FUN = sum)
dplyr
mydf %>% group_by(Name, Class) %>% summarize(scoreSum = sum(Score))
data.table
setDT(mydf)[ , .(scoreSum = sum(number)), by = .(Name, Class)]

Related

change column type and convert the existing values from string to integer in mariadb

I have a table name employees
MariaDB [company]> describe employees;
+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| employee_id | char(10) | NO | | NULL | |
| first_name | varchar(20) | NO | | NULL | |
| last_name | varchar(20) | NO | | NULL | |
| email | varchar(60) | NO | | NULL | |
| phone_number | char(14) | NO | | NULL | |
| hire_date | date | NO | | NULL | |
| job_id | int(11) | NO | | NULL | |
| salary | varchar(30) | NO | | NULL | |
| commission_pct | char(10) | NO | | NULL | |
| manager_id | char(10) | NO | | NULL | |
| department_id | char(10) | NO | | NULL | |
+----------------+-------------+------+-----+---------+-------+
MariaDB [company]> select * from employees;
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| employee_id | first_name | last_name | email | phone_number | hire_date | job_id | salary | commission_pct | manager_id | department_id |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
| 100 | Steven | King | sking#gmail.com | 515.123.4567 | 2003-06-17 | 1 | 24000.00 | 0.00 | 0 | 90 |
| 101 | Neena | Kochhar | nkochhar#gmail.com | 515.123.4568 | 2005-09-21 | 2 | 17000.00 | 0.00 | 100 | 90 |
| 102 | Lex | Wow | Lwow#gmail.com | 515.123.4569 | 2001-01-13 | 2 | 17000.00 | 0.00 | 100 | 9 |
| 103 | Alexander | Hunold | ahunold#gmail.com | 590.423.4567 | 2006-01-03 | 3 | 9000.00 | 0.00 | 102 | 60 |
| 104 | Bruce | Ernst | bernst#gmail.com | 590.423.4568 | 2007-05-21 | 3 | 6000.00 | 0.00 | 103 | 60 |
| 105 | David | Austin | daustin#gmail.com | 590.423.4569 | 2005-06-25 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 106 | Valli | Pataballa | vpatabal#gmail.com | 590.423.4560 | 2006-02-05 | 3 | 4800.00 | 0.00 | 103 | 60 |
| 107 | Diana | Lorentz | dlorentz#gmail.com | 590.423.5567 | 2007-02-07 | 3 | 4200.00 | 0.00 | 103 | 60 |
| 108 | Nancy | Greenberg | ngreenbe#gmail.com | 515.124.4569 | 2002-08-17 | 4 | 12008.00 | 0.00 | 101 | 100 |
| 109 | Daniel | Faviet | dfaviet#gmail.com | 515.124.4169 | 2002-08-16 | 5 | 9000.00 | 0.00 | 108 | 100 |
| 110 | John | Chen | jchen#gmail.com | 515.124.4269 | 2005-09-28 | 5 | 8200.00 | 0.00 | 108 | 100 |
| 111 | Ismael | Sciarra | isciarra#gmail.com | 515.124.4369 | 2005-09-30 | 5 | 7700.00 | 0.00 | 108 | 100 |
| 112 | Jose | Urman | jurman#gmail.com | 515.124.4469 | 2006-03-07 | 5 | 7800.00 | 0.00 | 108 | 100 |
| 113 | Luis | Popp | lpopp#gmail.com | 515.124.4567 | 2007-12-07 | 5 | 6900.00 | 0.00 | 108 | 100 |
| 114 | Den | Raphaely | drapheal#gmail.com | 515.127.4561 | 2002-12-07 | 6 | 11000.00 | 0.00 | 100 | 30 |
| 115 | Alexander | Khoo | akhoo#gmail.com | 515.127.4562 | 2003-05-18 | 7 | 3100.00 | 0.00 | 114 | 30 |
+-------------+-------------+-------------+--------------------+---------------+------------+--------+----------+----------------+------------+---------------+
I wanted to change the salary column from string to integer. So, I ran this command
MariaDB [company]> alter table employees modify column salary int;
ERROR 1292 (22007): Truncated incorrect INTEGER value: '24000.00'
As you can see it gave me truncation error. I found some previous questions where they showed how to use convert() and trim() but those actually didn't answer my question.
sql code and data can be found here https://0x0.st/oYoB.com_5zfu
I tested this on MySQL and it worked fine. So it is apparently an issue only with MariaDB.
The problem is that a string like '24000.00' is not an integer. Integers don't have a decimal place. So in strict mode, the implicit type conversion fails.
I was able to work around this by running this update first:
update employees set salary = round(salary);
The column is still a string, but '24000.00' has been changed to '24000' (with no decimal point character or following digits).
Then you can alter the data type, and implicit type conversion to integer works:
alter table employees modify column salary int;
See demonstration using MariaDB 10.6:
https://dbfiddle.uk/V6LrEMKt
P.S.: You misspelled the column name "commission_pct" as "comission_pct" in your sample DDL, and I had to edit that to test. In the future, please use one of the db fiddle sites to share samples, because they will test your code.

how to solve problem running code for MySQL8 on MySQL 5.7?

I have the following data:
+---------+--------+----------+------+-------+--------+-----------+
| xType | xAccID | xAccName | xCat | xYear | xMonth | xRaseed |
+---------+--------+----------+------+-------+--------+-----------+
| Amounts | 52 | Acc1 | Rs | 2020 | 11 | 3144.83 |
| Amounts | 52 | Acc1 | Rs | 2020 | 12 | -15199.64 |
| Amounts | 53 | Acc2 | Cus | 2020 | 12 | 5306.04 |
| Amounts | 53 | Acc2 | Cus | 2020 | 11 | 1090.64 |
+---------+--------+----------+------+-------+--------+-----------+
actually, I want to sum the (xRaseed) in the current row with the (xRaseed) in the previous row For each (xAccID) separately
the result that I want:
+---------+--------+----------+------+-------+--------+--------------------------------+
| xType | xAccID | xAccName | xCat | xYear | xMonth | xRaseed |
+---------+--------+----------+------+-------+--------+--------------------------------+
| Amounts | 52 | Acc1 | Rs | 2020 | 11 | 3144.83 |
| Amounts | 52 | Acc1 | Rs | 2020 | 12 | Not -15199.64 But (-12,054.81) |
| Amounts | 53 | Acc2 | Cus | 2020 | 12 | 5306.04 |
| Amounts | 53 | Acc2 | Cus | 2020 | 11 | Not 1090.64 But (6,396.68) |
+---------+--------+----------+------+-------+--------+--------------------------------+
I applied the following solution that I got from somebody here:
select t.*,
sum(xRaseed) over (partition by xAccID order by xYear, xMonth) as running_xRaseed
from t;
but everything was working in the local server but when I applied the solution on my hosting, didn't work?? in the local I use (xampp - 10.4.17-MariaDB), and in my hosting, I use (MySQL 5.7.23-23), what's the problem, please?
Here is a db<>fiddle
On versions of MySQL earlier than 8+, we can use a correlated subquery to find the rolling sum:
SELECT xType, xAccID, xAccName, xCat, xYear, xMonth,
(SELECT SUM(t2.xRaseed) FROM yourTable t2
WHERE t2.xAccID = t1.xAccID AND
(t2.xYear < t1.xYear OR
t2.xYear = t1.xYear AND t2.xMonth <= t1.xMonth)) AS xRaseed
FROM yourTable t1
ORDER BY
xAccId,
xYear,
xMonth;

How to repeat the rows in a table

I have two tables.
Table1:
+-----+--------+-------------+
| Key | region | region_name |
+-----+--------+-------------+
| ABC | NT | NORTH |
| ABC | ST | SOUTH |
| XYZ | NT | NORTH |
| XYZ | ST | SOUTH |
| DEF | ST | SOUTH |
+-----+--------+-------------+
Table2:
+-----+-------+------+--------+
| KEY | Sales | cost | profit |
+-----+-------+------+--------+
| ABC | 130 | 100 | 30 |
| XYZ | 120 | 95 | 25 |
| DEF | 110 | 90 | 20 |
+-----+-------+------+--------+
I want the final output be like below.
+-----+-------+------+--------+--------+-------------+
| KEY | Sales | cost | profit | region | region_name |
+-----+-------+------+--------+--------+-------------+
| ABC | 130 | 100 | 30 | NT | NORTH |
| ABC | 130 | 100 | 30 | ST | SOUTH |
| XYZ | 120 | 95 | 25 | NT | NORTH |
| XYZ | 120 | 95 | 25 | ST | SOUTH |
| DEF | 110 | 90 | 20 | ST | SOUTH |
+-----+-------+------+--------+--------+-------------+
Thanks in advance..!
We can use left_join
library(dplyr)
left_join(df1, df2, by = 'Key')
Or with merge in base R
merge(df1, df2, by = 'Key', all.x = TRUE)

QBO3 Matrix Filter does not sort as expected

I have two QBO matricies:
Matrix A has 1 input and 1 outputs,
Matrix B has 3 inputs and 1 outputs
When I navigate to Matrix A, I see something like this:
| Client | Price |
| ------ | ------ |
| | 120.00 |
| A,B,C | 100.00 |
| D,E | 90.00 |
| F | 95.00 |
when I enter E into the client filter, the Matrix sorts as I would expect, with the first row 'valid' and the others crossed out:
| Client | Price |
| ------ | ------ |
| D,E | 90.00 |
| | 120.00 |
| A,B,C | 100.00 |
| F | 95.00 |
Matrix B looks like this:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 |
| San Diego | CA | | 110.00 |
| | FL | | 95.00 |
| Miami | FL | 3 | 105.00 |
When I enter 'Miami' into the Client column filter, the sorting appears like this with all columns crossed out:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 |
| | FL | | 95.00 |
| San Diego | CA | | 110.00 |
| Miami | FL | 3 | 105.00 |
Why don't I see the row with Miami at the top?
In your second example, the row with Miami also requires State = 'FL', so that row is not a match. To get the Miami row to match, you must enter City = 'Miami' and State = 'FL'.
When the Matrix is evaluated, it calculates the following:
for each row of the matrix
for each input of the matrix
if the row has a value that equals your input, that's a 'match'
if the row has a value and you provided no input, that's a 'mismatch'
if the row has a NOT(value) and you provided no input, that's a 'match'
sort by
weight descending,
then by number of mismatches,
then by the number of matches
For the Matrix B output:
| Client | State | Investor | Price |
| --------- | ----- | -------- | ------ |
| | CA | NOT(2,3) | 100.00 | Mismatch = 1, Match=1
| | FL | | 95.00 | Mismatch = 1, Match=0
| San Diego | CA | | 110.00 | Mismatch = 2, Match=1
| Miami | FL | 3 | 105.00 | Mismatch = 2, Match=1
In short:
The 'Miami' line requires 3 inputs to match, and 2 did not, so it's at the bottom
the first line (CA) actually matches your 'empty' input because it just cares about the Investor NOT being Investor 2 or 3 -- no investor matches this requirement!

How to make a multiple corpora in R

This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!
Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.

Resources