Replacing multiple observations in multiple columns - r

I have two dataframes one with the original information and the second one with corrections about the first observations. I would like to create a function or find a way to replace in multiple columns the information I have in my first dataframe with the new information I received. I have an ID to identify the observations that need to be replace but since so many columns will be changing for certain IDs I don´t know which will be the appropriate way of changing them.
My first data frame has 500 columns and 1000 observations and my second data frame has 100 columns and 800 observations that will change the original dataframe. I don´t know how to efficiently replace those values according to the ID
Here is an example of what the 2 dataframes look like, I need to replace in multiple columns just some values and a merge is not the most efficient options since I have more than 100 columns at least that will need changes in some of the observations.
I just need to replace the new info and keep the old one
enter image description here
Dataframe 1
|ID | X1 | X2 | X3 | X4 | XN |
|a1 | 1 | 1 | 1 | 1 | 1 |
|a2 | 2 | 2 | 2 | 2 | 2 |
|a3 | 3 | 3 | 3 | 3 | 3 |
|a4 | 4 | 4 | 4 | 4 | 4 |
|a5 | 5 | 5 | 5 | 5 | 5 |
|an | 6 | 6 | 6 | 6 | 6 |
dataframe 2
|ID | X1 | X2 | X4|
|a1 | 8 | | 4 |
|a3 | | | 2 |
|a4 | 2 | 9 | |
|an | 1 | | 3 |
The outcome should have the old values of dataframe 1 just with the replacements I got from dataframe 2
outcome
|ID | X1 | X2 | X3 | X4 | XN |
|a1 | 8 | 1 | 1 | 4 | 1 |
|a2 | 2 | 2 | 2 | 2 | 2 |
|a3 | 3 | 3 | 3 | 2 | 3 |
|a4 | 2 | 9 | 4 | 4 | 4 |
|a5 | 5 | 5 | 5 | 5 | 5 |
|an | 1 | 6 | 6 | 3 | 6 |

Related

r data.table groupby join in pyspark 1.6

I have the following datatables (R code):
accounts <- fread("ACC_ID | DATE | RATIO | VALUE
1 | 2017-12-31 | 2.00 | 8
2 | 2017-12-31 | 2.00 | 12
3 | 2017-12-31 | 6.00 | 20
4 | 2017-12-31 | 1.00 | 5 ", sep='|')
timeline <- fread(" DATE
2017-12-31
2018-12-31
2019-12-31
2020-12-31", sep="|")
In R, I know I can join on DATE, by ACC_ID, RATIO and VALUE:
accounts[, .SD[timeline, on='DATE'], by=c('ACC_ID', 'RATIO', 'VALUE')]
This way, I can "project" ACC_ID, RATIO and VALUE values over timeline dates, getting the following data table:
ACC_ID | RATIO | VALUE | DATE
1 | 2 | 8 |2017-12-31
2 | 2 | 12 |2017-12-31
3 | 6 | 20 |2017-12-31
4 | 1 | 5 |2017-12-31
1 | 2 | 8 |2018-12-31
2 | 2 | 12 |2018-12-31
3 | 6 | 20 |2018-12-31
4 | 1 | 5 |2018-12-31
1 | 2 | 8 |2019-12-31
2 | 2 | 12 |2019-12-31
3 | 6 | 20 |2019-12-31
4 | 1 | 5 |2019-12-31
1 | 2 | 8 |2020-12-31
2 | 2 | 12 |2020-12-31
3 | 6 | 20 |2020-12-31
4 | 1 | 5 |2020-12-31
I've been trying hard to find something similar with PySpark, but I've not been able to. What should be the appropriate way to solve this?
Thanks very much for your time. I greatly appreciate any help you can give me, this one is important for me.
It looks like you're trying to do a cross join?
spark.sql('''
select ACC_ID, RATIO, VALUE, timeline.DATE
from accounts, timeline
''')

Subtract column values using coalesce

I want to subtract values in the "place" column for each record returned in a "race", "bib", "split" group by so that a "diff" column appears like so.
Desired Output:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 0
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 0
10 | 17 | 2 | 12 | -4
10 | 17 | 3 | 15 | -3
I'm new to using the coalesce statement and the closest I have come to the desired output is the following
select a.race,a.bib,a.split, a.place,
coalesce(a.place -
(select b.place from ranking b where b.split < a.split), a.place) as diff
from ranking a
group by race,bib, split
which produces:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 5
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 8
10 | 17 | 2 | 12 | 11
10 | 17 | 3 | 15 | 14
Thanks for looking!
To compute the difference, you have to look up the value in the row that has the same race and bib values, and the next-smaller split value:
SELECT race, bib, split, place,
coalesce((SELECT r2.place
FROM ranking AS r2
WHERE r2.race = ranking.race
AND r2.bib = ranling.bib
AND r2.split < ranking.split
ORDER BY r2.split DESC
LIMIT 1
) - place,
0) AS diff
FROM ranking;

r increment column value based on another column value

I have a datatable x like this
+----+---------------+-------+
| id | arg | value |
+----+---------------+-------+
| 1 | New Day | NA |
| 2 | Eat breakfast | 3 |
| 3 | Bike | 45 |
| 4 | New Day | 0 |
| 5 | Get coffee | 1 |
| 6 | Exercise | 15 |
| 7 | Get beer | NA |
| 8 | New Day | |
| 9 | Pet cat | |
+----+---------------+-------+
I would like to add an incrementing column for every day to get something like this
+----+---------------+-------+-----+
| id | arg | value | day |
+----+---------------+-------+-----+
| 1 | New Day | NA | 1 |
| 2 | Eat breakfast | 3 | 1 |
| 3 | Bike | 45 | 1 |
| 4 | New Day | 0 | 2 |
| 5 | Get coffee | 1 | 2 |
| 6 | Exercise | 15 | 2 |
| 7 | Get beer | NA | 2 |
| 8 | New Day | | 3 |
| 9 | Pet cat | | 3 |
+----+---------------+-------+-----+
I have tried this without much success
x$day <-0
x<-within(x, day<-ifelse(arg == "New day", day+1, day))
As pointed by #A.Webb
cumsum(arg == "New day")

SQLite query select best option depending on a max value

I have a probably pretty hard question/situation:
I have a database to divide several tasks to some workers.
In the next example I have two tasks (Task 1 and Task 2) and 4 Employee's(1, 2, 3 and 4)
The maximum employee's that works on 1 task is three. Therefore I have 3 columns to get all possible options (in this example, not every option is shown!). The last column is a value which indicate how good the option is (the higher the number, the better).
The goal is to get the most optimal situation which means:
Every employee have to do one task (and cannot do 2 tasks)
The sum of the values is the highest possible value
+------------+------------+------------+------+--------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+------------+------------+------+--------+
| 1 | | | 1 | 5.0 |
| 2 | | | 1 | -2.5 |
| 3 | | | 1 | 1.0 |
| 4 | | | 1 | 0.5 |
| 1 | 2 | | 1 | 0.5 |
| 1 | 4 | | 1 | 5,0 |
| 1 | 2 | 3 | 1 | 0.33 |
| 2 | 3 | | 1 | -4.5 |
| 2 | 3 | 4 | 1 | -6.5 |
| 3 | 4 | | 1 | 3.0 |
| 1 | | | 2 | 1.0 |
| 2 | | | 2 | 2.0 |
| 3 | | | 2 | -5.0 |
| 4 | | | 2 | 3.0 |
| 1 | 2 | | 2 | -2.0 |
| 1 | 2 | 3 | 2 | -3.5 |
| 2 | 3 | | 2 | 5.0 |
| 2 | 3 | 4 | 2 | 0.5 |
| 3 | 4 | | 2 | 2.0 |
+------------+------------+------------+------+--------+
As you can see: sometimes it is better for the productivity:
Employee 1 gets a value of 5 on task 1
Employee 4 gets a value of 0.5 on task 1
Employee 1 and 3 gets a value of 5,0 on task 1
In this situation it is better that Employee 1 and 3 works separate and the query should give both lines:
+------------+-------------+------------+-------+---------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+-------------+------------+-------+---------+
| 1 | | | 1 | 5.0 |
| 4 | | | 1 | 0.5 |
+------------+-------------+------------+-------+---------+
The real solution for this example should be:
+------------+-------------+------------+-------+---------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+-------------+------------+-------+---------+
| 1 | | | 1 | 5.0 |
| 2 | 3 | | 2 | 5.0 |
| 4 | | | 2 | 3.0 |
+------------+-------------+------------+-------+---------+
Since employee 1 has a very high value on its own on task 1
Employee 3 is really bad on his own, but together with employee 2 they do great on task 2
Employee 4 is the only one who is left en this employee is pretty good at task 2.
The problem is to write the query to get this result

order grouping variable in R

I have a database like this:
ID | familysize | age | gender
------+------------+-------------------+------------+-----+----------
1001 | 4 | 26 | 1
1001 | 4 | 38 | 2
1001 | 4 | 30 | 2
1001 | 4 | 7 | 1
1002 | 3 | 25 | 2
1002 | 3 | 39 | 1
1002 | 3 | 10 | 2
1003 | 5 | 60 | 1
1003 | 5 | 50 | 2
1003 | 5 | 26 | 2
1003 | 5 | 23 | 1
1003 | 5 | 20 | 1
1004 | ....
I want to order this dataframe by age of people in each ID , so I use this command:
library(plyr)
require(plyr)
b2<-ddply(b , "ID", function(x) head(x[order(x$ age, decreasing = TRUE), ], ))
but when I use this command I lost some of observation. what should I do for ordering this database ?
b2 <- b[order(b$ID, -b$age), ]
should do the trick.
The arrange function in plyr does a great job here. Order by ID after that by age but in a descending order.
arrange(b, ID, desc(age))

Resources