Modifying function to for EXPSS summary - r

Hi I am trying to create a function for EXPSS table, sample data below
dput( df<-data.frame(
aa = c("q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c","q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c"),
col1=c(1,2,3,2,1,2,3,4,4,4,5,3,4,2,1,2,5,3,2,1,2,4,2,1,3,2,1,2,3,1,2,3,4,4,4,1,2,5,3,5),
col2=c(2,1,1,7,4,1,2,7,5,7,2,6,2,2,6,3,4,3,2,5,7,5,6,4,4,6,5,6,4,1,7,7,2,7,7,2,3,7,2,4)
)
)
function i created is
sum1 <- cro_cpct(df1[[1]],df2[[2]])
}
now i want to add a criteria in this function on total, if the total falls in (3,4,5) then the whole column will replace by "--".

Something like this:
library(expss)
dataa<-data.frame(
aa = c("q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c","q","r","y","v","g","y","d","s","n","k","y","d","s","t","n","u","l","h","x","c"),
col1=c(1,2,3,2,1,2,3,4,4,4,5,3,4,2,1,2,5,3,2,1,2,4,2,1,3,2,1,2,3,1,2,3,4,4,4,1,2,5,3,5),
col2=c(2,1,1,7,4,1,2,7,5,7,2,6,2,2,6,3,4,3,2,5,7,5,6,4,4,6,5,6,4,1,7,7,2,7,7,2,3,7,2,4)
)
tab1 <- cro_cpct(dataa$aa,dataa$col1)
total_row = grep("#", tab1[[1]])
tab1[total_row, -1] = ifelse(tab1[total_row, -1]<8, "--", tab1[total_row, -1])
tab1
# | | | dataa$col1 | | | | |
# | | | 1 | 2 | 3 | 4 | 5 |
# | -------- | ------------ | ---------- | ---- | ---- | ---- | -- |
# | dataa$aa | c | 12.5 | | | | 25 |
# | | d | 12.5 | | 37.5 | | |
# | | g | 12.5 | | 12.5 | | |
# | | h | | | 12.5 | | 25 |
# | | k | 12.5 | | | 12.5 | |
# | | l | | 8.3 | | | 25 |
# | | n | 12.5 | | 12.5 | 25.0 | |
# | | q | 12.5 | 8.3 | | | |
# | | r | | 8.3 | | 12.5 | |
# | | s | | 8.3 | | 37.5 | |
# | | t | | 8.3 | | 12.5 | |
# | | u | 12.5 | 8.3 | | | |
# | | v | 12.5 | 8.3 | | | |
# | | x | | 8.3 | 12.5 | | |
# | | y | | 33.3 | 12.5 | | 25 |
# | | #Total cases | 8.0 | 12.0 | 8.0 | 8.0 | -- |

Related

R combine 3 dataframes and perform operations

I have 3 dataframes which have different row numbers. I want to perform some operation on 2 dataframes based on row values in third dataframe.
dataframe 1:
+--------------------------+
| V1 Particlei Particlej |
+--------------------------+
| <chr> <dbl> <dbl> |
| 1 conf10 6 1829 |
| 2 conf10 6 13928 |
| 3 conf10 8 2875 |
| 4 conf10 8 13765 |
| 5 conf10 9 3184 |
| 6 conf10 9 11139 |
+--------------------------+
dataframe 2
+----------+----------+------------+-------------+
| V1 | cluster | position.x | position.y |
+----------+----------+------------+-------------+
| <chr> | <dbl> | <dbl> | <dbl> |
| 1 conf10 | 6 | 0.000659 | 0.00932 |
| 2 conf10 | 8 | 0.0291 | 0.00922 |
| 3 conf10 | 10 | 0.0101 | 0.00380 |
| 4 conf10 | 12 | -0.0103 | 0.00379 |
| 5 conf10 | 14 | 0.0165 | 0.000900 |
| 6 conf10 | 16 | -0.000554 | 0.0112 |
+----------+----------+------------+-------------+
and dataframe 3
+----------+----------+--------------------+------------+
| V1 | cluster | position.x | position.y |
+----------+----------+--------------------+------------+
| <chr> | <dbl> | <dbl> | <dbl> |
| 1 conf9 | 7 | -0.0104 | 0.000920 |
| 2 conf9 | 9 | -0.00426 0.0139 | |
| 3 conf9 | 11 | 0.0249 | 0.0164 |
| 4 conf9 | 13 | -0.0146 | 0.00242 |
| 5 conf9 | 15 | -0.0176 | 0.00220 |
| 6 conf9 | 17 | -0.0183 | 0.00620 |
+----------+----------+--------------------+------------+
I want to do row wise operation based on data1 values. For example I want to check that for each row in data1 if the values in columns Particlei and particlej are present in column cluster of data 2 and 3. After detecting if the values are present then perform some operations on rows in data2 and 3. For example for row number 1 in data1 I have 6 and 1829 so I want to select rows in column cluster in data2 and 3 which have 6 and 1829 and after selecting subtract column position.x of data3 from data2 for the two selected rows. similarly subtract column position.y of data3 from data2. do all these operations row wise. What I did till now
for(i in row_number(data3)){
y <- data1 %>% filter(any(data3[,1:2]==data2$cluster))
if(any(data2$cluster==data3[,1:2])){
while(any(data2$cluster==data3[,1])){
delta_x = data2$position.x-data1$position.x
delta_y = data2position.y-data1$position.y
}
}
expected output
+---------------+------------+-------------------+-------------------+------------------+------------------+-----------+-------------------------------------------------+-----------+-----------+
| | | | | | | | | | |
| V1 | cluster| position.x_data3 | position.y_data3 | position.x_data2 | position.y_data2 | delta.x | delta.y | particlei | particlej |
| +---------+ | | | | | | | | | |
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | | |
| 1 conf9,10 | 6 | -0.0104 | 0.000920 | 0.000659 | 0.00932 | -0.011059 | -0.0084 | 6 | 1829 |
| 2 conf9,10 | 1829 | -0.00426 | 0.0139 | 0.000659 | 0.000659 | 0.000659 | 0.000575 | 6 | 1829 |
| 3 conf9,10 | 7 | 0.0249 | 0.0164 | ... | ,... | ... | some values subtracted between position columns | 7 | 13928 |
| 4 conf9,10 | 13928 | -0.0146 | 0.00242 | some values | some values | ... | ... | 7 | 13928 |
+---------------+------------+-------------------+-------------------+------------------+------------------+-----------+-------------------------------------------------+-----------+-----------+

R studio: How to extend time series and fill in 0 for missing values?

I have two tables:
one is the "visits" table:
+--------+---------+--------+
| date | user_id | visits |
+--------+---------+--------+
| 1/1/18 | A | 2 |
+--------+---------+--------+
| 1/2/18 | A | 4 |
+--------+---------+--------+
| 1/3/18 | A | 10 |
+--------+---------+--------+
| 1/4/18 | A | 34 |
+--------+---------+--------+
| 1/5/18 | A | 23 |
+--------+---------+--------+
| 1/1/18 | B | 15 |
+--------+---------+--------+
| 1/2/18 | B | 12 |
+--------+---------+--------+
| 1/1/18 | C | 10 |
+--------+---------+--------+
| 1/1/18 | D | 5 |
+--------+---------+--------+
| 1/2/18 | D | 12 |
+--------+---------+--------+
| 1/3/18 | D | 15 |
+--------+---------+--------+
| 1/4/18 | D | 25 |
+--------+---------+--------+
| 1/1/18 | E | 18 |
+--------+---------+--------+
| 1/1/18 | G | 21 |
+--------+---------+--------+
| 1/2/18 | G | 10 |
+--------+---------+--------+
Another one is the "location" table:
+---------+----------+
| user_id | location |
+---------+----------+
| A | 1 |
+---------+----------+
| B | 1 |
+---------+----------+
| C | 1 |
+---------+----------+
| D | 2 |
+---------+----------+
| E | 3 |
+---------+----------+
| F | 3 |
+---------+----------+
| G | 3 |
+---------+----------+
Note:
If a user does not visit, he/she will not show up in the "visits" table. His/her visit is 0 that day.
The "location" table has the complete population of users.
Question:
I would like to extend the "visits" table, such that it looks like this:
+--------+---------+--------+
| date | user_id | visits |
+--------+---------+--------+
| 1/1/18 | A | 2 |
+--------+---------+--------+
| 1/2/18 | A | 4 |
+--------+---------+--------+
| 1/3/18 | A | 10 |
+--------+---------+--------+
| 1/4/18 | A | 34 |
+--------+---------+--------+
| 1/5/18 | A | 23 |
+--------+---------+--------+
| 1/1/18 | B | 15 |
+--------+---------+--------+
| 1/2/18 | B | 12 |
+--------+---------+--------+
| 1/3/18 | B | 0 |
+--------+---------+--------+
| 1/4/18 | B | 0 |
+--------+---------+--------+
| 1/5/18 | B | 0 |
+--------+---------+--------+
| 1/1/18 | C | 10 |
+--------+---------+--------+
| 1/2/18 | C | 0 |
+--------+---------+--------+
| 1/3/18 | C | 0 |
+--------+---------+--------+
| 1/4/18 | C | 0 |
+--------+---------+--------+
| 1/5/18 | C | 0 |
+--------+---------+--------+
| 1/1/18 | D | 5 |
+--------+---------+--------+
| 1/2/18 | D | 12 |
+--------+---------+--------+
| 1/3/18 | D | 15 |
+--------+---------+--------+
| 1/4/18 | D | 25 |
+--------+---------+--------+
| 1/5/18 | D | 0 |
+--------+---------+--------+
| 1/1/18 | E | 18 |
+--------+---------+--------+
| 1/2/18 | E | 0 |
+--------+---------+--------+
| 1/3/18 | E | 0 |
+--------+---------+--------+
| 1/4/18 | E | 0 |
+--------+---------+--------+
| 1/5/18 | E | 0 |
+--------+---------+--------+
| 1/1/18 | F | 0 |
+--------+---------+--------+
| 1/2/18 | F | 0 |
+--------+---------+--------+
| 1/3/18 | F | 0 |
+--------+---------+--------+
| 1/4/18 | F | 0 |
+--------+---------+--------+
| 1/5/18 | F | 0 |
+--------+---------+--------+
| 1/1/18 | G | 21 |
+--------+---------+--------+
| 1/2/18 | G | 10 |
+--------+---------+--------+
| 1/3/18 | G | 0 |
+--------+---------+--------+
| 1/4/18 | G | 0 |
+--------+---------+--------+
| 1/5/18 | G | 0 |
+--------+---------+--------+
A table in this format is easier for me to do further analysis with the whole population in one table.
I would like to code this in R, ideally using tidyverse.
I can't wrap my head around how to achieve this. Appreciate any insights and help into this. Thanks so much!
We may need complete here
library(dplyr)
library(tidyr)
visits %>%
complete(date, user_id = location$user_id, fill = list(visits = 0))

r increment column value based on another column value

I have a datatable x like this
+----+---------------+-------+
| id | arg | value |
+----+---------------+-------+
| 1 | New Day | NA |
| 2 | Eat breakfast | 3 |
| 3 | Bike | 45 |
| 4 | New Day | 0 |
| 5 | Get coffee | 1 |
| 6 | Exercise | 15 |
| 7 | Get beer | NA |
| 8 | New Day | |
| 9 | Pet cat | |
+----+---------------+-------+
I would like to add an incrementing column for every day to get something like this
+----+---------------+-------+-----+
| id | arg | value | day |
+----+---------------+-------+-----+
| 1 | New Day | NA | 1 |
| 2 | Eat breakfast | 3 | 1 |
| 3 | Bike | 45 | 1 |
| 4 | New Day | 0 | 2 |
| 5 | Get coffee | 1 | 2 |
| 6 | Exercise | 15 | 2 |
| 7 | Get beer | NA | 2 |
| 8 | New Day | | 3 |
| 9 | Pet cat | | 3 |
+----+---------------+-------+-----+
I have tried this without much success
x$day <-0
x<-within(x, day<-ifelse(arg == "New day", day+1, day))
As pointed by #A.Webb
cumsum(arg == "New day")

Passing R script variables into a batch script

In an R script, I'm executing a batch file inside of a for loop.
for (i in 1:2){
shell(shQuote("\\\\NETWORK\\PATH\\TO\\THE\\FILE.BAT", "cmd"))
}
The batch script creates data and moves it to a SQL table that looks like this:
| Name | Version | Category | Value | Number | Replication |
|:-----:|:-------:|:--------:|:-----:|:------:|:-----------:|
| File1 | 1.0 | Time | 123 | 1 | 1 |
| File1 | 1.0 | Size | 456 | 1 | 1 |
| File2 | 1.0 | Time | 312 | 1 | 1 |
| File2 | 1.0 | Size | 645 | 1 | 1 |
| File1 | 1.0 | Time | 369 | 1 | 2 |
| File1 | 1.0 | Size | 258 | 1 | 2 |
| File2 | 1.0 | Time | 741 | 1 | 2 |
| File2 | 1.0 | Size | 734 | 1 | 2 |
| File1 | 1.1 | Time | 997 | 2 | 1 |
| File1 | 1.1 | Size | 997 | 2 | 1 |
| File2 | 1.1 | Time | 438 | 2 | 1 |
| File2 | 1.1 | Size | 735 | 2 | 1 |
| File1 | 1.1 | Time | 786 | 2 | 2 |
| File1 | 1.1 | Size | 486 | 2 | 2 |
| File2 | 1.1 | Time | 379 | 2 | 2 |
| File2 | 1.1 | Size | 943 | 2 | 2 |
| File1 | 1.2 | Time | 123 | 3 | 1 |
| File1 | 1.2 | Size | 456 | 3 | 1 |
| File2 | 1.2 | Time | 312 | 3 | 1 |
| File2 | 1.2 | Size | 645 | 3 | 1 |
| File1 | 1.2 | Time | 369 | 3 | 2 |
| File1 | 1.2 | Size | 258 | 3 | 2 |
| File2 | 1.2 | Time | 741 | 3 | 2 |
| File2 | 1.2 | Size | 734 | 3 | 2 |
| File1 | 1.3 | Time | 997 | 4 | 1 |
| File1 | 1.3 | Size | 997 | 4 | 1 |
| File2 | 1.3 | Time | 438 | 4 | 1 |
| File2 | 1.3 | Size | 735 | 4 | 1 |
However, I'd like for the Number and Replication column to be declared in the R script, not the batch file.
I know I can do that like this:
Replication <- i
Number <- as.integer(sqlQuery(dbhandle, "select max(Number) from Table"))
Number<-ifelse(is.na(Number), 1, Number + 1)
My question though is how can I pass these variables into the batch script? Can I pass parameters into the batch script?
So that in my batch script, I could have something similar to this:
set Rep=[Replication variable from R]
set Num=[Number variable from R]

SQLite query select best option depending on a max value

I have a probably pretty hard question/situation:
I have a database to divide several tasks to some workers.
In the next example I have two tasks (Task 1 and Task 2) and 4 Employee's(1, 2, 3 and 4)
The maximum employee's that works on 1 task is three. Therefore I have 3 columns to get all possible options (in this example, not every option is shown!). The last column is a value which indicate how good the option is (the higher the number, the better).
The goal is to get the most optimal situation which means:
Every employee have to do one task (and cannot do 2 tasks)
The sum of the values is the highest possible value
+------------+------------+------------+------+--------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+------------+------------+------+--------+
| 1 | | | 1 | 5.0 |
| 2 | | | 1 | -2.5 |
| 3 | | | 1 | 1.0 |
| 4 | | | 1 | 0.5 |
| 1 | 2 | | 1 | 0.5 |
| 1 | 4 | | 1 | 5,0 |
| 1 | 2 | 3 | 1 | 0.33 |
| 2 | 3 | | 1 | -4.5 |
| 2 | 3 | 4 | 1 | -6.5 |
| 3 | 4 | | 1 | 3.0 |
| 1 | | | 2 | 1.0 |
| 2 | | | 2 | 2.0 |
| 3 | | | 2 | -5.0 |
| 4 | | | 2 | 3.0 |
| 1 | 2 | | 2 | -2.0 |
| 1 | 2 | 3 | 2 | -3.5 |
| 2 | 3 | | 2 | 5.0 |
| 2 | 3 | 4 | 2 | 0.5 |
| 3 | 4 | | 2 | 2.0 |
+------------+------------+------------+------+--------+
As you can see: sometimes it is better for the productivity:
Employee 1 gets a value of 5 on task 1
Employee 4 gets a value of 0.5 on task 1
Employee 1 and 3 gets a value of 5,0 on task 1
In this situation it is better that Employee 1 and 3 works separate and the query should give both lines:
+------------+-------------+------------+-------+---------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+-------------+------------+-------+---------+
| 1 | | | 1 | 5.0 |
| 4 | | | 1 | 0.5 |
+------------+-------------+------------+-------+---------+
The real solution for this example should be:
+------------+-------------+------------+-------+---------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+-------------+------------+-------+---------+
| 1 | | | 1 | 5.0 |
| 2 | 3 | | 2 | 5.0 |
| 4 | | | 2 | 3.0 |
+------------+-------------+------------+-------+---------+
Since employee 1 has a very high value on its own on task 1
Employee 3 is really bad on his own, but together with employee 2 they do great on task 2
Employee 4 is the only one who is left en this employee is pretty good at task 2.
The problem is to write the query to get this result

Resources