Cross-table for subset in R - r

I have the following data frame (simplified):
IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2
How can I create a cross table (using the crosstable function in gmodels, because I need to do a chi-square test), but only if Type equals 1.

You probably want this.
library(gmodels)
with(df.1[df.1$Type==1, ], CrossTable(IPET, Task))
Yielding
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 3
| Task
IPET | 1 | 3 | Row Total |
-------------|-----------|-----------|-----------|
1 | 1 | 1 | 2 |
| 0.083 | 0.167 | |
| 0.500 | 0.500 | 0.667 |
| 0.500 | 1.000 | |
| 0.333 | 0.333 | |
-------------|-----------|-----------|-----------|
2 | 1 | 0 | 1 |
| 0.167 | 0.333 | |
| 1.000 | 0.000 | 0.333 |
| 0.500 | 0.000 | |
| 0.333 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 2 | 1 | 3 |
| 0.667 | 0.333 | |
-------------|-----------|-----------|-----------|
Data
df.1 <- read.table(header=TRUE, text="IPET Task Type
1 1 1
1 2 2
1 3 1
2 1 1
2 1 2")

Related

R combine 3 dataframes and perform operations

I have 3 dataframes which have different row numbers. I want to perform some operation on 2 dataframes based on row values in third dataframe.
dataframe 1:
+--------------------------+
| V1 Particlei Particlej |
+--------------------------+
| <chr> <dbl> <dbl> |
| 1 conf10 6 1829 |
| 2 conf10 6 13928 |
| 3 conf10 8 2875 |
| 4 conf10 8 13765 |
| 5 conf10 9 3184 |
| 6 conf10 9 11139 |
+--------------------------+
dataframe 2
+----------+----------+------------+-------------+
| V1 | cluster | position.x | position.y |
+----------+----------+------------+-------------+
| <chr> | <dbl> | <dbl> | <dbl> |
| 1 conf10 | 6 | 0.000659 | 0.00932 |
| 2 conf10 | 8 | 0.0291 | 0.00922 |
| 3 conf10 | 10 | 0.0101 | 0.00380 |
| 4 conf10 | 12 | -0.0103 | 0.00379 |
| 5 conf10 | 14 | 0.0165 | 0.000900 |
| 6 conf10 | 16 | -0.000554 | 0.0112 |
+----------+----------+------------+-------------+
and dataframe 3
+----------+----------+--------------------+------------+
| V1 | cluster | position.x | position.y |
+----------+----------+--------------------+------------+
| <chr> | <dbl> | <dbl> | <dbl> |
| 1 conf9 | 7 | -0.0104 | 0.000920 |
| 2 conf9 | 9 | -0.00426 0.0139 | |
| 3 conf9 | 11 | 0.0249 | 0.0164 |
| 4 conf9 | 13 | -0.0146 | 0.00242 |
| 5 conf9 | 15 | -0.0176 | 0.00220 |
| 6 conf9 | 17 | -0.0183 | 0.00620 |
+----------+----------+--------------------+------------+
I want to do row wise operation based on data1 values. For example I want to check that for each row in data1 if the values in columns Particlei and particlej are present in column cluster of data 2 and 3. After detecting if the values are present then perform some operations on rows in data2 and 3. For example for row number 1 in data1 I have 6 and 1829 so I want to select rows in column cluster in data2 and 3 which have 6 and 1829 and after selecting subtract column position.x of data3 from data2 for the two selected rows. similarly subtract column position.y of data3 from data2. do all these operations row wise. What I did till now
for(i in row_number(data3)){
y <- data1 %>% filter(any(data3[,1:2]==data2$cluster))
if(any(data2$cluster==data3[,1:2])){
while(any(data2$cluster==data3[,1])){
delta_x = data2$position.x-data1$position.x
delta_y = data2position.y-data1$position.y
}
}
expected output
+---------------+------------+-------------------+-------------------+------------------+------------------+-----------+-------------------------------------------------+-----------+-----------+
| | | | | | | | | | |
| V1 | cluster| position.x_data3 | position.y_data3 | position.x_data2 | position.y_data2 | delta.x | delta.y | particlei | particlej |
| +---------+ | | | | | | | | | |
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | | |
| 1 conf9,10 | 6 | -0.0104 | 0.000920 | 0.000659 | 0.00932 | -0.011059 | -0.0084 | 6 | 1829 |
| 2 conf9,10 | 1829 | -0.00426 | 0.0139 | 0.000659 | 0.000659 | 0.000659 | 0.000575 | 6 | 1829 |
| 3 conf9,10 | 7 | 0.0249 | 0.0164 | ... | ,... | ... | some values subtracted between position columns | 7 | 13928 |
| 4 conf9,10 | 13928 | -0.0146 | 0.00242 | some values | some values | ... | ... | 7 | 13928 |
+---------------+------------+-------------------+-------------------+------------------+------------------+-----------+-------------------------------------------------+-----------+-----------+

Replacing multiple observations in multiple columns

I have two dataframes one with the original information and the second one with corrections about the first observations. I would like to create a function or find a way to replace in multiple columns the information I have in my first dataframe with the new information I received. I have an ID to identify the observations that need to be replace but since so many columns will be changing for certain IDs I don´t know which will be the appropriate way of changing them.
My first data frame has 500 columns and 1000 observations and my second data frame has 100 columns and 800 observations that will change the original dataframe. I don´t know how to efficiently replace those values according to the ID
Here is an example of what the 2 dataframes look like, I need to replace in multiple columns just some values and a merge is not the most efficient options since I have more than 100 columns at least that will need changes in some of the observations.
I just need to replace the new info and keep the old one
enter image description here
Dataframe 1
|ID | X1 | X2 | X3 | X4 | XN |
|a1 | 1 | 1 | 1 | 1 | 1 |
|a2 | 2 | 2 | 2 | 2 | 2 |
|a3 | 3 | 3 | 3 | 3 | 3 |
|a4 | 4 | 4 | 4 | 4 | 4 |
|a5 | 5 | 5 | 5 | 5 | 5 |
|an | 6 | 6 | 6 | 6 | 6 |
dataframe 2
|ID | X1 | X2 | X4|
|a1 | 8 | | 4 |
|a3 | | | 2 |
|a4 | 2 | 9 | |
|an | 1 | | 3 |
The outcome should have the old values of dataframe 1 just with the replacements I got from dataframe 2
outcome
|ID | X1 | X2 | X3 | X4 | XN |
|a1 | 8 | 1 | 1 | 4 | 1 |
|a2 | 2 | 2 | 2 | 2 | 2 |
|a3 | 3 | 3 | 3 | 2 | 3 |
|a4 | 2 | 9 | 4 | 4 | 4 |
|a5 | 5 | 5 | 5 | 5 | 5 |
|an | 1 | 6 | 6 | 3 | 6 |

Subtract column values using coalesce

I want to subtract values in the "place" column for each record returned in a "race", "bib", "split" group by so that a "diff" column appears like so.
Desired Output:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 0
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 0
10 | 17 | 2 | 12 | -4
10 | 17 | 3 | 15 | -3
I'm new to using the coalesce statement and the closest I have come to the desired output is the following
select a.race,a.bib,a.split, a.place,
coalesce(a.place -
(select b.place from ranking b where b.split < a.split), a.place) as diff
from ranking a
group by race,bib, split
which produces:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 5
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 8
10 | 17 | 2 | 12 | 11
10 | 17 | 3 | 15 | 14
Thanks for looking!
To compute the difference, you have to look up the value in the row that has the same race and bib values, and the next-smaller split value:
SELECT race, bib, split, place,
coalesce((SELECT r2.place
FROM ranking AS r2
WHERE r2.race = ranking.race
AND r2.bib = ranling.bib
AND r2.split < ranking.split
ORDER BY r2.split DESC
LIMIT 1
) - place,
0) AS diff
FROM ranking;

SQLite query select best option depending on a max value

I have a probably pretty hard question/situation:
I have a database to divide several tasks to some workers.
In the next example I have two tasks (Task 1 and Task 2) and 4 Employee's(1, 2, 3 and 4)
The maximum employee's that works on 1 task is three. Therefore I have 3 columns to get all possible options (in this example, not every option is shown!). The last column is a value which indicate how good the option is (the higher the number, the better).
The goal is to get the most optimal situation which means:
Every employee have to do one task (and cannot do 2 tasks)
The sum of the values is the highest possible value
+------------+------------+------------+------+--------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+------------+------------+------+--------+
| 1 | | | 1 | 5.0 |
| 2 | | | 1 | -2.5 |
| 3 | | | 1 | 1.0 |
| 4 | | | 1 | 0.5 |
| 1 | 2 | | 1 | 0.5 |
| 1 | 4 | | 1 | 5,0 |
| 1 | 2 | 3 | 1 | 0.33 |
| 2 | 3 | | 1 | -4.5 |
| 2 | 3 | 4 | 1 | -6.5 |
| 3 | 4 | | 1 | 3.0 |
| 1 | | | 2 | 1.0 |
| 2 | | | 2 | 2.0 |
| 3 | | | 2 | -5.0 |
| 4 | | | 2 | 3.0 |
| 1 | 2 | | 2 | -2.0 |
| 1 | 2 | 3 | 2 | -3.5 |
| 2 | 3 | | 2 | 5.0 |
| 2 | 3 | 4 | 2 | 0.5 |
| 3 | 4 | | 2 | 2.0 |
+------------+------------+------------+------+--------+
As you can see: sometimes it is better for the productivity:
Employee 1 gets a value of 5 on task 1
Employee 4 gets a value of 0.5 on task 1
Employee 1 and 3 gets a value of 5,0 on task 1
In this situation it is better that Employee 1 and 3 works separate and the query should give both lines:
+------------+-------------+------------+-------+---------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+-------------+------------+-------+---------+
| 1 | | | 1 | 5.0 |
| 4 | | | 1 | 0.5 |
+------------+-------------+------------+-------+---------+
The real solution for this example should be:
+------------+-------------+------------+-------+---------+
| Employee_1 | Employee_2 | Employee_3 | Task | Value |
+------------+-------------+------------+-------+---------+
| 1 | | | 1 | 5.0 |
| 2 | 3 | | 2 | 5.0 |
| 4 | | | 2 | 3.0 |
+------------+-------------+------------+-------+---------+
Since employee 1 has a very high value on its own on task 1
Employee 3 is really bad on his own, but together with employee 2 they do great on task 2
Employee 4 is the only one who is left en this employee is pretty good at task 2.
The problem is to write the query to get this result

3-way tabulation in R

I have a dataset that looks like
| ID | Category | Failure |
|----+----------+---------|
| 1 | a | 0 |
| 1 | b | 0 |
| 1 | b | 0 |
| 1 | a | 0 |
| 1 | c | 0 |
| 1 | d | 0 |
| 1 | c | 0 |
| 1 | failure | 1 |
| 2 | c | 0 |
| 2 | d | 0 |
| 2 | d | 0 |
| 2 | b | 0 |
This is data where each ID potentially ends in a failure event, through an intermediate sequence of events {a, b, c, d}. I want to be able to count the number of IDs for which each of those intermediate events occur by failure event.
So, I would like a table of the form
| | a | b | c | d |
|------------+---+---+---+---|
| Failure | 4 | 5 | 6 | 2 |
| No failure | 9 | 8 | 6 | 9 |
where, for example, the number 4 indicates that in 4 of the IDs where a occurred ended in failure.
How would I go about doing this in R?
You can use table for example:
dat <- data.frame(categ=sample(letters[1:4],20,rep=T),
failure=sample(c(0,1),20,rep=T))
res <- table(dat$failure,dat$categ)
rownames(res) <- c('Failure','No failure')
res
a b c d
Failure 3 2 2 1
No failure 1 2 4 5
you can plot it using barplot:
barplot(res)
EDIT to get this by ID, you can use by for example:
dat <- data.frame(ID=c(rep(1,9),rep(2,11)),categ=sample(letters[1:4],20,rep=T),
failure=sample(c(0,1),20,rep=T))
by(dat,dat$ID,function(x)table(x$failure,x$categ))
dat$ID: 1
a b c d
0 1 2 1 3
1 1 1 0 0
---------------------------------------------------------------------------------------
dat$ID: 2
a b c d
0 1 2 3 0
1 1 3 1 0
EDIT using tapply
Another way to get this is using tapply
with(dat,tapply(categ,list(failure,categ,ID),length))

Resources