I want to take a set of observations & find out how much overlap different columns have based on the indicators. I have the following data:
uniquevalue | X | Y | Z |
Obs 1 | 1 | 0 | 1 |
Obs 2 | 1 | 1 | 0 |
Obs 3 | 1 | 0 | 1 |
Obs 4 | 0 | 1 | 0 |
Obs 5 | 0 | 0 | 1 |
Obs 6 | 0 | 1 | 0 |
Obs 7 | 0 | 0 | 1 |
I want to create the following data overlap matrix:
Label | X | Y | Z |
X | 100% | 33% | 50% |
Y | 33% | 100% | 0% |
Z | 66% | 0% | 100% |
So, for example, Z has a total of 4 observations. 2 of its 4 observations are also present on X, so its overlap % is 50%. However because different columns have different numbers of observations, the reverse is not necessarily true. As you can see, 2 of 3 observations in X are shared with Z, so its a 66% overlap.
You can use crossprod:
mat <- crossprod(as.matrix(df[2:4])) # calculate the overlap
floor(t(mat * 100 / diag(mat))) # calculate the percentage
# X Y Z
#X 100 33 50
#Y 33 100 0
#Z 66 0 100
Related
I started out using Firth's logistic (logistf) to deal with my small sample size (n=80), but wanted to try out exact logistic regression using the elrm package. However, I'm having trouble figuring out how to create the "collapsed" data required for elrm to run. I have a csv that I import into R as a dataframe that has the following variables/columns. Here is some example data (real data has a few more columns and 80 rows):
+------------+-----------+-----+--------+----------------+
| patien_num | asymmetry | age | female | field_strength |
+------------+-----------+-----+--------+----------------+
| 1 | 1 | 25 | 1 | 1.5 |
| 2 | 0 | 50 | 0 | 3 |
| 3 | 0 | 75 | 1 | 1.5 |
| 4 | 0 | 33 | 1 | 3 |
| 5 | 0 | 66 | 1 | 3 |
| 6 | 0 | 99 | 0 | 3 |
| 7 | 1 | 20 | 0 | 1.5 |
| 8 | 1 | 40 | 1 | 3 |
| 9 | 0 | 60 | 1 | 3 |
| 10 | 0 | 80 | 0 | 1.5 |
+------------+-----------+-----+--------+----------------+
Basically my data is one line per patient (not a frequency table). I'm trying to run a regression with asymmetry as the dependent variable and age (continuous), female (binary), and field_strength (factor) as independent variables. I'm trying to understand how to collapse this into the appropriate format so I can get that "ntrials" part required for the elrm formula.
I've looked at https://stats.idre.ucla.edu/r/dae/exact-logistic-regression/ but they start with data in a different format than mine, and having trouble. Any help appreciated!
In R, I've created a 3-dimensional table from a dataset. The three variables are all factors and are labelled H, O, and S. This is the code I used to simply create the table:
attach(df)
test <- table(H, O, S)
Outputting the flattened table produces this table below. The two values of S were split up, so these are labelled S1 and S2:
ftable(test)
+-----------+-----------+-----+-----+
| H | O | S1 | S2 |
+-----------+-----------+-----+-----+
| Isolation | Dead | 2 | 15 |
| | Sick | 64 | 20 |
| | Recovered | 153 | 379 |
| ICU | Dead | 0 | 15 |
| | Sick | 0 | 2 |
| | Recovered | 1 | 9 |
| Other | Dead | 7 | 133 |
| | Sick | 4 | 20 |
| | Recovered | 17 | 261 |
+-----------+-----------+-----+-----+
The goal is to use this table object, subset it, and produce a second table. Essentially, I want only "Isolation" and "ICU" from H, "Sick" and "Recovered" from O, and only S1, so it basically becomes the 2-dimensional table below:
+-----------+------+-----------+
| | Sick | Recovered |
+-----------+------+-----------+
| Isolation | 64 | 153 |
| ICU | 0 | 1 |
+-----------+------+-----------+
S = S1
I know I could first subset the dataframe and then create the new table, but the goal is to subset the table object itself. I'm not sure how to retrieve certain values from each dimension and produce the reduced table.
Edit: ANSWER
I now found a much simpler method. All I needed to do was reference the specific columns in their respective directions. So a much simpler solution is below:
> test[1:2,2:3,1]
O
H Sick Healed
Isolation 64 153
ICU 0 1
Subset the data before running table, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2
I have a simple database table with three columns: id, x, y. x and y are just the coordinates of points in a line. I want to using the SQLite Window function to partition the table using a sliding window of three rows, and then get the y value that is the furthest from the y value of the first coordinate (row) in the window.
An example:
| id | x | y |
|----|---|---|
| 1 | 1 | .5|
| 2 | 2 | .9|
| 3 | 3 | .7|
| 4 | 4 |1.1|
| 5 | 5 | 1 |
So the first partition would consist of:
| id | x | y |
|----|---|---|
| 1 | 1 | .5|
| 2 | 2 | .9|
| 3 | 3 | .7|
And the desired result would be:
| id | x | y | d |
|----|---|---|---|
| 1 | 1 | .5| .4|
| 2 | 2 | .9|
| 3 | 3 | .7|
Since the the window with id = 1 as the CURRENT ROW would have a maximum variation of .4; the maximum distance between the y value of the first row in the partition, .5, and .9, is .4.
The final expected result:
| id | x | y | d |
|----|---|---|---|
| 1 | 1 | .5| .4|
| 2 | 2 | .9| .2|
| 3 | 3 | .7| .4|
| 4 | 4 |1.1| .1|
| 5 | 5 | 1 | |
I've tried using a window function like: WINDOW win1 AS (ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING which gives me the correct window.
With the window defined, I tried doing something like:
SELECT
max(abs(y - first_value(y) OVER win1)) AS d
FROM t
WINDOW win1 AS (ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
But I get an error for misuse of first_value.
I think the problem I have is this is not the proper approach to calculate over each row of a partition, but I could not find another solution or approach that matches what I am trying to do here.
For each row of your table you define a window starting from the current row up to the next 2 rows.
In your code y is the value in the current row and first_value() is the 1st value of y of the current window which is also the value of y of the current row.
So even if your code was syntactically correct the difference you calculate would always return 0.
It's easier to solve your problem with LEAD() window function:
WITH cte AS (
SELECT *,
LEAD(y, 1) OVER () AS y1,
LEAD(y, 2) OVER () AS y2
FROM tablename
)
SELECT
id, x, y,
MAX(ABS(y - y1), COALESCE(ABS(y - y2), 0)) d
FROM cte
See the demo.
Results:
id x y d
1 1 0.5 0.4
2 2 0.9 0.2
3 3 0.7 0.4
4 4 1.1 0.1
5 5 1.0
If got data looking like this:
A | B | C
--------------
f | 1 | 1420h
f | 1 | 1540h
f | 3 | 600h
g | 2 | 900h
g | 2 | 930h
h | 1 | 700h
h | 3 | 400h
Now I want to create a new column which counts other rows in the data frame that meet certain conditions.
In this case I would like to know in each row how often the same combination of A and B occured in a range of 100 around C.
So the result with this data would be:
A | B | C | D
------------------
f | 1 | 1420 | 0
f | 1 | 1540 | 0
f | 3 | 1321 | 0
g | 2 | 900 | 1
g | 2 | 930 | 1
h | 1 | 700 | 0
h | 3 | 400 | 0
I actually came to a solution using for(for()). But the time R needs to compute the resuts is tooooo long.
for(i in 1:nrow(df)) {
df[i,D] <- sum( for(p in 1:nrow(df)) {
df[p,A] == df[i,A] &
df[p,B] == df[i,B] &
df[i,C] +100 > df[p,C] &
df[p,C] > df[i,C]-100 } ) }
Is there a better way?
Thanks a lot!
I have a data frame relative to accesses to a website. Several accesses per day, with different possible actions and descriptions of the actions
People | Date | Time | Action | Descr |
| | | | |
j | 01/01/2010 | 10:13 | X | A |
j | 01/01/2010 | 10:15 | Y | B |
j | 02/01/2010 | 14:15 | Z | C |
j | 03/01/2010 | 11:45 | X | D |
j | 03/01/2010 | 13:56 | X | E |
j | 03/01/2010 | 18:43 | Z | F |
j | 03/01/2010 | 18:44 | X | A |
After reducing the data frame to a balanced daily panel data, I need to create variables such that:
-the value of the first variable (FirstX) must be equal to the description (Descr) of the first Action = X of the day (if available) and zero otherwise
-the value of the second variable must be equal to the description of the second Action = X of the day and zero otherwise
-so on
Once I transformed it into a balanced daily panel (which I can do) I need to have a final result which looks like this:
People | Date |Accesses| First X|Second X| Third X| Fourth X |
| | | | | | |
j | 01/01/2010 | 2 | A | 0 | 0 | 0 |
j | 02/01/2010 | 1 | 0 | 0 | 0 | 0 |
j | 03/01/2010 | 4 | D | E | A | 0 |
You can do it using the dplyr package:
library(dplyr)
df %>%
group_by(People,Date) %>%
summarise(Accesses = n(),
FirstX = ifelse(sum(Action=="X")>=1,Descr[Action=="X"][1],"0"),
SecondX = ifelse(sum(Action=="X")>=2,Descr[Action=="X"][2],"0"),
ThirdX = ifelse(sum(Action=="X")>=3,Descr[Action=="X"][3],"0"),
FourthX = ifelse(sum(Action=="X")>=4,Descr[Action=="X"][4],"0"))
This returns:
People Date Accesses FirstX SecondX ThirdX FourthX
<chr> <chr> <int> <chr> <chr> <chr> <chr>
1 j 01/01/2010 2 A 0 0 0
2 j 02/01/2010 1 0 0 0 0
3 j 03/01/2010 4 D E A 0
Note that you cannot have numeric 0s and characters in the same vector, so I put character 0s in the FirstX, SecondX, .. columns.
I found a solution myself. I post it here in case this is useful to somebody.
# create temp variables to be used for the count(just a vector of all the
numbers from 1 to N)
subset$temp_var1<-c(1:N)
# generate a variable which starts counting from one and starts again
# every time "date" or "people" change
subset$count<-ave(subset$temp_var1 , subset$date ,
subset$people , FUN = seq_along)
#drop variable "Action"
subset<-subset( subset, select=c("people" , "date" ,
"descr" , "count"))
#reshape
subset_comuni<-reshape(subset_comuni , idvar=c("nome_utente" , "date") ,
timevar = "count" , direction = "wide")