Add a group to one df based on values from another df - r

I have two df's:
df_1 <-
tribble(
~time, ~v1, ~v2,
-3, 213, 1,
-2, 124, 4,
-1, 532, 2,
0, 423, 5,
-3, 123, 3,
-2, 523, 2,
-1, 125, 5,
0, 515, 2,
-3, 321, 5
)
df_2 <-
tribble(
~trial, ~v4,
2, 12,
4, 23,
5, 34,
6, 53
)
'Time' of df_1 has values which at different points reset to -3. All rows before the next reset belong to a group which is defined in 'Trial' column of df_2. That is, rows of df_1 between the two resets belong to a group defined in a single row of df_2. I want to use the value from df_2 and paste it into all corresponding df_1 rows. Number of resets in df_1 matches the number of rows in df_2.
My target df would look like that:
df_final <-
tribble(
~time, ~v1, ~v2, ~trial,
-3, 213, 1, 2,
-2, 124, 4, 2,
-1, 532, 2, 2,
0, 423, 5, 2,
-3, 123, 3, 4,
-2, 523, 2, 4,
-1, 125, 5, 4,
0, 515, 2, 4,
-3, 321, 5, 5
)
Note that the 'Trial' is not simply a enumeration: it jumps from 2 to 4 in this example. This would be easy for left/right join but there is no common key in this case. I have a general idea how to do such a thing with a for loop and if, but as my df's are huge this wouldn't be optimal. Any ideas for a more typical R solution - preferably, but not necessarily using dplyr? I was trying something with 'lag' and 'which' functions but with no showable effects really.

Related

How to make the expected value of the difference in the values in paired data using ggplot2

I have a pair data as below and I want to make the expected value of the difference in the value (column called value) of pairs. In all the pairs, one has disease and the other one does not have disease as you can see from the data. In other words, the expected value of the difference of the value in one sibling compare to his/her sibling.
The description of the variable in the data are:
id = individual ID
family ID = family ID showing their dependency
status = 1 means disease and status = 0 means no-disease
Any guidance is appreciated.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
I'm not certain if this is what you are looking for, but I used pivot_wider from tidyr to spread the values into two columns, though with status 0 and those with status 1. Then I used mutate to take a difference between the two columns, then plotted the familyID by the newly created difference with ggplot. Note that I removed the id column for the pivot_wider to work.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
library(dplyr)
library(tidyr)
library(ggplot2)
d%>%
select(-id)%>%
pivot_wider(values_from = value, names_from = status)%>%
mutate("Diff" = (`0`-`1`))%>%
ggplot()+
aes(as.character(familyID), Diff)+
geom_point()
You can group by familyID, then use summarize() from the dplyr package to find the differences.
Also note the conversion of id, familyID, and status to factors, which may make life easier so they aren't confused with being integers.
library(dplyr)
library(forcats)
library(ggplot2)
d <- structure(list(id = as.factor(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)),
familyID = as.factor(c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10)),
status = as.factor(c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1)),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
diffs <- group_by(d, familyID) %>%
summarize(., diff = (value[status == 0] - value[status == 1]))
Reordering the families by difference can help get a sense of the distribution of differences
diffs$familyID <- fct_reorder(diffs$familyID, diffs$diff, .desc = TRUE)
ggplot(diffs, aes(x = familyID, y = diff)) +
geom_bar(stat="identity")
If you really have a lot of families you may want to display a summary of the differences.
One option is with a histogram (modifying binwidth can control how fine the bins are):
ggplot(diffs, aes(x = diff)) +
geom_histogram(binwidth = 3)
Similar to a histogram is a density plot:
ggplot(diffs, aes(x = diff)) +
geom_density()
Finally, a boxplot is also a familiar summary. They're mostly meant for comparing multiple groups, but it works okay with just one. I've added the individual points using the geom_jitter() function.
ggplot(diffs, aes(y = diff)) + #If using multiple groups add x=group inside the aes() function.
geom_boxplot() +
geom_jitter(aes(x = 0))

Zelen Exact Test - Trying to use a k 2x2 in the function zelen.test()

I am trying to use the zelen.test function on the package NSM3. I am having difficulty reading the data into the function.
You can recreate my data using
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(data, ncol = 2)
The documentation on CRAN states that zelen.test(z, example = F, r = 3) where z is an array of k 2 x 2 matrix, example is set to FALSE because it returns a p-value for an example I cannot access, and r is the number of decimals the users wants returned in the p-value.
I've tried:
zelen.test(events, r = 4)
I thought it may want the study number and the trial data, so I tried this:
studies <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7)
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(cbind(studies, events), ncol = 3)
zelen.test(events, r = 4)
but it continues to return and error stating
"Error in z[1, 1, ] : incorrect number of dimensions" for both cases I tried above.
Any help would be greatly appreciated!
If we check the source code by typing zelen.test on the console, if the example = TRUE, it is constructing a 3D array
...
if (example)
z <- array(c(2, 1, 2, 5, 1, 5, 4, 1), dim = c(2, 2, 2))
...
The input z dim is also specified in the documentation of ?zelen.test
z - data as an array of k 2x2 matrices. Small data sets only!
So, we may need to construct an array of dimensions 3
library(NSM3)
z1 <- array(c(4, 2, 3, 3, 8, 3, 4, 7), c(2, 2, 2))
zelen.test(z1, r = 4)
# Zelen's test:
# P = 1
Or with 3rd dimension of length 3
z1 <- array( c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1), c(2, 2, 3))
zelen.test(z1, r = 4)
# Zelen's test:
#P = 0.1238

Finding differences between populations

I have data equivalent data from 2019 and 2020. The proportion of diagnoses in 2020 look like they differ from 2019, but I'd like to ...
a) statistically test the populations are different.
b) determine which categories are the most different.
I've worked out I can do 'a' using:
chisq.test(test$count.2020, test$count.2019)
I don't know how to find out which categories are the ones that are the most different between 2020 and 2019. Any help would be amazing, thanks!
diagnosis <- data.frame(mf_label = c("Audiovestibular", "Autonomic", "Cardiovascular",
"Cerebral palsy", "Cerebrovascular", "COVID", "Cranial nerves",
"CSF disorders", "Developmental", "Epilepsy and consciousness",
"Functional", "Head injury", "Headache", "Hearing loss", "Infection",
"Maxillofacial", "Movement disorders", "Muscle and NMJ", "Musculoskeletal",
"Myelopathy", "Neurodegenerative", "Neuroinflammatory", "Peripheral nerve",
"Plexopathy", "Psychiatric", "Radiculopathy", "Spinal", "Syncope",
"Toxic and nutritional", "Tumour", "Visual system"),
count.2019 = c(5, 0, 1, 1, 2, 0, 4, 3, 0, 7, 4, 0, 24, 0, 0, 2, 22, 3, 3, 0, 3, 18, 12, 0, 0, 2, 2, 0, 1, 4, 0),
count.2020 = c(5, 1, 1, 3, 28, 9, 11, 13, 1, 13, 30, 5, 68, 1, 1, 2, 57, 14, 5, 8, 16, 37, 27, 3, 13, 17, 3, 1, 8, 13, 11))
Your Chi square test is not correct. You need to provide the counts as a table or matrix, not as two separate vectors. Because you have very small expected values for half of the cells, you need to use simulation to estimate the p-value:
results <- chisq.test(diagnosis[, 2:3], simulate.p.value=TRUE)
The overall table is barely significant at .05. The chisq.test function returns a list including the original data, the expected values, residuals, and standardized residuals. The manual page describes these (?chisq.test) and provides some citations for more details.

Interpolating three columns

I have a set of data in ranges like:
x|y|z
-4|1|45
-4|2|68
-4|3|96
-2|1|56
-2|2|65
-2|3|89
0|1|45
0|2|56
0|3|75
2|1|23
2|2|56
2|3|75
4|1|42
4|2|65
4|3|78
Here I need to interpolate between x and y using the z value.
I tried interpolating separately for x and y using z value by using the below code:
interpol<-approx(x,z,method="linear")
interpol_1<-approx(y,z,method="linear")
Now I'm trying to use all the three columns but values are coming wrong.
In your script you forgot to direct to your data.frame. Note the use of $ in the approx function.
interpol <- approx(df$x,df$z,method="linear")
interpol_1 <- approx(df$y,df$z,method="linear")
Data:
df <- data.frame(
x = c(-4, -4, -4, -2, -2, -2, 0, 0, 0, 2, 2, 2, 4, 4, 4),
y = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
z = c(45, 68, 96, 56, 65, 89, 45, 56, 75, 23, 56, 75, 42, 65, 78)
)

Complicated filtering of data frame without loops

I have big data frame with positions, time stamps, trip ids etc.
I would like to in a simple way, to avoid double loops, filter out and save only some of the rows.
So for all the rows that have the same combination of trip_id and stop_id, I want to save the row where the speed was first equal to zero. Either by take the minimum timestamp where the speed is zero or simple just the first time the speed is zero since the frame is ordered by the timestamp.
So in the example below, I would like to find the three top rows (in the real data frame a lot more rows) and just save the second row where the speed first was zero.
Is there a way to do this without any loops?
trip_id.x stop_id latitude.x longitude.x bearing speed timestamp vehicle id
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 30 0.2 1541399400 9031005990005424
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 0 0 1541399401 9031005990005424
55700000048910944 9022005000050006 58.416879999999999 15.624510000000001 0 0 1541399402 9031005990005424
55700000048910300 9022005000050006 58.416879999999999 15.624510000000001 30 0.5 1541400000 9031005990005424
Edit:
Here is the dput() of a longer exampel with a simpler format of the data I have:
structure(list(trip_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3), stop_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1,
1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3,
3, 3), speed = c(5, 0, 0, 5, 2, 0, 0, 2, 4, 0, 0, 4, 5, 0, 0,
5, 2, 0, 0, 2, 4, 0, 0, 4, 5, 0, 0, 5, 2, 0, 0, 2, 4, 0, 0, 4
), timestamp = c(1, 2, 3, 4, 101, 102, 103, 104, 201, 202, 203,
204, 301, 302, 303, 304, 401, 402, 403, 404, 501, 502, 503, 504,
601, 602, 603, 604, 701, 702, 703, 704, 801, 802, 803, 804)), row.names = c(NA,
-36L), class = c("tbl_df", "tbl", "data.frame"))
And the wanted output:
structure(list(trip_id = c(1, 1, 2, 2, 2, 3, 3, 3), stop_id = c(1,
3, 1, 2, 3, 1, 2, 3), speed = c(0, 0, 0, 0, 0, 0, 0, 0), timestamp = c(2,
202, 302, 402, 502, 602, 702, 802)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Edit: Trying to change to code to have conditions in it. Tried with case_when and if but can't get it to work:
df_arrival_z <- df %>%
group_by(trip_id, stop_id) %>%
filter(speed == 0)
# Check if there is any rows where speed is zero
if (nrow(filter(speed == 0)) > 0){
# Take the first row if there is rows with zero
filter(speed == 0) %>% slice(1)
}
if (nrow(filter(speed == 0)) == 0){
# Take the middle point if there is no rows with speed = 0
slice(nrow%/%2)
}
Without desired output I can't be sure what you expect, but try this and let me know:
library(dplyr)
df %>%
group_by(trip_id, stop_id) %>%
filter(speed == 0) %>%
slice(1)

Resources