Random stratified sampling with different proportions - r

I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -
Location1: 172
Location2: 615
Location3: 603
Location4: 502
I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using stratified function from the splitstackshape package but it doesn't seem to want to split my factors up.
Here is a simplified reproducible example -
x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)
xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")
df <- data.frame(x, xx)
validIndex <- stratified(df, "xx", size=16/nrow(df))
valid <- df[-validIndex,]
train <- df[validIndex,]
where A, B, C, D correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)

Using bothSets should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):
splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]
## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)

Related

gtsummary R package: pre-post summary table with paired 2-sample tests?

Is it possible to use the gtsummary R package to make a pre-post summary table with 2 columns that summarize multiple variables at 2 different time points?
I know the arsenal R package supports this, but I would prefer to use gtsummary if possible since it supports the tidyverse.
For example, is it possible to make a pre-post summary table using gtsummary that is similar to the table in this example? Here is a simpler version of the dataset from their example:
dat <- data.frame(
tp = paste0("Time Point ", c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2)),
id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6),
Cat = c("A", "A", "A", "B", "B", "B", "B", "A", NA, "B"),
Fac = factor(c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A")),
Num = c(1, 2, 3, 4, 4, 3, 3, 4, 0, NA),
stringsAsFactors = FALSE)
Note the dataset is in "long format": tp is the 2 pre-post time points, and id is the subject ID for the 2 repeated measures. To make the table, Cat and Fac are categorical variables that would be summarized as count(%) at each time point, and use McNemar's test to compare if they change over time. Num is a numeric variable that would be summarized as mean(standard deviation) at each time point, and use paired t-test to assess change over time.
Yes, as of gtsummary v1.3.6, there is a function called add_difference() for this express purpose. The function supports both paired (e.g. pre- and post-responses), and unpaired data. The method is specified in the test= argument.
Worked example here: http://www.danieldsjoberg.com/gtsummary/articles/gallery.html#paired-test
Here's an unpaired example:
trial %>%
select(trt, age, marker, response, death) %>%
tbl_summary(
by = trt,
statistic =
list(
all_continuous() ~ "{mean} ({sd})",
all_dichotomous() ~ "{p}%"
),
missing = "no"
) %>%
add_n() %>%
add_difference()

How to find the similarity in R?

I have a data set as I've shown below:
It shows which book is sold by which shop.
df <- tribble(
~shop, ~book_id,
"A", 1,
"B", 1,
"C", 2,
"D", 3,
"E", 3,
"A", 3,
"B", 4,
"C", 5,
"D", 1,
)
In the data set,
shop A sells 1, 3
shop B sells 1, 4
shop C sells 2, 5
shop D sells 3, 1
shop E sells only 3
So now, I want to calculate the Jaccard index here. For instance, let's take shop A and shop B. There are three different books that are sold by A and B (book 1, book 3, book 4). However, only one product is sold by both shops (this is product 1). So, the Jaccard index here should be 33.3% (1/3).
Here is the sample of the desired data:
df <- tribble(
~shop_1, ~shop_2, ~similarity,
"A", "B", 33.3,
"B", "A", 33.33,
"A", "C", 0,
"C", "A", 0,
"A", "D", 100,
"D", "A", 100,
"A", "E", 50,
"E", "A", 50,
)
Any comments/assistance really appreciated! Thanks in advance.
I don't know about a package but you can write your own function. I guess by similarity you mean something like this:
similarity <- function(x, y) {
k <- length(intersect(x, y))
n <- length(union(x, y))
k / n
}
Then you can use tidyr::crossing to merge the same data frame with itself
dfg <- df %>% group_by(shop) %>% summarise(books = list(book_id))
crossing(dfg %>% set_names(paste0, "_A"), dfg %>% set_names(paste0, "_B")) %>%
filter(shop_A != shop_B) %>%
mutate(similarity = map2_dbl(books_A, books_B, similarity))

R: Create every possible combinations of given columns

I have a data that corresponds to df. df shows the source and destination and the longitudes and latitudes of this sources and destinations.
I want to use df to generate df1. df1 gives all possible combinations of source and destination and while doing so combines the appropriate source and destination longitudes and latitudes.
Source <- c("A", "B", "C", "D")
Destination <- c("A", "B", "C", "D")
Source_Latitude <- c(1, 2, 3, 4)
Source_Longitude <- c(-1, -2, -3, -4)
Dest_Latitude <- c(1, 2, 3, 4)
Dest_Longitude <- c(-1, -2, -3, -4)
df <- data.frame(Source, Source_Latitude, Source_Longitude, Destination,Dest_Latitude,Dest_Longitude)
Source <- c("A", "A", "A", "A", "B","B","B","B", "C","C","C","C", "D","D","D","D")
Destination <- c("A", "B", "C", "D","A", "B", "C", "D","A", "B", "C", "D","A", "B", "C", "D")
Source_Latitude <- c(1,1,1,1, 2, 2, 2, 2, 3,3,3,3, 4,4,4,4)
Source_Longitude <- c(-1,-1,-1,-1,-2,-2,-2,-2,-3,-3,-3,-3,-4,-4,-4,-4)
Dest_Latitude <- c(1, 2, 3, 4,1, 2, 3, 4,1, 2, 3, 4,1, 2, 3, 4)
Dest_Longitude <- c(-1, -2, -3, -4,-1, -2, -3, -4,-1, -2, -3, -4,-1, -2, -3, -4)
df1 <- data.frame(Source, Source_Latitude, Source_Longitude, Destination,Dest_Latitude,Dest_Longitude)
I tried using crossing() and expand.grid() without any success
library(dplyr)
expand.grid(Source = Source, Destination = Destination) %>%
inner_join(select(df, contains("Source")), by = "Source") %>%
inner_join(select(df, contains("Dest")), by = "Destination")) %>%
select(contains("Source"), contains("Dest")) %>% View()
As an additional observation, although the code works, I don't think it's the best to keep sources and destinations in the same dataframe. Because the number of sources and destinations may be different. It would probably be best to have one data frame for each, and adapt the code accordingly.
all_combination<- expand.grid(Source=df$Source, Destination=df$Destination)%>%
inner_join(select(df, contains("Source")), by = "Source") %>%
inner_join(select(df, contains("Dest")), by = "Destination")) %>%
distinct()
This worked for me. Took a while to figure out how to use the expand.grid()function.

Kruskal-Wallis test: create lapply function to subset data.frame?

I have a data set of values (val) grouped by multiple categories (distance & phase). I would like to test each category by Kruskal-Wallis test, where val is dependent variable, distance is a factor, and phase split my data in 3 groups.
As such, I need to specify the subset of the data within Kruskal-Wallis test and then apply the test to each of groups. BUT, I can not get my subsetting to work!
In R help, it is specified that the subset is an optional vector specifying a subset of observations to be used. But how to correctly put this to my lapply function?
My dummy data:
# create data
val<-runif(60, min = 0, max = 100)
distance<-floor(runif(60, min=1, max=3))
phase<-rep(c("a", "b", "c"), 20)
df<-data.frame(val, distance, phase)
# get unique groups
ii<-unique(df$phase)
# get basic statistics per group
aggregate(val ~ distance + phase, df, mean)
# run Kruskal test, specify the subset
kruskal.test(df$val ~df$distance,
subset = phase == "c")
This works well, so my subset should be correctly set as a vector.
But how to use this in a lapply function?
# DOES not work!!
lapply(ii, kruskal.test(df$val ~ df$distance,
subset = df$phase == as.character(ii)))
My overall goal is to create a function from kruskal.test, and save all statistics for each group into one table.
All help is highly appreciated.
Usually you would start by splitting, and then lapplying.
Something like
lapply(split(df, df$phase), function(d) { kruskal.test(val ~ distance, data=d) })
would yield a list, indexed by the phase, of the results of kruskal.test.
Your final expression does not work because lapply expects a function, and applying kruskal.test does not result in a function, it results in the result of running that test. If you surround it with a function definition with the index, then it would work, just be a little less idiomatic.
lapply(ii, function(i) { kruskal.test(df$val ~ df$distance, subset=df$phase==i )})
Though it is late, it might help someone having the same problem. So, I am putting an answer implemented using tidyverse and rstatix packages. The rstatix package which "provides a simple and intuitive pipe friendly framework, coherent with the 'tidyverse' design philosophy for performing basic statistical tests".
library(rstatix)
library(tidyverse)
df %>%
group_by(phase) %>%
kruskal_test(val ~ distance)
Output
# A tibble: 3 x 7
phase .y. n statistic df p method
* <chr> <chr> <int> <dbl> <int> <dbl> <chr>
1 a val 20 0.230 1 0.631 Kruskal-Wallis
2 b val 20 0.0229 1 0.88 Kruskal-Wallis
3 c val 20 0.322 1 0.570 Kruskal-Wallis
which is same as provided by #user295691.
Data
df = structure(list(val = c(93.8056977232918, 31.0681172646582, 40.5262873973697,
47.6368983509019, 65.23181500379, 64.4571609096602, 10.3301600087434,
90.4661140637472, 41.2359046051279, 28.3357713604346, 49.8977075796574,
10.8744730940089, 5.31001624185592, 71.9248640118167, 99.0267782937735,
73.7928744405508, 3.31214582547545, 40.2693636715412, 27.6980920461938,
79.501334275119, 60.5167196830735, 89.9171086261049, 87.4633299885318,
43.1893823202699, 91.1248738644645, 99.755659350194, 7.25280269980431,
96.957387868315, 75.0860505970195, 52.3794749286026, 26.6221587313339,
52.5518182432279, 24.1361060412601, 49.5364486705512, 65.5214034719393,
38.9469220302999, 0.687191751785576, 19.3090825574473, 19.6511475136504,
25.5966754630208, 7.33999472577125, 33.9820940745994, 50.3751677693799,
10.811762069352, 17.2359711956233, 53.958406439051, 64.2723652534187,
92.7404976682737, 26.824192632921, 30.0975760444999, 52.0105463219807,
74.4495407678187, 56.0636054025963, 91.891074879095, 14.0827904455364,
59.3607738381252, 66.5170294465497, 24.1726311156526, 83.0881901318207,
35.5380675755441), distance = c(2, 1, 1, 1, 1, 2, 1, 2, 2, 1,
2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1,
1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1,
1, 2, 1, 1, 2, 2, 2, 2), phase = c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b",
"c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c")), class = "data.frame", row.names = c(NA, -60L))

Using matplot in R whenever certain column changes

Sorry in advance because I am new at asking questions here and don't know how to input this table properly.
Say I have a data frame in R constructed like:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind(team, value)
I want to create a plot that will give me 3 lines graphing the values for teams A, B, and C. I believe I can do this inputting the matrix m into matplot somehow, but I'm not sure how.
EDIT: I've gotten a lot closer to solving my problem. However I've realized that for some reason, with the code I have, "Value" is a list of 745 which matches the number of rows in my dataframe m. However when I unlist(Value) it turns into a numeric of length 894. Any ideas on why this would happen?
You can try something like this:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind.data.frame(team, value)
library(ggplot2)
ggplot(m, aes(x=as.factor(1:nrow(m)), y=value, group=team, col=team)) +
geom_line(lwd=2) + xlab('index')
if you have same number of ordered values for each team, you could use matplot to visualize them. but the data should be converted to matrix first;
m = cbind.data.frame(team, value, index = rep(1:3, 3))
m <- reshape(m, v.names = 'value', idvar = 'team', direction = 'wide', timevar = 'index')
matplot(t(m[, 2:4]), type = 'l', lty = 1)
legend('top', legend = m[, 1], lty = 1, col = 1:3)

Resources