I have two data sets, lets call them
A: 1, 3, 5, 6, 3
and
B: 2, 4, 7, 9, 8
Where the numbers in A and B represent time
I want to label all of the numbers in A by row as "positive"
and all numbers in B labelled "Negative" by row.
I then want to merge both A and B into a single data frame under one column called "time", but they must keep their row names "Positive"/"Negative" for their corresponding number so I can plot both onto a survival plot
My guess is that you are looking for such a dataframe:
A <- data.frame(time = c(1, 3, 5, 6, 3), status= "positive")
B <- data.frame(time = c(2, 4, 7, 9, 8), status= "negative")
rbind(A, B)
Output:
time status
1 1 positive
2 3 positive
3 5 positive
4 6 positive
5 3 positive
6 2 negative
7 4 negative
8 7 negative
9 9 negative
10 8 negative
I have a dataset with 54285 observations. What I need is to assign randomly 50% of the rows into another dataframe, 30% into another dataset, and the rest (20%) into another one. This should be done without duplicates.
This is an example:
data<-data.frame(numbers=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
data
1
2
3
4
5
6
7
8
9
10
What I expect would be:
df1
5
3
8
1
7
df2
2
4
9
df3
6
10
Multiply the ratio by number of rows in the dataset and split the data to divide them in separate dataframes.
set.seed(123)
result <- split(data, sample(rep(1:3, nrow(data) * c(0.5, 0.3, 0.2))))
names(result) <- paste0('df', seq_along(result))
list2env(result, .GlobalEnv)
df1
# numbers
#1 1
#3 3
#7 7
#9 9
#10 10
df2
# numbers
#4 4
#5 5
#8 8
df3
# numbers
#2 2
#6 6
For large dataframes using sample with prob argument should work as well. However, note that this might not give you exact number of rows that you expect like the above rep answer.
result <- split(data, sample(1:3, nrow(data), replace = TRUE, prob = c(0.5, 0.3, 0.2)))
I have two dataframes, one that is a very large, wide dataset with hundreds of parameters and another with 3 columns that identify the parameters in the larger dataframe with specification limits and two columns for the lower and upper limits. What I want to do is to be able to reduce the wide dataframe to just the columns that are in the limits dataframe. I feel like this is incredibly basic but I cannot get it to work
See below for an example and output that I would like.
df
df <- data.frame("par.1" = c(1, 1, 2, 3, 5), "par.2" = c(10, 11, 12, 11, 15),"par.3" = c(8, 8, 12, 8, 9),"par.4" = c(8, 8, 12, 8, 9))
limits
limits <- data.frame("parameter" = c("par.2", "par.4"), "lsl" = c(8,5), "usl" = c(16,15))
Here is the output I am looking for
df.reduced
par.2 par.4
1 10 8
2 11 8
3 12 12
4 11 8
5 15 9
Just subset df column names by values %in% the parameter column of limits
df[names(df) %in% limits$parameter]
par.2 par.4
1 10 8
2 11 8
3 12 12
4 11 8
5 15 9
Alternatively, use match:
df[match(limits$parameter, names(df))]
An option with intersect
df[intersect(names(df), as.character(limits$parameter))]
So far I have the following data.frame, with an initial column filled with set values:
df <- data.frame(N=seq(10, 100, by=10))
Now, I want to have a second column here, which would be a list (or c()) of integers, such that the output of calling df would be as follows:
N I
1 10 2, 8, 1
2 20 4, 0, 99
.. .. ..
I tried doing the following, where df <- data.frame(N=seq(10, 100, by=10), I=logical(10)), which puts a FALSE in each of the columns. But trying to test what I wanted to do using df$I[df$N == 10] <- list(2, 8, 1) throws the error:
number of items to replace is not a multiple of replacement length
Edit: I also tried using I(list(...)) to keep the list interpreted as is, but the same error was thrown.
We can create the list by wrapping with I in data.frame and then assign by extracting the list element that corresponds to the index provided by the logical vector
df <- data.frame(N=seq(10, 100, by=10), I= I(vector('list', 10)))
df$I[df$N == 10][[1]] <- list(2, 8, 1)
df
# N I
#1 10 2, 8, 1
#2 20
#3 30
#4 40
#5 50
#6 60
#7 70
#8 80
#9 90
#10 100
I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the
#first two [1,1] and last three [1,1,9] positions in the vector.
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive.
#They could be randomly selected from any position.
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.
Here is a function I designed to perform the task you described.
The vec_merge function takes the following arguments:
x: an integer vector.
event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.
sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
# Check if event_perc makes sense
if (event_perc > 1 | event_perc <= 0){
stop("event_perc should be between 0 to 1.")
}
# Check if sample_n makes sense
if (any(sample_n < 2)){
stop("sample_n should be at least larger than 2")
}
# Determine the event numbers
n <- round(length(x) * event_perc)
# Determine the sample number of each event
sample_vec <- sample(sample_n, size = n, replace = TRUE)
names(sample_vec) <- paste0("S", 1:n)
# Check if the sum of sample_vec is larger than the length of x
# If yes, stop the function and print a message
if (length(x) < sum(sample_vec)){
stop("Too many samples. Decrease event_perc or sampel_n")
}
# Determine the number that will not be merged
n2 <- length(x) - sum(sample_vec)
# Create a vector with replicated 1 based on m
non_merge_vec <- rep(1, n2)
names(non_merge_vec) <- paste0("N", 1:n2)
# Combine sample_vec and non_merge_vec, and then randomly sorted the vector
combine_vec <- c(sample_vec, non_merge_vec)
combine_vec2 <- sample(combine_vec, size = length(combine_vec))
# Expand the vector
expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
expand_vec <- inverse.rle(expand_list)
# Create a data frame with x and expand_vec
dat <- data.frame(number = x,
group = factor(expand_vec, levels = unique(expand_vec)))
dat$index <- 1
dat2 <- aggregate(cbind(dat$number, dat$index),
by = list(group = dat$group),
FUN = sum)
# # Convert dat2 to a matrix, remove the group column
dat2$group <- NULL
mat <- as.matrix(dat2)
return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.
set.seed(123)
vec_merge(1:10)
# number index
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 9 2
# [5,] 6 1
# [6,] 7 1
# [7,] 17 2
# [8,] 10 1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
if (sum(events) > length(vec)) stop("Too many events to merge")
# Create "groups" for the events
merge_events_seq <- rep(seq_along(events), events)
# Create "groups" for the rest of the data
remainder <- sequence((length(vec) - sum(events))) + length(events)
# Combine both groups and shuffle them so that the
# positions being combined are not necessarily consecutive
inds <- sample(c(merge_events_seq, remainder))
# Aggregate using `data.table`
temp <- data.table(values = vec, groups = inds)[
, list(count = length(values),
total = sum(values),
pos = toString(.I),
original = toString(values)), groups][, groups := NULL]
# Drop the other columns if required. Return the output.
if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
The count of values that were included in a particular sum (your ex.indices).
The total after summing relevant values (your ex.want).
The positions of the original values from the input vector.
The original values themselves, in case you want to verify it later.
The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
## [1] 4 7 1 2 11 14 18 19 1 10
merge_events <- c(2, 3)
fun(have, merge_events)
## count total pos original
## 1: 1 4 1 4
## 2: 1 7 2 7
## 3: 2 2 3, 9 1, 1
## 4: 1 2 4 2
## 5: 3 40 5, 8, 10 11, 19, 10
## 6: 1 14 6 14
## 7: 1 18 7 18
fun(events = c(3, 4))
## count total pos original
## 1: 4 39 1, 4, 6, 8 4, 2, 14, 19
## 2: 3 36 2, 5, 7 7, 11, 18
## 3: 1 1 3 1
## 4: 1 1 9 1
## 5: 1 10 10 10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
## [1] 6 10 10 6 15 20 28 20 26 12 25 23 6 25 8 12 25 23 24 6
fun(input, events = c(4, 7, 2, 3))
## count total pos original
## 1: 7 92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2: 1 10 2 10
## 3: 3 71 6, 9, 14 20, 26, 25
## 4: 4 69 7, 12, 13, 16 28, 23, 6, 12
## 5: 2 45 8, 17 20, 25
## 6: 1 12 10 12
## 7: 1 8 15 8
## 8: 1 23 18 23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1] 6 10 6 15 25 24 6
sum(.Last.value)
## [1] 92