I have a dataset with 54285 observations. What I need is to assign randomly 50% of the rows into another dataframe, 30% into another dataset, and the rest (20%) into another one. This should be done without duplicates.
This is an example:
data<-data.frame(numbers=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
data
1
2
3
4
5
6
7
8
9
10
What I expect would be:
df1
5
3
8
1
7
df2
2
4
9
df3
6
10
Multiply the ratio by number of rows in the dataset and split the data to divide them in separate dataframes.
set.seed(123)
result <- split(data, sample(rep(1:3, nrow(data) * c(0.5, 0.3, 0.2))))
names(result) <- paste0('df', seq_along(result))
list2env(result, .GlobalEnv)
df1
# numbers
#1 1
#3 3
#7 7
#9 9
#10 10
df2
# numbers
#4 4
#5 5
#8 8
df3
# numbers
#2 2
#6 6
For large dataframes using sample with prob argument should work as well. However, note that this might not give you exact number of rows that you expect like the above rep answer.
result <- split(data, sample(1:3, nrow(data), replace = TRUE, prob = c(0.5, 0.3, 0.2)))
I have two large dataframes (50+ columns and many are long character vars) and I need to identify the "link" variable that I should use to merge them together. The problem is the name of the variables don't match up. That is I need to identify variables in the two datasets where the values have a high correlation.
As an example :
dta1 = data.frame(A = c(1 , 2,3, 4), B = c( 23, 45, 6, 8), C = c("001", "028", "076", "039"))
dta2 = data.frame(first = c(5, 6, 7, 8), second = c( 58, 32, 33, 45), third = c("008", "028", "076", "039"))
I would like the code to tell me that columns C and third have a very high correlation (they are not complete duplicates though!).
I have tried adding the two dataframes and running a cor() function, but this doesn't work with character variables.
Also tried union_all(x, y, ...) from dplyr but that requires the same column names.
At this point I am out of ideas.
Thanks very much.
To identify the columns most similar, try the following. It systematically compares the values from each column in dta1 with the columns in dta2. It returns a matrix.
sapply(dta1, function(x) sapply(dta2, function(y) sum(x == y)))
A B C
first 0 1 0
second 0 0 0
third 0 0 3
From here we can see that third and C have the most matches. Now you can join your two data.frames. To keep all rows and columns, you will want a full_join from the dplyr package.
library(dplyr)
full_join(dta1, dta2, by = c("C" = "third"))
A B C first second
1 1 23 001 NA NA
2 2 45 028 6 32
3 3 6 076 7 33
4 4 8 039 8 45
5 NA NA 008 5 58
I am trying to learn how to use for loops in R, in particular to subtract a number from it's above number in a column in R.
I know can do this with b <- diff(df$a) or with:
library(dplyr)
df %>%
mutate(b = a - lag(a))
But I am trying to understand how I can also get the same result with something like:
for(i in 1:nrow(df)){
result = df[2,] - df[i,]
print (result)
}
How do I set this for loop so the df[2,] takes every following row, and not just the 2nd row, and subtracts from the row above?
For example I have data like this:
column a
1
10
20
and I want to eventually create a column with the subtractions:
column a column b
1 10
11 9
20 ...
You can use for loop like
df$columnB <- NA
for(i in 1:(nrow(df) - 1)) {
df$columnB[i] = df$columnA[i+1] - df$columnA[i]
}
df
# columnA columnB
#1 1 10
#2 11 9
#3 20 5
#4 25 9
#5 34 NA
data
Sample data used:
df <- data.frame(columnA = c(1, 11, 20, 25, 34))
I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the
#first two [1,1] and last three [1,1,9] positions in the vector.
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive.
#They could be randomly selected from any position.
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.
Here is a function I designed to perform the task you described.
The vec_merge function takes the following arguments:
x: an integer vector.
event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.
sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
# Check if event_perc makes sense
if (event_perc > 1 | event_perc <= 0){
stop("event_perc should be between 0 to 1.")
}
# Check if sample_n makes sense
if (any(sample_n < 2)){
stop("sample_n should be at least larger than 2")
}
# Determine the event numbers
n <- round(length(x) * event_perc)
# Determine the sample number of each event
sample_vec <- sample(sample_n, size = n, replace = TRUE)
names(sample_vec) <- paste0("S", 1:n)
# Check if the sum of sample_vec is larger than the length of x
# If yes, stop the function and print a message
if (length(x) < sum(sample_vec)){
stop("Too many samples. Decrease event_perc or sampel_n")
}
# Determine the number that will not be merged
n2 <- length(x) - sum(sample_vec)
# Create a vector with replicated 1 based on m
non_merge_vec <- rep(1, n2)
names(non_merge_vec) <- paste0("N", 1:n2)
# Combine sample_vec and non_merge_vec, and then randomly sorted the vector
combine_vec <- c(sample_vec, non_merge_vec)
combine_vec2 <- sample(combine_vec, size = length(combine_vec))
# Expand the vector
expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
expand_vec <- inverse.rle(expand_list)
# Create a data frame with x and expand_vec
dat <- data.frame(number = x,
group = factor(expand_vec, levels = unique(expand_vec)))
dat$index <- 1
dat2 <- aggregate(cbind(dat$number, dat$index),
by = list(group = dat$group),
FUN = sum)
# # Convert dat2 to a matrix, remove the group column
dat2$group <- NULL
mat <- as.matrix(dat2)
return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.
set.seed(123)
vec_merge(1:10)
# number index
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 9 2
# [5,] 6 1
# [6,] 7 1
# [7,] 17 2
# [8,] 10 1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
if (sum(events) > length(vec)) stop("Too many events to merge")
# Create "groups" for the events
merge_events_seq <- rep(seq_along(events), events)
# Create "groups" for the rest of the data
remainder <- sequence((length(vec) - sum(events))) + length(events)
# Combine both groups and shuffle them so that the
# positions being combined are not necessarily consecutive
inds <- sample(c(merge_events_seq, remainder))
# Aggregate using `data.table`
temp <- data.table(values = vec, groups = inds)[
, list(count = length(values),
total = sum(values),
pos = toString(.I),
original = toString(values)), groups][, groups := NULL]
# Drop the other columns if required. Return the output.
if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
The count of values that were included in a particular sum (your ex.indices).
The total after summing relevant values (your ex.want).
The positions of the original values from the input vector.
The original values themselves, in case you want to verify it later.
The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
## [1] 4 7 1 2 11 14 18 19 1 10
merge_events <- c(2, 3)
fun(have, merge_events)
## count total pos original
## 1: 1 4 1 4
## 2: 1 7 2 7
## 3: 2 2 3, 9 1, 1
## 4: 1 2 4 2
## 5: 3 40 5, 8, 10 11, 19, 10
## 6: 1 14 6 14
## 7: 1 18 7 18
fun(events = c(3, 4))
## count total pos original
## 1: 4 39 1, 4, 6, 8 4, 2, 14, 19
## 2: 3 36 2, 5, 7 7, 11, 18
## 3: 1 1 3 1
## 4: 1 1 9 1
## 5: 1 10 10 10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
## [1] 6 10 10 6 15 20 28 20 26 12 25 23 6 25 8 12 25 23 24 6
fun(input, events = c(4, 7, 2, 3))
## count total pos original
## 1: 7 92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2: 1 10 2 10
## 3: 3 71 6, 9, 14 20, 26, 25
## 4: 4 69 7, 12, 13, 16 28, 23, 6, 12
## 5: 2 45 8, 17 20, 25
## 6: 1 12 10 12
## 7: 1 8 15 8
## 8: 1 23 18 23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1] 6 10 6 15 25 24 6
sum(.Last.value)
## [1] 92
I want using R to organize the most efficient search a value in tables in the format data.frame like this
x01 x02 x03 x04 x05 x06 x07
1 NA 100 200 300 400 500 600
2 10 1 4 3 6 7 1
3 20 2 5 2 5 8 2
4 30 3 6 1 4 9 8
Values in the first row and first column in order of increasing. For example, I need to find value to the crosshairs of a column containing 300 in the first row and the row containing 20 in the first column. The value 2. Code for this:
coefficient_table_1 <- data.frame(
x01=c(NA, 10, 20, 30),
x02=c(100, 1, 2, 3),
x03=c(200, 4, 5, 6),
x04=c(300, 3, 2, 1),
x05=c(400, 6, 5, 4),
x06=c(500, 7, 8, 9),
x07=c(600, 1, 2, 8)
)
col_value <- 300
row_value <- 20
col <- 0
for(i in 2:ncol(coefficient_table_1)){
if(coefficient_table_1[1,i]==col_value ){
col <- i
}
}
row <- which(coefficient_table_1$x01==row_value)
value <- coefficient_table_1[row, col]
Table can be large and the search can be arranged inside the loop. What is the most effective way to search in data.frame?
Your data is all numeric, so your best course of action is probably to use arrays, rather than data frames.
Since arrays contain data of only a single class (e.g. numeric), many operations are much faster when your data is in array format.
Try this:
x <- as.matrix(coefficient_table_1)
x[which(x[, 1]==row_value), which(x[1, ]==col_value)]
x04
2