Assign a specific number of random rows into datasets in R - r

I have a dataset with 54285 observations. What I need is to assign randomly 50% of the rows into another dataframe, 30% into another dataset, and the rest (20%) into another one. This should be done without duplicates.
This is an example:
data<-data.frame(numbers=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
data
1
2
3
4
5
6
7
8
9
10
What I expect would be:
df1
5
3
8
1
7
df2
2
4
9
df3
6
10

Multiply the ratio by number of rows in the dataset and split the data to divide them in separate dataframes.
set.seed(123)
result <- split(data, sample(rep(1:3, nrow(data) * c(0.5, 0.3, 0.2))))
names(result) <- paste0('df', seq_along(result))
list2env(result, .GlobalEnv)
df1
# numbers
#1 1
#3 3
#7 7
#9 9
#10 10
df2
# numbers
#4 4
#5 5
#8 8
df3
# numbers
#2 2
#6 6
For large dataframes using sample with prob argument should work as well. However, note that this might not give you exact number of rows that you expect like the above rep answer.
result <- split(data, sample(1:3, nrow(data), replace = TRUE, prob = c(0.5, 0.3, 0.2)))

Related

Automate replacement of missing data on a sequence of variables using mutate_all

I am trying to automate a process to complete missing values on a sequence of variables using an ifelse statement and mutate_all function. The problem involves a dataframe with many variable names, for example, ax1, bx1, ...zx1, ax2, bx2, ...zx2, ax3, bx3, ...zx3. The following data give a small scenario:
df<-data.frame(
"id" = c(1:5),
"ax1" = c(1, "NA", 8, "NA", 17),
"bx1" = c(2, 7, "NA", 11, 12),
"ax2" = c(2, 1, 8, 15, 17),
"bx2" = c(2, 6, 4, 13, 11))
The process is to replace the missing values on the variables with the ending "x1" with their corresponding values on the variables with the ending "x2". That is, if ax1 is missing it is replaced by ax2 and any missingness on bx1 is replaced by bx2 and so on. Since there are many variables than the scenario presented here, I am looking for a way to automate this process. I have tried the following codes
library(dplyr)
df <- df %>%
mutate_all(vars(ends_with("x1", "x2")), function(x,y)
ifelse(is.na(x), y, x)))
but it does not work. I greatly appreciate any help on this.
The expected output is
id ax1 bx1 ax2 bx2
1 1 2 2 2
2 1 7 1 6
3 8 4 8 4
4 15 11 15 13
5 17 12 17 11
In base R, we can replace NA value in x1 with corresponding NA values in x2 using Map.
x1_cols <- grep('x1$', names(df))
x2_cols <- grep('x2$', names(df))
df[x1_cols] <- Map(function(x, y) {x[is.na(x)] <- y[is.na(x)];x},
df[x1_cols], df[x2_cols])
df
# id ax1 bx1 ax2 bx2
#1 1 1 2 2 2
#2 2 1 7 1 6
#3 3 8 4 8 4
#4 4 15 11 15 13
#5 5 17 12 17 11
We can use the same logic and use purrr::map2
df[x1_cols] <- purrr::map2(df[x1_cols], df[x2_cols],
~{.x[is.na(.x)] <- .y[is.na(.x)];.x})
data
Modified data a bit making sure that NA are actual NAs and not string "NA" which were actually making columns as factors.
df<-data.frame(id=c(1:5),
ax1=c(1,NA,8,NA,17),
bx1=c(2,7,NA,11,12),
ax2=c(2,1,8,15,17),
bx2=c(2,6,4,13,11))

Combining/summing two positions in a vector of integers in R

I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the
#first two [1,1] and last three [1,1,9] positions in the vector.
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive.
#They could be randomly selected from any position.
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.
Here is a function I designed to perform the task you described.
The vec_merge function takes the following arguments:
x: an integer vector.
event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.
sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
# Check if event_perc makes sense
if (event_perc > 1 | event_perc <= 0){
stop("event_perc should be between 0 to 1.")
}
# Check if sample_n makes sense
if (any(sample_n < 2)){
stop("sample_n should be at least larger than 2")
}
# Determine the event numbers
n <- round(length(x) * event_perc)
# Determine the sample number of each event
sample_vec <- sample(sample_n, size = n, replace = TRUE)
names(sample_vec) <- paste0("S", 1:n)
# Check if the sum of sample_vec is larger than the length of x
# If yes, stop the function and print a message
if (length(x) < sum(sample_vec)){
stop("Too many samples. Decrease event_perc or sampel_n")
}
# Determine the number that will not be merged
n2 <- length(x) - sum(sample_vec)
# Create a vector with replicated 1 based on m
non_merge_vec <- rep(1, n2)
names(non_merge_vec) <- paste0("N", 1:n2)
# Combine sample_vec and non_merge_vec, and then randomly sorted the vector
combine_vec <- c(sample_vec, non_merge_vec)
combine_vec2 <- sample(combine_vec, size = length(combine_vec))
# Expand the vector
expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
expand_vec <- inverse.rle(expand_list)
# Create a data frame with x and expand_vec
dat <- data.frame(number = x,
group = factor(expand_vec, levels = unique(expand_vec)))
dat$index <- 1
dat2 <- aggregate(cbind(dat$number, dat$index),
by = list(group = dat$group),
FUN = sum)
# # Convert dat2 to a matrix, remove the group column
dat2$group <- NULL
mat <- as.matrix(dat2)
return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.
set.seed(123)
vec_merge(1:10)
# number index
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 9 2
# [5,] 6 1
# [6,] 7 1
# [7,] 17 2
# [8,] 10 1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
if (sum(events) > length(vec)) stop("Too many events to merge")
# Create "groups" for the events
merge_events_seq <- rep(seq_along(events), events)
# Create "groups" for the rest of the data
remainder <- sequence((length(vec) - sum(events))) + length(events)
# Combine both groups and shuffle them so that the
# positions being combined are not necessarily consecutive
inds <- sample(c(merge_events_seq, remainder))
# Aggregate using `data.table`
temp <- data.table(values = vec, groups = inds)[
, list(count = length(values),
total = sum(values),
pos = toString(.I),
original = toString(values)), groups][, groups := NULL]
# Drop the other columns if required. Return the output.
if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
The count of values that were included in a particular sum (your ex.indices).
The total after summing relevant values (your ex.want).
The positions of the original values from the input vector.
The original values themselves, in case you want to verify it later.
The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
## [1] 4 7 1 2 11 14 18 19 1 10
merge_events <- c(2, 3)
fun(have, merge_events)
## count total pos original
## 1: 1 4 1 4
## 2: 1 7 2 7
## 3: 2 2 3, 9 1, 1
## 4: 1 2 4 2
## 5: 3 40 5, 8, 10 11, 19, 10
## 6: 1 14 6 14
## 7: 1 18 7 18
fun(events = c(3, 4))
## count total pos original
## 1: 4 39 1, 4, 6, 8 4, 2, 14, 19
## 2: 3 36 2, 5, 7 7, 11, 18
## 3: 1 1 3 1
## 4: 1 1 9 1
## 5: 1 10 10 10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
## [1] 6 10 10 6 15 20 28 20 26 12 25 23 6 25 8 12 25 23 24 6
fun(input, events = c(4, 7, 2, 3))
## count total pos original
## 1: 7 92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2: 1 10 2 10
## 3: 3 71 6, 9, 14 20, 26, 25
## 4: 4 69 7, 12, 13, 16 28, 23, 6, 12
## 5: 2 45 8, 17 20, 25
## 6: 1 12 10 12
## 7: 1 8 15 8
## 8: 1 23 18 23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1] 6 10 6 15 25 24 6
sum(.Last.value)
## [1] 92

Alternative to a nested for do loop

I have a data frame which name is df, of 200+ variables with 300,000+ observations (200+ columns, 300000+ rows)
The end goal of my R code is to find the outlier of each column and replace them with a certain value, say, NA. If the value is already NA, skip and proceed to the next loop
for (j in 1:ncol(df)){
outnumtext <- paste0('out_value <- boxplot.stats(df$',colnames(df[j]),')$out')
eval(parse(text=outnumtext))
for (k in 1:nrow(df)){
replacetext <- paste0('
if ((df[',k,',',j,'] %in% out_value) & !(is.na(df[',k,',',j,']))) {
df[',k,',',j,'] <- NA
} else if (is.na(df[',k,',',j,'])) {
next
} else {
next
}')
eval(parse(text=replacetext))
}
}
I discovered that using the for loop in r and looping through each and every one of the rows in every column, considerably slows down the running. Are there any alternatives to this?
Thank you very much in advance!
Edit P/S: The real code is not just replacing outliers with NA, but has several ways of dealing based on several conditions (where if & if else conditions will be executed accordingly). However my goal is to get a possible alternative in reducing the running time, thus I tried simplifying my original code as much as possible to get to the main point
You don't want to use loops for this. You could try dplyr::mutate_all().
It will still be slow over 300K+ rows, but should be better than the loop.
library(dplyr)
df <- df %>%
mutate_all(funs(ifelse(. %in% boxplot.stats(.)$out, NA, .)))
Example:
exdata <- structure(list(x = c(200, 6, 8, 2, 7, 1, 4, 9, 3, 5, 1000),
y = c(300, 1, 18, 3, 2, 16, 14, 9, 11, 6, 100)),
row.names = c(NA, -11L),
class = "data.frame")
exdata
x y
1 200 300
2 6 1
3 8 18
4 2 3
5 7 2
6 1 16
7 4 14
8 9 9
9 3 11
10 5 6
11 1000 100
data1 %>%
mutate_all(funs(ifelse(. %in% boxplot.stats(.)$out, NA, .)))
x y
1 NA NA
2 6 1
3 8 18
4 2 3
5 7 2
6 1 16
7 4 14
8 9 9
9 3 11
10 5 6
11 NA NA

Subsetting Longitudinal Dataframe by Randomly Selected Participant ID

I'd like to subset a longitudinal dataset by a randomly sampled number of participants. In this example there are three entries per participant and I want to sample 4 participants.
id <- rep(c(1:6), each = 3)
score <- rnorm(18, 10, 3)
group <- rep(c("a", "b"), each = 3, times = 3)
df <- data.frame(id, group, score)
I tried with this command...
dfSub <- df[df$id %in% sample(df$id, 4, replace = FALSE),]
But it only returns the entries for three participants, not the four I stipulated. Can anyone tell me why this didn't work and how to do it better?
We can use unique
df[df$id %in%sample(unique(df$id), 4, replace = FALSE),]
# id group score
#7 3 a 8.123872
#8 3 a 12.685344
#9 3 a 12.824781
#10 4 b 11.868296
#11 4 b 13.000660
#12 4 b 9.541258
#13 5 a 9.722255
#14 5 a 3.889751
#15 5 a 10.851232
#16 6 b 10.945997
#17 6 b 11.632380
#18 6 b 3.289507
The OP's command didn't work out because of the following
sample(c(1, 1, 4,3), 3, replace=FALSE)
#[1] 3 4 1
sample(c(1, 1, 4,3), 3, replace=FALSE)
#[1] 1 3 1
If there are duplicate values, sample can still return duplicates instead of unique values for the size specified. The replace only does whether sampling should be done with replacement or not. In the dummy example, we have 2 1s. So, even with replace=FALSE, the number of 1s that can be possible in the sample is 2.

Reverse Scoring Items

I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R

Resources