I have the following database (as an example)
participants <- c(1:60)
codes <- c(1:60)
database <-cbind(participants, codes)
The second variable codes contains emails linked to each participant ID, although for example proposes I just filled it with numbers.
I have 60 participants each one with a participant ID and an email tied to this ID (a number from 1 to 60). As such in the example row 1 is 1 ,1 and so on.
I need to divide the list on 3 groups of identical proportion, eg 20 participants per group.
The way I am doing it now is
#generating list of participants
participants <- c(1:60)
codes <- c(1:60)
database <-cbind(participants, codes)
#randomizing the order of the list
randomized_list <- sample(participants)
#Extracting the three groups
group1 <- randomized_list[c(1:20)]
group2 <- randomized_list[c(21:40)]
group3 <- randomized_list[c(41:60)]
Leaving me to do the work of getting the email addresses and dividing the lists more or less by hand (compare group 1, 2 and 3 with database and making the link).
Is there a more elegant and compact solution for achieving the results I seek?
First assign rows randomly to groups and then use that to access the groups.
# generate random group lables
r_labels <- sample(1:3, nrow(database), replace = T)
# group 1
database[r_labels == 1, ]
Related
I have a data frame that looks like this:
df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))
I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.
I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.
I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.
Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.
hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))
In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).
Join the hierarchy to the original data frame.
require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))
Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.
setDT(matched)[, freq := .N, by = c("Set", "Value")]
Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.
multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)
Now, multiply frequency by the numeric scores.
multiplied$mult <- multiplied$Nums * multiplied$freq
Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.
check <- multiplied[with(multiplied, order(Set, -mult)), ]
final <- distinct(check, Set, .keep_all = TRUE)
This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).
If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).
Please forgive me if this question has been asked before!
So I have a dataframe (df) of individuals sampled from various populations with each individual given a population name and a corresponding number assigned to that population as follows:
Individual Population Popnum
ALM16-014 AimeesMdw 1
ALM16-024 AimeesMdw 1
ALM16-026 AimeesMdw 1
ALM16-003 AMKRanch 2
ALM16-022 AMKRanch 2
ALM16-075 BearPawLake 3
ALM16-076 BearPawLake 3
ALM16-089 BearPawLake 3
There are a total of 12 named populations (they do not all have the same number of individuals) with Popnum 1-12 in this file. What I need to do is randomly delete one or more populations (preferably using the 'Popnum' column) from the dataframe and repeating this 100 times and then saving each result as a separate dataframe (ie. df1, df2, df3, etc). The end result is 100 dfs with each one having one population removed randomly. The next step is to repeat this 100 times removing two random populations, then 3 random populations, and so on.
Any help would be greatly appreciated!!
You can write a function which takes dataframe as input and n i.e number of Popnum to remove.
remove_n_Popnum <- function(data, n) {
subset(data, !Popnum %in% sample(unique(Popnum), n))
}
To get one popnum you can do :
remove_n_Popnum(df, 1)
# Individual Population Popnum
#1 ALM16-014 AimeesMdw 1
#2 ALM16-024 AimeesMdw 1
#3 ALM16-026 AimeesMdw 1
#4 ALM16-003 AMKRanch 2
#5 ALM16-022 AMKRanch 2
To do this 100 times you can use replicate
list_data <- replicate(100, remove_n_Popnum(df1, 1), simplify = FALSE)
To pass different n in remove_n_Popnum function you can use lapply
nested_list_data <- lapply(seq_along(unique(df$Popnum)[-1]),
function(x) replicate(100, remove_n_Popnum(df, x), simplify = FALSE))
where seq_along generates a sequence which is 1 less than the number of unique values.
seq_along(unique(df$Popnum)[-1])
#[1] 1 2
I have a data set of 12.5 million records and I need to randomly select about 2.5 million. However, these individuals are in 55284 groups and I want to keep groups intact.
So basically I want to remove groups until I've got 2.5 million records left OR select groups until I have about 2.5 million individuals.
If this is my data:
data <- data.frame(
id = c(1, 2, 3, 4, 5),
group = (1, 1, 2, 2, 3)
)
I wouldn't want to remove id1 and keep id2, I'd like to either keep them both or discard both, because they are in the same group(2).
So ideally, this function randomly selects a group, counts these individuals and puts them in a data set, then does the same thing again, keeps counting the individuals until it has about 2.5 million (it is okay to say: if n exceeds 2.5 stop putting groups into new data set).
I haven't been able to find a function and I am not yet skilled enough to put something together myself, unfortunately.
Hope someone can help me out!
Thanks
Too long for a comment hence answering. Do you need something like this ?
#Order data by group so rows with same groups are together
data1 <- data[order(data$group), ]
#Get all the groups in first 2.5M entries
selected_group <- unique(data1$group[1:2500000])
#Subset those groups so you have all groups intact
final_data <- data1[data1$group %in% selected_group, ]
For a random approach, we can use while loop
#Get all the groups in the data
all_groups <- unique(data$group)
#Variable to hold row indices
rows_to_sample <- integer()
#While the number of rows to subset is less than 2.5M
while (length(rows_to_sample) <= 2500000) {
#Select one random group
select_group <- sample(all_groups, 1)
#Get rows indices of that group
rows_to_sample <- c(rows_to_sample, which(data$group == select_group))
#Remove that group from the all_groups
all_groups <- setdiff(all_groups, select_group)
}
data[rows_to_sample, ]
here is a possibility. I demonstrate it using toydata and threshold of 33 (instead of 2.5) million. First I create the toy group vector:
threshold <- 33
set.seed(111)
mygroups <- rep(1:10, rpois(10, 10))
In this toy example group 1 has 10 individuals, group 2 has 8 individuals and so on.
Now I put the groups in random order and use cumsum to determine when the threshold is exceeded:
x <- cumsum(table(mygroups)[sample(1:10)])
randomgroups <- as.integer(names(x[x <= threshold]))
randomgroups
[1] 1 7 5
I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.
I have numbers starting from 1 to 6000 and I want it to be separated in the manner listed below.
1-10 as "Range1"
10-20 as "Range2"
20-30 as ""Range3"
.
.
.
5900-6000 as "Range 600".
I want to calculate the range with equal time interval as 10 and at last I want to calculate the frequency as which range is repeated the most.
How can we solve this in R programming.
You should use the cut function and then table can determine the counts in each category and sort in order of the most prevalent.
x <- 1:6000
x2 <- cut(x, breaks=seq(1,6000,by=10), labels=paste0('Range', 1:599))
sort(table(x2), descending = TRUE)
There is a maths trick to you question. If you want categories of length 10, round(x/10) will create a category in which 0-5 will become 0, 6 to 14 will become 1, 15 to 24 will become 2 etc. If you want to create cat 1-10, 11-20, etc., you can use round((x+4.1)/10).
(i don't know why in R round(0.5)=0 but round(1.5)=2, that's why i have to use 4.1)
Not the most elegant code but maybe the easiest to understand, here is an example:
# Create randomly 50 numbers between 1 and 60
x = sample(1:60, 50)
# Regroup in a data.frame and had a column count containing the value one for each row
df <- data.frame(x, count=1)
df
# create a new column with the category
df$cat <- round((df$x+4.1)/10)
# If you want it as text:
df$cat2 <- paste("Range",round((df$x+4.1)/10), sep="")
str(df)
# Calculate the number of values in each category
freq <- aggregate(count~cat2, data=df, FUN=sum)
# Get the maximum number of values in the most frequent category(ies)
max(freq$count)
# Get the category(ies) name(s)
freq[freq$count == max(freq$count), "cat2"]