I need to divide data in DF1 in to groups based on their class. In some cases everything in class will be in the same group. Class needs to be divided in to groups by random, but not by equal shares. In DF2 i have the data that gives the shares how the data needs to be divided.
DF2 is imported by me from excel. This file is mentained by me and if needed you can make changes to the structure of the data. This is the file I will use to divide the classes in to groups. Share column tells me how many of the class must be divided in to this group. For example 50% of rows in DF1 with class 1 must be dividend in to Apples, 25% in to Hammers and 25% in to Car.
NB! It needs to be random, it cant be that first 50% rows are Apples, next 25% hammers etc.
My solution is to give every row in DF1 a random number that I save every time i make it so i can go back and use the seed I got before.
NB! It’s important to me that I can go back to the previouse random if a colleague or I runs the code by mistake and making a new random seed. I have that part covered in the case of the random number.
DF1 (base data)
ID Class Random
1 1 0,65
2 1 0,23
3 2 0,45
4 1 0,11
5 2 0,89
6 3 0,12
7 1 0,9
My solution is to make a share_2 column where i divide 0-1 in to spaces based on the share column. In excel logic i would like to do the following:
IF Class = 1 then
IF Random < 0,5; Apples; if not then
IF Random < 0,75; Hammer if not then
IF Random <1; Car
DF2 (Classification file made by me)
Class Group Share Share_2
1 Apples 50%* 0,5
1 Hammer 25% 0,75
1 Car 25% 1
2 Building 100%** 1
3 Computer 50% 0,5
3 Hammer 50% 1
*This means that 50% of class 1 need to be "Apples". Shares in a class give 100% in total.
I need
DF3
ID Class Random Group
1 1 0,65 Hammer
2 1 0,23 Apples
3 2 0,45 Building
4 1 0,11 Apples
5 2 0,89 Building
6 3 0,12 Computer
7 1 0,9 Car
My probleem is that i don’t know how to write it in R. Can you please help me.
NB! Please feel free to offer also ohter methods of solving my problem as long as it makes the dividing of class by random and i can save the randomnes to replicate it.
One way to go about this that does not use the random numbers you have already generated, but is otherwise fairly short, is to use the random() function to do the random assignment directly for you:
DF1 <- data.frame(
ID = 1:7,
Class = c(1, 1, 2, 1, 2, 3, 1),
Random = c(0.65, 0.23, 0.45, 0.11, 0.89, 0.12, 0.9)
)
DF1 <- DF1[order(DF1$Class), ] #EDIT: need this for the code to behave properly!
DF2 <- data.frame(
Class = c(1, 1, 1, 2, 3, 3),
Group = c("Apples", "Hammer", "Car", "Building", "Computer", "Hammer"),
Share = c(0.5, 0.25, 0.25, 1, 0.5, 0.5),
Share_2 = c(0.5, 0.75, 1, 1, 0.5, 1)
)
set.seed(12345) # this is for reproducibility; you can choose any number here
DF3 <- DF1
DF3$Group <- unlist(sapply(unique(DF1$Class), function(x) {
with(DF2[DF2$Class == x, ],
sample(Group, size = sum(DF3$Class == x),
prob = Share, replace = TRUE))
}))
Working from the outside in: the sapply parameter serves essentially the role of a for loop. It begins by looking at all the unique entries in DF1$Class. For each of those (called x), it carves out a chunk of DF2 corresponding to the portion that has Class equal to x, and then focuses only on that chunk of DF2 -- this is what the with() function is doing here.
The core idea is to use sample(). We draw the things to sample from the Group column of DF2, draw an appropriate number of samples (marked by the size parameter), set the probabilities according to the Share column of DF2, and draw with replacement. All of this makes sense because we are inside the with() function; we have already restricted our attention to not only DF2, but just the chunk of DF2 corresponding to Class == x.
The unlist() function is used because the output of the sapply() function is a list in this case, and we want it just to be a vector; then, we just glue that vector directly onto the DF3 data frame, which is otherwise an identical copy of DF1.
EDIT: I added a line sorting DF1, which is necessary for this solution.
Actually I don't like this solution since I pipe two filter-functions and don't know how to do it in one statement.
Using dplyr and #Aaron Montgomery's data:
merge(DF1, DF2, by="Class") %>%
group_by(Class, ID) %>%
filter(Random <= Share_2) %>%
filter(Share_2 == min(Share_2)) %>%
select(-c(Share, Share_2)) %>%
arrange(ID)
gives
# A tibble: 7 x 4
# Groups: Class, ID [7]
Class ID Random Group
<dbl> <int> <dbl> <chr>
1 1 1 0.65 Hammer
2 1 2 0.23 Apples
3 2 3 0.45 Building
4 1 4 0.11 Apples
5 2 5 0.89 Building
6 3 6 0.12 Computer
7 1 7 0.9 Car
Related
I have seen some similar questions, but none of them was exactly the same as the thing I want to do - which is why I am asking.
I have a dataframe (dummy_data) which contains indices of some observations (obs) regarding given subjects (ID). The dataframe consists only the meaningful data (in other words: the desired conditions are met). The last column in this example data contains the total number of observations (total_obs).
ID <-c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
obs <- c(1,2,3,5,6,3,4,5,7,8,9,12,16,1,2,4,5,6,7,8,2,4,6,7,8,10,13,14,15,3,4,6,7,11)
total_obs <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
dummy_data <- data.frame(ID, obs, total_obs)
I would like to create a new column (interval) with 3 possible values: "start", "center", "end" based on following condition(s):
it should split total number of observations (total_obs) into 3 groups (based on indices - from 1st to the last - which is the value stored in the total_obs column) and assign the interval value according to the indices stored in obs column.
Here is the expected output:
ID <- c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
segment <- c(1,2,3,5,6, 3,4,5,7,8,9,12,16, 1,2,4,5,6,7,8, 2,4,6,7,8,10,13,14,15, 3,4,6,7,11)
total_segments <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
interval <- c("start","start","center","end","end","start","start","start","center","center","center","end","end","start","start","center","center","center","end","end","start","start","start","center","center","center","end","end","end", "start","start","center","center","end")
wanted_data <- data.frame(ID, segment, total_segments, interval)
I would like to use use dplyr::ntile() with dplyr::mutate() and dplyr::case_when() but I could not make my code function properly. Any solutions?
You just need dplyr::mutate() and dplyr::case_when().
The following should give you something to work off of.
dummy_data %>%
mutate(interval = case_when(obs < (total_obs/3) ~ "start",
obs < 2*(total_obs/3) ~ "center",
TRUE ~ "end"))
# TRUE ~ "end" is the 'else' case when everything else is false
Which gives slightly different results.
I think more careful deliberation should be made regarding where the endpoints are for each interval, but if you know what you are doing, using a combination of <=, %/%, and ceil() should give you the result you desire.
First, because dummy_data$obs is identical withwanted_data$segment, and dummy_data$total_obs is identical with wanted_data$total_segments, you just need to rename these columns.
For the interval column, here is one approach of creating it:
group the data based on segment column
create a column, say tile, and fill it with ntile(segment) results.
create interval column, and use case_when to fill it with the category labels created from tile. It means, fill interval with "start" when tile = 1, "center" when 2, and "end" when 3.
drop the tile column.
wanted_data <- dummy_data %>%
rename(segment = obs, total_segments = total_obs) %>%
group_by(total_segments) %>%
mutate(tile = ntile(segment, 3)) %>%
mutate(interval = case_when(tile == 1~"start",
tile == 2~"center",
tile == 3~"end")) %>%
select(-tile)
wanted_data
# A tibble: 34 × 4
# Groups: total_segments [5]
ID segment total_segments interval
<chr> <dbl> <dbl> <chr>
1 item_001 1 6 start
2 item_001 2 6 start
3 item_001 3 6 center
4 item_001 5 6 center
5 item_001 6 6 end
6 item_452 3 16 start
7 item_452 4 16 start
8 item_452 5 16 start
9 item_452 7 16 center
10 item_452 8 16 center
# … with 24 more rows
It's slightly different from wanted_data$interval that you showed because based on your comment, you said that the division into categories is just as dplyr::ntile() does.
I'm using the epiR package as it does nice 2 by 2 contingency tables with odds ratios, and population attributable fractions.
As is common my data is coded
0 = No
1 = Yes
So when I do
tabele(var_1,var_2)
The output comes out as a table aligned like
For its input though epiR wants the top left square to be Exposed+VE Outcome+VE - i.e the top left square should be Var 1==1 and Var 2==1
Currently I do this by recoding the zeroes to 2 or alternatively by setting as a factor and using re-level. Both of these are slightly annoying for other analyses as in general I want Outcome+VE to come after Outcome-VE
So I wondered if there is an easy way (?within table) to flip the orientation of table so that it essentially inverts the ordering of the rows/columns?
Hope the above makes sense - happy to provide clarification if not.
Edit: Thanks for suggestions below; just for clarification I want to be able to do this when calling table from existing dataframe variable - i.e when what I am doing is table(data$var_1, data$var_2) - ideally without having to create a whole new object
Table is a simple matrix. You can just call indices in reverse order.
xy <- table(data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2]))
variable
value a b
0 20 22
1 30 28
xy[2:1, 2:1]
variable
value b a
1 20 30
0 30 20
Using factor levels:
# reproducible example (adapted from Roman's answer)
df1 <- data.frame(value = rbinom(100, size = 1, prob = 0.5),
variable = letters[1:2])
table(df1)
# variable
# value a b
# 0 32 23
# 1 18 27
#convert to factor, specify levels
df1$value <- factor(df1$value, levels = c("1", "0"))
df1$variable <- factor(df1$variable, levels = c("b", "a"))
table(df1)
# variable
# value b a
# 1 24 26
# 0 26 24
I'm trying to transfer some work previously done in Excel into R. All I need to do is transform two basic count_if formulae into readable R script. In Excel, I would use three tables and calculate across those using 'point-and-click' methods, but now I'm lost in how I should address it in R.
My original dataframes are large, so for this question I've posted sample dataframes:
OperatorData <- data.frame(
Operator = c("A","B","C"),
Locations = c(850, 575, 2175)
)
AreaData <- data.frame(
Area = c("Torbay","Torquay","Tooting","Torrington","Taunton","Torpley"),
SumLocations = c(1000,500,500,250,600,750)
)
OperatorAreaData <- data.frame(
Operator = c("A","A","A","B","B","B","C","C","C","C","C"),
Area = c("Torbay","Tooting","Taunton",
"Torbay","Taunton","Torrington",
"Tooting","Torpley","Torquay","Torbay","Torrington"),
Locations = c(250,400,200,
100,400,75,
100,750,500,650,175)
)
What I'm trying to do is add two new columns to the OperatorData dataframe: one indicating the count of Areas that operator operates in and another count indicating how many areas in which that operator operates in and owns more than 50% of locations.
So the new resulting dataframe would look like this
Operator Locations AreaCount Own_GE_50percent
A 850 3 1
B 575 3 1
C 2715 5 4
So far, I've managed to calculate the first column using the table function and then appending:
OpAreaCount <- data.frame(table(OperatorAreaData$Operator))
names(OpAreaCount)[2] <- "AreaCount"
OperatorData$"AreaCount" <- cbind(OpAreaCount$AreaCount)
This is fairly straightforward, but I'm stuck in how to calculate the second column calculation with the condition of 50%.
library(dplyr)
OperatorAreaData %>%
inner_join(AreaData, by="Area") %>%
group_by(Operator) %>%
summarise(AreaCount = n_distinct(Area),
Own_GE_50percent = sum(Locations > (SumLocations/2)))
# # A tibble: 3 x 3
# Operator AreaCount Own_GE_50percent
# <fct> <int> <int>
# 1 A 3 1
# 2 B 3 1
# 3 C 5 4
You can use AreaCount = n() if you're sure you have unique Area values for each Operator.
I've got a numeric variable in the range from 1 (min) to 5 (max). The value ranges across 8 different variables. Therefore, the first row would look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
4 4 1 4 5 4 4 1
I've computed (row-wise) a median value for each row across the 8 variables. Occasionally, the median will be a midpoint value, for example, 4.5 (since it's even number of variables). Therefore the resulting row might look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Median
1 2 3 4 5 5 5 5 4.5
When I call table on each median value calculated in Medina variable, I'll get this:
table(df$Median)
1 1.5 2 2.5 3 3.5 4 4.5 5
2 3 10 5 25 17 75 53 87
The issue I am trying to overcome is that I wish to "get rid" of the midpoint/decimal values by including them in the nearest nondecimal values; however, if I simply use round(), then I end up biasing the values (as by definition the 4.5 is really in between), like this:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87
What I was thinking of doing would be to round values based on the proportion of non-decimal numbers in the table (excluding the midpoint values):
So I would get proportion of non-decimal numbers using dplyr filter functions:
df %>% filter(median %% 1 == 0) %>%
select(median) %>% table() %>% prop.table()
To get:
1 2 3 4 5
0.01005025 0.05025126 0.12562814 0.37688442 0.43718593
Next step requires constructing a function that will take all midpoint values in the median variable and round them to their nearest non-decimal values while keeping the proportion of the non-decimal variables intact or close to the original one. For example, 4.5 nearest values are 4 and 5, so it would have a chance of going becoming 4 based on proportion 0.37688442 and 5 based on proportion 0.43718593. This way I would transform midpoint values to the whole; however, it would not be as biased as using simply round().
An alternative approach is to split the value equally between 4 and 5. So 50% of variables with value 4.5 will go to 4, 50% will go to 5.
I am thankful for any suggestions that would help me to solve this problem or get to the point I can start developing the function.
Edit1. Provided my own attempt to answer this question.
Edit2. Provided data.
dput(head(df, 15))
structure(list(uniqueID = c("R_AtXpiwxKPvILFv3", "R_2xwP4iz6UAu1fTj",
"R_b8IXGRKHP58x7GR", "R_ZelynHN8PCxxYyt", "R_PNjIc7h4dHebRgR",
"R_2bTZvYLUuKNC22D", "R_3iLqwuDs493HstB", "R_291dITimLKjYXeL",
"R_YWWGleFLxlIYzrX", "R_3st91vjNWNXlTHt", "R_3Mm8P52gaaxIpwD",
"R_3MxHXTnrncpgnB8", "R_1LqDx1uxReOQHvO", "R_vJEGJDmbqdfO7qF",
"R_3q8Wl8qys6nqxBH"), Median = c(4, 4.5,
1, 4, 5, 4.5, 4, 1.5, 4.5, 4, 3.5, 2, 4.5, 4.5, 3.5)), .Names = c("uniqueID",
"Median"), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
I'd implement it like this:
round_randomly = function(x, tolerance = 1e-6) {
round(x + sample(c(-tolerance, tolerance), size = length(x), replace = TRUE))
}
Calling your sample data dd,
table(round_randomly(dd$Median))
# 1 2 4 5
# 1 2 8 4
Any tolerance value less than 0.5 will work the same if your data is only integers and 0.5. If you have more continuous data, a smaller tolerance is better (to prevent, say 4.4 from being jittered up to 4.51 and being rounded to 5). I set the default to 1e-6, which seems reasonable, a value > 4.499999 might get rounded up to 5.
Your answer goes to quite a bit of trouble to only add a random value to the midpoints - this isn't necessary because of the rounding. If the original value is 4, 4.000001 will still round to 4. Even if you set the tolerance to 0.4, 4.4 will still round to 4).
My method makes no guarantees about rounding exactly 50% of midpoints up and 50% down, but each midpoint is rounded up and down with equal probability. Unless you have very little data and an unusually skewed random draw, that should be close enough.
Following suggestion from comments, I've attempted to create a function that randomly adds 0.1 or subtract 0.1 from all median midpoint values. It's not exactly the most elegant function ever but it does the job. One issue with the approach might be that randomization occurs by randomly sampling fraction of the dataset and adding 0.1 to it. Therefore, remaining unsampled fraction automatically gets to be subtracted by 0.1. It would be more elegant to do this for every value individually but I would have to explore this option.
The function:
randomize_midpoint <- function(dataset, new_random_median) {
# Prepare variable for mutate
new_random_median <- enquo(new_random_median)
# Get Sample A
sample_A <- dataset %>%
filter(Median %% 1 != 0) %>% # get midpoint values
sample_frac(0.5, replace = F) %>% # randomly sample 50% of them
select(uniqueID, Median) # anti_join will need some unique identifier
# Get Sample B by anti_join
sample_B <- dataset %>%
filter(Median %% 1 != 0) %>%
anti_join(sample_A) %>% # anti_join automatically uses uniqueID
select(uniqueID, Median)
# Create opposite of %in%
"%w/o%" <- Negate("%in%")
# Mutate median according to conditions in case_when()
dataset %>% mutate(
!!quo_name(new_random_median) := case_when(
uniqueID %in% sample_A$uniqueID ~ round(Median + 0.1),
uniqueID %in% sample_B$uniqueID ~ round(Median - 0.1),
uniqueID %w/o% c(sample_A$uniqueID , sample_B$uniqueID) ~ Median
)
)
}
The output of the function to compare with previous table():
randomize_midpoint(dataset = df, new_random_median = random_med) %>%
select(random_med) %>%
table()
Will return:
Joining, by = c("uniqueID", "Median")
1 2 3 4 5
2 16 36 110 113
Previous table:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87
In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.