I've got a numeric variable in the range from 1 (min) to 5 (max). The value ranges across 8 different variables. Therefore, the first row would look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
4 4 1 4 5 4 4 1
I've computed (row-wise) a median value for each row across the 8 variables. Occasionally, the median will be a midpoint value, for example, 4.5 (since it's even number of variables). Therefore the resulting row might look like this:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Median
1 2 3 4 5 5 5 5 4.5
When I call table on each median value calculated in Medina variable, I'll get this:
table(df$Median)
1 1.5 2 2.5 3 3.5 4 4.5 5
2 3 10 5 25 17 75 53 87
The issue I am trying to overcome is that I wish to "get rid" of the midpoint/decimal values by including them in the nearest nondecimal values; however, if I simply use round(), then I end up biasing the values (as by definition the 4.5 is really in between), like this:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87
What I was thinking of doing would be to round values based on the proportion of non-decimal numbers in the table (excluding the midpoint values):
So I would get proportion of non-decimal numbers using dplyr filter functions:
df %>% filter(median %% 1 == 0) %>%
select(median) %>% table() %>% prop.table()
To get:
1 2 3 4 5
0.01005025 0.05025126 0.12562814 0.37688442 0.43718593
Next step requires constructing a function that will take all midpoint values in the median variable and round them to their nearest non-decimal values while keeping the proportion of the non-decimal variables intact or close to the original one. For example, 4.5 nearest values are 4 and 5, so it would have a chance of going becoming 4 based on proportion 0.37688442 and 5 based on proportion 0.43718593. This way I would transform midpoint values to the whole; however, it would not be as biased as using simply round().
An alternative approach is to split the value equally between 4 and 5. So 50% of variables with value 4.5 will go to 4, 50% will go to 5.
I am thankful for any suggestions that would help me to solve this problem or get to the point I can start developing the function.
Edit1. Provided my own attempt to answer this question.
Edit2. Provided data.
dput(head(df, 15))
structure(list(uniqueID = c("R_AtXpiwxKPvILFv3", "R_2xwP4iz6UAu1fTj",
"R_b8IXGRKHP58x7GR", "R_ZelynHN8PCxxYyt", "R_PNjIc7h4dHebRgR",
"R_2bTZvYLUuKNC22D", "R_3iLqwuDs493HstB", "R_291dITimLKjYXeL",
"R_YWWGleFLxlIYzrX", "R_3st91vjNWNXlTHt", "R_3Mm8P52gaaxIpwD",
"R_3MxHXTnrncpgnB8", "R_1LqDx1uxReOQHvO", "R_vJEGJDmbqdfO7qF",
"R_3q8Wl8qys6nqxBH"), Median = c(4, 4.5,
1, 4, 5, 4.5, 4, 1.5, 4.5, 4, 3.5, 2, 4.5, 4.5, 3.5)), .Names = c("uniqueID",
"Median"), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
I'd implement it like this:
round_randomly = function(x, tolerance = 1e-6) {
round(x + sample(c(-tolerance, tolerance), size = length(x), replace = TRUE))
}
Calling your sample data dd,
table(round_randomly(dd$Median))
# 1 2 4 5
# 1 2 8 4
Any tolerance value less than 0.5 will work the same if your data is only integers and 0.5. If you have more continuous data, a smaller tolerance is better (to prevent, say 4.4 from being jittered up to 4.51 and being rounded to 5). I set the default to 1e-6, which seems reasonable, a value > 4.499999 might get rounded up to 5.
Your answer goes to quite a bit of trouble to only add a random value to the midpoints - this isn't necessary because of the rounding. If the original value is 4, 4.000001 will still round to 4. Even if you set the tolerance to 0.4, 4.4 will still round to 4).
My method makes no guarantees about rounding exactly 50% of midpoints up and 50% down, but each midpoint is rounded up and down with equal probability. Unless you have very little data and an unusually skewed random draw, that should be close enough.
Following suggestion from comments, I've attempted to create a function that randomly adds 0.1 or subtract 0.1 from all median midpoint values. It's not exactly the most elegant function ever but it does the job. One issue with the approach might be that randomization occurs by randomly sampling fraction of the dataset and adding 0.1 to it. Therefore, remaining unsampled fraction automatically gets to be subtracted by 0.1. It would be more elegant to do this for every value individually but I would have to explore this option.
The function:
randomize_midpoint <- function(dataset, new_random_median) {
# Prepare variable for mutate
new_random_median <- enquo(new_random_median)
# Get Sample A
sample_A <- dataset %>%
filter(Median %% 1 != 0) %>% # get midpoint values
sample_frac(0.5, replace = F) %>% # randomly sample 50% of them
select(uniqueID, Median) # anti_join will need some unique identifier
# Get Sample B by anti_join
sample_B <- dataset %>%
filter(Median %% 1 != 0) %>%
anti_join(sample_A) %>% # anti_join automatically uses uniqueID
select(uniqueID, Median)
# Create opposite of %in%
"%w/o%" <- Negate("%in%")
# Mutate median according to conditions in case_when()
dataset %>% mutate(
!!quo_name(new_random_median) := case_when(
uniqueID %in% sample_A$uniqueID ~ round(Median + 0.1),
uniqueID %in% sample_B$uniqueID ~ round(Median - 0.1),
uniqueID %w/o% c(sample_A$uniqueID , sample_B$uniqueID) ~ Median
)
)
}
The output of the function to compare with previous table():
randomize_midpoint(dataset = df, new_random_median = random_med) %>%
select(random_med) %>%
table()
Will return:
Joining, by = c("uniqueID", "Median")
1 2 3 4 5
2 16 36 110 113
Previous table:
table(round(df$Median))
1 2 3 4 5
2 18 25 145 87
Related
Apologies for the unclear title. Although not effective, I couldn't think of a better way to describe this problem.
Here is a sample dataset I am working with
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
I want to create a new column (called "Value_Standardized") whose values are calculated by grouping the data by GroupNum and then dividing each Value observation by the Value observation of the group when the Index is 1.
Here's what I've come up with so far.
test2 = test %>%
group_by(GroupNum) %>%
mutate(Value_Standardized = Value / special_function(Value))
The special_function would represent a way to get value when Index == 1.
That is also precisely the problem - I cannot figure out a way to get the denominator to be the value when index == 1 in that group. Unfortunately, the value when the index is 1 is not necessarily the max or the min of the group.
Thanks in advance.
Edit: Emphasis added for clarity.
There is a super simple tidyverse way of doing this with the method cur_data() it pulls the tibble for the current subset (group) of data and acts on it
test2 <- test %>%
group_by(GroupNum) %>%
mutate(output=Value/cur_data()$Value[1])
The cur_data() grabs the tibble, then you extract the Values column as you would normally using $Value and because the denominator is always the first row in this group, you just specify that index with [1]
Nice and neat, there are a whole bunch of cur_... methods you can use, check them out here:
Not sure if this is what you meant, nor if it's the best way to do this but...
Instead of using a group_by I used a nested pipe, filtering and then left_joining the table to itself.
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
test %>%
left_join(test %>%
filter(Index == 1) %>%
select(Value,GroupNum),
by = "GroupNum",
suffix = c('','_Index_1')) %>%
mutate(Value = Value/Value_Index_1)
output:
Value Index GroupNum Value_Index_1
1 1.0 1 1 1
2 2.0 2 1 1
3 3.0 3 1 1
4 4.0 4 1 1
5 5.0 5 1 1
6 1.0 1 2 5
7 0.8 2 2 5
8 0.6 3 2 5
9 0.4 4 2 5
10 0.2 5 2 5
A quick base R solution:
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5)),
Value_Standardized = NA
)
groups <- levels(factor(test$GroupNum))
for(currentGroup in groups) {
test$Value_Standardized[test$GroupNum == currentGroup] <- test$Value[test$GroupNum == currentGroup] / test$Value[test$GroupNum == currentGroup & test$Index == 1]
}
This only works under the assumption that each group will have only one observation with a "1" index though. It's easy to run into trouble...
I have the following sample data set
Time <- c(1,2,3,4,5,6,7,8,9,10,11,12)
Value <- c(0,1,2,3,2,1,2,3,2,1,2,3)
Data <- data.frame(Time, Value)
I would like to automatically find each maximum for the Value column and create a new data frame with only the Value and associated Time. In this example, maximum values occur every fourth time interval. I would like to group the data into bins and find the associated max value.
I kept my example simple for illustrative purposes, however, keep in mind:
Each max value in my data set will be different
Each max value is not guaranteed to occur at equal intervals but rather, I can guarantee that each max value will occur within a range (i.e. a bin) of time values.
Thank you for any help with this process!
You could find the local maxima by finding the points where the diff of the sign of the diff of the Value column is negative.
Data[which(diff(sign(diff(Data$Value))) < 0) + 1,]
#> Time Value
#> 4 4 3
#> 8 8 3
We can see that this works in a more general case too:
Time <- seq(0, 10, 0.1)
Value <- sin(Time)
Data <- data.frame(Time, Value)
plot(Data$Time, Data$Value)
Data2 <- Data[which(diff(sign(diff(Data$Value))) < 0) + 1,]
abline(v = Data2$Time, col = 'red')
Edit
Following more info from the OP, it seems we are looking for the maxima within a 120-second window. This being the case, we can get the solution more easily like this:
library(dplyr)
bin_size <- 4 # Used for example only, will be 120 in real use case
Data %>%
mutate(Bin = floor((Time - 1) / bin_size)) %>%
group_by(Bin) %>%
filter(Value == max(Value))
#> # A tibble: 3 x 3
#> # Groups: Bin [3]
#> Time Value Bin
#> <dbl> <dbl> <dbl>
#> 1 4 3 0
#> 2 8 3 1
#> 3 12 3 2
Obviously in the real data, change bin_size to 120.
Maybe this one?
library(dplyr)
Data %>%
slice_max(Value)
Time Value
1 4 3
2 8 3
3 12 3
This is a follow-up to a question I previously asked (Replace only certain values in column based on multiple conditions). For context I'm including some of the same information.
I have a large dataframe that contains many columns, but the relevant ones are: ID (this is number assigned to subject), Time (time at which this subject's measurement was taken) and Concentration. A very simplified example would be:
df <- data.frame( ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
Concentration=c("XXX",0.3,0.7,0.6,"XXX","XXX",0.8,0.3,"XXX","XXX",
"XXX",0.6,0.1,0.1,"XXX"),
Time=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5))
I would like to replace only the "XXX" values in column Concentration based on the following conditions:
when the value in column Time is less than or equal to timeX ; "XXX"==0
when the value in column Time is greater than timeX; "XXX" should be replaced with the word "Missing" unless two consecutive "XXX" values appear for a single subject (ID) for Time>timeX then the first consecutive "XXX" should be replaced with 0.05 and the second consecutive "XXX" (or all the following "XXX" values if there are more) should be replaced with the word "Missing".
It's very important that the ID's are somehow seperated here because there could be "XXX" as the final Concentration of one ID and as the first Concentration of the next ID and I do not want that to be read as two consecutive "XXX" values for a single ID.
The solution I have, for when we assume timeX=3 is:
require(tidyverse)
df <- tibble(df) %>%
mutate(Concentration = as.character(Concentration),
Concentration_Original = Concentration) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Time <= 3, "0", Concentration)) %>%
group_by(ID) %>%
mutate(Concentration = ifelse(Concentration == 'XXX' & Concentration == lead(Concentration),
"0.05", ifelse(Concentration == 'XXX',
"Missing", Concentration))) %>%
replace_na(list(Concentration = "Missing")) %>% ungroup()
To make the code more flexible and more importantly so that it doesn't require the user to manually check what the time cut off point should be and then manually insert it, I've been trying to make the code more automatic.
I would like to replace Time <= 3 with the following condition for timeX:
timeX is the value in column Time for that specific subject ID at which the value in column concentration is the highest. So basically the condition should be that timeX is that at which the concentration achieves it's maximum value.
For example: For ID 1 in my df, the highest concentration would be 0.7 and that concentration is achieved at Time = 3 so the value 3 should be inserted as timeX value.
Here are some thoughts/suggestions that might be helpful.
First, if you wish to look at maximum value for Concentration, I would not have this column be of character type. Instead, would make it numeric, and use NA for missing values. The first mutate sets that up.
After grouping, you can use mutate and case_when for your various situations. You can access the Time of maximum concentration through:
Time[which(Concentration == max(Concentration, na.rm = TRUE))]
(removing the missing values).
If it the Concentration is missing, and Time is less than the Time of maximum concentration, then change to 0.
In second case, if lead (or subsequent row) also is missing, then change to .05.
Otherwise, do not change Concentration.
Depending on further analyses and presentation, you can use "Missing" as a text label for missing data.
Edit: Based on OP comment, it appears that only the first "XXX" after max time should be replace with .05 for concentration, but all the following "XXX" after that as missing. To achieve this, add:
!is.na(lag(Concentration, default = 0))
as a condition for determining if value should be .05. The logic is: if the previous row's value is not NA, but the following value is NA, after the max time, then change to .05.
Here is the modified code:
library(tidyverse)
df %>%
mutate(Concentration = ifelse(Concentration == "XXX", NA_character_, Concentration),
Concentration = as.numeric(Concentration)) %>%
group_by(ID) %>%
mutate(Concentration_New = case_when(
is.na(Concentration) & Time < first(Time[which(Concentration == max(Concentration, na.rm = TRUE))]) ~ 0,
is.na(Concentration) & Time > last(Time[which(Concentration == max(Concentration, na.rm = TRUE))]) &
is.na(lead(Concentration, default = 0)) & !is.na(lag(Concentration, default = 0)) ~ .05,
TRUE ~ Concentration
))
Output
ID Concentration Time Concentration_New
<dbl> <dbl> <dbl> <dbl>
1 1 NA 1 0
2 1 0.3 2 0.3
3 1 0.7 3 0.7
4 1 0.6 4 0.6
5 1 NA 5 NA
6 2 NA 1 0
7 2 0.8 2 0.8
8 2 0.3 3 0.3
9 2 NA 4 0.05
10 2 NA 5 NA
11 3 NA 1 0
12 3 0.6 2 0.6
13 3 0.1 3 0.1
14 3 0.1 4 0.1
15 3 NA 5 NA
I need to divide data in DF1 in to groups based on their class. In some cases everything in class will be in the same group. Class needs to be divided in to groups by random, but not by equal shares. In DF2 i have the data that gives the shares how the data needs to be divided.
DF2 is imported by me from excel. This file is mentained by me and if needed you can make changes to the structure of the data. This is the file I will use to divide the classes in to groups. Share column tells me how many of the class must be divided in to this group. For example 50% of rows in DF1 with class 1 must be dividend in to Apples, 25% in to Hammers and 25% in to Car.
NB! It needs to be random, it cant be that first 50% rows are Apples, next 25% hammers etc.
My solution is to give every row in DF1 a random number that I save every time i make it so i can go back and use the seed I got before.
NB! It’s important to me that I can go back to the previouse random if a colleague or I runs the code by mistake and making a new random seed. I have that part covered in the case of the random number.
DF1 (base data)
ID Class Random
1 1 0,65
2 1 0,23
3 2 0,45
4 1 0,11
5 2 0,89
6 3 0,12
7 1 0,9
My solution is to make a share_2 column where i divide 0-1 in to spaces based on the share column. In excel logic i would like to do the following:
IF Class = 1 then
IF Random < 0,5; Apples; if not then
IF Random < 0,75; Hammer if not then
IF Random <1; Car
DF2 (Classification file made by me)
Class Group Share Share_2
1 Apples 50%* 0,5
1 Hammer 25% 0,75
1 Car 25% 1
2 Building 100%** 1
3 Computer 50% 0,5
3 Hammer 50% 1
*This means that 50% of class 1 need to be "Apples". Shares in a class give 100% in total.
I need
DF3
ID Class Random Group
1 1 0,65 Hammer
2 1 0,23 Apples
3 2 0,45 Building
4 1 0,11 Apples
5 2 0,89 Building
6 3 0,12 Computer
7 1 0,9 Car
My probleem is that i don’t know how to write it in R. Can you please help me.
NB! Please feel free to offer also ohter methods of solving my problem as long as it makes the dividing of class by random and i can save the randomnes to replicate it.
One way to go about this that does not use the random numbers you have already generated, but is otherwise fairly short, is to use the random() function to do the random assignment directly for you:
DF1 <- data.frame(
ID = 1:7,
Class = c(1, 1, 2, 1, 2, 3, 1),
Random = c(0.65, 0.23, 0.45, 0.11, 0.89, 0.12, 0.9)
)
DF1 <- DF1[order(DF1$Class), ] #EDIT: need this for the code to behave properly!
DF2 <- data.frame(
Class = c(1, 1, 1, 2, 3, 3),
Group = c("Apples", "Hammer", "Car", "Building", "Computer", "Hammer"),
Share = c(0.5, 0.25, 0.25, 1, 0.5, 0.5),
Share_2 = c(0.5, 0.75, 1, 1, 0.5, 1)
)
set.seed(12345) # this is for reproducibility; you can choose any number here
DF3 <- DF1
DF3$Group <- unlist(sapply(unique(DF1$Class), function(x) {
with(DF2[DF2$Class == x, ],
sample(Group, size = sum(DF3$Class == x),
prob = Share, replace = TRUE))
}))
Working from the outside in: the sapply parameter serves essentially the role of a for loop. It begins by looking at all the unique entries in DF1$Class. For each of those (called x), it carves out a chunk of DF2 corresponding to the portion that has Class equal to x, and then focuses only on that chunk of DF2 -- this is what the with() function is doing here.
The core idea is to use sample(). We draw the things to sample from the Group column of DF2, draw an appropriate number of samples (marked by the size parameter), set the probabilities according to the Share column of DF2, and draw with replacement. All of this makes sense because we are inside the with() function; we have already restricted our attention to not only DF2, but just the chunk of DF2 corresponding to Class == x.
The unlist() function is used because the output of the sapply() function is a list in this case, and we want it just to be a vector; then, we just glue that vector directly onto the DF3 data frame, which is otherwise an identical copy of DF1.
EDIT: I added a line sorting DF1, which is necessary for this solution.
Actually I don't like this solution since I pipe two filter-functions and don't know how to do it in one statement.
Using dplyr and #Aaron Montgomery's data:
merge(DF1, DF2, by="Class") %>%
group_by(Class, ID) %>%
filter(Random <= Share_2) %>%
filter(Share_2 == min(Share_2)) %>%
select(-c(Share, Share_2)) %>%
arrange(ID)
gives
# A tibble: 7 x 4
# Groups: Class, ID [7]
Class ID Random Group
<dbl> <int> <dbl> <chr>
1 1 1 0.65 Hammer
2 1 2 0.23 Apples
3 2 3 0.45 Building
4 1 4 0.11 Apples
5 2 5 0.89 Building
6 3 6 0.12 Computer
7 1 7 0.9 Car
I need to test the value of'peso'(see replication code below) for each factor. Whether a factor reaches 50% of the overall sum for 'peso', the values of each factor should be paste into a new object 'results', otherwise, R should evaluate which factor has the lowest aggregated value for 'peso', and consider the factor in the next column for aggregate 'peso' again. Basically, this process replace the lowest scored factor for the next factor. The process should repeat till a factor cross the 50% threshold. So my question is, where do I start?
set.seed(51)
Data <- sapply(1:100, function(x) sample(1:10, size=5))
Data <- data.frame(t(Data))
names(Data) <- letters[1:5]
Data$peso <- sample(0:3.5, 100, rep=TRUE)
It should be like
If your first two rows are:
a b c d e peso
8 2 3 7 9 1
8 3 4 5 7 3
9 7 4 10 1 2
10 3 4 5 7 3
What would you like for the total?
Totals_08 = 4
Totals_09 = 2
Totals_10 = 3
etc?
So, factor 8 got the greater share 4/(4+2+3) = 0.4444444, but not reached 50% threshold in the round a. Therefore, I need something more: repeat the aggregation but considering now the factor 7 in the column 'b' instead of factors 9 in the column 'a', since it got the lowest aggregated value in the first round.
It's unclear if you have your list of factors already or not. If you do not have it, and are taking it from the data set, you can grab it in a few different ways:
# Get a list of all the factors
myFactors <- levels(Data[[1]]) # If actual factors.
myFactors <- sort(unique(unlist(Data))) # Otherwise use similar to this line
Then to calculate the Totals per factor, you can do the following
Totals <-
colSums(sapply(myFactors, function(fctr)
# calculate totals per fctr
as.integer(Data$peso) * rowSums(fctr == subset(Data, select= -peso))
))
names(Totals) <- myFactors
Which gives
Totals
# 1 2 3 4 5 6 7 8 9 10
# 132 153 142 122 103 135 118 144 148 128
Next:
I'm not sure if afterwards, you want to compare to the sum of peso or the sum of the totals. Here are both options, broken down into steps:
# Calculate the total of all the Totals:
TotalSum <- sum(Totals)
# See percentage for each:
Totals / TotalSum
Totals / sum(as.integer(Data$peso))
# See which, if any, is greater than 50%
Totals / TotalSum > 0.50
Totals / sum(as.integer(Data$peso)) > 0.50
# Using Which to identify the ones you are looking for
which(Totals / TotalSum > 0.50)
which(Totals / sum(as.integer(Data$peso)) > 0.50)
Note on your sampling for Peso
You took a sample of 0:3.5, however, the x:y sequence only gives integers.
If you want fractions, you can either use seq() or you can take a larger sequence and then divide appropriately:
option1 <- (0:7) / 2
option2 <- seq(from=0, to=3.5, by=0.5)
If you want whole integers from 0:3 and also the value 3.5, then use c()
option3 <- c(0:3, 3.5)