I have a variable called "exposed" and I know already the sum of exposed people over time, "index" is how many people the individual meet in a week: have a look to understand
i
exposed
index
1
y
22
2
y
12
3
y
6
4
n
54
5
n
3
So I have 3 exposed individuals and 2 are not.
t <- 5
#I know that each i in t :
sum(exposed[i]) <- c(3,4,1,4,5)
I created this line of code to capture the change in data:
evol <- list()
for(i in 1:t){evol[[i]]<- df}
for (i in 2:t) {
# condition
}
If the number of exposed at [t] is higher than the number of exposed at [t-1] so individuals with exposed==n and have a higher index change the exposed variable from n to y.
If the number of exposed at [t] is lower than the number of exposed in [t-1] so individuals with exposed==y and have the lower index change the exposed variable from y to n.
My question is: what is the condition that I have to write to have in:
A data set that looks like this:
evol[[1]]
i
exposed
index
1
y
22
2
y
12
3
y
6
4
n
54
5
n
3
should change to the data hat looks like this:
evol[[2]]
i
exposed
index
1
y
22
2
y
12
3
y
6
4
y
54
5
n
3
If the data looks like this:
evol[[3]]
i
exposed
index
1
n
22
2
n
12
3
n
6
4
y
54
5
n
3
I hope I made it clear;
Any ideas please;
kind regards.
Here is a function to change the vector exposed.
change_exposed <- function(exposed, index) {
stopifnot(length(exposed) == length(index))
for(i in seq_len(length(index))[-1L]) {
if(index[i] > index[i - 1L]) {
exposed[i] <- if(exposed[i] == "y") "n" else "y"
}
}
exposed
}
change_exposed(evol[[1]]$exposed, evol[[1]]$index)
#[1] "y" "y" "y" "y" "n"
Assign the result to exposed to actually change the data set.
evol[[1]]$exposed <- change_exposed(evol[[1]]$exposed, evol[[1]]$index)
identical(evol[[1]], evol[[2]])
# [1] TRUE
Data
evol <- list()
x <- 'i exposed index
1 y 22
2 y 12
3 y 6
4 n 54
5 n 3'
evol[[1]] <- read.table(textConnection(x), header = TRUE)
x <- 'i exposed index
1 y 22
2 y 12
3 y 6
4 y 54
5 n 3'
evol[[2]] <- read.table(textConnection(x), header = TRUE)
Related
I have a variable called "exposed" and I know already the sum of exposed people over time: have a look to understand
i
exposed
1
y
2
y
3
y
4
n
5
n
So I have 3 exposed individuals and 2 are not.
t <- 5
#I know that each i in t :
sum(exposed[i]) <- c(3,4,1,4,5)
I created this line of code to capture the change in data:
evol <- list()
for(i in 1:t){evol[[i]]<- df}
for (i in 2:t) {
# condition
}
My question is: what is the condition that I have to write to have in:
evol[[1]]
a data that looks like this:
i
exposed
1
y
2
y
3
y
4
n
5
n
evol[[2]]
the data looks like this:
i
exposed
1
y
2
y
3
y
4
y
5
n
evol[[3]]
a data that looks like this:
i
exposed
1
y
2
n
3
n
4
n
5
n
I hope I made it clear;
Any ideas please;
kind regards.
If I'm understanding you correctly, you want a list of dataframes based on the exposed sums.
Using lapply you can do
exposed <- c(3,4,1,4,5)
evol <- lapply(exposed, \(x) data.frame(i = seq_along(exposed), exposed = c(rep("y", x), rep("n", length(exposed) - x ))) )
evol[[1]]
i exposed
1 1 y
2 2 y
3 3 y
4 4 n
5 5 n
Ahoy,
below is a df similar to the one I have to work with but way smaller:
(I left out a lot of rows to make it easier on the eyes.)
x y variable values
1 1 5 a 9
2 2 5 a 2
3 3 5 a 9
4 4 5 a 8
5 5 5 a 4
...
22 2 1 a 7
23 3 1 a 9
24 4 1 a 7
25 5 1 a 10
26 1 5 b 7
27 2 5 b 8
...
48 3 1 b 8
49 4 1 b 7
50 5 1 b 2
The df above is created by an fluorescence plate reader which scans light intensity within an area by dividing it in into 25 sectors (5x5) and measuring each sector individually giving one value each. The order of measurements is upper left corner sector first and lower right corner sector last. To make it more graphical:
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Filling in the values from the df given above (+coordinates), it would look like this:
(5) ___ 9 - 2 - 9 - 8 - 4
(4) ___ 7 - 7 - 2 - 5 - 3
(3) ___ 5 - 4 - 7 - 8 - 9
(2) ___ 6 - 6 - 3 - 5 - 9
(1) ___ 4 - 7 - 9 - 7 - 10
(y^,x>) (1) (2) (3) (4) (5)
What I need is to read out the max value for each variable and calculate the mean of this value and the (up to) 9 fields surrounding it. In the area/"variable" above("a") the highest value of a sector is 10 in the lower right corner, which is surrounded by the values 5,9 and 7. Hence the Result I am looking for for variable "a" is 7.75 ((5+9+7+10)/4).
I imagine the code to resemble something like this (I am aware that this is not how you write r, but I don't know better):
mean(max value within variable,value at x(of max value within variable)-1,y(of max value within variable)),value at x(of max value within variable)-1,y(of max value within variable)+1) .....
The next issue challenge is that the instrument will perform scans of 96 areas (="variables"). And ideally I need a solution that automatically gives me this special mean value for every/all variables without me having to write the almost identical code 96times.
I know this is asking a bit much but I have been working on it for a while and I just cant come up with a solution or even a good way of googling it.
Thank you very much for any help!
Tim,
Ps: Using this R code creates a random version of the df I present above:
df <- data.frame(x = c(1:5), y = rep(c(5:1), each=5),variable = rep(c("a", "b"), each=25 ), values = floor(runif(50, min=1, max=10)))
This updated answer will provide the mean value of the maximum value and it's up to 9 surrounding values, within each variable group.
library(dplyr)
# Create the function
get.means <- function(df){
# Get a data frame of rows with the maximum value
max.rows <- df[df$values == max(df$values), ]
# Create an empty data frame
means.df <- data.frame(variable = character(), x = integer(), y = integer(), value = numeric(), mean = numeric(), stringsAsFactors = FALSE)
# Create an iterator for the data frame
iterator <- 1
# Loop through each row of the maximum value data frame
for(i in c(1:nrow(max.rows))){
# Get the x value for the current row
x <- max.rows$x[i]
# Get the y value for the current row
y <- max.rows$y[i]
# Set the range of x values to process based on the x coordinate
if(x == 1){
x.range <- c(1, 2)
} else if(x == 5){
x.range <- c(4, 5)
} else{
x.range <- c(x-1, x, x+1)
}
# Set the range of y values to process based on the y coordinate
if(y == 1){
y.range <- c(1, 2)
} else if(y == 5){
y.range <- c(4, 5)
} else{
y.range <- c(y-1, y, y+1)
}
# Get a matrix of the values from the original data frame, which are in both the current y and x ranges
vals <- as.matrix(df[(df$y %in% y.range) & (df$x %in% x.range), 'values'])
# Get the mean of the values
mean.val <- mean(vals)
# Insert the current variable value to the data frame for the new row
means.df[iterator, 'variable'] <- as.character(max.rows$variable[i])
# Insert the current x, y, value, and mean values for the new row
means.df[iterator, c('x','y','value', 'mean')] <- c(x, y, max.rows$values[i], mean.val)
# Increment the iterator
iterator <- iterator + 1
}
# Return the final data frame
return(means.df)
}
# Create a test data frame
df <- data.frame(x = c(1:5), y = rep(c(5:1), each=5),variable = rep(c("a", "b"), each=25 ), values = floor(runif(50, min=1, max=10)))
# Get the means for each max value within the variable grouping
df1 <- df %>%
group_by(variable) %>%
do(get.means(.))
I am basically new to using R software.
I have a list of repeating codes (numeric/ categorical) from an excel file. I need to add another column values (even at random) to which every same code will get the same value.
Codes Value
1 122
1 122
2 155
2 155
2 155
4 101
4 101
5 251
5 251
Thank you.
We can use match:
n <- length(code0 <- unique(code))
value <- sample(4 * n, n)[match(code, code0)]
or factor:
n <- length(unique(code))
value <- sample(4 * n, n)[factor(code)]
The random integers generated are between 1 and 4 * n. The number 4 is arbitrary; you can also put 100.
Example
set.seed(0); code <- rep(1:5, sample(5))
code
# [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5
n <- length(code0 <- unique(code))
sample(4 * n, n)[match(code, code0)]
# [1] 5 5 5 5 5 18 18 19 19 19 19 12 12 12 11
Comment
The above gives the most general treatment, assuming that code is not readily sorted or taking consecutive values.
If code is sorted (no matter what value it takes), we can also use rle:
if (!is.unsorted(code)) {
n <- length(k <- rle(code)$lengths)
value <- rep.int(sample(4 * n, n), k)
}
If code takes consecutive values 1, 2, ..., n (but not necessarily sorted), we can skip match or factor and do:
n <- max(code)
value <- sample(4 * n, n)[code]
Further notice: If code is not numerical but categorical, match and factor method will still work.
What you could also do is the following, it is perhaps more intuitive to a beginner:
data <- data.frame('a' = c(122,122,155,155,155,101,101,251,251))
duplicates <- unique(data)
duplicates[, 'b'] <- rnorm(nrow(duplicates))
data <- merge(data, duplicates, by='a')
I'm creating a group assignment for a college class (~180 students) I'm instructing. It's important that these groups be as heterogeneous as possible across three variables (field of study (FOS), sex, division:i.e., newer/older students).
FOS has 5 levels, sex has 2, division has 2. Given the project, I'd like to create about 8-9 groups. In other words, I'd like groups of approximately 6 with a "good" balance of different fields of study, males/females, and new and older students. I'd then simply post the names with the automated assignments.
The instructor before did it all by hand, but I've tried playing around with R to see if there's a more systematic way of doing this, but only came up with repeated (and clunky) sorting. I expect the 5 FOS levels to vary in size, so I recognize that it will not be a perfect solution. Interested in people's clever solutions. Here's a reproducible sample:
dat <- data.frame(
student = 1:180,
gender = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.52,.48)),
labels=c("female","male")),
division = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.6,.4)),
labels=c("lower","upper")),
field = factor(sample(LETTERS[1:5], 180, replace = T,
prob = c(.26,.21,.35,.07,.11)),
labels = c("humanities","natural science",
"social science","engineer","other")))
This was what I was playing with, but it's really increasing the randomness in assignment and not so much the balance as can be seen:
library(dplyr)
dat$rand <- sample(1:180,180)
dat1 <- arrange(dat, field, division, gender, rand)
dat1$grp <- 1:(nrow(dat1)/6) #issue if not divisible
Which does not result in adequate balance:
with(dat1, table(gender, grp)) #as a check
with(dat1, table(field, grp))
with(dat1, table(division, grp))
I know this is an old question, but I had a similar problem today and here's the solution I came up with. Basically you assign groups randomly then use either chi square test for categorical variables or ANOVA for continuous variables to test for group differences for each variable. You set a threshold for the p-value that you do not want to drop below. The code will reshuffle the groups until all p values are above that threshold. If it goes through 10,000 iterations without reaching a grouping solution, it will stop and suggest that you lower the threshold.
set.seed(905)
#let's say you have a continuous variable you would also like to keep steady across groups
dat$age <- sample(18:35, nrow(dat), replace = TRUE)
dat$group <- rep_len(1:20, length.out = nrow(dat)) #if you wanted to make 20 groups
dat$group <- as.factor(dat$group)
a <- 0.1; b <- 0.1; c <- 0.1; d <- 0.1
thresh <- 0.85 #Minimum threshold for p value
z <- 1
while (a < thresh | b < thresh |c < thresh |d < thresh) {
dat <- transform(dat, group = sample(group)) #shuffles the groups
x <- summary(aov(age ~ group, dat)) #ANOVA for continuous variables
a <- x[[1]]['group','Pr(>F)']
x <- summary(table(dat$group, dat$gender)) #Chi Sq for categorical variables
b <- x[['p.value']]
x <- summary(table(dat$group, dat$division))
c <- x[['p.value']]
x <- summary(table(dat$group, dat$field))
d <- x[['p.value']]
z <- z + 1
if (z > 10000) {
print('10,000 tries, no solution, reduce threshold')
break
}
}
With enough datapoints per combination of the variables, you should be able to do this:
dat <- groupdata2::fold(dat, k = 8,
cat_col = c("gender", "division", "field"))
with(dat, table(gender, .folds))
## .folds
## gender 1 2 3 4 5 6 7 8
## female 11 12 11 12 12 11 12 12
## male 10 11 11 11 11 11 11 11
with(dat, table(field, .folds))
## .folds
## field 1 2 3 4 5 6 7 8
## humanities 5 8 9 7 9 6 6 5
## natural science 2 3 4 6 3 9 2 4
## social science 9 7 6 8 5 6 9 6
## engineer 3 3 2 1 3 0 2 4
## other 2 2 1 1 3 1 4 4
with(dat, table(division, .folds))
## .folds
## division 1 2 3 4 5 6 7 8
## lower 11 15 13 14 10 13 11 15
## upper 10 8 9 9 13 9 12 8
I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733