divide multiple column by a value based on each condition - r

I have a dataset that has 3 different conditions. Data within condition 1 will need to be divided by 15, data within conditions 2 and 3 will need to be divided by 10. I tried to do for() in order to create separate datasets for each condition and then merge the two groups (group 1 is composed of condition 1, group 2 is composed of conditions 2 and 3). This is what I have so far for condition 1. Is there an easier way to do this that does not require creating subgroups?
Group1 <- NULL
for (val in ParticipantID) {
ParticipantID_subset_Group1 <- subset(PronounData, ParticipantID == val & Condition == "1")
I_Words_PPM <- (ParticipantID_subset_Group1$I_Words/"15")
YOU_Words_PPM <- (ParticipantID_subset_Group1$YOU_Words/"15")
WE_Words_PPM <- (ParticipantID_subset_Group1$WE_Words/"15")
df <- data.frame(val, Group, I_Words_PPM, YOU_Words_PPM, WE_Words_PPM)
Group1 <- rbind(Group1, df)
colnames(Group1) <- c("ParticipantID", "Condition", "I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")

Couldn't fully test this solution without example data, but this should do what you want:
# make some fake data
PronounData <- data.frame(
ParticipantID = 1:9,
Condition = rep(1:3, 3),
I_Words = sample(0:20, 9, replace = TRUE),
YOU_Words = sample(0:40, 9, replace = TRUE),
WE_Words = sample(0:10, 9, replace = TRUE)
# if Condition 1, divide by 15
PronounData[PronounData$Condition == 1, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition == 1, c("I_Words", "YOU_Words", "WE_Words")] / 15
# if Condition 2 or 3, divide by 10
PronounData[PronounData$Condition %in% 2:3, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition %in% 2:3, c("I_Words", "YOU_Words", "WE_Words")] / 10
# result
# ParticipantID Condition I_Words YOU_Words WE_Words I_Words_PPM YOU_Words_PPM WE_Words_PPM
# 1 1 1 17 40 6 1.1333 2.6667 0.4000
# 2 2 2 14 1 6 1.4000 0.1000 0.6000
# 3 3 3 2 34 8 0.2000 3.4000 0.8000
# 4 4 1 0 33 1 0.0000 2.2000 0.0667
# 5 5 2 4 15 0 0.4000 1.5000 0.0000
# 6 6 3 1 7 6 0.1000 0.7000 0.6000
# 7 7 1 6 10 1 0.4000 0.6667 0.0667
# 8 8 2 1 33 9 0.1000 3.3000 0.9000
# 9 9 3 9 40 0 0.9000 4.0000 0.0000
NB, R is built on vectorized operations, so looping through each row is rarely the best solution. Instead, you generally want to find a way of modifying whole vectors/columns at once, or at least subsets of them. This will usually be faster and simpler.


Create a new column per group based on condition in a data frame

although I searched long for solutions, e.g.
Assign value to group based on condition in column
I am not able to solve the following problem and would appreciate greatly any help!
I have the following data frame (in reality, many more with thousands of rows):
df <- data.frame(ID1 = c(1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6,7,7),
ID2 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
Percentage = c(0,10,NA,65,79,81,52,0,0,11,12,35,0,24,89,76,0,NA,59,16),
Group_expected_result = c(6,6,6,7,7,7,7,1,1,3,3,3,4,4,4,5,5,5,2,2))
What I want to do is to assign a group type from 1 to 7 to each group as indicated by ID1. Which group type should be assigned is dependent on the conditions of column 3, Percentage (can have values from 0-100) and is split into seven types:
Type 1 has a percentage of 0, i.e.
Type 1 = 0
Type 2 > 0 & < 10
Type 3 > 9 & < 20
Type 4 > 19 & < 30
Type 5 > 29 & < 40
Type 6 > 39 & < 50
Type 7 > 49
The combination of these types (above) defines the group type (G1-G7) below:
G1 = only T7
G2 = only T7 & T2-T6
G3 = only T2-T6
G4 = at least one T1, & one T2-T6, & one T7 (= all)
G5 = only T7 & T1
G6 = only T2-T6 & T1
G7 = only T1
The expected result is in the last column of the sample data frame, e.g.
the first group consists of types T1 and T2, therefore should be group type G6.
So, the question is how to get the expected result in the last column? I hope I made the problem clear! Thanks in advance!
Try this:
myType <- function(x) {
if (is.na(x) || x==0) {
} else if (x < 50) {
} else {
myGroup <- function(myDf) {
myIds <- unique(myDf$ID1)
myGs <- list(G1=1L, G2=2:3, G3=2L, G4=1:3, G5=c(1L,3L), G6=1:2, G7=3L)
assignG <- vector(mode = "integer", length=nrow(myDf))
vT <- vapply(myDf[,"Percentage"], function(x) myType(x), 1L)
for (i in myIds) {
myV <- which(myDf[,1L]==i)
testV <- sort(unique(vT[myV]))
assignG[myV] <- which(vapply(myGs, function(x) identical(x,testV), TRUE, USE.NAMES = FALSE))
myDf$myResult <- assignG
Calling it, we obtain:
ID1 ID2 Percentage Group_expected_result myResult
1 1 1 0 6 6
2 1 2 10 6 6
3 1 3 NA 6 6
4 2 4 65 7 7
5 2 5 79 7 7
6 2 6 81 7 7
7 2 7 52 7 7
8 3 8 0 1 1
9 3 9 0 1 1
10 4 10 11 3 3
11 4 11 12 3 3
12 4 12 35 3 3
13 5 13 0 4 4
14 5 14 24 4 4
15 5 15 89 4 4
16 6 16 76 5 5
17 6 17 0 5 5
18 6 18 NA 5 5
19 7 19 59 2 2
20 7 20 16 2 2
Here is a less intuitive, but more efficient solution.
myGroup2 <- function(myDf) {
myIds <- unique(myDf$ID1)
AltGs <- c(G1=2L, G2=7L, G3=3L, G4=9L, G5=6L, G6=5L, G7=4L)
assignG <- vector(mode = "integer", length=nrow(myDf))
vT <- vapply(myDf[,"Percentage"], function(x) myType(x), 1L)
for (i in myIds) {
myV <- which(myDf[,1L]==i)
testV <- unique(vT[myV])
assignG[myV] <- which(AltGs==(length(testV)+sum(testV)))
myDf$myResult <- assignG
It is about twice as fast.
microbenchmark(t1=myGroup(df,7), t2=myGroup2(df,7))
Unit: microseconds
expr min lq mean median uq max neval
t1 692.117 728.4470 779.6459 748.562 819.170 1018.060 100
t2 320.608 340.3115 390.7098 351.395 414.203 1781.195 100
You can obtain AltGs above by running the following:
myGs <- list(G1=1L, G2=2:3, G3=2L, G4=1:3, G5=c(1L,3L), G6=1:2, G7=3L)
AltGs <- vapply(myGs, function(x) length(x)+sum(x), 2L, USE.NAMES = FALSE)

Create N random integers with no gaps

For a clustering algorithm that I'm implementing, I would like to initialize the clusters assignments at random. However, I need that there are no gaps. That is, this is not ok:
K <- 10 # initial number of clusters
N <- 20 # number of data points
z_init <- sample(K,N, replace=TRUE) # initial assignments
# [1] 2 8 6 2 10 10 2 9 5 6 6 3 8 2 5 9 10 3 5 1
# [1] 1 2 3 5 6 8 9 10
where labels 4 and 7 have not been used.
Instead, I would like this vector to be:
# [1] 2 6 5 2 8 8 2 7 4 5 5 3 6 2 4 7 8 3 4 1
where the label 5 has become 4 and so forth to fill the lower empty labels.
More examples:
The vector 1 2 3 5 6 8 should be ̀1 2 3 4 5 6 7
The vector 15,5,7,7,10 should be ̀1 2 3 3 4
Can it be done avoiding for loops? I don't need it to be fast, I prefer it to be elegant and short, since I'm doing it only once in the code (for label initialization).
My solution using a for loop
z_init <- c(3,2,1,3,3,7,9)
idx <- order(z_init)
for (i in 2:length(z_init)){
if(z_init[idx[i]] > z_init[idx[i-1]]){
z_init[idx[i]] <- z_init[idx[i-1]]+1
z_init[idx[i]] <- z_init[idx[i-1]]
# 3 2 1 3 3 4 5
Edit: #GregSnow came up with the current shortest answer. I'm 100% convinced that this is the shortest possible way.
For fun, I decided to golf the code, i.e. write it as short as possible:
z <- c(3, 8, 4, 4, 8, 2, 3, 9, 5, 1, 4)
# solution by hand: 1 2 3 3 4 4 4 5 6 6 7
sort(c(factor(z))) # 18 bits, as proposed by #GregSnow in the comments
# [1] 1 2 3 3 4 4 4 5 6 6 7
Some other (functioning) attempts:
y=table(z);rep(seq(y),y) # 24 bits
sort(unclass(factor(z))) # 24 bits, based on #GregSnow 's answer
diffinv(diff(sort(z))>0)+1 # 26 bits
sort(as.numeric(factor(z))) # 27 bits, #GregSnow 's original answer
rep(seq(unique(z)),table(z)) # 28 bits
cumsum(c(1,diff(sort(z))>0)) # 28 bits
y=rle(sort(z))$l;rep(seq(y),y) # 30 bits
Edit2: Just to show that bits isn't everything:
z <- sample(1:10,10000,replace=T)
Unit: microseconds
expr min lq mean median uq max neval
sort(c(factor(z))) 2550.128 2572.2340 2681.4950 2646.6460 2729.7425 3140.288 100
{ y = table(z) rep(seq(y), y) } 2436.438 2485.3885 2580.9861 2556.4440 2618.4215 3070.812 100
sort(unclass(factor(z))) 2535.127 2578.9450 2654.7463 2623.9470 2708.6230 3167.922 100
diffinv(diff(sort(z)) > 0) + 1 551.871 572.2000 628.6268 626.0845 666.3495 940.311 100
sort(as.numeric(factor(z))) 2603.814 2672.3050 2762.2030 2717.5050 2790.7320 3558.336 100
rep(seq(unique(z)), table(z)) 2541.049 2586.0505 2733.5200 2674.0815 2760.7305 5765.815 100
cumsum(c(1, diff(sort(z)) > 0)) 530.159 545.5545 602.1348 592.3325 632.0060 844.385 100
{ y = rle(sort(z))$l rep(seq(y), y) } 661.218 684.3115 727.4502 724.1820 758.3280 857.412 100
z <- sample(1:100000,replace=T)
Unit: milliseconds
expr min lq mean median uq max neval
sort(c(factor(z))) 84.501189 87.227377 92.13182 89.733291 94.16700 150.08327 100
{ y = table(z) rep(seq(y), y) } 78.951701 82.102845 85.54975 83.935108 87.70365 106.05766 100
sort(unclass(factor(z))) 84.958711 87.273366 90.84612 89.317415 91.85155 121.99082 100
diffinv(diff(sort(z)) > 0) + 1 9.784041 9.963853 10.37807 10.090965 10.34381 17.26034 100
sort(as.numeric(factor(z))) 85.917969 88.660145 93.42664 91.542263 95.53720 118.44512 100
rep(seq(unique(z)), table(z)) 86.568528 88.300325 93.01369 90.577281 94.74137 118.03852 100
cumsum(c(1, diff(sort(z)) > 0)) 9.680615 9.834175 10.11518 9.963261 10.16735 14.40427 100
{ y = rle(sort(z))$l rep(seq(y), y) } 12.842614 13.033085 14.73063 13.294019 13.66371 133.16243 100
It seems to me that you are trying to randomly assign elements of a set (the numbers 1 to 20) to clusters, subject to the requirement that each cluster be assigned at least one element.
One approach that I could think of would be to select a random reward r_ij for assigning element i to cluster j. Then I would define binary decision variables x_ij that indicate whether element i is assigned to cluster j. Finally, I would use mixed integer optimization to select the assignment from elements to clusters that maximizes the collected reward subject to the following conditions:
Every element is assigned to exactly one cluster
Every cluster has at least one element assigned to it
This is equivalent to randomly selecting an assignment, keeping it if all clusters have at least one element, and otherwise discarding it and trying again until you get a valid random assignment.
In terms of implementation, this is pretty easy to accomplish in R using the lpSolve package:
N <- 20
K <- 10
r <- matrix(rnorm(N*K), N, K)
mod <- lp(direction = "max",
objective.in = as.vector(r),
const.mat = rbind(t(sapply(1:K, function(j) rep((1:K == j) * 1, each=N))),
t(sapply(1:N, function(i) rep((1:N == i) * 1, K)))),
const.dir = c(rep(">=", K), rep("=", N)),
const.rhs = rep(1, N+K),
all.bin = TRUE)
(assignments <- apply(matrix(mod$solution, nrow=N), 1, function(x) which(x > 0.999)))
# [1] 6 5 3 3 5 6 6 9 2 1 3 4 7 6 10 2 10 6 6 8
# [1] 1 2 3 4 5 6 7 8 9 10
You could do like this:
un <- sort(unique(z_init))
(z <- unname(setNames(1:length(un), un)[as.character(z_init)]))
# [1] 2 6 5 2 8 8 2 7 4 5 5 3 6 2 4 7 8 3 4 1
# [1] 1 2 3 4 5 6 7 8
Here I replace elements of un in z_init with corresponding elements of 1:length(un).
A simple (but possibly inefficient) approach is to convert to a factor then back to numeric. Creating the factor will code the information as integers from 1 to the number of unique values, then add labels with the original values. Converting to numeric then drops the labels and leaves the numbers:
> x <- c(1,2,3,5,6,8)
> (x2 <- as.numeric(factor(x)))
[1] 1 2 3 4 5 6
> xx <- c(15,5,7,7,10)
> (xx2 <- as.numeric(factor(xx)))
[1] 4 1 2 2 3
> (xx3 <- as.numeric(factor(xx, levels=unique(xx))))
[1] 1 2 3 3 4
The levels = portion in the last example sets the numbers to match the order in which they appear in the original vector.

How to squeeze in missing values into a vector

Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
ifelse(duplicated(dtw(A, B)$index2), NA, A)
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
c[i,1] <- NA

Data on one row by ID

I have a data frame with one id column and several other column grouped by couple and i'm trying to put all the data for a same id on one row. ID's do not appear the same number of times each.
My data looks like this :
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12, hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
## id vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3
## 1 1 1 0.04632267 1 -0.37404379 1 0.90711353
## 2 4 2 0.50383152 2 0.06075954 2 0.30690284
## 3 1 3 1.52450117 3 -1.21539925 3 -1.12411614
## 4 1 4 -0.50624871 4 -0.75988364 4 -0.47970608
## 5 3 5 1.64610863 5 0.03445275 5 -0.18895338
## 6 1 6 0.22019099 6 -0.32101883 6 1.29375822
## 7 2 7 -0.10041807 7 -0.17351799 7 -0.03767921
## 8 2 8 0.81683565 8 0.62449158 8 0.50474787
## 9 2 9 -0.46891269 9 1.07743469 9 -0.55539149
## 10 1 10 0.69736549 10 -0.08573679 10 0.28025325
## 11 3 11 0.73354215 11 0.80676315 11 -1.12561358
## 12 2 12 -0.40903143 12 1.94155313 12 0.64231119
For the moment i came up with this :
align2 <- function(df) {
result <- lapply(1:nrow(df), function(j) lapply(1:3, function(i) {x <- df[j, paste0(c("vpcc", "hpcc"), i)]
names(x) <- paste0(c("vpcc", "hpcc"), (i + (j-1)*4))
result2 <- lapply(result, function(x) do.call(cbind, x))
result3 <- do.call(cbind, result2)
testX <- lapply(1:4, function(k) align2(as.data.frame(split(df, f=df$id)[[k]])))
testX2 <- do.call(rbind.fill, testX)
## vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3 vpcc4 hpcc4 vpcc5 hpcc5 vpcc6 hpcc6 vpcc7 hpcc7 vpcc8 hpcc8 ...
## 1 1 0.04632267 1 -0.37404379 1 0.90711353 3 1.5245012 3 -1.2153992 3 -1.1241161 4 -0.5062487 4 -0.7598836 ...
## 2 7 -0.10041807 7 -0.17351799 7 -0.03767921 8 0.8168356 8 0.6244916 8 0.5047479 9 -0.4689127 9 1.0774347 ...
## 3 5 1.64610863 5 0.03445275 5 -0.18895338 11 0.7335422 11 0.8067632 11 -1.1256136 NA NA NA NA ...
## 4 2 0.50383152 2 0.06075954 2 0.30690284 NA NA NA NA NA NA NA NA NA NA ...
It's a partial solution since it don't keep the id.
But I can't imagine there's not a easier way...
Thank you for suggestions
PS : maybe there's already a solution on SO but I didn't find it...
In your example the variables vpcc1 vpcc2 etc. are redundant, since they have all the same value. So you can transform the dataset into a more economical structure:
df <- data.frame(id=sample(1:4, 12, T), vpcc=1:12, hpcc1=rnorm(12),
Then use reshape() and you'll have all the values for each id in a single row, with the columns corresponding to the vpcc value, so that "hpcc3.5" means hpcc3 when vpcc is 5.
reshape(df, idvar = "id", direction = "wide", timevar = "vpcc")
if vpccX varies, then maybe this will give you what you need?
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12,
hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
df$time = ave(df$id, df$id, FUN = function(x) 1:length(x))
reshape(df, idvar = "id", direction = "wide", timevar = "time")
of course, you can rename your variables, if it's needed.
When you say "same row", is it necessary that the output is like it is in your attempt or would you be happy with something like:
x <- aggregate(df[2:ncol(df)],list(df$id),list)
which allows you to view output on one row as:
# Group.1 vpcc1 hpcc1 vpcc2 hpcc2 vpcc3
#1 1 9, 10 1.4651392, 0.8581344 9, 10 -1.621135, 1.391945 9, 10
#2 2 1, 3, 7 2.784998, 1.667367, -1.329005 1, 3, 7 0.2115051, 0.7871399, -0.4835389 1, 3, 7
#3 3 5, 6 -0.5024987, 0.2822224 5, 6 0.155844, 1.336449 5, 6
#4 4 2, 4, 8, 11, 12 -0.48563550, -0.92684024, -0.04016263, -0.41861021, 0.02309864 2, 4, 8, 11, 12 -0.17304058, 0.25428404, -0.49897995, 0.03101927, -0.13529866 2, 4, 8, 11, 12
# hpcc3
#1 -0.05182822, 0.28365514
#2 -0.06189895, -0.83640652, 0.19425789
#3 -0.006440312, 1.378218706
#4 0.09412386, 0.16733125, -1.15198965, -1.00839015, -0.16114475
and reference different values of vpcc and hpcc using list notation:
#[1] 9 10
#[1] 1 3 7
#[1] 5 6
#[1] 2 4 8 11 12
#[1] 9 10

bin range and form a data frame using the boundary.

Suppose I want to generate bins for range 1 to 10
the output is
1 6 10 15 20
I want to form a data.frame as
[,1] [,2]
[1,] 1 6
[2,] 7 10
[3,] 11 15
[4,] 16 20
so the start will be 1,7, 11, 16, and ends are 6, 10, 15, 20, respectively.
Any solution for this?
x = round(seq(1,20,length.out=5))
df = data.frame(a = c(x[1], head(x[-1],-1) + 1), b = x[-1])
# a b
#1 1 6
#2 7 10
#3 11 15
#4 16 20
I am not sure if you are looking for the following solution. If you are, you can use cut and sub function as in my earlier post:
names(mydata)<-"V" #name the column as V
mydata$V1<-cut(mydata$V,5) #break the data into five intervals and name that as col V1
mydata$lower<-with(mydata,as.numeric( sub("\\((.+),.*", "\\1", V1))) #extract lower value
mydata$upper<-with(mydata,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",V1))) # extract upper value
myfinaldata<-mydata[,c("lower","upper")] #create data frame of lower and upper values
> myfinaldata
lower upper
1 0.981 4.79
2 4.790 8.60
3 8.600 12.40
4 12.400 16.20
5 16.200 20.00
Note: Although these look like ovelapping intervals, they are not. For example for the first row this means all data>=0.981 but <4.79 where as for the second row, this is >=4.79 and <8.60.
