I am trying to split up data evenly in R. For example, I am using the dataset cars that is built into R Studio with 50 lines. If I want to split the data into two sections, I would do something along the lines of this:
cars$split <- rep(1:2, each=25) where I would create a column called split and assign the first 25 values to a 1, and the next 25 values to a 2. However, if I wanted to split my data into, lets say, 8 sections (based on user discretion), I would not be able to divide 50 / 8 evenly as it equals to 6.25. In this case, I would simply assign the last two rows (since 50 / 8 = 6.25, and 6 * 8 = 48 so we would have 2 rows left over) to the number 8 in this case using the function above. However, I am unable to do this since the rep function needs to divide properly so I tried to write out some logic as so, but I get an issue saying:
Error in `$<-.data.frame`(`*tmp*`, "split", value = c(1L, 1L, 1L, 1L, : replacement has 48 rows, data has 50
Any ideas on how to fix this? My attempt is shown below:
numDataPerSection <- floor(nrow(cars) / userInputNum)
if(nrow(cars) %% userInputNum != 0){
#If not divisible, assign last few data points to the last number
cars$split <- rep(1:ncls, each=numDataPerSection, len = nrow(cars) - (nrow(cars) %% userInputNum))
for(i in nrow(cars) %% userInputNum){
cars$split[nrow(cars) - i] <- userInputNum
}
}
#Everything divides correctly
else{
cars$split <- rep(1:ncls, each=numDataPerSection)
}
How about using a function such as this one to create your indices?
create.indices <- function(nrows, sections) {
indices <- rep(1:sections,each=floor(nrows/sections))
indices <- append(indices, rep(tail(indices, 1), nrows%%sections))
return(indices)
}
create.indices(50,8)
# [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 8 8
You could use something like
split1 <- function(n,s){ c( rep(1:s, each=n%/%s), rep(s, n%%s) ) }
cars$split <- split1(nrow(cars,userInputNum))
but this is not very balanced as in your example category 8 is two larger than any other, and would be worse with 55 rows and 8 sections:
> split1(50,8)
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7
[39] 7 7 7 7 8 8 8 8 8 8 8 8
> table(split1(50,8))
1 2 3 4 5 6 7 8
6 6 6 6 6 6 6 8
> table(split1(55,8))
1 2 3 4 5 6 7 8
6 6 6 6 6 6 6 13
You could do better with something like
split2 <- function(n,s){ ((1:n)*s+n-s) %/% n }
which produces
> split2(50,8)
[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6
[39] 7 7 7 7 7 7 8 8 8 8 8 8
> table(split2(50,8))
1 2 3 4 5 6 7 8
7 6 6 6 7 6 6 6
> table(split2(55,8))
1 2 3 4 5 6 7 8
7 7 7 7 7 7 7 6
You can use the length.out argument of rep() to create your split column: rep(1:8, length.out = 50, each = round(50/8)). Using the round() function works reasonably well to achieve a uniform distribution of group sizes:
> table(rep(1:8, length.out = 50, each = round(50/8)))
1 2 3 4 5 6 7 8
8 6 6 6 6 6 6 6
Related
I have a factor variable with 6 levels, which simplified looks like:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 1 1 1 2 2 2 2... 1 1 1 2 2... (with n = 78)
Note, that each number is repeated mostly but not always three times.
I need to transform this variable into the following pattern:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8...
where each repetition of the 6 levels continuous counting ascending.
Is there any way / any function that lets me do that?
Sorry for my bad description!
Assuming that you have a numerical vector that represents your simplified version you posted. i.e. x = c(1,1,1,2,2,3,3,3,1,1,2,2), you can use this:
library(dplyr)
cumsum(x != lag(x, default = 0))
# [1] 1 1 1 2 2 3 3 3 4 4 5 5
which compares each value to its previous one and if they are different it adds 1 (starting from 1).
Maybe you can try rle, i.e.,
v <- rep(seq_along((v<-rle(x))$values),v$lengths)
Example with dummy data
x = c(1,1,1,2,2,3,3,3,4,4,5,6,1,1,2,2,3,3,3,4,4)
then we can get
> v
[1] 1 1 1 2 2 3 3 3 4 4 5 6 7 7 8 8 9 9
[19] 9 10 10
In base you can use diff and cumsum.
c(1, cumsum(diff(x)!=0)+1)
# [1] 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8
Data:
x <- c(1,1,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,6,1,1,1,2,2,2,2)
I want to replicate a vector with one value within this vector is missing (sequentially).
For example, my vector is
value <- 1:7
First, the series is without 1, second without 2, and so on. In the end, the series is in one vector.
The intended output looks like
2 3 4 5 6 7 1 3 4 5 6 7 1 2 4 5 6 7 1 2 3 5 6 7 1 2 3 4 6 7 1 2 3 4 5 6
Is there any smart way to do this?
You could use the diagonal matrix to set up a logical vector, using it to remove the appropriate values.
n <- 7
rep(1:n, n)[!diag(n)]
# [1] 2 3 4 5 6 7 1 3 4 5 6 7 1 2 4 5 6 7 1 2 3 5 6 7 1 2 3 4 6 7 1 2 3 4 5
# [36] 7 1 2 3 4 5 6
Well, you can certainly do it as a one-liner but I am not sure it qualifies as smart. For example:
x <- 1:7
do.call("c", lapply(as.list(-1:-length(x)), function(a)x[a]))
This simple uses lapply to create a list of copies of x with each of its entries deleted, and then concatenates them using c. The do.call function applies its first argument (a function) to its second argument (a list of arguments to the function).
For fun, it's also possible to just use rep:
> n <- 7
> rep(1:n, n)[rep(c(FALSE, rep(TRUE, n)), length.out=n^2)]
[1] 2 3 4 5 6 7 1 3 4 5 6 7 1 2 4 5 6 7 1 2 3 5 6 7 1 2 3 4 6 7 1 2 3 4 5 7 1 2
[39] 3 4 5 6
But lapply is cleaner, I think.
You could also do:
n <- 7
rep(seq(n), n)[-seq(1,n*n,n+1)]
#[1] 2 3 4 5 6 7 1 3 4 5 6 7 1 2 4 5 6 7 1 2 3 5 6 7 1 2 3 4 6 7 1 2 3 4 5 7 1 2 3 4 5 6
Eliminate in an increasing order rows in a data frame
x<-c(4,5,6,23,5,6,7,8,0,3)
y<-c(2,4,5,6,23,5,6,7,8,0)
z<-c(1,2,4,5,6,23,5,6,7,8)
df<-data.frame(x,y,z)
df
x y z
1 4 2 1
2 5 4 2
3 6 5 4
4 23 6 5
5 5 23 6
6 6 5 23
7 7 6 5
8 8 7 6
9 0 8 7
10 3 0 8
I would like to eliminate number 23 in the df from all columns by instructing to sequentially increasingly remove a row per column (not by matching the value 23, but by its initial x location).
df
x y z
1 4 2 1
2 5 4 2
3 6 5 4
4 5 6 5
5 6 5 6
6 7 6 5
7 8 7 6
8 0 8 7
9 3 0 8
Thank you
You can iterate through the columns and remove the element from each, then reassemble as a data frame:
result <- as.data.frame(lapply(1:ncol(df), function(x) df[-(x+3),x]))
names(result) <- names(df)
result
## x y z
## 1 4 2 1
## 2 5 4 2
## 3 6 5 4
## 4 5 6 5
## 5 6 5 6
## 6 7 6 5
## 7 8 7 6
## 8 0 8 7
## 9 3 0 8
df[-(x+3),x] is the column with the value removed, by location. To start with row N in column x you would use df[-(x+N-1),x].
You could also try:
n <- 4
df1 <- df[-n,]
df1[] <- unlist(df,use.names=FALSE)[-seq(n, prod(dim(df)), by=nrow(df)+1)]
df1
# x y z
#1 4 2 1
#2 5 4 2
#3 6 5 4
#5 5 6 5
#6 6 5 6
#7 7 6 5
#8 8 7 6
#9 0 8 7
#10 3 0 8
I have a simple data frame as follows
x = data.frame(id = seq(1,10),val = seq(1,10))
x
id val
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I want to add 4 more columns. The first 2 are the previous two rows and the next two are the next two rows. For the first two rows and last two rows it needs to write out as NA.
How do I accomplish this using cast in the reshape package?
The final output would look like
1 1 NA NA 2 3
2 2 NA 1 3 4
3 3 1 2 4 5
4 4 2 3 5 6
... and so on...
Thanks much in advance
After your give the example , I change the solution
mat <- cbind(dat,
c(c(NA,NA),head(dat$id,-2)),
c(c(NA),head(dat$val,-1)),
c(tail(dat$id,-1),c(NA)),
c(tail(dat$val,-2),c(NA,NA)))
colnames(mat) <- c('id','val','idp','valp','idn','valn')
id val idp valp idn valn
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA
Here is a soluting with sapply. First, choose the relative change for the new columns:
lags <- c(-2, -1, 1, 2)
Create the new columns:
newcols <- sapply(lags,
function(l) {
tmp <- seq.int(nrow(x)) + l;
x[replace(tmp, tmp < 1 | tmp > nrow(x), NA), "val"]})
Bind together:
cbind(x, newcols)
The result:
id val 1 2 3 4
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA
to become more specific, here is an example:
> expand.grid(5, 5, c(1:4,6),c(1:4,6))
Var1 Var2 Var3 Var4
1 5 5 1 1
2 5 5 2 1
3 5 5 3 1
4 5 5 4 1
5 5 5 6 1
6 5 5 1 2
7 5 5 2 2
8 5 5 3 2
9 5 5 4 2
10 5 5 6 2
11 5 5 1 3
12 5 5 2 3
13 5 5 3 3
14 5 5 4 3
15 5 5 6 3
16 5 5 1 4
17 5 5 2 4
18 5 5 3 4
19 5 5 4 4
20 5 5 6 4
21 5 5 1 6
22 5 5 2 6
23 5 5 3 6
24 5 5 4 6
25 5 5 6 6
This data frame was created from all combinations of the supplied vectors. I would like to create a similar data frame from all permutations of the supplied vectors. Notice that each row must contain exactly 2 fives, yet not necessarily the fist two in line.
Thank you.
The code below works. (relies on permutations from gtools)
comb <- t(as.matrix(expand.grid(5, 5, c(1:4,6),c(1:4,6))))
perms <- t(permutations(4,4))
ans <- apply(comb,2,function(x) x[perms])
ans <- unique(matrix(as.vector(ans), ncol = 4, byrow = TRUE))
Try ?allPerms in the vegan package.