Combinations of Variables in R - r

I'm trying to create a fake data frame to examine the effects from a multinomial logit model in R. I have code that does precisely what I want to do, wich is to create a row representing every combination of levels of different variables.
var1 <- seq(1,10,1)
var2 <- seq(1,20,5)
FakeData <- as.data.frame(matrix(NA, nrow=length(var1) * length(var2),
ncol=2))
row <- 1
for(i in 1:length(var1)){
for(j in 1:length(var2)){
FakeData[row, 1] <- var1[i]
FakeData[row, 2] <- var2[j]
row <- row + 1
}
}
> head(FakeData)
V1 V2
1 1 1
2 1 6
3 1 11
4 1 16
5 2 1
6 2 6
My problem is that this code is very inefficient when applied to my problem with four variables of around ten levels each. Any tips on functions that might make it quicker?

You may be looking for expand.grid ?
R> expand.grid(var1, var2)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 1 6
12 2 6
13 3 6
14 4 6
15 5 6
16 6 6
17 7 6
18 8 6
19 9 6
20 10 6

Related

anti-join not working - giving 0 rows, why?

I am trying to use anti-join exactly as I have done many times to establish which rows across two datasets do not have matches for two specific columns. For some reason I keep getting 0 rows in the result and I can't understand why.
Below are two dummy df's containing the two columns I am trying to compare - you will see one is missing an entry (df1, SITE no2, PLOT no 8) - so when I use anti-join to compare the two dfs, this entry should be returned, but I am just getting a result of 0.
a<- seq(1:3)
SITE <- rep(a, times = c(16,15,1))
PLOT <- c(1:16,1:7,9:16,1)
df1 <- data.frame(SITE,PLOT)
SITE <- rep(a, times = c(16,16,1))
PLOT <- c(rep(1:16,2),1)
df2 <- data.frame(SITE,PLOT)
df1 df2
SITE PLOT SITE PLOT
1 1 1 1
1 2 1 2
1 3 1 3
1 4 1 4
1 5 1 5
1 6 1 6
1 7 1 7
1 9 1 8
1 10 1 9
1 11 1 10
1 12 1 11
1 13 1 12
1 14 1 13
1 15 1 14
1 16 1 15
1 1 1 16
2 2 2 1
2 3 2 2
2 4 2 3
2 5 2 4
2 6 2 5
2 7 2 6
2 8 2 7
2 9 2 8
2 10 2 9
2 11 2 10
2 12 2 11
2 13 2 12
2 14 2 13
2 15 2 14
2 16 2 15
3 1 2 16
3 1
a <- anti_join(df1, df2, by=c('SITE', 'PLOT'))
a
<0 rows> (or 0-length row.names)
I'm sure the answer is obvious but I can't see it.
The answer can be found in the help file.
anti_join() return all rows from x without a match in y.
So reversing the input for df1 and df2 will give you what you expect.
anti_join(df2, df1, by=c('SITE', 'PLOT'))
# SITE PLOT
# 1 2 8

Compare 2 values of the same row of a matrix with the row and column index of another matrix in R

I have a matrix1 with 11217 rows and 2 columns, a second matrix2 which has 10 rows and 10 columns. Now, I want to compare the values in the rows of matrix 1 with the indices of matrix 2 and if these are the same then the value of the corresponding index (currently 0) of the matrix2 should be increased with +1.
c1 <- x[2:11218] #these values go from 1 to 10
#second column from index 3 to N
c2 <- x[3:11219] #these values also go from 1 to 10
#matrix with column c1 and c2
m1 <- as.matrix(cbind(c1 = c1, c2 = c2))
#empty matrix which will count the frequencies
m2 <- matrix(0, nrow = 10, ncol = 10)
#change row and column names of m2 to the numbers of 1 to 10
dimnames(m2) <-list(c(1:10), c(1:10))
#go through every row of the matrix m1 and look which rotation appears, add 1 to m2 if the rotation
#equals the corresponding index
r <- c(1:10)
c <- c(1:10)
for (i in 1:nrow(m1)) {
if(m1[i,1] == r & m1[i,2] == c)
m2[r,c]+1
}
no frequencies where calculated, i don't understand why?
It appears that you are trying to replicate the behavior of table. I'd recommend just using it instead.
Simpler data (it appears you did not include variable x):
m1 <-
matrix(round(runif(20, 1,10))
, ncol = 2)
Then, use table. Here, I am setting the values of each column to be a factor to ensure that the right columns are generated:
table(factor(m1[,1], 1:10)
, factor(m1[,2], 1:10))
gives:
1 2 3 4 5 6 7 8 9 10
1 3 4 0 4 2 0 5 3 2 0
2 3 7 9 7 4 5 3 4 5 2
3 4 6 3 10 8 9 4 2 7 3
4 5 2 14 3 7 13 8 11 3 3
5 2 13 2 5 8 5 7 7 8 6
6 1 10 7 4 5 6 8 5 8 5
7 3 3 6 5 4 5 4 8 7 7
8 5 5 8 7 6 10 5 4 3 4
9 2 5 8 4 7 4 4 6 4 2
10 3 1 2 3 3 5 3 5 1 0

Assigning test / control group vector using split-apply-combine strategy [duplicate]

This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 6 years ago.
this should be simple but it's got me pulling my hair out!
Here is some data:
Clicks <- c(1,2,3,4,5,6,5,4,3,2)
Cost <- c(10,11,12,13,14,15,14,13,12,11)
Cluster <- c(1,1,1,2,2,1,1,1,1,1)
df <- data.frame(Clicks,Cost,Cluster)
I want to filter my df by cluster, assign a new vector that assigns "test" and "control" group at random, then recombine to the original data frame
Step 1: Filter (by cluster 1)
Clicks Cost Cluster
1 1 10 1
2 2 11 1
3 3 12 1
4 6 15 1
5 5 14 1
6 4 13 1
7 3 12 1
8 2 11 1
Step 2: Assign test and control group at random
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 6 15 1 Test
5 5 14 1 Control
6 4 13 1 Control
7 3 12 1 Test
8 2 11 1 Control
Step 3: Get back to the original data frame
Clicks Cost Cluster group
1 1 10 1 Test
2 2 11 1 Control
3 3 12 1 Control
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Test
7 5 14 1 Control
8 4 13 1 Control
9 3 12 1 Test
10 2 11 1 Control
Step 4: do the same for cluster 2
Thanks :)
How about
df$Group <- 'NULL'
df1 <- df
df1[df1$Cluster==1, ]$Group <- ifelse(runif(sum(df1$Cluster==1)) > 0.5, 'Control', 'Test')
df1
Clicks Cost Cluster Group
1 1 10 1 Test
2 2 11 1 Test
3 3 12 1 Test
4 4 13 2 NULL
5 5 14 2 NULL
6 6 15 1 Control
7 5 14 1 Test
8 4 13 1 Test
9 3 12 1 Control
10 2 11 1 Control
df2 <- df
df2[df2$Cluster==2, ]$Group <- ifelse(runif(sum(df2$Cluster==2)) > 0.5, 'Control', 'Test')
df2
Clicks Cost Cluster Group
1 1 10 1 NULL
2 2 11 1 NULL
3 3 12 1 NULL
4 4 13 2 Test
5 5 14 2 Control
6 6 15 1 NULL
7 5 14 1 NULL
8 4 13 1 NULL
9 3 12 1 NULL
10 2 11 1 NULL

How to generate an uneven sequence of numbers in R

Here's an example data frame:
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
I want to generate a sequence of numbers according to the number of observations of y per x group (e.g. there are 2 observations of y for x=1). I want the sequence to be continuously increasing and jumps by 2 after each x group.
The desired output for this example would be:
1,2,5,6,7,10,11,14,17,20,21,22,25,26
How can I do this simply in R?
To expand on my comment, the groupings can be arbitrary, you simply need to recast it to the correct ordering. There are a few ways to do this, #akrun has shown that this can be accomplished using match function, or you can make use the the as.numeric function if this is easier to understand for yourself.
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
# these are equivalent
df$newx <- as.numeric(factor(df$x, levels=unique(df$x)))
df$newx <- match(df$x, unique(df$x))
Since you now have a "new" releveling which is sequential, we can use the logic that was discussed in the comments.
df$newNumber <- 1:nrow(df) + (df$newx-1)*2
For this example, this will result in the following dataframe:
x y newx newNumber
1 1 1 1
1 2 1 2
2 3 2 5
2 4 2 6
2 6 2 7
3 3 3 10
3 7 3 11
4 8 4 14
5 6 5 17
6 4 6 20
6 3 6 21
6 7 6 22
9 3 7 25
9 2 7 26
where df$newNumber is the output you wanted.
To create the sequence 0,0,4,4,4,9,..., basically what you're doing is taking the minimum of each group and subtracting 1. The easiest way to do this is using the library(dplyr).
library(dplyr)
df %>%
group_by(x) %>%
mutate(newNumber2 = min(newNumber) -1)
Which will have the output:
Source: local data frame [14 x 5]
Groups: x
x y newx newNumber newNumber2
1 1 1 1 1 0
2 1 2 1 2 0
3 2 3 2 5 4
4 2 4 2 6 4
5 2 6 2 7 4
6 3 3 3 10 9
7 3 7 3 11 9
8 4 8 4 14 13
9 5 6 5 17 16
10 6 4 6 20 19
11 6 3 6 21 19
12 6 7 6 22 19
13 9 3 7 25 24
14 9 2 7 26 24

remove i+1th term if reoccuring

Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3

Resources