Return rows of data frame that meet multiple criteria in R (panel data random sample) - r

I am hoping to create a random sample from panel data based on the unique id.
For instance if you start with:
e = data.frame(id=c(1,1,1,2,2,3,3,3,4,4,4,4), data=c(23,34,45,1,23,45,6,2,9,39,21,1))
And you want a random sample of 2 unique ids:
out = data.frame(id=c(1,1,1,3,3,3), data=c(23,34,45,45,6,2))
Although sample gives me random unique ids
sample( e$id ,2) # give c(1,3)
I can't figure out how to use logical calls to return all the desired data.
I have tried a number of things including:
e[ e$id == sample( e$id ,2) ] # only returns 1/2 the data
Any ideas??? Its killing me.

I'm not entirely sure what your expected result should be, but does this work for what you're trying to do?
> e[e$id %in% sample(e$id, 2), ]
id data
6 3 45
7 3 6
8 3 2
9 4 9
10 4 39
11 4 21
12 4 1
Or maybe you want this:
> e[e$id %in% sample(unique(e$id), 2), ]
id data
1 1 23
2 1 34
3 1 45
9 4 9
10 4 39
11 4 21
12 4 1

Related

Create new column with shared ID to randomly link two rows in R

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!
Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

For Loop Adding Extra Rows to The Data Frame

Hello I am very new to the programming world and data science as well, and I am trying to work my way through it.
I am trying to assign values to the column in a data frame and using for loop such that the data frame is divided into ten groups and every row in every group is assigned a rank, such that row 1 to 10 is assigned as rank 1 and row 11 to 20 is assigned as rank 2 and so on. The original dimension of subset data set is 100 * 6
My data frame looks like
Data Frame
The codes I have written are:
x <- round(nrow(subset) / 10)
a=1
for(j in 1:10){
for(i in a:x){
subset[i, "rank"] = j
}
j = j + 1
a = x + 1
x = x * j
}
However, the loop runs infinitely and keeps on adding additional rows to the data frame. I had to manually stop the loop and the resulting dimension of the subset data frame was 17926 * 6.
Please help me understand where am I going wrong in writing the loop.
P.S. subset is a data frame name and not the subset function in R
Thanks in Advance !!
It might be better for you to start working with vectorized calculations instead of loops. This will help you in the future.
For example:
df <- data.frame(x = 1:100)
df$rank <- (df$x-1)%/%10 + 1
df
results in:
x rank
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 3
22 22 3
23 23 3
24 24 3
25 25 3
How about something like this:
subset$Rank <- ceiling(as.numeric(rownames(subset))/10)
The as.numeric converts the rowname into a number, dividing it by 10 and rounding up should give you what you need? Let me know if I've misunderstood.

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

R - Subset rows of a data frame on a condition in all the columns

I want to subset rows of a data frame on a single condition in all the columns, avoiding the use of subset.
I understand how to subset a single column but I cannot generalize for all columns (without call all the columns).
Initial data frame :
V1 V2 V3
1 1 8 15
2 2 0 16
3 3 10 17
4 4 11 18
5 5 0 19
6 0 13 20
7 7 14 21
In this example, I want to subset the rows without zeros.
Expected output :
V1 V2 V3
1 1 8 15
2 3 10 17
3 4 11 18
4 7 14 21
Thanks
# create your data
a <- c(1,2,3,4,5,0,7)
b <- c(8,0,10,11,0,14,14)
c <- c(15,16,17,18,19,20,21)
data <- cbind(a, b, c)
# filter out rows where there is at least one 0
data[apply(data, 1, min) > 0,]
A solution using rowSums function after matching to 0.
# creating your data
data <- data.frame(a = c(1,2,3,4,5,0,7),
b = c(8,0,10,11,0,14,14),
c = c(15,16,17,18,19,20,21))
# Selecting rows containing no 0.
data[which(rowSums(as.matrix(data)==0) == 0),]
Another way
data[-unique(row(data)[grep("^0$", unlist(data))]),]

Randomly choose value between 1 and 10 with equal number of instances [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Select and insert value unique number of times in R
I would like to generate 2000 random numbers between 1 and 10 such that for each random number I have the same number of instances.
In this case 200 for each number.
What should be random is the order in which it is generated.
I have the following problem:
I have an array with 2000 entries but not each with unique values, for example it starts like this:
11112233333333344445667777777777
and consists of 2000 entries.
I would like to generate random numbers and assign each UNIQUE value a separate random number but have an entry for each value
So my intended result would look like this:
original array: 11112233333333344445667777777777
random numbers: 33334466666666699991778888888888
You could do this in a few steps:
my_numbers <- rep(1:10, each=200)
my_randomizer <- sample(seq_along(my_numbers), length(my_numbers))
my_random_numbers <- my_numbers[my_randomizer]
Based on the edits:
I would use rle. It sounds like you don't have an array, but instead a vector:
my_array_rled <- rle(my_array)
my_random_numbers <- sample(1:10, length(unique(my_array)))
my_array_rled$values <- factor(my_array_rled$values)
levels(my_array_rled$values) <- my_random_numbers
my_array_randomized <- inverse.rle(my_array_rled)
If I understand you correctly you can use "rep" to replicate your random numbers 200 times and "sample" to randomize the resulting vector.
x <- sample(rep(runif(2000,1,10),200))
A non vectorized code:
# using a seed for reproducible example
set.seed(2)
original_array <- c(1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,6,6,7,7,7,7,7,7,7,7,7,7)
random_numbers <- numeric(length=length(original_array))
rdnum <- sample(unique(original_array), length(unique(original_array)))
for ( i in 1:length(unique(original_array)))
random_numbers[original_array == i] <- rdnum[i]
random_numbers
2 2 2 2 5 5 3 3 3 3 3 3 3 3 3 1 1 1 1 6 7 7 4 4 4 4 4 4 4 4 4 4
The table function with sample comes in quite handy for this scenerio:
set.seed(1)
## ASSUMING ORIGINAL IS A VECTOR
original <- c(1, 1, 1, 1, 2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,6,6,7,7,7,7,7,7,7,7,7,7)
## CREATE A TABLE OF ALL THE VALUES
tabl <- table(original)
## RNG is the sample range to select from. Assuming 1:10 in this example
RNG <- 1:10
## PICK VALUES RANDOMLY FROM RNG
tabl[] <- sample(RNG, length(tabl), replace=FALSE)
# note that the `names` of `tabl` will contain the values from `original`
# whereas the values of `tabl` will contain the new random value.
## ASSIGN NEW VALUES
randomNums <- original
for(i in seq(length(tabl)))
randomNums[ original==as.numeric(names(tabl))[[i]] ] <- tabl[[i]]
Results:
rbind(orig=original, rand=randomNums)
orig: 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 5 6 6 7 7 7 7 7 7 7 7 7 7
rand: 3 3 3 3 4 4 5 5 5 5 5 5 5 5 5 7 7 7 7 2 8 8 9 9 9 9 9 9 9 9 9 9

Resources