Creating duplicated data frames with different ID - r

I have a question for the community and hoping for some help.
I am trying to duplicate a data frame like the one below:
ID Time Solve
1 0 1
1 2 2
1 4 3
1 6 1
I am trying to duplicate the above data frame 100 times so, it would read as below:
ID Time Solve
1 0 1
1 2 2
1 4 3
1 6 1
2 0 1
2 2 2
2 4 3
2 6 1
3 0 1
3 2 2
3 4 3
3 6 1
4 0 1
4 2 2
4 4 3
4 6 1
.....
100 0 1
100 2 2
100 4 3
100 6 1
Does anyone have a good solution for this or a resource to read up on this?
Thanks!

We can use replicate
out <- do.call(rbind, replicate(100, df1, simplify = FALSE))
out$ID <- as.integer(gl(nrow(out), nrow(df1), nrow(out)))
Or another option is rep
out <- df1[rep(seq_len(nrow(df1)), 100),]
out$ID <- as.integer(gl(nrow(out), nrow(df1), nrow(out)))
Or make use of uncount
library(tidyr)
library(dplyr)
uncount(df1, 100) %>%
mutate(ID = as.integer(gl(n(), nrow(df1), n()))
Or another option is
df1 %>%
nest_by(ID) %>%
uncount(100) %>%
mutate(ID = row_number()) %>%
unnest(c(data))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L), Time = c(0L, 2L, 4L, 6L
), Solve = c(1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-4L))

Related

How to remove rows if values from a specified column in data set 1 does not match the values of the same column from data set 2 using dplyr

I have 2 data sets, both include ID columns with the same IDs. I have already removed rows from the first data set. For the second data set, I would like to remove any rows associated with IDs that do not match the first data set by using dplyr.
Meaning whatever is DF2 must be in DF1, if it is not then it must be removed from DF2.
For example:
DF1
ID X Y Z
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
DF2
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
DF2 once rows have been removed
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
I used anti_join() which shows me the difference in rows but I cannot figure out how to remove any rows associated with IDs that do not match the first data set by using dplyr.
Try with paste
i1 <- do.call(paste, DF2) %in% do.call(paste, DF1)
# if it is only to compare the 'ID' columns
i1 <- DF2$ID %in% DF1$ID
DF3 <- DF2[i1,]
DF3
ID A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 5 5 5 5
5 6 6 6 6
DF4 <- DF2[!i1,]
DF4
ID A B C
4 4 4 4 4
7 7 7 7 7
data
DF1 <- structure(list(ID = c(1L, 2L, 3L, 5L, 6L), X = c(1L, 2L, 3L,
5L, 6L), Y = c(1L, 2L, 3L, 5L, 6L), Z = c(1L, 2L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
DF2 <- structure(list(ID = 1:7, A = 1:7, B = 1:7, C = 1:7), class = "data.frame", row.names = c(NA,
-7L))
# Load package
library(dplyr)
# Load dataframes
df1 <- data.frame(
ID = 1:6,
X = 1:6,
Y = 1:6,
Z = 1:6
)
df2 <- data.frame(
ID = 1:7,
X = 1:7,
Y = 1:7,
Z = 1:7
)
# Include all rows in df1
df1 %>%
left_join(df2)
Joining, by = c("ID", "X", "Y", "Z")
ID X Y Z
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

Merging using index as key in R

I have two data frames
df1:
01.2020 02.2020 03.2020
11190 4 1 2
12345 3 3 1
11323 1 2 2
df2
08.2020 04.2020 09.2020
11190 1 2 2
12345 1 2 3
11324 1 2 2
Dummy Data -
df1 <- structure(list(`01.2020` = c(4L, 3L, 1L), `02.2020` = c(1L, 3L, 2L), `03.2020` = c(2L, 1L, 2L)), class = "data.frame", row.names = c("11190","12345", "11323"))
df2 <- structure(list(`08.2020` = c(1L, 1L, 1L), `04.2020` = c(2L, 2L, 2L), `09.2020` = c(2L, 3L, 2L)), class = "data.frame", row.names = c("11190", "12345", "11324"))
I want to "outer merge" these two dataframes by key = index
How can we do that? what should be there in the place of by=
merge(x = sheet1_UN, y = sheet2_UN, by = "" , all = TRUE)
I want my final dataframe to look something like this
01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
11190 4 1 2 1 1 2
12345 3 3 1 1 2 3
11323 1 2 2 - - -
11324 - - - 1 2 2
Thanks in advance.
another method
df3 <- merge(df1, df2, by = "row.names", all = TRUE)
output:
Row.names 01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
1 11190 4 1 2 1 2 2
2 11323 1 2 2 NA NA NA
3 11324 NA NA NA 1 2 2
4 12345 3 3 1 1 2 3
This should do:
df1 %>% rownames_to_column('id') %>%
full_join(df2 %>% rownames_to_column('id'), by='id')
output:
id 01.2020 02.2020 03.2020 08.2020 04.2020 09.2020
1 11190 4 1 2 1 2 2
2 12345 3 3 1 1 2 3
3 11323 1 2 2 NA NA NA
4 11324 NA NA NA 1 2 2
You might use replace_na('-') if you want no NA values, like this:
df1 %>% rownames_to_column('id') %>%
full_join(df2 %>% rownames_to_column('id'), by='id') %>%
mutate(across(everything(), ~.x %>% as.character %>% replace_na('-')))

Using tidyverse to clean up rank-choice survey

I have survey data in R that looks like this, where I've presented people with two groups of actions - High and Low - and asked them to rank each action. Each group contains unique actions, marked by the letter (6 actions in total).
id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4
I need a new df that looks like the one below, where each High action is assigned a new numeric rank (between 0 and 3) corresponding to the number of Low action items that were ranked below that High action.
For example, a person with id 001 ranked A_High at number 5, B_High at 2, and C_High at 1. A_High's new rank would be 1 (since only 1 Low action, D_Low is ranked below A_High), B_High's new rank would be 3 (since all 3 Low actions were ranked below B_High), and C_High's new rank would be 3 (since all 3 Low actions were ranked below C_High).
id A_High_rank B_High_rank C_High_rank
001 1 3 3
002 0 1 1
003 2 3 0
004 0 0 2
I have a sense that this can be done with if/else statements but suspect that there should be a far more efficient way of achieving this with tidyverse. In the real dataset, I have 1000+ rows and 12 actions (6 High and 6 Low). I would appreciate any help on this.
Thanks!
Data:
"id A_High B_High C_High D_Low E_Low F_Low
001 5 2 1 6 4 3
002 6 4 3 5 2 1
003 3 1 6 2 4 5
004 6 5 2 1 3 4"
A base R option would be to loop over the 'High' columns, get the rowSums of the logical matrix created by checking if it less than the 'Low' column, and rename those output by appending _rank as suffix
out <- cbind(df1[1], sapply(df1[2:4],
function(x) rowSums(x < df1[endsWith(names(df1), 'Low')])))
names(out)[-1] <- paste0(names(out)[-1], "_rank")
-output
out
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
Or using dplyr
library(dplyr)
df1 %>%
transmute(id, across(ends_with('High'),
~ rowSums(. < select(df1, ends_with('Low'))), .names = '{.col}_rank'))
# id A_High_rank B_High_rank C_High_rank
#1 1 1 3 3
#2 2 0 1 1
#3 3 2 3 0
#4 4 0 0 2
data
df1 <- structure(list(id = 1:4, A_High = c(5L, 6L, 3L, 6L), B_High = c(2L,
4L, 1L, 5L), C_High = c(1L, 3L, 6L, 2L), D_Low = c(6L, 5L, 2L,
1L), E_Low = c(4L, 2L, 4L, 3L), F_Low = c(3L, 1L, 5L, 4L)),
class = "data.frame", row.names = c(NA,
-4L))
After much suffering, this is the tidyverse solution I came up with. This was fun!
library(tidyverse)
data %>%
pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
filter(High-Low < 0) %>%
group_by(`High Variables`, `id`) %>%
summarise(Count = n()) %>%
pivot_wider(names_from = `High Variables`, values_from = Count) %>%
arrange(id)
Translation:
The first two line create two pairs of columns and leave id untouched. Each pair has two columns, one with the original column names, and the other with the values. Each pait of columns represents either High or Low.
Then, I filtered all the rows, keeping only those where Low was greater than High. Then I counted how many where left for each id and reversed back the format.
Now I just have to figure out how to turn those NAs into 0s.
Here's the output:
> data %>%
+ pivot_longer(cols = ends_with("_High"), names_to = "High Variables", values_to = "High") %>%
+ pivot_longer(cols = ends_with("_Low"), names_to = "Low Variables", values_to = "Low") %>%
+ filter(High < Low) %>%
+ group_by(`High Variables`, `id`) %>%
+ summarise(Count = n()) %>%
+ pivot_wider(names_from = `High Variables`, values_from = Count) %>%
+ arrange(id)
`summarise()` regrouping output by 'High Variables' (override with `.groups` argument)
# A tibble: 4 x 4
id A_High B_High C_High
<int> <int> <int> <int>
1 1 1 3 3
2 2 NA 1 1
3 3 2 3 NA
4 4 NA NA 2

Sorting and calculating sum and rank with new columns in R

I have 200 columns and want to calculate mean and rank and then generate columns. Here is an example of data
df<-read.table(text="Q1a Q2a Q3b Q4c Q5a Q6c Q7b
1 2 4 2 2 0 1
3 2 1 2 2 1 1
4 3 2 1 1 1 1",h=T)
I want to sum a, b and c for each row, and then sum them together. Next I want to calculate the rank for each row. I want to generate the following table:
Q1a Q2a Q3b Q4c Q5a Q6c Q7b a b c Total Rank
1 2 4 2 2 0 1 5 5 2 12 2
3 2 1 2 2 1 1 7 2 3 12 2
4 3 2 1 1 1 1 8 3 2 13 1
library(dplyr)
df %>%
cbind(sapply(c('a', 'b', 'c'), function(x) rowSums(.[, grep(x, names(.)), drop=FALSE]))) %>%
mutate(Total = a + b + c,
Rank = match(Total, sort(Total, decreasing = T)))
Output is:
Q1a Q2a Q3b Q4c Q5a Q6c Q7b a b c Total Rank
1 1 2 4 2 2 0 1 5 5 2 12 2
2 3 2 1 2 2 1 1 7 2 3 12 2
3 4 3 2 1 1 1 1 8 3 2 13 1
Sample data:
df <- structure(list(Q1a = c(1L, 3L, 4L), Q2a = c(2L, 2L, 3L), Q3b = c(4L,
1L, 2L), Q4c = c(2L, 2L, 1L), Q5a = c(2L, 2L, 1L), Q6c = c(0L,
1L, 1L), Q7b = c(1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-3L))
You can also go with the tidyverse approach. However, it is longer.
library(tidyverse)
df %>%
rownames_to_column(var = "ID") %>%
gather(question, value, -ID) %>%
mutate(type = substr(question, 3,3)) %>%
group_by(ID, type) %>%
summarise(sumType = sum(value, na.rm = TRUE)) %>%
as.data.frame() %>%
spread(type, sumType) %>%
mutate(Total = a+b+c,
Rank = match(Total, sort(Total, decreasing = T)))
Results:
ID a b c Total Rank
1 1 5 5 2 12 2
2 2 7 2 3 12 2
3 3 8 3 2 13 1

Resources