reorder/standardize and create rows in R - r

I am new to R and I have been looking for a solution to an existing dataframe I have been given. I have a set of variables, each of which contains some of other subcategories. Assume it looks something like this:
Michael Physics 1 2
Michael Math 2 4
Michael Science 3 4
Michael PE 2 1
James Art 0 9
James PE 1 2
James Physics -1 2
James Science 1 2
Simon PE 1 2
Simon Art 1 3
Simon Music 1 4
Simon Science 1 4
Notably, the second column has a "standard" set of variables, so that each student shares most but not necessarily all of the variables, and the ordering of these variables is scrambled. My issue is then that I want to convert this dataframe to a "standard format". That is I want each of the students to have ALL of the variables and in the same order. So if I define a list of all the subjects: say Physics, Math, Science, Art, PE, Music. I would like for there to be 18 rows in my modified dataframe(6 for each student, with the ordering defined for the subject). If the student and subject are contained in the original dataset, the row should have the data from the row, and if the student and subject doesnt exist in the original dataframe, then the other datacolumns would just be NA.

Update on OP's comment:
To keep the original order you could factor Student and define level:
df <- df %>%
mutate(Student = factor(Student, levels = c("Michael", "James", "Simon")))
df1 <- df %>%
expand(Student, Course)
df %>%
right_join(df1) %>%
arrange(Student, Course)
Output:
Student Course V1 V2
<fct> <chr> <dbl> <dbl>
1 Michael Art NA NA
2 Michael Math 2 4
3 Michael Music NA NA
4 Michael PE 2 1
5 Michael Physics 1 2
6 Michael Science 3 4
7 James Art 0 9
8 James Math NA NA
9 James Music NA NA
10 James PE 1 2
11 James Physics -1 2
12 James Science 1 2
13 Simon Art 1 3
14 Simon Math NA NA
15 Simon Music 1 4
16 Simon PE 1 2
17 Simon Physics NA NA
18 Simon Science 1 4
We could combine expand and right_join
library(dplyr)
library(tidyr)
df1 <- df %>%
expand(Student, Course)
df %>%
right_join(df1) %>%
arrange(Student, Course)
Output:
Student Course V1 V2
<chr> <chr> <dbl> <dbl>
1 James Art 0 9
2 James Math NA NA
3 James Music NA NA
4 James PE 1 2
5 James Physics -1 2
6 James Science 1 2
7 Michael Art NA NA
8 Michael Math 2 4
9 Michael Music NA NA
10 Michael PE 2 1
11 Michael Physics 1 2
12 Michael Science 3 4
13 Simon Art 1 3
14 Simon Math NA NA
15 Simon Music 1 4
16 Simon PE 1 2
17 Simon Physics NA NA
18 Simon Science 1 4

In the below, we repeatedly use pivot_ to get the desired result. The output is sorted by student name and subject.
library(tidyverse)
df <- read_delim("Michael Physics 1 2
Michael Math 2 4
Michael Science 3 4
Michael PE 2 1
James Art 0 9
James PE 1 2
James Physics -1 2
James Science 1 2
Simon PE 1 2
Simon Art 1 3
Simon Music 1 4
Simon Science 1 4", delim = " ", col_names = c("student", "subject", "v1", "v2"))
df %>%
pivot_wider(names_from = "subject", values_from = c("v1", "v2")) %>%
pivot_longer(cols = starts_with("v"), names_to = "name", values_to = "value") %>%
separate(name, into = c("var", "subject"), sep = "_") %>%
pivot_wider(names_from = var, values_from = value) %>%
arrange(student, subject)
#> # A tibble: 18 x 4
#> student subject v1 v2
#> <chr> <chr> <dbl> <dbl>
#> 1 James Art 0 9
#> 2 James Math NA NA
#> 3 James Music NA NA
#> 4 James PE 1 2
#> 5 James Physics -1 2
#> 6 James Science 1 2
#> 7 Michael Art NA NA
#> 8 Michael Math 2 4
#> 9 Michael Music NA NA
#> 10 Michael PE 2 1
#> 11 Michael Physics 1 2
#> 12 Michael Science 3 4
#> 13 Simon Art 1 3
#> 14 Simon Math NA NA
#> 15 Simon Music 1 4
#> 16 Simon PE 1 2
#> 17 Simon Physics NA NA
#> 18 Simon Science 1 4
Created on 2021-07-18 by the reprex package (v2.0.0)

You can use complete. To preserve the original ordering of the data you can save the name of the students in a variable and use match and arrange.
library(dplyr)
library(tidyr)
orignal_order <- unique(df$V1)
df %>% complete(V1, V2) %>% arrange(match(V1, orignal_order))
# V1 V2 V3 V4
# <chr> <chr> <int> <int>
# 1 Michael Art NA NA
# 2 Michael Math 2 4
# 3 Michael Music NA NA
# 4 Michael PE 2 1
# 5 Michael Physics 1 2
# 6 Michael Science 3 4
# 7 James Art 0 9
# 8 James Math NA NA
# 9 James Music NA NA
#10 James PE 1 2
#11 James Physics -1 2
#12 James Science 1 2
#13 Simon Art 1 3
#14 Simon Math NA NA
#15 Simon Music 1 4
#16 Simon PE 1 2
#17 Simon Physics NA NA
#18 Simon Science 1 4
data
df <- structure(list(V1 = c("Michael", "Michael", "Michael", "Michael",
"James", "James", "James", "James", "Simon", "Simon", "Simon",
"Simon"), V2 = c("Physics", "Math", "Science", "PE", "Art", "PE",
"Physics", "Science", "PE", "Art", "Music", "Science"), V3 = c(1L,
2L, 3L, 2L, 0L, 1L, -1L, 1L, 1L, 1L, 1L, 1L), V4 = c(2L, 4L,
4L, 1L, 9L, 2L, 2L, 2L, 2L, 3L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -12L))

Related

Replacing NAs with existing data when merging two dataframes in R

I would like to merge two dataframes. There are some shared variables and some different variables and there are different numbers of rows in each dataframe. The dataframes share some rows, but not all. And both dataframes have missing data that the other my have.
DF1:
name
age
weight
height
Tim
7
54
112
Dave
5
50
NA
Larry
NA
42
73
Rob
1
30
43
DF2:
name
age
weight
height
grade
Tim
7
NA
112
2
Dave
NA
50
103
1
Larry
3
NA
73
NA
Rob
1
30
NA
NA
John
6
60
NA
1
Tom
8
61
112
2
I want to merge these two dataframes together by the shared columns (name, age, weight, and height). However, I want NAs to be overridden, such that if one of the two dataframes has a value where the other has NA, I want the value to be carried through into the third dataframe. Ideally, the last dataframe should only have NAs when both DF1 and DF2 had NAs in that same location.
Ideal Data Frame
name
age
weight
height
grade
Tim
7
54
112
2
Dave
5
50
103
1
Larry
3
42
73
NA
Rob
1
30
43
NA
John
6
60
NA
1
Tom
8
61
112
2
I've been using full_join and left_join, but I don't know how to merge these in such a way that NAs are replaced with actual data (if it is present in one of the dataframes). Is there a way to do this?
This is a typical case that rows_patch() from dplyr can treat.
library(dplyr)
rows_patch(df2, df1, by = "name")
name age weight height grade
1 Tim 7 54 112 2
2 Dave 5 50 103 1
3 Larry 3 42 73 NA
4 Rob 1 30 43 NA
5 John 6 60 NA 1
6 Tom 8 61 112 2
Data
df1 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob"), age = c(7L,
5L, NA, 1L), weight = c(54L, 50L, 42L, 30L), height = c(112L,
NA, 73L, 43L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob", "John",
"Tom"), age = c(7L, NA, 3L, 1L, 6L, 8L), weight = c(NA, 50L,
NA, 30L, 60L, 61L), height = c(112L, 103L, 73L, NA, NA, 112L),
grade = c(2L, 1L, NA, NA, 1L, 2L)), class = "data.frame", row.names = c(NA, -6L))
I like the powerjoin package suggested as an answer to the question in the first comment, which I had never heard of before.
However, if you want to avoid using extra packages, you can do it in base R. This approach also avoids having to explicitly name each column - the dplyr approaches suggested in the comments do not do that, although perhaps could be modified.
# Load data
df1 <- read.table(text = "name age weight height
Tim 7 54 112
Dave 5 50 NA
Larry NA 42 73
Rob 1 30 43", header=TRUE)
df2 <- read.table(text = "name age weight height grade
Tim 7 NA 112 2
Dave NA 50 103 1
Larry 3 NA 73 NA
Rob 1 30 NA NA
John 6 60 NA 1
Tom 8 61 112 2", header=TRUE)
df3 <- merge(df1, df2, by = "name", all = TRUE, sort=FALSE)
# Coalesce the common columns
common_cols <- names(df1)[names(df1)!="name"]
df3[common_cols] <- lapply(common_cols, function(col) {
coalesce(df3[[paste0(col, ".x")]], df3[[paste0(col, ".y")]])
})
# Select desired columns
df3[names(df2)]
# name age weight height grade
# 1 Tim 7 54 112 2
# 2 Dave 5 50 103 1
# 3 Larry 3 42 73 NA
# 4 Rob 1 30 43 NA
# 5 John 6 60 NA 1
# 6 Tom 8 61 112 2
There are advantages to using base R, but powerjoin looks like an interesting package too.
Another possible solution:
library(tidyverse)
df2 %>%
bind_rows(df1) %>%
group_by(name) %>%
fill(age:grade, .direction = "updown") %>%
ungroup %>%
distinct
#> # A tibble: 6 x 5
#> name age weight height grade
#> <chr> <int> <int> <int> <int>
#> 1 Tim 7 54 112 2
#> 2 Dave 5 50 103 1
#> 3 Larry 3 42 73 NA
#> 4 Rob 1 30 43 NA
#> 5 John 6 60 NA 1
#> 6 Tom 8 61 112 2

How can I apply a custom function that adds new columns to a dataframe to a subset of existing columns?

I am working with a large dataset where much of the data was entered twice. This means that many of the variables are represented by pairs of columns: column.1 with the data entered by one person, and column.2 where the same data was entered by a different person. I want to create a "master" column called simply column that first draws from column.1 and then, if column.1 is NA, draws from column.2.
Here is an example of what I am trying to do with made-up data:
mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
cheese.1 = c(1,4,NA,6,NA),
cheese.2 = c(1,4,3,5,NA),
milk.1 = c(NA,2,0,4,NA),
milk.2 = c(1,2,1,4,2),
tofu.1 = c("yum","yum",NA,"gross", NA),
tofu.2 = c("gross", "yum", "yum", NA, "gross"))
For example, the code below shows an example of what I want to do for a single pair of columns.
mydata %>% mutate(cheese = ifelse(is.na(cheese.1), cheese.2, cheese.1))
#OUTPUT:
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
However, I want to automate the process rather than doing each manually. Below is my attempt at automating the process, using a list (col.list) of the column pairs for which I want to create new "master" columns:
col.list = c("cheese","milk","tofu")
lapply(col.list, FUN = function(x) {
v <- as.name({{x}})
v.1 <- as.name(paste0({{x}}, ".1"))
v.2 <- as.name(paste0(({{x}}), ".2"))
mydata %>% mutate(v = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})
#OUTPUT:
[[1]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
[[2]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 2
3 Carmen NA 3 0 1 <NA> yum 0
4 Dinah 6 5 4 4 gross <NA> 4
5 Billie NA NA NA 2 <NA> gross 2
[[3]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross yum
2 Ella 4 4 2 2 yum yum yum
3 Carmen NA 3 0 1 <NA> yum yum
4 Dinah 6 5 4 4 gross <NA> gross
5 Billie NA NA NA 2 <NA> gross gross
The problems with this attempt are:
the new columns are not correctly named (they should be named cheese, milk and tofu rather than all be called v)
the new columns are not added to the original data frame. What I want is for the program to add a series of new "master" columns to my dataframe (one new column for each pair of columns identified in col.list).
(1) You have to wrap v into the curly-curly operator and use :=:
library(dplyr)
col.list <- c("cheese","milk","tofu")
lapply(col.list, FUN = function(x) {
v <- as.name({{x}})
v.1 <- as.name(paste0({{x}}, ".1"))
v.2 <- as.name(paste0(({{x}}), ".2"))
mydata %>% mutate({{ v }} = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})
returns
[[1]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
[...]
which is one step closer to your desired output.
(2) But to get your desired output, I suggest using purrr:
library(purrr)
library(dplyr)
col.list %>%
map(~mydata %>%
select(name, starts_with(.x)) %>%
mutate({{ .x }} := ifelse(
is.na(!!sym(paste0(.x, ".1"))),
!!sym(paste0(.x, ".2")),
!!sym(paste0(.x, ".1"))
)
)
) %>%
reduce(left_join, by = "name")
This returns
name cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2 tofu
1 Sarah 1 1 1 NA 1 1 yum gross yum
2 Ella 4 4 4 2 2 2 yum yum yum
3 Carmen NA 3 3 0 1 0 <NA> yum yum
4 Dinah 6 5 6 4 4 4 gross <NA> gross
5 Billie NA NA NA NA 2 2 <NA> gross gross
Here is a pretty simple and dynamic option. Since it uses tidyselect, if there are more than just two columns (eg cheese.1, cheese.2, and cheese.3) this will still work. This will also work if the column groups are unbalanced (eg 3 cheese columns, but only 2 milk columns):
library(purrr)
library(stringr)
library(rlang)
library(dplyr)
col.list <- c("cheese","milk","tofu")
express <- map(set_names(col.list), ~
str_glue("coalesce(!!!across(starts_with(\"{.x}\")))") %>%
parse_expr())
mydata %>%
mutate(!!! express, .keep = "unused")
Output
The other columns were removed by .keep = "unused". If you want to keep all the columns then delete that argument.
name cheese milk tofu
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
How it works
The use of map and set_names is important because this creates a named list, which is important for the big-bang !!! operator later. map creates a named list of expressions.
The use of across and coalesce allows the dynamic tidy-selection of columns.
The !!! operator force-splices the list of objects and the names for the columns are from the list names set up using map and set_names.
Here is another way of doing this amid all the great answers you got:
library(dplyr)
library(purrr)
col.list %>%
reduce(~ .x %>%
bind_cols(mydata %>%
select(starts_with(.y)) %>%
mutate(!!gsub("(\\D+)\\.\\d+", "\\1", .y) := invoke(coalesce, cur_data()))),
.init = NULL)
cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2 tofu
1 1 1 1 NA 1 1 yum gross yum
2 4 4 4 2 2 2 yum yum yum
3 NA 3 3 0 1 0 <NA> yum yum
4 6 5 6 4 4 4 gross <NA> gross
5 NA NA NA NA 2 2 <NA> gross gross
Here is one way I would do it. First convert to long format then reshape back to wide format but having only 2 value columns 1 and 2
library(dplyr)
library(tidyr)
mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
cheese.1 = c(1,4,NA,6,NA),
cheese.2 = c(1,4,3,5,NA),
milk.1 = c(NA,2,0,4,NA),
milk.2 = c(1,2,1,4,2),
tofu.1 = c("yum","yum",NA,"gross", NA),
tofu.2 = c("gross", "yum", "yum", NA, "gross"))
mydata_long <- mydata %>%
mutate(across(where(is.numeric), as.character)) %>%
pivot_longer(-name,
names_to = c("food", "nr"),
names_sep = "\\.")
mydata_long
#> # A tibble: 30 x 4
#> name food nr value
#> <chr> <chr> <chr> <chr>
#> 1 Sarah cheese 1 1
#> 2 Sarah cheese 2 1
#> 3 Sarah milk 1 <NA>
#> 4 Sarah milk 2 1
#> 5 Sarah tofu 1 yum
#> 6 Sarah tofu 2 gross
#> 7 Ella cheese 1 4
#> 8 Ella cheese 2 4
#> 9 Ella milk 1 2
#> 10 Ella milk 2 2
#> # ... with 20 more rows
Apply ifelse() function after transforming back to different wide format
mydata_wide <- mydata_long %>%
pivot_wider(names_from = nr,
values_from = value) %>%
mutate(final_val = ifelse(is.na(`1`), `2`, `1`)) %>%
arrange(food)
mydata_wide
#> # A tibble: 15 x 5
#> name food `1` `2` final_val
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Sarah cheese 1 1 1
#> 2 Ella cheese 4 4 4
#> 3 Carmen cheese <NA> 3 3
#> 4 Dinah cheese 6 5 6
#> 5 Billie cheese <NA> <NA> <NA>
#> 6 Sarah milk <NA> 1 1
#> 7 Ella milk 2 2 2
#> 8 Carmen milk 0 1 0
#> 9 Dinah milk 4 4 4
#> 10 Billie milk <NA> 2 2
#> 11 Sarah tofu yum gross yum
#> 12 Ella tofu yum yum yum
#> 13 Carmen tofu <NA> yum yum
#> 14 Dinah tofu gross <NA> gross
#> 15 Billie tofu <NA> gross gross
mydata_wide2 <- mydata_wide %>%
pivot_wider(-c(`1`, `2`),
names_from = food,
values_from = final_val)
mydata_wide2
#> # A tibble: 5 x 4
#> name cheese milk tofu
#> <chr> <chr> <chr> <chr>
#> 1 Sarah 1 1 yum
#> 2 Ella 4 2 yum
#> 3 Carmen 3 0 yum
#> 4 Dinah 6 4 gross
#> 5 Billie <NA> 2 gross
Created on 2021-10-29 by the reprex package (v2.0.1)
I would use purrr::map_dfc and coalesce here. Looks pretty straightforward.
library(purrr)
library(dplyr)
library(stringr)
mydata %>% mutate(map2_dfc(select(., ends_with('1')),
select(., ends_with('2')),
~coalesce(.x, .y)))%>%
select(-ends_with('2'))%>%
rename_with(~str_remove(.x, '\\.\\d+$'))
name cheese milk tofu
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
Here is how you can achieve your task:
define your pairs (in case you have hundreds of columns, this could be automated.
use imap_dfc to apply coalesce do the defined pairs
bind to original dataframe
library(dplyr)
library(purrr)
pairs <- list(cheese = c(2, 3), milk = c(4, 5), tofu = c(6, 7))
imap_dfc(pairs, ~mydata[, .x] %>% transmute(!!.y := coalesce(!!!syms(names(mydata)[.x])))) %>%
bind_cols(mydata)
cheese milk tofu name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2
1 1 1 yum Sarah 1 1 NA 1 yum gross
2 4 2 yum Ella 4 4 2 2 yum yum
3 3 0 yum Carmen NA 3 0 1 <NA> yum
4 6 4 gross Dinah 6 5 4 4 gross <NA>
5 NA 2 gross Billie NA NA NA 2 <NA> gross
Another tidyverse option. Advantage here is that it keeps the original data type and doesn‘t convert everything to character values.
library(tidyverse)
mydata %>%
pivot_longer(cols = -name,
names_pattern = '(.*)(\\..)',
names_to = c('.value', 'number')) %>%
group_by(name) %>%
mutate(across(-number, ~if_else(is.na(.[1]), .[2], .[1]))) %>%
ungroup() %>%
filter(number == '.1') %>%
select(-number)
Which gives
# A tibble: 5 x 4
name cheese milk tofu
<chr> <dbl> <dbl> <chr>
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
Alternative solution with coalesce:
mydata %>%
pivot_longer(cols = -name,
names_pattern = '(.*)(\\..)',
names_to = c('.value', 'number')) %>%
group_by(name) %>%
mutate(across(-number, ~coalesce(.[1], .[2]))) %>%
ungroup() %>%
filter(number == '.1') %>%
select(-number)

My question is about R: How to number each repetition in a table in R?

In my data set, their is column of full names (eg: below) and I want to add the another column next to it mentioning if a name has appeared two one, two, three, four.... times using R. My output should look like the column below: Number of repetition.
Eg: Data set name: People
**Full name** **Number of repetition**
Peter 1
Peter 2
Alison
Warren
Jack 1
Jack 2
Jack 3
Jack 4
Susan 1
Susan 2
Henry 1
Walison
Tinder 1
Peter 3
Henry 2
Tinder 2
Thanks
Teena
Here is an alternative way solved with help from akrun: sum() condition in ifelse statement
library(dplyr)
df1 %>%
group_by(Fullname) %>%
mutate(newcol = row_number(),
newcol = if(sum(newcol)> 1) newcol else NA) %>%
ungroup
Fullname newcol
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2
Here is one way. Do a group by 'Fullname', and create the sequence with row_number() if the number of rows is greater than 1. By default, case_when returns the other case as NA
library(dplyr)
df1 <- df1 %>%
group_by(Fullname) %>%
mutate(Number_of_repetition = case_when(n() > 1 ~ row_number())) %>%
ungroup
-output
df1
# A tibble: 16 × 2
Fullname Number_of_repetition
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2
If we need to add a third column, use unite on the updated data from previous step
library(tidyr)
df1 %>%
unite(FullNameRep, Fullname, Number_of_repetition, sep="", na.rm = TRUE, remove = FALSE)
-output
# A tibble: 16 × 3
FullNameRep Fullname Number_of_repetition
<chr> <chr> <int>
1 Peter1 Peter 1
2 Peter2 Peter 2
3 Alison Alison NA
4 Warren Warren NA
5 Jack1 Jack 1
6 Jack2 Jack 2
7 Jack3 Jack 3
8 Jack4 Jack 4
9 Susan1 Susan 1
10 Susan2 Susan 2
11 Henry1 Henry 1
12 Walison Walison NA
13 Tinder1 Tinder 1
14 Peter3 Peter 3
15 Henry2 Henry 2
16 Tinder2 Tinder 2
data
df1 <- structure(list(Fullname = c("Peter", "Peter", "Alison", "Warren",
"Jack", "Jack", "Jack", "Jack", "Susan", "Susan", "Henry", "Walison",
"Tinder", "Peter", "Henry", "Tinder")), row.names = c(NA, -16L
), class = "data.frame")

Randomly assigning columns to other columns in R

I have a column of student names and a column consisting the group number for each of those students. How could I randomly assign each student to be a judge of another group's work, could anyone let me know on how to build a function to solve that issue? They cannot be a judge of their own group.
Bob Ross 1
Kanye West 1
Chris Evans 1
Robert Jr 1
Bruce Wayne 2
Peter Parker 2
Steven Strange 2
Danny rand 2
Daniel Fisher 2
Rob Son 3
Son Bob 3
Chun Li 3
Ching Do 3
Ping Pong 3
Michael Jackson 4
Rich Brian 4
Ryan Gosling 4
Nathan Nguyen 4
Justin Bieber 4
Here's one way, using tidyverse methods. Basically this says for each value (map_int) in group, take a sample from the groups that aren't the current one.
library(tidyverse)
df <- structure(list(name = c("Kanye West", "Chris Evans", "Robert Jr", "Bruce Wayne", "Peter Parker", "Steven Strange", "Danny rand", "Daniel Fisher", "Rob Son", "Son Bob", "Chun Li", "Ching Do", "Ping Pong", "Michael Jackson", "Rich Brian", "Ryan Gosling", "Nathan Nguyen", "Justin Bieber"), group = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -18L))
df %>%
mutate(
to_judge = map_int(
.x = group,
.f = ~ sample(
x = unique(group)[unique(group) != .x],
size = 1
)
)
)
#> # A tibble: 18 x 3
#> name group to_judge
#> <chr> <int> <int>
#> 1 Kanye West 1 4
#> 2 Chris Evans 1 2
#> 3 Robert Jr 1 3
#> 4 Bruce Wayne 2 1
#> 5 Peter Parker 2 3
#> 6 Steven Strange 2 3
#> 7 Danny rand 2 4
#> 8 Daniel Fisher 2 1
#> 9 Rob Son 3 1
#> 10 Son Bob 3 2
#> 11 Chun Li 3 4
#> 12 Ching Do 3 4
#> 13 Ping Pong 3 4
#> 14 Michael Jackson 4 2
#> 15 Rich Brian 4 3
#> 16 Ryan Gosling 4 1
#> 17 Nathan Nguyen 4 2
#> 18 Justin Bieber 4 1
Created on 2018-09-20 by the reprex package (v0.2.0).
Another option with tidyverse would be to group_by the group column, define the sample vector with setdiff and draw a sample of the size of the group:
df <- data.frame(Student = LETTERS[1:20],
Group = gl(4, 5))
library(tidyverse)
df %>%
group_by(Group) %>%
mutate(Judge = sample(setdiff(unique(df$Group), Group), n(), replace = T))
# A tibble: 20 x 3
# Groups: Group [4]
Student Group Judge
<fct> <fct> <chr>
1 A 1 4
2 B 1 2
3 C 1 3
4 D 1 3
5 E 1 4
6 F 2 4
7 G 2 4
8 H 2 1
9 I 2 1
10 J 2 4
11 K 3 4
12 L 3 2
13 M 3 1
14 N 3 2
15 O 3 2
16 P 4 2
17 Q 4 1
18 R 4 2
19 S 4 1
20 T 4 3

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

Resources