My question is about R: How to number each repetition in a table in R? - r

In my data set, their is column of full names (eg: below) and I want to add the another column next to it mentioning if a name has appeared two one, two, three, four.... times using R. My output should look like the column below: Number of repetition.
Eg: Data set name: People
**Full name** **Number of repetition**
Peter 1
Peter 2
Alison
Warren
Jack 1
Jack 2
Jack 3
Jack 4
Susan 1
Susan 2
Henry 1
Walison
Tinder 1
Peter 3
Henry 2
Tinder 2
Thanks
Teena

Here is an alternative way solved with help from akrun: sum() condition in ifelse statement
library(dplyr)
df1 %>%
group_by(Fullname) %>%
mutate(newcol = row_number(),
newcol = if(sum(newcol)> 1) newcol else NA) %>%
ungroup
Fullname newcol
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2

Here is one way. Do a group by 'Fullname', and create the sequence with row_number() if the number of rows is greater than 1. By default, case_when returns the other case as NA
library(dplyr)
df1 <- df1 %>%
group_by(Fullname) %>%
mutate(Number_of_repetition = case_when(n() > 1 ~ row_number())) %>%
ungroup
-output
df1
# A tibble: 16 × 2
Fullname Number_of_repetition
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2
If we need to add a third column, use unite on the updated data from previous step
library(tidyr)
df1 %>%
unite(FullNameRep, Fullname, Number_of_repetition, sep="", na.rm = TRUE, remove = FALSE)
-output
# A tibble: 16 × 3
FullNameRep Fullname Number_of_repetition
<chr> <chr> <int>
1 Peter1 Peter 1
2 Peter2 Peter 2
3 Alison Alison NA
4 Warren Warren NA
5 Jack1 Jack 1
6 Jack2 Jack 2
7 Jack3 Jack 3
8 Jack4 Jack 4
9 Susan1 Susan 1
10 Susan2 Susan 2
11 Henry1 Henry 1
12 Walison Walison NA
13 Tinder1 Tinder 1
14 Peter3 Peter 3
15 Henry2 Henry 2
16 Tinder2 Tinder 2
data
df1 <- structure(list(Fullname = c("Peter", "Peter", "Alison", "Warren",
"Jack", "Jack", "Jack", "Jack", "Susan", "Susan", "Henry", "Walison",
"Tinder", "Peter", "Henry", "Tinder")), row.names = c(NA, -16L
), class = "data.frame")

Related

reorder/standardize and create rows in R

I am new to R and I have been looking for a solution to an existing dataframe I have been given. I have a set of variables, each of which contains some of other subcategories. Assume it looks something like this:
Michael Physics 1 2
Michael Math 2 4
Michael Science 3 4
Michael PE 2 1
James Art 0 9
James PE 1 2
James Physics -1 2
James Science 1 2
Simon PE 1 2
Simon Art 1 3
Simon Music 1 4
Simon Science 1 4
Notably, the second column has a "standard" set of variables, so that each student shares most but not necessarily all of the variables, and the ordering of these variables is scrambled. My issue is then that I want to convert this dataframe to a "standard format". That is I want each of the students to have ALL of the variables and in the same order. So if I define a list of all the subjects: say Physics, Math, Science, Art, PE, Music. I would like for there to be 18 rows in my modified dataframe(6 for each student, with the ordering defined for the subject). If the student and subject are contained in the original dataset, the row should have the data from the row, and if the student and subject doesnt exist in the original dataframe, then the other datacolumns would just be NA.
Update on OP's comment:
To keep the original order you could factor Student and define level:
df <- df %>%
mutate(Student = factor(Student, levels = c("Michael", "James", "Simon")))
df1 <- df %>%
expand(Student, Course)
df %>%
right_join(df1) %>%
arrange(Student, Course)
Output:
Student Course V1 V2
<fct> <chr> <dbl> <dbl>
1 Michael Art NA NA
2 Michael Math 2 4
3 Michael Music NA NA
4 Michael PE 2 1
5 Michael Physics 1 2
6 Michael Science 3 4
7 James Art 0 9
8 James Math NA NA
9 James Music NA NA
10 James PE 1 2
11 James Physics -1 2
12 James Science 1 2
13 Simon Art 1 3
14 Simon Math NA NA
15 Simon Music 1 4
16 Simon PE 1 2
17 Simon Physics NA NA
18 Simon Science 1 4
We could combine expand and right_join
library(dplyr)
library(tidyr)
df1 <- df %>%
expand(Student, Course)
df %>%
right_join(df1) %>%
arrange(Student, Course)
Output:
Student Course V1 V2
<chr> <chr> <dbl> <dbl>
1 James Art 0 9
2 James Math NA NA
3 James Music NA NA
4 James PE 1 2
5 James Physics -1 2
6 James Science 1 2
7 Michael Art NA NA
8 Michael Math 2 4
9 Michael Music NA NA
10 Michael PE 2 1
11 Michael Physics 1 2
12 Michael Science 3 4
13 Simon Art 1 3
14 Simon Math NA NA
15 Simon Music 1 4
16 Simon PE 1 2
17 Simon Physics NA NA
18 Simon Science 1 4
In the below, we repeatedly use pivot_ to get the desired result. The output is sorted by student name and subject.
library(tidyverse)
df <- read_delim("Michael Physics 1 2
Michael Math 2 4
Michael Science 3 4
Michael PE 2 1
James Art 0 9
James PE 1 2
James Physics -1 2
James Science 1 2
Simon PE 1 2
Simon Art 1 3
Simon Music 1 4
Simon Science 1 4", delim = " ", col_names = c("student", "subject", "v1", "v2"))
df %>%
pivot_wider(names_from = "subject", values_from = c("v1", "v2")) %>%
pivot_longer(cols = starts_with("v"), names_to = "name", values_to = "value") %>%
separate(name, into = c("var", "subject"), sep = "_") %>%
pivot_wider(names_from = var, values_from = value) %>%
arrange(student, subject)
#> # A tibble: 18 x 4
#> student subject v1 v2
#> <chr> <chr> <dbl> <dbl>
#> 1 James Art 0 9
#> 2 James Math NA NA
#> 3 James Music NA NA
#> 4 James PE 1 2
#> 5 James Physics -1 2
#> 6 James Science 1 2
#> 7 Michael Art NA NA
#> 8 Michael Math 2 4
#> 9 Michael Music NA NA
#> 10 Michael PE 2 1
#> 11 Michael Physics 1 2
#> 12 Michael Science 3 4
#> 13 Simon Art 1 3
#> 14 Simon Math NA NA
#> 15 Simon Music 1 4
#> 16 Simon PE 1 2
#> 17 Simon Physics NA NA
#> 18 Simon Science 1 4
Created on 2021-07-18 by the reprex package (v2.0.0)
You can use complete. To preserve the original ordering of the data you can save the name of the students in a variable and use match and arrange.
library(dplyr)
library(tidyr)
orignal_order <- unique(df$V1)
df %>% complete(V1, V2) %>% arrange(match(V1, orignal_order))
# V1 V2 V3 V4
# <chr> <chr> <int> <int>
# 1 Michael Art NA NA
# 2 Michael Math 2 4
# 3 Michael Music NA NA
# 4 Michael PE 2 1
# 5 Michael Physics 1 2
# 6 Michael Science 3 4
# 7 James Art 0 9
# 8 James Math NA NA
# 9 James Music NA NA
#10 James PE 1 2
#11 James Physics -1 2
#12 James Science 1 2
#13 Simon Art 1 3
#14 Simon Math NA NA
#15 Simon Music 1 4
#16 Simon PE 1 2
#17 Simon Physics NA NA
#18 Simon Science 1 4
data
df <- structure(list(V1 = c("Michael", "Michael", "Michael", "Michael",
"James", "James", "James", "James", "Simon", "Simon", "Simon",
"Simon"), V2 = c("Physics", "Math", "Science", "PE", "Art", "PE",
"Physics", "Science", "PE", "Art", "Music", "Science"), V3 = c(1L,
2L, 3L, 2L, 0L, 1L, -1L, 1L, 1L, 1L, 1L, 1L), V4 = c(2L, 4L,
4L, 1L, 9L, 2L, 2L, 2L, 2L, 3L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -12L))

How can I create a Variable for experience in R? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a dataset that has observations for different case files. And I would like to create a variable that indicates the number of cases that have been dealt with of that kind before a specific case is looked into.
Here is a test code and dataset to specify what I am asking.
df <- data.frame( ID= c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
name = c("Jon", "Jon", "Maria","Jon", "Jon", "Maria","Jon", "Jon", "Maria","Prince", "Jon", "Maria","Prince", "Jon", "Maria","Prince"),
date = c("2007-01-22", "2007-02-13", "2007-05-22", "2007-02-25", "2007-04-22", "2007-03-13", "2007-03-22", "2007-07-13", "2007-08-22",
"2007-05-10", "2007-04-18", "2007-07-09","2007-06-10", "2008-02-13","2007-09-22", "2007-05-15"))
I would like to group the observations into categories and for each observation check the date and give a count of the number of observations in that category before the stated observation.
df$date <- as.Date(df$date, '%Y-%m-%d')
df$exp = NA
for(i in 1:nrow(df)){
temp = df %>% filter(!is.na(date))
temp = temp %>% filter(name == name[i])
df$exp[i]= nrow( filter(temp,date[i]>date))
}
I tried run the code above but doesn't give the results I am looking for. It gives me the following results
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 5
4 7 Jon 2007-03-22 4
5 11 Jon 2007-04-18 0
6 5 Jon 2007-04-22 3
7 8 Jon 2007-07-13 7
8 14 Jon 2008-02-13 0
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 3
11 12 Maria 2007-07-09 0
12 9 Maria 2007-08-22 0
13 15 Maria 2007-09-22 0
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 0
16 13 Prince 2007-06-10 0
instead of
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2
How can I efficiently get this done?
You can sort by name and date, make groups by name and use the row_number to get the result
library(tidyverse)
df %>%
arrange(name, as.Date(date)) %>%
group_by(name) %>%
mutate(n = row_number() - 1)
# A tibble: 16 x 4
# Groups: name [3]
ID name date n
<dbl> <chr> <chr> <dbl>
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2

How can I create a data.frame in R of how many times players start together in games

I am just starting looking at network analysis and wanted to begin by creating a data.frame of how often basketball players on a team have started together
Ideally, I would like to incorporate map functions from purrr
So with this as input
game_1 <- c("Andy","Bob","Chris","Doug","Evan")
game_2 <- c("Andy","Chris","Evan","Fred","George")
I would want a result like this
n_1 n_2 games
Andy Bob 1
Andy Chris 2
Andy Doug 1
Andy Evan 2
Andy Fred 1
Andy George 1
Bob Chris 1
Bob Doug 1
Bob Evan 1
Chris Doug 1
Chris Evan 2
Chris Fred 1
Chris George 1
Doug Evan 1
Evan Fred 1
Evan George 1
Fred George 1
My solution does not use purrr, but it should work
game_1 <- c("Andy","Bob","Chris","Doug","Evan")
game_2 <- c("Andy","Chris","Evan","Fred","George")
# Combine all games into a single list for use with lapply
all_games <- list(game_1, game_2)
library(dplyr)
# Find combinations, sorted to ensure the earlier alphabets are in the first column
df <- do.call(rbind, lapply(all_games, function(x) { data.frame(t(combn(sort(x), 2))) }))
# Calculate the number of instances where 2 players appear with each other
df %>% group_by(X1, X2) %>% summarise(count = n())
# A tibble: 17 x 3
# Groups: X1 [?]
# X1 X2 count
# <fctr> <fctr> <int>
# 1 Andy Bob 1
# 2 Andy Chris 2
# 3 Andy Doug 1
# 4 Andy Evan 2
# 5 Andy Fred 1
# 6 Andy George 1
# 7 Bob Chris 1
# 8 Bob Doug 1
# 9 Bob Evan 1
# 10 Chris Doug 1
# 11 Chris Evan 2
# 12 Chris Fred 1
# 13 Chris George 1
# 14 Doug Evan 1
# 15 Evan Fred 1
# 16 Evan George 1
# 17 Fred George 1
library(dplyr)
# get combinations from game_1
g1 <- combn(game_1, 2) %>% t
# get combinations from game_2
g2 <- combn(game_2, 2) %>% t
# bind both in a dataframe and count pairs
g1 %>%
rbind.data.frame(g2) %>%
group_by(V1, V2) %>%
summarise(games = n())
# A tibble: 17 x 3
# Groups: V1 [?]
V1 V2 games
<fctr> <fctr> <int>
1 Andy Bob 1
2 Andy Chris 2
3 Andy Doug 1
4 Andy Evan 2
5 Andy Fred 1
6 Andy George 1
7 Bob Chris 1
8 Bob Doug 1
9 Bob Evan 1
10 Chris Doug 1
11 Chris Evan 2
12 Chris Fred 1
13 Chris George 1
14 Doug Evan 1
15 Evan Fred 1
16 Evan George 1
17 Fred George 1
building on whalea's answer:
game_1 <- c("Andy","Bob","Chris","Doug","Evan")
game_2 <- c("Andy","Chris","Evan","Fred","George")
all_games <- list(game_1, game_2)
library(dplyr)
df <- do.call(rbind, lapply(all_games, function(x) { expand.grid(x, x) %>% filter(Var1 != Var2) })) %>% apply(1,sort) %>% t %>% data.frame
df %>% group_by(X1, X2) %>% summarise(count = n()/2)
result:
1 Andy Bob 1.
2 Andy Chris 2.
3 Andy Doug 1.
4 Andy Evan 2.
5 Andy Fred 1.
6 Andy George 1.
7 Bob Chris 1.
8 Bob Doug 1.
9 Bob Evan 1.
10 Chris Doug 1.
11 Chris Evan 2.
12 Chris Fred 1.
13 Chris George 1.
14 Doug Evan 1.
15 Evan Fred 1.
16 Evan George 1.
17 Fred George 1.

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

remove individuals based on their range of values

I have a df with two variables, one with IDs and one with a variable called numbers. I would like to excude individuals who do not start their sequence of numbers with the number 1.
I have managed to do this by creating a binary indicator and excluding if the person has this indicator. However, there must be a simpler more elegant way to do this?
Example data and the code I've used to achieve desired result are below.
Thank you.
sample df:
zz<-" names numbers
1 john 1
2 john 2
3 john 3
4 john 4
5 john 5
6 john 6
7 john 7
8 john 8
9 mary 4
10 mary 5
11 mary 6
12 mary 7
13 mary 8
14 mary 9
15 mary 10
16 mary 11
17 mary 12
18 pat 1
19 pat 2
20 pat 3
21 pat 4
22 pat 5
23 pat 6
24 pat 7
25 pat 8
26 pat 9
27 pat 10
28 sue 2
29 sue 3
30 sue 4
31 sue 5
32 sue 6
33 sue 7
34 sue 8
35 sue 9
36 tom 5
37 tom 6
38 tom 7
39 tom 8
40 tom 9
41 tom 10
42 tom 11
"
Data <- read.table(text=zz, header = TRUE)
Step 1 - add binary indicator
df$all<-ifelse(df$numbers==1, 1,0)
df$allperson<-ave(df$all, df$names, FUN=cumsum)
Step two - get rid of people who do not have 1 as their start number
df[!df$allperson==0,]
If you want elegance, I must recommend the package dplyr:
library(dplyr)
Data %>%
group_by(names) %>%
filter(min(numbers) != 1)
It means just what it appears to mean: filter only records where a group (defined by names) has a minimum numbers value inequal to 1.
names numbers
1 mary 4
2 mary 5
3 mary 6
4 mary 7
5 mary 8
6 mary 9
7 mary 10
8 mary 11
9 mary 12
10 sue 2
11 sue 3
You may also try:
zz1 <- zz[with(zz, names %in% unique(names)[!!table(zz)[,1]]),]
head(zz1,4)
# names numbers
#1 john 1
#2 john 2
#3 john 3
#4 john 4

Resources