Group columns in Rmarkdown - r

I am trying to produce a grouped table, something like this. Only can not figure out how it works if my groups are months which are consisted of weather variables, while rows are years.
This is how the data looks like:
year_var month temp rain rhum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2007 4 11 16.4 73.5
2 2007 5 11.6 38.3 74.3
3 2007 6 14 108. 83.9
4 2007 7 14.4 129. 83.6
5 2007 8 14.9 104. 83.2
6 2007 9 13.5 36.8 82.7
7 2008 4 7.8 31.8 76.4
8 2008 5 12.7 37.1 76.6
9 2008 6 13.3 86.8 77.7
10 2008 7 15.2 137. 80.4
11 2008 8 15.3 142. 83.1
12 2008 9 12.4 81.2 84.9
Is there a way to turn the month column into grouping variable?
A sample of my data:
structure(list(year_var = c(2007, 2007, 2007, 2007, 2007, 2007,
2008, 2008, 2008, 2008, 2008, 2008), month = c(4, 5, 6, 7, 8,
9, 4, 5, 6, 7, 8, 9), temp = c(11, 11.6, 14, 14.4, 14.9, 13.5,
7.8, 12.7, 13.3, 15.2, 15.3, 12.4), rain = c(16.4, 38.3, 107.7,
129.3, 103.8, 36.8, 31.8, 37.1, 86.8, 136.7, 142.4, 81.2), rhum = c(73.5,
74.3, 83.9, 83.6, 83.2, 82.7, 76.4, 76.6, 77.7, 80.4, 83.1, 84.9
)), row.names = c(NA, -12L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = "year_var", drop = TRUE, indices = list(
0:5, 6:11), group_sizes = c(6L, 6L), biggest_group_size = 6L, labels = structure(list(
year_var = c(2007, 2008)), row.names = c(NA, -2L), class = "data.frame", vars = "year_var", drop = TRUE))

Related

Calculate a Weighted Rolling Average by rows by group in r?

I have a dataframe games_h. This is just a snippet of the table but it has many teams and is sorted by date, team, game number. I am trying to create a weighted rolling average grouped by the team. I would like the most recent game to be weighted more than two games ago. So the weights would be (Game_1 * 1+ Game_2 *2)/3 or weights equal to 1 with same ratio so weights = c(1-.667, .667).
dput(games_h)
structure(list(GameId = c(16, 16, 37, 37, 57, 57), GameDate = structure(c(17905,
17905, 17916, 17916, 17926, 17926), class = "Date"), NeutralSite = c(0,
0, 0, 0, 0, 0), AwayTeam = c("Virginia Cavaliers", "Virginia Cavaliers",
"Florida State Seminoles", "Florida State Seminoles", "Syracuse Orange",
"Syracuse Orange"), HomeTeam = c("Boston College Eagles", "Boston College Eagles",
"Boston College Eagles", "Boston College Eagles", "Boston College Eagles",
"Boston College Eagles"), Team = c("Virginia Cavaliers", "Boston College Eagles",
"Florida State Seminoles", "Boston College Eagles", "Syracuse Orange",
"Boston College Eagles"), Home = c(0, 1, 0, 1, 0, 1), Score = c(83,
56, 82, 87, 77, 71), AST = c(17, 6, 12, 16, 11, 13), TOV = c(10,
8, 9, 13, 11, 11), STL = c(5, 4, 4, 6, 6, 5), BLK = c(6, 0, 4,
4, 1, 0), Rebounds = c(38, 18, 36, 33, 23, 23), ORB = c(7, 4,
16, 10, 7, 6), DRB = c(31, 14, 20, 23, 16, 17), FGA = c(55, 57,
67, 55, 52, 45), FGM = c(33, 22, 28, 27, 29, 21), X3FGM = c(8,
7, 8, 13, 11, 9), X3FGA = c(19, 25, 25, 21, 26, 22), FTA = c(14,
9, 24, 28, 15, 23), FTM = c(9, 5, 18, 20, 8, 20), Fouls = c(16,
12, 25, 20, 19, 19), Game_Number = 1:6, Count = c(1, 1, 1, 1,
1, 1)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), groups = structure(list(HomeTeam = "Boston College Eagles",
.rows = structure(list(1:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))
Here is an example output of the score column.
Here is my failed attempt. The function work correctly but I cannot apply it to the entire dataframe by group.
weighted_avg<-function(x, wt1, wt2) {
rs1 = rollsum(x,1,align = "right")
rs2 = rollsum(x,2,align = "right")
rs1=rs1[-1]
rs3 = rs2 - rs1
weighted_avg= ((rs3 * wt2)+ (rs1*wt1))/(wt1+wt2)
return(weighted_avg)
}
weighted_avg(csum$Score_Y, 2, 1)
apply(csum$Score_Y , 2, weighted_avg, wt1 = 2, wt2=1)
test<-csum %>%
group_by(Team)%>%
group_map(across(c(Score:Fouls), weighted_avg(.x$Team, 2, 1) ))
test<-csum %>%
group_by(Team)%>%
group_walk(across(c(Score:Fouls),weighted_avg(.~,2,1) ))
Here are some notes about the code:
I used slider::slide_dbl function. First we specify the vector for which we would like to compute the moving average Score.
As we need a sliding window of length 2, I used .before argument in slide_dbl to use the previous value and a current value to be used for calculating moving average.
Also I set .complete argument to TRUE to makes sure to only calculate moving average when we have a previous value. In other word we don't have any moveing average in first row.
For more info check the documentation for slider package.
library(tidyverse)
library(slider)
df %>%
group_by(HomeTeam) %>%
summarise(Example = c(NA, slide_dbl(Score, .before = 1, .complete = TRUE,
.f = ~ (.x[1] * 1 + .x[2] * 2) / 3)))
`summarise()` has grouped output by 'HomeTeam'. You can override using the `.groups` argument.
# A tibble: 7 × 2
# Groups: HomeTeam [1]
HomeTeam Example
<chr> <dbl>
1 Boston College Eagles NA
2 Boston College Eagles NA
3 Boston College Eagles 65
4 Boston College Eagles 73.3
5 Boston College Eagles 85.3
6 Boston College Eagles 80.3
7 Boston College Eagles 73
If it is going to calculate moving average for all numeric columns you could try:
df %>%
group_by(HomeTeam) %>%
summarise(across(where(is.numeric), ~ c(NA, slide_dbl(., .before = 1, .complete = TRUE,
.f = ~ (.x[1] * 1 + .x[2] * 2) / 3)))) %>%
ungroup()
`summarise()` has grouped output by 'HomeTeam'. You can override using the `.groups` argument.
# A tibble: 7 × 21
HomeTeam GameId NeutralSite Home Score AST TOV STL BLK Rebounds ORB DRB FGA FGM
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Boston C… NA NA NA NA NA NA NA NA NA NA NA NA NA
2 Boston C… NA NA NA NA NA NA NA NA NA NA NA NA NA
3 Boston C… 16 0 0.667 65 9.67 8.67 4.33 2 24.7 5 19.7 56.3 25.7
4 Boston C… 30 0 0.333 73.3 10 8.67 4 2.67 30 12 18 63.7 26
5 Boston C… 37 0 0.667 85.3 14.7 11.7 5.33 4 34 12 22 59 27.3
6 Boston C… 50.3 0 0.333 80.3 12.7 11.7 6 2 26.3 8 18.3 53 28.3
7 Boston C… 57 0 0.667 73 12.3 11 5.33 0.333 23 6.33 16.7 47.3 23.7
# … with 7 more variables: X3FGM <dbl>, X3FGA <dbl>, FTA <dbl>, FTM <dbl>, Fouls <dbl>,
# Game_Number <dbl>, Count <dbl>

Finding distance between a row and the row two above it in R

I would like to efficiently compute distances between every row in a matrix and the row two rows above it in R...
My attempts at finding a dplyr rowwise solution with lag(., n = 2) have failed, and I'm sure there's a better solution than this for loop.
Thoughts are much appreciated!
library(rdist)
library(tidyverse)
structure(list(sodium = c(140, 152.6, 138, 152.4, 140, 152.6,
141, 152.7, 141, 152.7), chloride = c(103, 148.9, 104, 149, 102,
148.8, 103, 148.9, 104, 149), potassium_plas = c(3.4, 0.34, 4.1,
0.41, 3.7, 0.37, 4, 0.4, 3.7, 0.37), co2_totl = c(31, 3.1, 22,
2.2, 23, 2.3, 27, 2.7, 20, 2), bun = c(11, 1.1, 5, 0.5, 8, 0.8,
21, 2.1, 10, 1), creatinine = c(0.84, 0.084, 0.53, 0.053, 0.69,
0.069, 1.04, 0.104, 1.86, 0.186), calcium = c(9.3, 0.93, 9.8,
0.98, 9.4, 0.94, 9.4, 0.94, 9.1, 0.91), glucose = c(102, 10.2,
99, 9.9, 115, 11.5, 94, 9.4, 122, 12.2), anion_gap = c(6, 0.599999999999989,
12, 1.20000000000001, 15, 1.50000000000001, 11, 1.09999999999998,
17, 1.69999999999999)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
dist_prior <- rep(NA, n = nrow(input_labs))
for(i in 3:nrow(input_labs)){
dist_prior[i] <- cdist(input_labs[i,], input_labs[i-2,])
}
We could loop over the sequence of rows in map and apply the function, append NAs at the beginning to make the length correct
library(dplyr)
library(rdist)
library(purrr)
input_labs %>%
mutate(dist_prior = c(NA_real_, NA_real_,
map_dbl(3:n(), ~ cdist(cur_data()[.x,], cur_data()[.x-2, ]))))
-output
# A tibble: 10 × 10
sodium chloride potassium_plas co2_totl bun creatinine calcium glucose anion_gap dist_prior
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 140 103 3.4 31 11 0.84 9.3 102 6 NA
2 153. 149. 0.34 3.1 1.1 0.084 0.93 10.2 0.600 NA
3 138 104 4.1 22 5 0.53 9.8 99 12 13.0
4 152. 149 0.41 2.2 0.5 0.053 0.98 9.9 1.20 1.30
5 140 102 3.7 23 8 0.69 9.4 115 15 16.8
6 153. 149. 0.37 2.3 0.8 0.069 0.94 11.5 1.50 1.68
7 141 103 4 27 21 1.04 9.4 94 11 25.4
8 153. 149. 0.4 2.7 2.1 0.104 0.94 9.4 1.10 2.54
9 141 104 3.7 20 10 1.86 9.1 122 17 31.5
10 153. 149 0.37 2 1 0.186 0.91 12.2 1.70 3.15
Or may split by row on the original data and the laged one and use map2 to loop over the list and apply
input_labs$dist_prior <- map2_dbl(
asplit(lag(input_labs, n = 2), 1),
asplit(input_labs, 1),
~ cdist(as.data.frame.list(.x), as.data.frame.list(.y))[,1])
in Base R you can use diff and rowSums as shown below:
c(NA, NA, sqrt(rowSums(diff(as.matrix(input_labs), 2)^2)))
[1] NA NA 12.955157 1.295516 16.832873 1.683287 25.381342 2.538134 31.493688 3.149369
You can cbind the results to the original dataframe.

Removing blank rows in the two data frames and combine them

I have a dataset which contains two groups. First 3 columns are 1st group and next 3 columns are 2nd group. They contains missing values at random manner.
I have to delete the rows containing complete missing values in any one group. And both group contains at least one value in the row.
At last I have to combine both the groups.
I have tried many R codes. Please suggest some useful R function for this issue.
example data structure
If your data is properly named, this can be done using starts_with and if_any (otherwise you might rename your columns first as you see fit)
library(tidyverse)
df <- tribble(
~x1, ~x2, ~x3, ~y1, ~y2, ~y3,
26.4, 26.5, 26.6, 26.7, 26.4, 26.5,
NA, NA, NA, 23.7, NA, NA,
27.2, 28.0, 27.9, 27.6, 27.8, 27.7,
NA, 24.2, 24.9, 23.9, 24.9, 24.0,
24.3, NA, 24.3, 24.0, 24.1, 24.5,
26.9, 26.7, 27.0, 26.9, 26.8, 26.8,
24.4, 24.4, 24.5, 24.8, 24.3, 24.3,
NA, NA, NA, 23.9, NA, NA,
NA, NA, NA, 23.9, NA, NA,
24.9, NA, NA, 24.9, NA, NA,
NA, NA, NA, 24.5, NA, NA,
28.3, 28.2, 28.3, 28.2, 28.4, 28.3,
28.3, 28.4, 28.1, 28.3, 28.3, 28.2
)
df %>% filter(!if_all(starts_with("x"), is.na) & !if_all(starts_with("y"), is.na))
#> # A tibble: 9 × 6
#> x1 x2 x3 y1 y2 y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26.4 26.5 26.6 26.7 26.4 26.5
#> 2 27.2 28 27.9 27.6 27.8 27.7
#> 3 NA 24.2 24.9 23.9 24.9 24
#> 4 24.3 NA 24.3 24 24.1 24.5
#> 5 26.9 26.7 27 26.9 26.8 26.8
#> 6 24.4 24.4 24.5 24.8 24.3 24.3
#> 7 24.9 NA NA 24.9 NA NA
#> 8 28.3 28.2 28.3 28.2 28.4 28.3
#> 9 28.3 28.4 28.1 28.3 28.3 28.2
Created on 2022-06-18 by the reprex package (v2.0.1)
I am not sure I understand your question, but here is a demonstration using dplyr::if_all() , dplyr::if_any()
library(tidyverse)
# Example data
# have to delete the rows containing complete missing values in any one group.
# And both group contains at least one value in the row. (Not sure what that means)
# At last I have to combine both the groups.
d <- tibble::tribble(
~gr1_col1, ~gr1_col2, ~gr1_col3, ~gr2_col1, ~gr2_col2, ~gr2_col3,
1, 2, NA, 1, 1, 1,
NA, NA, NA, NA, 1, 1,
NA, 1, 1, NA, NA, 1,
1, NA, 2, NA, NA, NA,
)
d
#> # A tibble: 4 x 6
#> gr1_col1 gr1_col2 gr1_col3 gr2_col1 gr2_col2 gr2_col3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 NA 1 1 1
#> 2 NA NA NA NA 1 1
#> 3 NA 1 1 NA NA 1
#> 4 1 NA 2 NA NA NA
d %>%
dplyr::filter(
# First group
!dplyr::if_all(.cols = c(1, 2, 3), .fns = is.na), # removing rows if all columns 1, 2 and 3 are NA
# second group
!if_all(.cols = c(4, 5, 6), .fns = is.na) # removing rows if all columns 1, 2 and 3 are NA
)
#> # A tibble: 2 x 6
#> gr1_col1 gr1_col2 gr1_col3 gr2_col1 gr2_col2 gr2_col3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 NA 1 1 1
#> 2 NA 1 1 NA NA 1
# Not sure what you mean with how you want to combine groups
Created on 2022-06-17 by the reprex package (v2.0.1)

Replacing values based on conditions

I have a dataframe one of the cols is id and some of the values have been messed up during the recording of the data.
here's an example of the type of data
dput(df)
structure(list(Id = c("'110171786'", "'1103fbfd5'", "'0700edf6dc'",
"'1103fad09'", "'01103fc9bb'", "''", "''", "0000fba2b'", "'01103fb169'",
"'01103fd723'", "'01103f9c34'", "''", "''", "''", "'01103fc088'",
"'01103fa6d8'", "'01103fb374'", "'01103fce8c'", "'01103f955d'",
"'011016e633'", "'01103fa0da'", "''", "''", "''", "'01103fa4bd'",
"'01103fb5c4'", "'01103fd0d7'", "'01103f9e2e'", "'01103fc657'",
"'01103fd4d1'", "'011016e78e'", "'01103fbda2'", "'01103fbae7'",
"'011016ee23'", "'01103fc847'", "'01103fbfbb'", "''", "'01103fb8bb'",
"'01103fc853'", "''", "'01103fbcd5'", "'011016e690'", "'01103fb253'",
"'01103fcb19'", "'01103fb446'", "'01103fa4fa'", "'011016cfbd'",
"'01103fd250'", "'01103fac7d'", "'011016a86e'"), Weight = c(11.5,
11.3, 11.3, 10.6, 10.6, 8.9, 18.7, 10.9, 11.3, 18.9, 18.9, 8.6,
8.8, 8.4, 11, 10.4, 10.4, 10.8, 11.2, 11, 10.3, 9.5, 8.1, 9.3,
10.2, 10.5, 11.2, 21.9, 18, 17.8, 11.3, 11.5, 10.8, 10.5, 12.8,
10.9, 8.9, 10.3, 10.8, 8.9, 10.9, 9.9, 19, 11.6, 11.3, 11.7,
10.9, 12.1, 11.3, 10.6)), class = "data.frame", row.names = c(NA,
-50L))
>
What I would like to do is search through the id column and replace the following mistakes
some of the values have a zero missing off the front, all of these would start with a 1 now instead which makes finding them easily. So basically anything that has a character length of 9 and starts with a 1 needs a 0 as the first character.
some of the values are less than 10 characters long, these need to be removed.
some have more than one leading 0 and these need to be removed.
df$Id <- gsub("^('?)(1.{8}')$", "\\10\\2", df$Id)
df[ !grepl("^'?(00|'$)", df$Id),]
# Id Weight
# 1 '0110171786' 11.5
# 2 '01103fbfd5' 11.3
# 3 '0700edf6dc' 11.3
# 4 '01103fad09' 10.6
# 5 '01103fc9bb' 10.6
# 9 '01103fb169' 11.3
# 10 '01103fd723' 18.9
# 11 '01103f9c34' 18.9
# 15 '01103fc088' 11.0
# 16 '01103fa6d8' 10.4
# 17 '01103fb374' 10.4
# 18 '01103fce8c' 10.8
# 19 '01103f955d' 11.2
# 20 '011016e633' 11.0
# 21 '01103fa0da' 10.3
# 25 '01103fa4bd' 10.2
# 26 '01103fb5c4' 10.5
# 27 '01103fd0d7' 11.2
# 28 '01103f9e2e' 21.9
# 29 '01103fc657' 18.0
# 30 '01103fd4d1' 17.8
# 31 '011016e78e' 11.3
# 32 '01103fbda2' 11.5
# 33 '01103fbae7' 10.8
# 34 '011016ee23' 10.5
# 35 '01103fc847' 12.8
# 36 '01103fbfbb' 10.9
# 38 '01103fb8bb' 10.3
# 39 '01103fc853' 10.8
# 41 '01103fbcd5' 10.9
# 42 '011016e690' 9.9
# 43 '01103fb253' 19.0
# 44 '01103fcb19' 11.6
# 45 '01103fb446' 11.3
# 46 '01103fa4fa' 11.7
# 47 '011016cfbd' 10.9
# 48 '01103fd250' 12.1
# 49 '01103fac7d' 11.3
# 50 '011016a86e' 10.6

Splitting two messy vectors in a data frame into one common column

Sample of dataset:
library(dplyr)
sample <- structure(list(Rank = c(15, 17, 20, 2, 16, 8, 21, 5, 13, 31, 22, 18, 2, 19, 11, 11, 8, 7, 12, 9, 5, 23, 17, 16, 15, 14, 4, 20, 13, 2), Athlete = c("François Gourmet(BEL)", "Agustín Félix(ESP)", "Keisuke Ushiro", "Michael Schrader", "Pieter Braun", "Laurent Hernu(FRA)", "Dmitriy Karpov", "Laurent Hernu(FRA)", "Thomas van der Plaetsen", "Attila Szabó", "Nadir El Fassi", "Eduard Mikhan", "Leonel Suárez", "Janek Õiglane", "Hans van Alphen(BEL)", "Roman Šebrle", "André Niklaus(GER)", "Pascal Behrenbruch", "Pieter Braun", "Oleksandr Yurkov(UKR)", "Eelco Sintnicolaas", "Brent Newdick", "Kim Kun-woo", "Akihiko Nakamura", "Bastien Auzeil", "Frédéric Xhonneux", "Janek Õiglane", "Keisuke Ushiro", "Roman Šebrle", "Rico Freimuth"), Total = c(7974, 7749, 7498, 8670, 7890, 8280, 7550, 8218, 8069, 7610, 7922, 7968, 8640, 7581, 8034, 8266, 8020, 8211, 8114, 8264, 8298, 7915, 7860, 7745, 7922, 7616, 8371, 7532, 8069, 8564), `100m` = c(10.67, 11.17, 11.53, 10.73, 11.22, 10.97, 11.24, 11.2, 11.2, 11.15, 11.12, 10.97, 11.13, 11.51, 11.11, 11.16, 11.19, 11.08, 11.11, 10.93, 10.76, 11.11, 11.11, 10.86, 11.35, 11.28, 11.08, 11.51, 11.25, 10.53), LJ = c(7.15, 7.12, 6.64, 7.85, 7.17, 7.31, 6.86, 7.22, 7.79, 7.09, 7.26, 7.42, 7.24, 6.78, 7.35, 7.8, 7.21, 6.8, 7.29, 7.37, 7.29, 7.42, 7.24, 7.26, 6.87, 7.21, 7.33, 6.73, 7.3, 7.48), SP = c(13.74, 13.29, 13.43, 14.56, 14.48, 14.43, 15.69, 13.99, 12.76, 13.92, 13.62, 14.15, 15.2, 14.43, 14.67, 14.98, 13.87, 16.01, 13.9, 15.15, 14.13, 14.35, 12.96, 11.67, 15.23, 12.92, 15.13, 14.93, 15.2, 14.85), HJ = c(1.85, 2.03, 1.96, 1.99, 1.93, 2.03, 1.93, 2.03, 2.17, 1.84, 1.99, 1.96, 2.11, 1.92, 1.88, 2.11, 1.97, 1.93, 2.04, 1.97, 1.93, 1.99, 1.96, 1.95, 1.96, 2.03, 2.05, 1.89, 2.05, 1.99), `400m` = c(47.98, 52.08, 51.43, 47.66, 48.54, 49.31, 52.01, 48.95, 49.46, 49.79, 51.35, 48.8, 48, 50.95, 48.52, 50.42, 49.95, 49.9, 48.24, 49.45, 48.35, 50.1, 49.24, 47.81, 50.36, 49.04, 49.58, 50.85, 51.18, 48.41), `110mh` = c(15.02, 14.75, 15.35, 14.29, 14.67, 14.01, 14.64, 14.15, 14.79, 14.65, 14.9, 14.82, 14.45, 15.33, 14.77, 14.44, 14.5, 14.33, 14.37, 14.41, 14.42, 14.82, 14.95, 14.72, 14.59, 15.75, 14.56, 15.43, 14.75, 13.68), DT = c(39.87, 43.67, 47.64, 46.44, 42.59, 43.93, 47.1, 46.13, 37.2, 43.75, 42.25, 48, 44.71, 40.94, 44.3, 46.3, 42.68, 48.56, 42.09, 48.1, 42.23, 43.6, 39.53, 33.48, 46.86, 38.62, 42.11, 46.85, 46.93, 51.17), PV = c(5, 5, 4.6, 5, 4.7, 5.1, 4.8, 4.9, 5.1, 4.4, 4.8, 4.6, 5, 4.6, 4.3, 4.6, 5.1, 4.9, 4.9, 5, 5.2, 4.8, 4.9, 4.7, 4.8, 4.7, 5.1, 4.7, 4.8, 4.8), JT = c(57.73, 56.69, 63.28, 65.67, 59.26, 59.9, 46.91, 59.63, 58.91, 59.56, 57.65, 50.74, 75.19, 68.51, 65.71, 65.61, 57.55, 66.5, 56.95, 58.63, 61.07, 51.52, 53.33, 53.57, 60.8, 50.18, 71.73, 56.52, 67.28, 62.34), `1500m` = c(265.51, 288.27, 291.9, 265.38, 278.4, 277.41, 298.41, 268.4, 285.86, 285.64, 256.51, 273.71, 267.25, 283.06, 262.5, 290.33, 268.8, 276.64, 272.46, 278.43, 265.4, 270.57, 255.63, 256.36, 279.8, 262.71, 279.24, 283.51, 296.5, 281.57), Year = structure(c(4L, 4L, 9L, 7L, 9L, 1L, 6L, 2L, 6L, 5L, 5L, 7L, 5L, 8L, 4L, 5L, 2L, 6L, 8L, 1L, 6L, 5L, 6L, 8L, 9L, 3L, 9L, 8L, 6L, 9L), .Label = c("2001", "2003", "2005", "2007", "2009", "2011", "2013", "2015", "2017"), class = "factor"), Nationality = c(NA, NA, "Japan(JPN)", "Germany(GER)", "Netherlands(NED)", NA, "Kazakhstan(KAZ)", NA, "Belgium(BEL)", "Hungary", "France", "Belarus(BLR)", "Cuba", "Estonia(EST)", NA, "Czech Republic", NA, "Germany(GER)", "Netherlands(NED)", NA, "Netherlands(NED)", "New Zealand", "South Korea(KOR)", "Japan(JPN)", "France(FRA)", NA, "Estonia(EST)", "Japan(JPN)", "Czech Republic(CZE)", "Germany(GER)"), Notes = c(NA, NA, NA, "PB", NA, NA, NA, NA, NA, NA, "SB", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "PB", "NR", NA, "SB", NA, "PB", NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality", "Notes"), row.names = c(NA, -30L), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 30 x 16
Rank Athlete Total `100m` LJ SP HJ `400m` `110mh` DT PV JT `1500m` Year Nationality Notes
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <chr> <chr>
1 15 François Gourmet(BEL) 7974 10.67 7.15 13.74 1.85 47.98 15.02 39.87 5.0 57.73 265.51 2007 <NA> <NA>
2 17 Agustín Félix(ESP) 7749 11.17 7.12 13.29 2.03 52.08 14.75 43.67 5.0 56.69 288.27 2007 <NA> <NA>
3 20 Keisuke Ushiro 7498 11.53 6.64 13.43 1.96 51.43 15.35 47.64 4.6 63.28 291.90 2017 Japan(JPN) <NA>
4 2 Michael Schrader 8670 10.73 7.85 14.56 1.99 47.66 14.29 46.44 5.0 65.67 265.38 2013 Germany(GER) PB
5 16 Pieter Braun 7890 11.22 7.17 14.48 1.93 48.54 14.67 42.59 4.7 59.26 278.40 2017 Netherlands(NED) <NA>
6 8 Laurent Hernu(FRA) 8280 10.97 7.31 14.43 2.03 49.31 14.01 43.93 5.1 59.90 277.41 2001 <NA> <NA>
7 21 Dmitriy Karpov 7550 11.24 6.86 15.69 1.93 52.01 14.64 47.10 4.8 46.91 298.41 2011 Kazakhstan(KAZ) <NA>
8 5 Laurent Hernu(FRA) 8218 11.20 7.22 13.99 2.03 48.95 14.15 46.13 4.9 59.63 268.40 2003 <NA> <NA>
9 13 Thomas van der Plaetsen 8069 11.20 7.79 12.76 2.17 49.46 14.79 37.20 5.1 58.91 285.86 2011 Belgium(BEL) <NA>
10 31 Attila Szabó 7610 11.15 7.09 13.92 1.84 49.79 14.65 43.75 4.4 59.56 285.64 2009 Hungary <NA>
# ... with 20 more rows
I have two character vectors, "Athlete and "Nationality", in my dataset where some entries have country codes in brackets attached at the end. I want to be able to split only the country codes from these two vectors into a new variable, say "countrycode", while getting rid of the brackets at the same time. I'm not sure what the best way or syntax to go about splitting would be though - dplyr::separate possibly? Though I'm uncertain how to incorporate the combinations of characters in the country codes within the brackets during the split, and the fact that some entries don't need splitting.
I would then do something like this after to remove the brackets from the new variable.
sample$countrycode<- gsub(pattern="\\(",replacement="",x=sample$countrycode)
sample$countrycode<- gsub(pattern="\\)",replacement="",x=sample$countrycode)
Thanks
Hope this works for you:
library(dplyr)
res <- sample %>% mutate(
countrycode = case_when(
is.na(Nationality) & grepl('\\(', Athlete) ~ gsub('.*?\\((.*)\\)', '\\1', Athlete),
grepl('\\(', Nationality) ~ gsub('.*?\\((.*)\\)', '\\1', Nationality),
TRUE ~ Nationality
)
)
sample output:
res %>% select(Athlete, Nationality, countrycode)
# # A tibble: 30 x 3
# Athlete Nationality countrycode
# <chr> <chr> <chr>
# 1 François Gourmet(BEL) NA BEL
# 2 Agustín Félix(ESP) NA ESP
# 3 Keisuke Ushiro Japan(JPN) JPN
# 4 Michael Schrader Germany(GER) GER
# 5 Pieter Braun Netherlands(NED) NED
# 6 Laurent Hernu(FRA) NA FRA
# 7 Dmitriy Karpov Kazakhstan(KAZ) KAZ
# 8 Laurent Hernu(FRA) NA FRA
# 9 Thomas van der Plaetsen Belgium(BEL) BEL
# 10 Attila Szabó Hungary Hungary
# # ... with 20 more rows
Remove the TRUE ~ Nationality to extract only country code as commented by Frank:
sample %>% mutate(
countrycode = case_when(
is.na(Nationality) & grepl('\\(', Athlete) ~ gsub('.*?\\((.*)\\)', '\\1', Athlete),
grepl('\\(', Nationality) ~ gsub('.*?\\((.*)\\)', '\\1', Nationality)
))
An ugly approach would be to use sub:
library(data.table)
DT = data.table(sample)
patt = "^.*\\((.{3})\\).*$"; rp = "\\1"
DT[Athlete %like% patt, cc := sub(patt, rp, Athlete)]
DT[Nationality %like% patt, cc := sub(patt, rp, Nationality)]
Something like str_extract from the stringr package would probably be cleaner if you're already working with tidyverse packages. Also, for the dplyr analogue to the code above, maybe look at the case_when function. (I am not familiar enough with these tools to know the exact syntax.)
The result looks like...
> DT[, .(Athlete, Nationality, cc)]
Athlete Nationality cc
1: François Gourmet(BEL) NA BEL
2: Agustín Félix(ESP) NA ESP
3: Keisuke Ushiro Japan(JPN) JPN
4: Michael Schrader Germany(GER) GER
5: Pieter Braun Netherlands(NED) NED
6: Laurent Hernu(FRA) NA FRA
7: Dmitriy Karpov Kazakhstan(KAZ) KAZ
8: Laurent Hernu(FRA) NA FRA
9: Thomas van der Plaetsen Belgium(BEL) BEL
10: Attila Szabó Hungary NA
11: Nadir El Fassi France NA
12: Eduard Mikhan Belarus(BLR) BLR
13: Leonel Suárez Cuba NA
14: Janek Õiglane Estonia(EST) EST
15: Hans van Alphen(BEL) NA BEL
16: Roman Šebrle Czech Republic NA
17: André Niklaus(GER) NA GER
18: Pascal Behrenbruch Germany(GER) GER
19: Pieter Braun Netherlands(NED) NED
20: Oleksandr Yurkov(UKR) NA UKR
21: Eelco Sintnicolaas Netherlands(NED) NED
22: Brent Newdick New Zealand NA
23: Kim Kun-woo South Korea(KOR) KOR
24: Akihiko Nakamura Japan(JPN) JPN
25: Bastien Auzeil France(FRA) FRA
26: Frédéric Xhonneux NA NA
27: Janek Õiglane Estonia(EST) EST
28: Keisuke Ushiro Japan(JPN) JPN
29: Roman Šebrle Czech Republic(CZE) CZE
30: Rico Freimuth Germany(GER) GER
Athlete Nationality cc
This simple solution works too.
library(stringr)
data1$country_code <- sapply(data1$Nationality, function(x) unlist(stri_extract_all(str = x, regex = '([A-Z]+)'))[2])
Nationality country_code
1: NA NA
2: NA NA
3: Japan(JPN) JPN
4: Germany(GER) GER
5: Netherlands(NED) NED
6: NA NA

Resources