using separate() to separate numbers stuck together (Example: 201612) in R [duplicate] - r

This question already has answers here:
Splitting Columns by Number of Characters [duplicate]
(2 answers)
Closed 2 years ago.
I want to separate the month_date_yyyymm column from this tibble:
month_date_yyyymm postal_code zip_name nielsen_hh_rank hotness_rank hotness_score
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 201612 80230 denver, co 8459 3420 74.0
2 201612 80503 longmont, co 2233 6088 60.7
3 201612 38221 big sandy, tn 15014 12539 25.5
4 201612 13691 theresa, ny 15586 14796 11.6
5 201612 19076 prospect park, pa 11777 1661 84.4
6 201612 18036 coopersburg, pa 8235 7870 51.5
>
I want the tibble to look like this
year month postal_code zip_name nielsen_hh_rank hotness_rank hotness_score
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 2016 12 80230 denver, co 8459 3420 74.0
2 2016 12 80503 longmont, co 2233 6088 60.7
3 2016 12 38221 big sandy, tn 15014 12539 25.5
4 2016 12 13691 theresa, ny 15586 14796 11.6
5 2016 12 19076 prospect park, pa 11777 1661 84.4
6 2016 12 18036 coopersburg, pa 8235 7870 51.5
I can't figure out how to separate numbers that are stuck together, such as the month_date_yyyymm column. I know it has something to do with sep = in the separate function. Here is my code:
hotness_cleaned <- hotness %>% separate(month_date_yyyymm, into = c("year", "month"), sep = "2016", remove = T)
However, it's showing up like this:
year month postal_code zip_name nielsen_hh_rank hotness_rank hotness_score
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 "" 12 80230 denver, co 8459 3420 74.0
2 "" 12 80503 longmont, co 2233 6088 60.7
3 "" 12 38221 big sandy, tn 15014 12539 25.5
4 "" 12 13691 theresa, ny 15586 14796 11.6
5 "" 12 19076 prospect park, pa 11777 1661 84.4
6 "" 12 18036 coopersburg, pa 8235 7870 51.5
What is the correct syntax for separating numbers that are stuck together using "sep = "?
Thank you.

We can specify the position index in sep
library(dplyr)
library(tidyr)
hotness %>%
separate(month_date_yyyymm, into = c("year", "month"),
sep = 4, remove = TRUE, convert = TRUE)
-output
# year month postal_code zip_name nielsen_hh_rank hotness_rank hotness_score
#1 2016 12 80230 denver, co 8459 3420 74.0
#2 2016 12 80503 longmont, co 2233 6088 60.7
#3 2016 12 38221 big sandy, tn 15014 12539 25.5
#4 2016 12 13691 theresa, ny 15586 14796 11.6
#5 2016 12 19076 prospect park, pa 11777 1661 84.4
#6 2016 12 18036 coopersburg, pa 8235 7870 51.5
data
hotness <- structure(list(month_date_yyyymm = c(201612L, 201612L, 201612L,
201612L, 201612L, 201612L), postal_code = c(80230L, 80503L, 38221L,
13691L, 19076L, 18036L), zip_name = c("denver, co", "longmont, co",
"big sandy, tn", "theresa, ny", "prospect park, pa", "coopersburg, pa"
), nielsen_hh_rank = c(8459L, 2233L, 15014L, 15586L, 11777L,
8235L), hotness_rank = c(3420L, 6088L, 12539L, 14796L, 1661L,
7870L), hotness_score = c(74, 60.7, 25.5, 11.6, 84.4, 51.5)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Related

Rename/recode variable value in R based on condition using dplyr

I have a dataset dataExtended with variable CountryOther and n which is a count of wines in that particular country. CountryOther is character type and n is integer. What I want to do, is to rename values in CountryOther to Other in case the n <=20. I would like to do it with dyplr package and I am not sure how to do it and if to use only mutate or mutate_at.
As long as I wasn't able to do wrote the condition as stated above, I tried to do it manually as follows but it didn't work:
dataExtended$CountryOther <- dataExtended$Country
dataExtended %>%
mutate(CountryOther = recode(CountryOther,
China = "Other",
Mexico = "Other",
Slovakia = "Other",
Bulgaria = "Other",
Canada = "Other",
Croatia = "Other",
Uruguay = "Other",
Georgia = "Other",
Turkey = "Other",
Moldova = "Other",
Slovenia = "Other",
Hungary = "Other",
Switzerland = "Other",
Greece = "Other",
Israel = "Other",
Lebanon= "Other"))
Using the Red.csv from your link imported with readr::read_csv() creates a data.frame / tibble
#> data
# A tibble: 8,666 × 8
Name Country Region Winery Rating NumberOf…¹ Price Year
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 Pomerol 2011 France Pomerol Château La Providence 4.2 100 95 2011
2 Lirac 2017 France Lirac Château Mont-Redon 4.3 100 15.5 2017
3 Erta e China Rosso di Toscana 2015 Italy Toscana Renzo Masi 3.9 100 7.45 2015
4 Bardolino 2019 Italy Bardolino Cavalchina 3.5 100 8.72 2019
5 Ried Scheibner Pinot Noir 2016 Austria Carnuntum Markowitsch 3.9 100 29.2 2016
6 Gigondas (Nobles Terrasses) 2017 France Gigondas Vieux Clocher 3.7 100 19.9 2017
7 Marion's Vineyard Pinot Noir 2016 New Zealand Wairarapa Schubert 4 100 43.9 2016
8 Red Blend 2014 Chile Itata Valley Viña La Causa 3.9 100 17.5 2014
9 Chianti 2015 Italy Chianti Castello Montaùto 3.6 100 10.8 2015
10 Tradition 2014 France Minervois Domaine des Aires Hautes 3.5 100 6.9 2014
# … with 8,656 more rows, and abbreviated variable name ¹​NumberOfRatings
Now with dplyrs help
library(dplyr)
data %>%
add_count(Country, name = "WineCount") %>%
mutate(CountryOther = ifelse(WineCount <= 20, "Other", Country))
we get
# A tibble: 8,666 × 10
Name Country Region Winery Rating Numbe…¹ Price Year WineC…² Count…³
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int> <chr>
1 Pomerol 2011 France Pomerol Château La… 4.2 100 95 2011 2256 France
2 Lirac 2017 France Lirac Château Mo… 4.3 100 15.5 2017 2256 France
3 Erta e China Rosso di Toscana 2015 Italy Toscana Renzo Masi 3.9 100 7.45 2015 2650 Italy
4 Bardolino 2019 Italy Bardolino Cavalchina 3.5 100 8.72 2019 2650 Italy
5 Ried Scheibner Pinot Noir 2016 Austria Carnuntum Markowitsch 3.9 100 29.2 2016 220 Austria
6 Gigondas (Nobles Terrasses) 2017 France Gigondas Vieux Cloc… 3.7 100 19.9 2017 2256 France
7 Marion's Vineyard Pinot Noir 2016 New Zealand Wairarapa Schubert 4 100 43.9 2016 63 New Ze…
8 Red Blend 2014 Chile Itata Valley Viña La Ca… 3.9 100 17.5 2014 326 Chile
9 Chianti 2015 Italy Chianti Castello M… 3.6 100 10.8 2015 2650 Italy
10 Tradition 2014 France Minervois Domaine de… 3.5 100 6.9 2014 2256 France
# … with 8,656 more rows, and abbreviated variable names ¹​NumberOfRatings, ²​WineCount, ³​CountryOther
We can filter for WineCount <= 30:
# A tibble: 125 × 10
Name Country Region Winery Rating Numbe…¹ Price Year WineC…² Count…³
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int> <chr>
1 Steiner 2013 Hungary Sopron Wenin… 3.7 100 24.5 2013 9 Other
2 Viile Metamorfosis Merlot 2015 Romania Dealu Mare Vitis… 3.5 102 7.5 2015 23 Romania
3 Halkidiki Limnio - Merlot 2013 Greece Chalkidiki Tsant… 3.2 105 12.5 2013 13 Other
4 Cabernet Sauvignon 2013 Mexico Valle de Guad… L. A.… 3.4 1066 8.65 2013 1 Other
5 Driopi Classic Agiorgitiko Nemea 2017 Greece Nemea Κτημα… 3.7 107 11.5 2017 13 Other
6 Malbec de Purcari 2018 Moldova South Eastern Châte… 4.1 107 12.0 2018 8 Other
7 Cabernet Sauvignon de Purcari 2017 Moldova South Eastern Châte… 4.1 1082 13.0 2017 8 Other
8 Cabernet Sauvignon 2016 Romania Samburesti Caste… 3.3 112 7.9 2016 23 Romania
9 Aigle Les Murailles Rouge 2015 Switzerland Aigle Henri… 3.7 112 23.2 2015 12 Other
10 Γουμένισσα (Goumenissa) 2015 Greece Goumenissa Chatz… 3.7 115 20 2015 13 Other
to check the desired output: There are several rows filled with "Other" in column CountryOther.
in the end I created this code which works:
#New table with wine count
wineCount <- data %>% count(Country)
#Joining two tables together
dataExtended <- inner_join(wineCount, data, by = "Country")
# Creating new variable CountryOther
dataExtended$CountryOther <- dataExtended$Country
# Renaming count from n to WineCount
dataExtended <- rename(dataExtended, WineCount = n)
# Replacement of countries with WineCount<=20 to Other
dataExtended <- dataExtended %>%
mutate(CountryOther = ifelse(WineCount<=20, "Other", CountryOther))
# Final check
unique(dataExtended$CountryOther)
The problem was I needed to store changes into the dataframe, which I didn't do before (as you can see in my last comment):
dataExtended <- rename(dataExtended, WineCount = n)
and
dataExtended <- dataExtended %>%
mutate(CountryOther = ifelse(WineCount<=20, "Other", CountryOther))
I also tested your code and it works as well and additionally it looks neater. So thank you very much for your help.

How to Import data from external website into R?

For my project purpose, I need to directly take the data (excel sheet) from a website into R working platform. How can it be performed, please do help me out.
This can be considered as an url for time being "https://www.contextures.com/tablesamples/sampledatahockey.zip"
You can try:
library(readxl)
download.file("https://www.contextures.com/tablesamples/sampledatahockey.zip",
destfile = "sampledatahockey.zip")
unzip("sampledatahockey.zip")
read_excel("sampledatahockey.xlsx", sheet = "PlayerData", skip = 2)
Output is:
# A tibble: 96 × 15
ID Team Country NameF NameL Weight Height DOB Hometown Prov Pos Age HeightFt HtIn BMI
<dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Women Canada Meghan Agosta 148 5'7 1987-02-12 00:00:00 Ruthven Ont. Forward 34 5.58 67 23
2 2 Women Canada Rebecca Johnston 148 5'9 1989-09-24 00:00:00 Sudbury Ont. Forward 32 5.75 69 22
3 3 Women Canada Laura Stacey 156 5'10 1994-05-05 00:00:00 Kleinburg Ont. Forward 27 5.83 70 22
4 4 Women Canada Jennifer Wakefield 172 5'10 1989-06-15 00:00:00 Pickering Ont. Forward 32 5.83 70 25
5 5 Women Canada Jillian Saulnier 144 5'5 1992-03-07 00:00:00 Halifax N.S. Forward 29 5.42 65 24
6 6 Women Canada Mélodie Daoust 159 5'6 1992-01-07 00:00:00 Valleyfield Que. Forward 29 5.5 66 26
7 7 Women Canada Bailey Bram 150 5'8 1990-09-05 00:00:00 St. Anne Man. Forward 31 5.67 68 23
8 8 Women Canada Brianne Jenner 156 5'9 1991-05-04 00:00:00 Oakville Ont. Forward 30 5.75 69 23
9 9 Women Canada Sarah Nurse 140 5'8 1995-01-04 00:00:00 Hamilton Ont. Forward 26 5.67 68 21
10 10 Women Canada Haley Irwin 170 5'7 1988-06-06 00:00:00 Thunder Bay Ont. Forward 33 5.58 67 27
# … with 86 more rows

calculate 5 year average of panel data with factors kept

I have a panel data set that may look like
set.seed(123)
df <- data.frame(
year = rep(2011:2020,5),
county = rep(c("a","b",'c','d','e'), each=10),
state = rep(c("A","B",'C','D','E'), each=10),
country = rep(c("AA","BB",'CC','DD','EE'), each=10),
var1 = runif(50, 0, 50),
var2 = runif(50, 50, 100)
)
I want to transform the panel data set to 5 year averages of the counties by
df <- df %>%
mutate(period = cut(df$year, seq(2011, 2021, by = 5),right = F)) %>%
group_by(county, period) %>%
summarise_all(mean)
The data set looks like
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 NA NA 33.1 69.7
2 a [2016,2021) 2018 NA NA 24.7 73.6
3 b [2011,2016) 2013 NA NA 27.6 72.3
4 b [2016,2021) 2018 NA NA 24.7 83.1
5 c [2011,2016) 2013 NA NA 38.7 75.7
6 c [2016,2021) 2018 NA NA 22.8 66.8
7 d [2011,2016) 2013 NA NA 33.8 72.2
8 d [2016,2021) 2018 NA NA 20.0 83.7
9 e [2011,2016) 2013 NA NA 14.9 71.0
10 e [2016,2021) 2018 NA NA 19.6 70.4
The warming messages are, for example
In mean.default(state) :
argument is not numeric or logical: returning NA
Is there a smart way (not by merging as actually, I have a lot of character columns) to keep the time-invariant character of each county after the transformation?
What I desire is
county period year state country var1 var2
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a [2011,2016) 2013 A AA 33.1 69.7
2 a [2016,2021) 2018 A AA 24.7 73.6
3 b [2011,2016) 2013 B BB 27.6 72.3
4 b [2016,2021) 2018 B BB 24.7 83.1
5 c [2011,2016) 2013 C CC 38.7 75.7
6 c [2016,2021) 2018 C CC 22.8 66.8
7 d [2011,2016) 2013 D DD 33.8 72.2
8 d [2016,2021) 2018 D DD 20.0 83.7
9 e [2011,2016) 2013 E EE 14.9 71.0
10 e [2016,2021) 2018 E EE 19.6 70.4
Thank you in advance!
The warnning results from that summarise_all(mean) calculates averages not only on var1 & var2 but on state & country. If you want to keep state and country as grouping columns, you should put them into group_by():
library(dplyr)
df %>%
group_by(county, state, country,
period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
summarise_all(mean) %>%
ungroup()
# # A tibble: 10 × 7
# county state country period year var1 var2
# <chr> <chr> <chr> <fct> <dbl> <dbl> <dbl>
# 1 a A AA [2011,2016) 2013 33.1 69.7
# 2 a A AA [2016,2021) 2018 24.7 73.6
# 3 b B BB [2011,2016) 2013 27.6 72.3
# 4 b B BB [2016,2021) 2018 24.7 83.1
# 5 c C CC [2011,2016) 2013 38.7 75.7
# 6 c C CC [2016,2021) 2018 22.8 66.8
# 7 d D DD [2011,2016) 2013 33.8 72.2
# 8 d D DD [2016,2021) 2018 20.0 83.7
# 9 e E EE [2011,2016) 2013 14.9 71.0
# 10 e E EE [2016,2021) 2018 19.6 70.4
If the grouping columns are simply county and period, and other categorical variables are unique in each group, you could keep them by just leaving the first values with first() while doing summarise().
df %>%
group_by(county,
period = cut(year, seq(2011, 2021, by = 5), right = FALSE)) %>%
summarise(across(!where(is.numeric), first),
across( where(is.numeric), mean)) %>%
ungroup()

Pivot longer with multiple variables in columns

My data looks like this:
# A tibble: 120 x 5
age death_rate_male life_exp_male death_rate_fem life_exp_fem
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.00630 76.0 0.00523 81.0
2 1 0.000426 75.4 0.000342 80.4
3 2 0.00029 74.5 0.000209 79.4
4 3 0.000229 73.5 0.000162 78.4
5 4 0.000162 72.5 0.000143 77.4
6 5 0.000146 71.5 0.000125 76.5
7 6 0.000136 70.5 0.000113 75.5
8 7 0.000127 69.6 0.000104 74.5
9 8 0.000115 68.6 0.000097 73.5
10 9 0.000103 67.6 0.000093 72.5
# ... with 110 more rows
>
I'm trying to create a tidy table where the variables are age, gender, life expectancy, and death rate.
I managed to do this by splitting the data frame into two (one containing life expectancy, the other death rate), tidying both with pivot_longer(), and then appending the two tables.
Is there a way to do this more elegantly, with a single pivot_longer() command? Thank you in advance.
We can use names_pattern (where we capture as a group based on the pattern)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -age, names_to = c( '.value', 'grp'),
names_pattern = "^(\\w+_\\w+)_(\\w+)")
# A tibble: 20 x 4
# age grp death_rate life_exp
# <int> <chr> <dbl> <dbl>
# 1 0 male 0.0063 76
# 2 0 fem 0.00523 81
# 3 1 male 0.000426 75.4
# 4 1 fem 0.000342 80.4
# 5 2 male 0.00029 74.5
# 6 2 fem 0.000209 79.4
# 7 3 male 0.000229 73.5
# 8 3 fem 0.000162 78.4
# 9 4 male 0.000162 72.5
#10 4 fem 0.000143 77.4
#11 5 male 0.000146 71.5
#12 5 fem 0.000125 76.5
#13 6 male 0.000136 70.5
#14 6 fem 0.000113 75.5
#15 7 male 0.000127 69.6
#16 7 fem 0.000104 74.5
#17 8 male 0.000115 68.6
#18 8 fem 0.000097 73.5
#19 9 male 0.000103 67.6
#20 9 fem 0.000093 72.5
or names_sep (specify the pattern here it is underscore followed by no character that is an underscore until the end)
df1 %>%
pivot_longer(cols = -age, names_to = c( '.value', 'grp'),
names_sep = "_(?=[^_]+$)")
data
df1 <- structure(list(age = 0:9, death_rate_male = c(0.0063, 0.000426,
0.00029, 0.000229, 0.000162, 0.000146, 0.000136, 0.000127, 0.000115,
0.000103), life_exp_male = c(76, 75.4, 74.5, 73.5, 72.5, 71.5,
70.5, 69.6, 68.6, 67.6), death_rate_fem = c(0.00523, 0.000342,
0.000209, 0.000162, 0.000143, 0.000125, 0.000113, 0.000104, 9.7e-05,
9.3e-05), life_exp_fem = c(81, 80.4, 79.4, 78.4, 77.4, 76.5,
75.5, 74.5, 73.5, 72.5)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Borrowing the data from akrun, here is a base R option using reshape
reshape(
setNames(df, gsub("(.*)_(\\w+)", "\\1\\.\\2", names(df))),
direction = "long",
varying = -1
)
such that
age time death_rate life_exp id
1.male 0 male 0.006300 76.0 1
2.male 1 male 0.000426 75.4 2
3.male 2 male 0.000290 74.5 3
4.male 3 male 0.000229 73.5 4
5.male 4 male 0.000162 72.5 5
6.male 5 male 0.000146 71.5 6
7.male 6 male 0.000136 70.5 7
8.male 7 male 0.000127 69.6 8
9.male 8 male 0.000115 68.6 9
10.male 9 male 0.000103 67.6 10
1.fem 0 fem 0.005230 81.0 1
2.fem 1 fem 0.000342 80.4 2
3.fem 2 fem 0.000209 79.4 3
4.fem 3 fem 0.000162 78.4 4
5.fem 4 fem 0.000143 77.4 5
6.fem 5 fem 0.000125 76.5 6
7.fem 6 fem 0.000113 75.5 7
8.fem 7 fem 0.000104 74.5 8
9.fem 8 fem 0.000097 73.5 9
10.fem 9 fem 0.000093 72.5 10

Group by DF and then Filter using dplyr

This might be relatively easy in dplyr. Sample question uses the Lahman package data.
What player managed both the NYA and NYN under teamID?
# get master player table
players <- Lahman::People
# get manager table
managers <- Lahman::Managers
# merge players to managers
manager_tbl <-
managers %>%
left_join(players)
I want to get the results for the players under playerID that have a row for both NYA and NYN under teamID.
How would I go about doing this? I'm guessing that I would need to group at playerID. berrayo01 is one of the answers.
After grouping by 'playerID', filter all groups having both 'NYA' and 'NYN' %in% 'teamID'
library(dplyr)
manager_tbl %>%
group_by(playerID) %>%
filter(all(c("NYA", "NYN") %in% teamID))
# A tibble: 69 x 35
# Groups: playerID [4]
# playerID yearID teamID lgID inseason G W L rank plyrMgr birthYear birthMonth birthDay birthCountry birthState birthCity deathYear deathMonth deathDay deathCountry deathState
# <chr> <int> <fct> <fct> <int> <int> <int> <int> <int> <fct> <int> <int> <int> <chr> <chr> <chr> <int> <int> <int> <chr> <chr>
# 1 stengca… 1934 BRO NL 1 153 71 81 6 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 2 stengca… 1935 BRO NL 1 154 70 83 5 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 3 stengca… 1936 BRO NL 1 156 67 87 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 4 stengca… 1938 BSN NL 1 153 77 75 5 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 5 stengca… 1939 BSN NL 1 152 63 88 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 6 stengca… 1940 BSN NL 1 152 65 87 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 7 stengca… 1941 BSN NL 1 156 62 92 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 8 stengca… 1942 BSN NL 1 150 59 89 7 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# 9 stengca… 1943 BSN NL 2 107 47 60 6 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
#10 stengca… 1949 NYA AL 1 155 97 57 1 N 1890 7 30 USA MO Kansas C… 1975 9 29 USA CA
# … with 59 more rows, and 14 more variables: deathCity <chr>, nameFirst <chr>, nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <fct>, throws <fct>, debut <chr>,
# finalGame <chr>, retroID <chr>, bbrefID <chr>, deathDate <date>, birthDate <date>

Resources