I lose the constant variables (including id) when using pivot_longer with multiple variables - r

I try to reshape the following
country
region
abc2001
abc2002
xyz2001
xyz2002
Japan
East Asia
1
2
4.5
5.5
to the following
country
region
year
abc
xyz
Japan
East Asia
2001
1
4.5
Japan
East Asia
2002
2
5.5
actually there are five more variables in the same way.
I use the following code:
long <- data %>% pivot_longer(cols = c(-country, -region), names_to = c(".value", "year"), names_pattern = "([^\\.]*)\\.*(\\d{4})")
The result is long version of the data except that I lose country and region variables. What do I do wrong? Or how else can I do this better?
Thank you in advance.

We may change the regex pattern to match one or more non-digits(\\D+) as the first capture group and one or more digits (\\d+) as the second one
librarytidyr)
pivot_longer(data, cols = c(-country, -region),
names_to = c(".value", "year"), names_pattern = "(\\D+)(\\d+)")
-output
# A tibble: 2 × 5
country region year abc xyz
<chr> <chr> <chr> <int> <dbl>
1 Japan East Asia 2001 1 4.5
2 Japan East Asia 2002 2 5.5
data
data <- structure(list(country = "Japan", region = "East Asia", abc2001 = 1L,
abc2002 = 2L, xyz2001 = 4.5, xyz2002 = 5.5),
class = "data.frame", row.names = c(NA,
-1L))

Update: see comments as #akrun noted, here is better regex with lookaround:
rename_with(., ~str_replace(names(data), "(?<=\\D)(?=\\d)", "\\_"))
First answer:
Here is a version with names_sep. The challenge was to add an underscore in the column names. The preferred answer is that of #akrun:
(.*) - Group 1: any zero or more chars as many as possible
(\\d{4}$) - Group 2: for digits at the end
library(dplyr)
library(tidyr)
data %>%
rename_with(., ~sub("(.*)(\\d{4}$)", "\\1_\\2", names(data))) %>%
pivot_longer(-c(country, region),
names_to =c(".value","Year"),
names_sep ="_"
)
country region Year abc xyz
<chr> <chr> <chr> <int> <dbl>
1 Japan East Asia 2001 1 4.5
2 Japan East Asia 2002 2 5.5

Related

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

In R, pivot duplicate row values into column values

My problem is similar to this one, but I am having trouble making the code work for me:
Pivot dataframe to keep column headings and sub-headings in R
My data looks like this:
prod1<-c(1000,2000,1400,1340)
prod2<-c(5000,5400,3400,5400)
partner<-c("World","World","Turkey","Turkey")
year<-c("2017","2018","2017","2018")
type<-c("credit","credit","debit","debit")
s<-as.data.frame(rbind(partner,year,type,prod1,prod2)
But I need to convert all the rows into individual variables so that it my columns are:
column.names<-c("products","partner","year","type","value")
I've been trying the code below:
#fix partners
colnames(s)[seq(2, 7, 1)] <- colnames(s)[2] #seq(start,end,increment)
colnames(s)[seq(9, ncol(s), 1)] <- colnames(s)[8]
colnames(s) <-
c(s[1, 1], paste(sep = '_', colnames(s)[2:ncol(s)], as.character(unlist(s[1, 2:ncol(s)]))))
test<-s[-1,]
s <- rename(s, category=1)
test<- s %>%
slice(-1) %>%
pivot_longer(-1,
names_to = c("partner", ".value"),
names_sep = "_") %>%
arrange(partner, `Service item`) %>%
mutate(partner = as.character(partner))
But it keeps saying I can't have duplicate column names. Can someone please help? The initial data is submitted in this format so I need to get it in the right shape.
s <- rownames_to_column(s)
s %>% pivot_longer(starts_with("V")) %>%
pivot_wider(names_from = rowname,values_from = value) %>%
select(-name) %>% pivot_longer(starts_with("prod"), names_to = "product",
values_to = "value")
# A tibble: 8 × 5
partner year type product value
<chr> <chr> <chr> <chr> <chr>
1 World 2017 credit prod1 1000
2 World 2017 credit prod2 5000
3 World 2018 credit prod1 2000
4 World 2018 credit prod2 5400
5 Turkey 2017 debit prod1 1400
6 Turkey 2017 debit prod2 3400
7 Turkey 2018 debit prod1 1340
8 Turkey 2018 debit prod2 5400
sorry misread the question at the beginning, is that what you look for ?

mutate with gsub to clean out commas from a number - table scraped with rvest

I'm practising scraping and data cleaning, and have a table which I've scraped from wikipedia. I'm trying to mutate the table to create a column which cleans out the commas from an existing column to return the number. All I'm getting is a column of NAs.
This is my output:
> library(dplyr)
> library(rvest)
>
> pg <- read_html("https://en.wikipedia.org/wiki/Rugby_World_Cup")
> rugby <- pg %>% html_table(., fill = T)
>
> rugby_table <- rugby[[3]]
>
> rugby_table
# A tibble: 9 x 8
Year `Host(s)` `Total attend­ance` Matches `Avg attend­ance` `% change in avg att.` `Stadium capacity` `Attend­ance as % o~
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60%
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79%
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77%
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83%
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83%
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92%
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85%
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95%
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90%
>
> rugby_table2 <- rugby %>%
+ .[[3]] %>%
+ tbl_df %>%
+ mutate(Attendance=as.numeric(gsub("[^0-9.-]+","",'Total attendance')))
>
> rugby_table2
# A tibble: 9 x 9
Year `Host(s)` `Total attend­ance` Matches `Avg attend­ance` `% change in avg~ `Stadium capaci~ `Attend­ance as~ Attendance
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1987 Australia New Zealand 604,500 32 20,156 — 1,006,350 60% NA
2 1991 England France Ireland Scotland Wales 1,007,760 32 31,493 +56% 1,212,800 79% NA
3 1995 South Africa 1,100,000 32 34,375 +9% 1,423,850 77% NA
4 1999 Wales 1,750,000 41 42,683 +24% 2,104,500 83% NA
5 2003 Australia 1,837,547 48 38,282 –10% 2,208,529 83% NA
6 2007 France 2,263,223 48 47,150 +23% 2,470,660 92% NA
7 2011 New Zealand 1,477,294 48 30,777 –35% 1,732,000 85% NA
8 2015 England 2,477,805 48 51,621 +68% 2,600,741 95% NA
9 2019 Japan 1,698,528 45† 37,745 –27% 1,811,866 90% NA
Any ideas?
The difficulty here is that gsub is interpreting 'Total attendance' as a character string, not a column name. My natural reaction was to use backticks instead of single quotes, but then I get a message that this object could not be found. I'm not sure what the problem is here, but you can resolve it using across
rugby_table2 <- rugby_table %>%
mutate(Attendance = across(contains("Total"),
function(x) as.numeric(gsub(",", "", x))),
Attendance = Attendance[[1]])
rugby_table2$Attendance
#> [1] 604500 1007760 1100000 1750000 1837547 2263223 1477294 2477805 1698528
EDIT
Ronak Shah has identified the problem, which is that there is an invisible character in the name brought across from the web page, which means the column isn't recognised. So an alternative solution would be:
names(rugby_table)[3] <- "Total attendance"
rugby_table2 <- rugby_table %>%
mutate(Attendance = as.numeric(gsub(",", "", `Total attendance`)))
rugby_table2$Attendance
#> [1] 604500 1007760 1100000 1750000 1837547 2263223 1477294 2477805
The gsub function is to do replacements to all matches of the provided pattern. If you are going to remove all the commas with gsub, the correct syntax would be
rugby_table2 <- rugby %>%
.[[3]] %>%
tbl_df %>%
mutate(Attendance = as.numeric(gsub(",", "", 'Total attendance')))
Edit:
rugby_table <- structure(list(Year = c(1987L, 1991L, 1995L, 1999L, 2003L, 2007L,
2011L, 2015L, 2019L), `Host(s)` = c("AustraliaNewZealand", "EnglandFranceIrelandScotlandWales",
"SouthAfrica", "Wales", "Australia", "France", "NewZealand",
"England", "Japan"), `Total attendance` = c("604,500", "1,007,760",
"1,100,000", "1,750,000", "1,837,547", "2,263,223", "1,477,294",
"2,477,805", "1,698,528"), Matches = c("32", "32", "32", "41",
"48", "48", "48", "48", "45+"), `Avg attendance` = c("20,156",
"31,493", "34,375", "42,683", "38,282", "47,150", "30,777", "51,621",
"37,745"), `% change in avg att` = c("—", "56%", "9%", "24%",
"–10%", "23%", "–35%", "68%", "–27%"), `Stadium capacity` = c("1,006,350",
"1,212,800", "1,423,850", "2,104,500", "2,208,529", "2,470,660",
"1,732,000", "2,600,741", "1,811,866"), `Attendance as % o~` = c("60%",
"79%", "77%", "83%", "83%", "92%", "85%", "95%", "90%")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
rugby_table %>%
mutate(Attendance = as.numeric(gsub(",", "", `Total attendance`))) %>%
select(Attendance)
#> # A tibble: 9 x 1
#> Attendance
#> <dbl>
#> 1 604500
#> 2 1007760
#> 3 1100000
#> 4 1750000
#> 5 1837547
#> 6 2263223
#> 7 1477294
#> 8 2477805
#> 9 1698528

How do I collapse rows on key id but keep unique measurements in a new variable?

I have a dataset that looks like this:
Federal.Area State Total_Miles
1 Allentown, PA--NJ NJ 1094508
2 Allentown, PA--NJ PA 9957805
3 Augusta-Richmond County, GA--SC GA 6221747
4 Augusta-Richmond County, GA--SC SC 2101823
5 Beloit, WI--IL IL 324238
6 Beloit, WI--IL WI 542491
I'd like to collapse the rows by Federal.Area but create and keep new variables which contain the unique State and unique Total_Miles such that it looks like this:
Federal.Area State Total_Miles State1 State2 Total_Miles_state1 Total_Miles_state2
<fct> <fct> <dbl> <fct> <fct> <dbl> <dbl>
1 Allentown, PA--NJ NJ 1094508 NJ PA 1094508 9957805
2 Augusta-Richmond Cou… GA 6221747 GA SC 6221747 2101823
3 Beloit, WI--IL IL 324238 IL WI 324238 542491
I don't know how to collapse the variables State and Total_Miles into the same row, but as new variables keyed on Federal.Area.
Perhaps you could use pivot_wider from tidyverse to put your data into a wide format.
First would number the rows within each Federal.Area as 1 and 2. Then call pivot_wider which will append the
library(tidyverse)
df %>%
group_by(Federal.Area) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols = Federal.Area, values_from = c(State, Total_Miles), names_from = rn)
Output
# A tibble: 3 x 5
# Groups: Federal.Area [3]
Federal.Area State_1 State_2 Total_Miles_1 Total_Miles_2
<chr> <chr> <chr> <int> <int>
1 Allentown,PA--NJ NJ PA 1094508 9957805
2 Augusta-RichmondCounty,GA--SC GA SC 6221747 2101823
3 Beloit,WI--IL IL WI 324238 542491
Data
df <- structure(list(Federal.Area = c("Allentown,PA--NJ", "Allentown,PA--NJ",
"Augusta-RichmondCounty,GA--SC", "Augusta-RichmondCounty,GA--SC",
"Beloit,WI--IL", "Beloit,WI--IL"), State = c("NJ", "PA", "GA",
"SC", "IL", "WI"), Total_Miles = c(1094508L, 9957805L, 6221747L,
2101823L, 324238L, 542491L)), class = "data.frame", row.names = c(NA,
-6L))

finding shared column information - a least common ancestor question

I have a data.frame object consisting of columns of information that is tree-like. For instance, I have performed a search of a set of features (query_name) and returned a set of potential matches (match_name). Every match has an associated location that is split into continent, country, region, and town.
The problem I'd like to resolve is finding, for a given query_name, the location information that all potential matches have in common.
For example, with this bit of example data:
query_name <- c(rep("feature1", 3), rep("feature2", 2), rep("feature3", 4))
match_name <- paste0("match", seq(1:9))
continent <- c(rep("NorthAmerica", 3), rep("NorthAmerica", 2), rep("Europe", 4))
country <- c(rep("UnitedStates", 3), rep("Canada", 2), rep("Germany", 4))
region <- c(rep("NewYork", 3), "Ontario", NA, rep("Bayern", 2), rep("Berlin", 2))
town <- c("Manhattan", "Albany", "Buffalo", "Toronto", NA, "Munich", "Nuremberg", "Berlin", "Frankfurt")
data <- data.frame(query_name, match_name, continent, country, region, town)
We'd generate this data.frame object:
query_name match_name continent country region town
1 feature1 match1 NorthAmerica UnitedStates NewYork Manhattan
2 feature1 match2 NorthAmerica UnitedStates NewYork Albany
3 feature1 match3 NorthAmerica UnitedStates NewYork Buffalo
4 feature2 match4 NorthAmerica Canada Ontario Toronto
5 feature2 match5 NorthAmerica Canada <NA> <NA>
6 feature3 match6 Europe Germany Bayern Munich
7 feature3 match7 Europe Germany Bayern Nuremberg
8 feature3 match8 Europe Germany Berlin Berlin
9 feature3 match9 Europe Germany Berlin Frankfurt
I'm hoping to get advice on how to construct a function that will produce the result below. Note that shared location information is now concatenated and separated with a ; delimiter.
Feature1 differs only at the town information, thus the returned string contains the continent through region information.
Feature2 doesn't differ at region or town in the two matches here because one of the two matches contains no information. Nevertheless, lack of information is considered distinct from values with information, so the only thing shared in common for feature2 matches are continent and country.
Feature3 contains shared continent and country information, but distinct region and town, so just continent and country are retained.
Hoping for an output file that looks like this:
query_name location_output
feature1 NorthAmerica;UnitedStates;NewYork;
feature2 NorthAmerica;Canada;;
feature3 Europe;Germany;;
Thanks for any advice you can spare.
Cheers!
Here is an option
library(tidyverse)
data %>%
gather(key, val, -query_name, -match_name) %>%
select(-match_name, -key) %>%
group_by(query_name, val) %>%
add_count() %>%
group_by(query_name) %>%
filter(n == max(n)) %>%
summarise(location_output = paste0(unique(val[!is.na(val)]), collapse = ";"))
## A tibble: 3 x 2
# query_name location_output
# <fct> <chr>
#1 feature1 NorthAmerica;UnitedStates;NewYork
#2 feature2 NorthAmerica;Canada
#3 feature3 Europe;Germany
This is less elegant than #MauritsEvers' solution (it doesn't automatically take care of an arbitrary number of levels), but it ensures that every location_output has all four ; delimiters.
library(dplyr)
data %>%
group_by(query_name) %>%
summarize(continent = ifelse(n_distinct(continent) == 1, first(continent), ""),
country = ifelse(n_distinct(country) == 1, first(country), ""),
region = ifelse(n_distinct(region) == 1, first(region), ""),
town = ifelse(n_distinct(town) == 1, first(town), "")) %>%
mutate(location_output = paste(continent, country, region, town, sep = ";")) %>%
select(query_name, location_output)
lapply(split(data, data$query_name), function(x){
x = x[,-(1:2)]
r = rle(sapply(x, function(d) length(unique(d))))
x[1, seq(r$lengths[1])]
})
#$feature1
# continent country region
#1 NorthAmerica UnitedStates NewYork
#$feature2
# continent country
#4 NorthAmerica Canada
#$feature3
# continent country
#6 Europe Germany

Resources