Gather or transpose data with multiple rows as 'key' argument - r

In my mind I want to tidyr::gather() gather on not only the column names but also on row 1 and 2. What I want to achieve is to have a data frame with 5 columns and 4 rows.
This is a little piece of the dataset I'm working with:
library(tidyverse)
# A tibble: 4 x 3
Aanduiding `Coolsingel 40 links` `Goudseweg 15 links`
<chr> <chr> <chr>
1 Gebiedsnummer 1 2
2 Postcode 3011 AD 3031 XH
3 Leefbaar Rotterdam 124 110
4 Partij van de Arbeid (P.v.d.A.) 58 65
and its reproducable dput(df) to work with:
df <- structure(list(Aanduiding = c("Gebiedsnummer", "Postcode", "Leefbaar Rotterdam",
"Partij van de Arbeid (P.v.d.A.)"), `Coolsingel 40 links` = c("1",
"3011 AD", "124", "58"), `Goudseweg 15 links` = c("2", "3031 XH",
"110", "65")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"), .Names = c("Aanduiding", "Coolsingel 40 links",
"Goudseweg 15 links"))
So wanted out put looks like this:
Aanduiding Gebiedsnummer Postcode adres value
<chr> <dbl> <chr> <chr> <dbl>
1 Leefbaar Rotterdam 1.00 3011 AD Coolsingel 40 links 124
2 Leefbaar Rotterdam 1.00 3031 XH Goudseweg 15 links 120
3 Partij van de Arbeid (P.v.d.A.) 2.00 3011 AD Coolsingel 40 links 58.0
4 Partij van de Arbeid (P.v.d.A.) 2.00 3031 XH Goudseweg 15 links 65.0
I use the gather() function from the tidyr package a lot, but this is alway when I only want to gather the column names with a certain value. Now I actually want to gather the column names but also observation on row 1 and 2.
Can I gather on multiple key's? Or paste the values in observation 1 and 2 to the column, then gather() and then separate()?
What's the best tactic here, if possible in a tidyr way.
Much appreciated.

There's two things that need to be done here, and you'll have to figure out how to break down your dataset accordingly.
data.frame(t(df[1:2,]))
will give you:
X1 X2
Aanduiding Gebiedsnummer Postcode
Coolsingel 40 links 1 3011 AD
Goudseweg 15 links 2 3031 XH
And
tidyr::gather(df[3:4,],key="adres",value="value", `Coolsingel 40 links`, `Goudseweg 15 links`)
will give you:
Aanduiding adres value
<chr> <chr> <chr>
1 Leefbaar Rotterdam Coolsingel 40 links 124
2 Partij van de Arbeid (P.v.d.A.) Coolsingel 40 links 58
3 Leefbaar Rotterdam Goudseweg 15 links 110
4 Partij van de Arbeid (P.v.d.A.) Goudseweg 15 links 65
How you go on from there is another problem, possibly a left_join based on adres, but that really depends on how the rest of the data is structured.

You can do this with a combination of gather and spread a few times. I do this often when I need to move a value out to serve as a denominator for a calculation.
library(tidyverse)
...
The goal is to move Gebiedsnummer and Postcode out of Aanduiding, and to gather the other two columns into one column of values. The first gather gets you this:
df %>%
gather(key = address, value = value, -Aanduiding)
#> # A tibble: 8 x 3
#> Aanduiding address value
#> <chr> <chr> <chr>
#> 1 Gebiedsnummer Coolsingel 40 links 1
#> 2 Postcode Coolsingel 40 links 3011 AD
#> 3 Leefbaar Rotterdam Coolsingel 40 links 124
#> 4 Partij van de Arbeid (P.v.d.A.) Coolsingel 40 links 58
#> 5 Gebiedsnummer Goudseweg 15 links 2
#> 6 Postcode Goudseweg 15 links 3031 XH
#> 7 Leefbaar Rotterdam Goudseweg 15 links 110
#> 8 Partij van de Arbeid (P.v.d.A.) Goudseweg 15 links 65
Using a spread after that gets:
df %>%
gather(key = address, value = value, -Aanduiding) %>%
spread(key = Aanduiding, value = value)
#> # A tibble: 2 x 5
#> address Gebiedsnummer `Leefbaar Rotter… `Partij van de Arbe… Postcode
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Coolsinge… 1 124 58 3011 AD
#> 2 Goudseweg… 2 110 65 3031 XH
Then you want to gather again, but to keep address, Gebiedsnummer, and Postcode as their own columns. The select is just there to get the columns in order. So all together:
df %>%
gather(key = address, value = value, -Aanduiding) %>%
spread(key = Aanduiding, value = value) %>%
gather(key = Aanduiding, value = value, -Gebiedsnummer, -address, -Postcode) %>%
select(Aanduiding, Gebiedsnummer, Postcode, address, value) %>%
mutate_at(vars(Gebiedsnummer, value), as.numeric)
#> # A tibble: 4 x 5
#> Aanduiding Gebiedsnummer Postcode address value
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Leefbaar Rotterdam 1 3011 AD Coolsingel 40 l… 124
#> 2 Leefbaar Rotterdam 2 3031 XH Goudseweg 15 li… 110
#> 3 Partij van de Arbeid (P.v… 1 3011 AD Coolsingel 40 l… 58
#> 4 Partij van de Arbeid (P.v… 2 3031 XH Goudseweg 15 li… 65
Created on 2018-08-24 by the reprex package (v0.2.0).

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

rvest html_table() use second row as header

I am trying to scrape data from a table on fbref however the tables contain two headers with the subheader being incorporated into the first row of data. Does anyone know how to skip the first line and use the second line as the table header so that data types can be maintained? Here is my code below.
library(rvest)
library(dplyr)
team_link = "https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats-All-Competitions"
team_page = read_html(team_link)
shooting_table = team_page %>% html_nodes("#all_stats_shooting") %>%
html_table()
shooting_table = shooting_table[[1]]
You can use the janitor package
library(janitor)
shooting_table %>%
row_to_names(1)
Which gives us:
# A tibble: 28 × 23
Player Nation Pos Age `90s` Gls Sh SoT `SoT%` `Sh/90` `SoT/90` `G/Sh` `G/SoT` Dist FK PK
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Edouard M… sn SEN GK 29 34.0 0 0 0 "" 0.00 0.00 "" "" "" 0 0
2 Antonio R… de GER DF 28 33.7 3 48 13 "27.1" 1.42 0.39 "0.06" "0.23" "19.… 0 0
3 Thiago Si… br BRA DF 36 29.4 3 18 5 "27.8" 0.61 0.17 "0.17" "0.60" "10.… 0 0
4 Mason Mou… eng E… MF,FW 22 26.3 11 75 27 "36.0" 2.86 1.03 "0.13" "0.37" "17.… 6 1

Add multiple columns with the same group and sum

I've got this dataframe and I want to add the last two columns to another dataframe by summing them and grouping them by "Full.Name"
# A tibble: 6 x 5
# Groups: authority_dic, Full.Name [6]
authority_dic Full.Name Entity `2019` `2020`
<chr> <chr> <chr> <int> <int>
1 accomplished Derek J. Leathers WERNER ENTERPRISES INC 1 0
2 accomplished Dirk Van de Put MONDELEZ INTERNATIONAL INC 0 1
3 accomplished Eileen P. Drake AEROJET ROCKETDYNE HOLDINGS 1 0
4 accomplished G. Michael Sievert T-MOBILE US INC 0 3
5 accomplished Gary C. Kelly SOUTHWEST AIRLINES 0 1
6 accomplished James C. Fish, Jr. WASTE MANAGEMENT INC 1 0
This is the dataframe I want to add the two columns to: Like you can see the "Full.Name" column acts as the grouping column.
# A tibble: 6 x 3
# Groups: Full.Name [6]
Full.Name `2019` `2020`
<chr> <int> <int>
1 A. Patrick Beharelle 5541 3269
2 Aaron P. Graft 165 200
3 Aaron P. Jagdfeld 4 5
4 Adam H. Schechter 147 421
5 Adam P. Symson 1031 752
6 Adena T. Friedman 1400 1655
I can add one column using the following piece of code, but if I want to do it with the second one, it overwrites my existing one and I am only left with one instead of two columns added.
narc_auth_total <- narc_auth %>% group_by(Full.Name) %>% summarise(`2019_words` = sum(`2019`)) %>% left_join(totaltweetsyear, ., by = "Full.Name")
The output for this command looks like this:
# A tibble: 6 x 4
# Groups: Full.Name [6]
Full.Name `2019` `2020` `2019_words`
<chr> <int> <int> <int>
1 A. Patrick Beharelle 5541 3269 88
2 Aaron P. Graft 165 200 2
3 Aaron P. Jagdfeld 4 5 0
4 Adam H. Schechter 147 421 2
5 Adam P. Symson 1031 752 15
6 Adena T. Friedman 1400 1655 21
I want to do the same thing and add the 2020_words column to the same dataframe. I just cannot do it, but it cannot be that hard to do so. It should be summarized as well, just like the 2019_words column. When I add "2020" to my command, it says object "2020" not found.
Thanks in advance.
If I have understood you well, this will solve your problem:
narc_auth_total <-
narc_auth %>%
group_by(Full.Name) %>%
summarise(
`2019_words` = sum(`2019`),
`2020_words` = sum(`2020`)
) %>%
left_join(totaltweetsyear, ., by = "Full.Name")

Match and replace row values across two dataframes

I am trying to replace entrezgene_accession names with entrezgene_id, but I am not able to figure it out.
The idea is to replace the gene names such as cd37, and catb in df1, with their entrezgene_id that is in df2.
I have been trying to combine datasets using dplyr, but that has not worked.
# df1: 2,002 × 1
id
<chr>
1 106590043
2 cd37
3 106577144
4 106561987
5 106569503
6 106571198
7 106573872
8 106601676
9 106612275
10 catb
# … with 1,992 more rows
# df2: 426 × 2
entrezgene_accession entrezgene_id
<chr> <chr>
37 catb 100195493
38 catk 100195370
39 catl1 100286607
40 cats 100196462
41 cav2 106573118
42 cav2 100196537
43 cb055 100306867
44 cbx6 106591466
45 ccdc178 106569132
46 ccdc84 106603745
47 ccm2 106571003
48 ccnb1 106563318
49 ccnd1 100306852
50 ccr3 100380477
51 ccr6 100194943
52 cd164 106607963
53 cd37 100195746
# … with 416 more rows
left_join from dplyr can help do the trick
library(dplyr)
df1<-tibble::tribble(
~id,
"106590043",
"cd37",
"catb"
)
df2<-tibble::tribble(
~entrezgene_accession, ~entrezgene_id,
"catb", "100286607",
"catk", "100195370",
"catl1", "100286607",
"cd37", "100195746"
)
df_combined<-df1 %>%
left_join(df2, by=c("id"="entrezgene_accession")) %>%
mutate(complete_id=if_else(is.na(entrezgene_id),id,entrezgene_id))
df_combined
#> # A tibble: 3 × 3
#> id entrezgene_id complete_id
#> <chr> <chr> <chr>
#> 1 106590043 <NA> 106590043
#> 2 cd37 100195746 100195746
#> 3 catb 100286607 100286607
Created on 2022-01-09 by the reprex package (v2.0.1)

Merge rows with same value in a column to single row in a dataframe [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
I have the dataframe below
post<-c("BAL","DEN","ARI","ATL")
home<-c("DEN","DEN","ARI","ARI")
away<-c("BAL","BAL","ATL","ATL")
ID<-c("2015_01_BAL_DEN","2015_01_BAL_DEN","2016_01_ARI_ATL","2016_01_ARI_ATL")
NUM<-c(58,69,45,67)
PO<-c(55,65,46,65)
P1<-data.frame(post,home,away,ID,NUM,PO)
post home away ID NUM PO
1 BAL DEN BAL 2015_01_BAL_DEN 58 55
2 DEN DEN BAL 2015_01_BAL_DEN 69 65
3 ARI ARI ATL 2016_01_ARI_ATL 45 46
4 ATL ARI ATL 2016_01_ARI_ATL 67 65
and what I want find rows with same ID and convert them to one row while adding new columns for post ,NUM and PO:
ID post home away NUM PO post2 NUM2 PO2
1 2015_01_BAL_DEN BAL DEN BAL 58 55 DEN 69 65
2 2016_01_ARI_ATL ARI ARI ATL 45 46 ATL 67 65
in base R you could do:
reshape(transform(P1, time = ave(NUM, ID, FUN=seq)), idvar = "ID", dir="wide", sep = "")
ID post1 home1 away1 NUM1 PO1 post2 home2 away2 NUM2 PO2
1 2015_01_BAL_DEN BAL DEN BAL 58 55 DEN DEN BAL 69 65
We could create a sequence column and specify that as the names_from in pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
P1 %>%
mutate(rn = rowid(ID)) %>%
pivot_wider(names_from = rn,
values_from = c(post, home, away, NUM, PO), names_sep="")
# A tibble: 2 x 11
# ID post1 post2 home1 home2 away1 away2 NUM1 NUM2 PO1 PO2
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 2015_01_BAL_DEN BAL DEN DEN DEN BAL BAL 58 69 55 65
#2 2016_01_ARI_ATL ARI ATL ARI ARI ATL ATL 45 67 46 65
Or using dcast
library(data.table)
dcast(setDT(P1), ID ~ rowid(ID),
value.var = c('post', 'home', 'away', 'NUM', 'PO'), sep="")

Resources