Selecting a column with a dot in R (nested object) - r

I'm new to R and I'm not sure how to rephrase the question, but basically, I have this dataset coming from the following code:
data_url <- 'https://prod-scores-api.ausopen.com/year/2021/stats'
dat <- jsonlite::fromJSON(data_url)
men_aces <- bind_rows(dat$statistics$rankings[[1]]$players[1])
men_aces_table <- dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>% select(full_name, nationality)
Which resulted in this data frame:
full_name nationality.uuid nationality.name nationality.code
1 Novak Djokovic 99da9b29-eade-4ac3-a7b0-b0b8c2192df7 Serbia SRB
2 Alexander Zverev 99d83e85-3173-4ccc-9d91-8368720f4a47 Germany GER
3 Milos Raonic 07779acb-6740-4b26-a664-f01c0b54b390 Canada CAN
4 Daniil Medvedev fa925d2d-337f-4074-a0bd-afddb38d66e1 Russia RUS
5 Nick Kyrgios 9b11f78c-47c1-43c4-97d0-ba3381eb9f07 Australia AUS
nationality is the nested object inside the player object if you check the JSON url, it contains the above properties (uuid, name, code), if I select the full_name property I would get the value (which is of type character) right back.
I'm not sure how to select the name and from that data frame (nationality) and rename it to country.
My expected outcome is:
full_name country
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
I would appreciate some help. Sorry I was unclear.

Use purrr::pmap_chr
library(tidyverse)
dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>%
select(full_name, nationality) %>%
mutate(nationality = pmap_chr(nationality, ~ ..2))
full_name nationality
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
11 Aslan Karatsev Russia
12 Taylor Fritz United States of America
13 Matteo Berrettini Italy
14 Grigor Dimitrov Bulgaria
15 Feliciano Lopez Spain
16 Stefanos Tsitsipas Greece
17 Felix Auger-Aliassime Canada
18 Thanasi Kokkinakis Australia
19 Ugo Humbert France
20 Borna Coric Croatia

You could do:
bind_cols(full_name = dat$players$full_name, country = dat$players$nationality$name)
# A tibble: 169 x 2
full_name country
<chr> <chr>
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa

just add this line at the end
newdf <- data.frame(full_name = men_aces_table$full_name, country = men_aces_table$nationality$name)

Related

How to use If function in R to create a column using multiple conditions

I am not familiar with R , I need your help for this issue ,
I have a data frame composed with 25 variables (25 columns) named df simplified
name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack
I would like to add a column to this data frame named "country".This new column takes into account different conditions .
Juve Italy
FCB Spain
Real Spain
Chelsea England
Liverpool England
Bayern Germany
So let say if the club is FCB or Real the value in country is Spain
the output of df$Country should be as follows
Country
Spain
Spain
Spain
Spain
England
England
Italy
Germany
Spain
England
England
Italy
Germany
The code I started to do is the following
df$country=ifelse(df$Club=="FCB","spain", df$Club=="Real","Spain" ......)
But it seems false .
knowing that my real data set has more than 250 different values in "club" column
and more than 30 in "Country"
doing that manually seems too long .
Could you help me in that point please .
Do you know how to use if-else statements inside for loops? This would be the simplest way out.
Something like this:
df <- data.frame(name = c("a", "b", "c"),
Club = c("FCB", "Real", "Liverpool"),
stringsAsFactors = FALSE)
for(i in 1:nrow(df)){
if(df$Club[i] == "FCB" | df$Club[i] == "Real"){
df$country[i] <- "Spain"
} else if(df$Club[i] == "Liverpool"){
df$country[i] <- "England"
} else{
df$country[i] <- NA
}
}
df
# name Club country
# 1 a FCB Spain
# 2 b Real Spain
# 3 c Liverpool England

Swap Misplaced cells in R?

I have a huge database (more than 65M of rows) and I noticed that some cells are misplaced. As an example, let's say I have this:
library("tidyverse")
DATA <- tribble(
~SURNAME,~NAME,~STATE,~COUNTRY,
'Smith','Emma','California','USA',
'Johnson','Oliia','Texas','USA',
'Williams','James','USA','California',
'Jones','Noah','Pennsylvania','USA',
'Williams','Liam','Illinois','USA',
'Brown','Sophia','USA','Louisiana',
'Daves','Evelyn','USA','Oregon',
'Miller','Jacob','New Mexico','USA',
'Williams','Lucas','Connecticut','USA',
'Daves','John','California','USA',
'Jones','Carl','USA','Illinois'
)
=====
> DATA
# A tibble: 11 x 4
SURNAME NAME STATE COUNTRY
<chr> <chr> <chr> <chr>
1 Smith Emma California USA
2 Johnson Oliia Texas USA
3 Williams James USA California
4 Jones Noah Pennsylvania USA
5 Williams Liam Illinois USA
6 Brown Sophia USA Louisiana
7 Daves Evelyn USA Oregon
8 Miller Jacob New Mexico USA
9 Williams Lucas Connecticut USA
10 Daves John California USA
11 Jones Carl USA Illinois
As you can see, the Country and State are misplaced in some rows, how can I efficiently swap those ones?
Kind Regards,
Luiz.
Using data.table and the in-built state.name vector:
setDT(DATA)
DATA[COUNTRY %in% state.name, `:=`(COUNTRY = STATE, STATE = COUNTRY)]
DATA
# SURNAME NAME STATE COUNTRY
# 1: Smith Emma California USA
# 2: Johnson Oliia Texas USA
# 3: Williams James California USA
# 4: Jones Noah Pennsylvania USA
# 5: Williams Liam Illinois USA
# 6: Brown Sophia Louisiana USA
# 7: Daves Evelyn Oregon USA
# 8: Miller Jacob New Mexico USA
# 9: Williams Lucas Connecticut USA
# 10: Daves John California USA
# 11: Jones Carl Illinois USA
Check this solution (it assumes that COUNTRY column is in ISO3 format e.g. MEX, CAN):
DATA %>%
mutate(
COUNTRY_TMP = if_else(str_detect(COUNTRY, '[A-Z]{3}'), COUNTRY, STATE),
STATE = if_else(str_detect(COUNTRY, '[A-Z]{3}'), STATE, COUNTRY),
COUNTRY = COUNTRY_TMP
) %>%
select(-COUNTRY_TMP)
Assuming all country names are followed ISO3 format, we can first install the countrycode package. In this package, there is a data frame called codelist with a column iso3c with the ISO3 country names. We can use that as follows to swap the country name.
library(tidyverse)
library(countrycode)
DATA2 <- DATA %>%
mutate(STATE2 = ifelse(STATE %in% codelist$iso3c &
!COUNTRY %in% codelist$iso3c, COUNTRY, STATE),
COUNTRY2 = ifelse(!STATE %in% codelist$iso3c &
COUNTRY %in% codelist$iso3c, COUNTRY, STATE)) %>%
select(-STATE, -COUNTRY) %>%
rename(STATE = STATE2, COUNTRY = COUNTRY2)
DATA2
# # A tibble: 11 x 4
# SURNAME NAME STATE COUNTRY
# <chr> <chr> <chr> <chr>
# 1 Smith Emma California USA
# 2 Johnson Oliia Texas USA
# 3 Williams James California USA
# 4 Jones Noah Pennsylvania USA
# 5 Williams Liam Illinois USA
# 6 Brown Sophia Louisiana USA
# 7 Daves Evelyn Oregon USA
# 8 Miller Jacob New Mexico USA
# 9 Williams Lucas Connecticut USA
# 10 Daves John California USA
# 11 Jones Carl Illinois USA

html_table doubles value of columns

I'm trying to scrape wiki table with this code:
library(tidyverse)
library(rvest)
my_url <- "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions"
mytable <- read_html(my_url) %>% html_nodes("table") %>% .[[4]]
mytable <- mytable %>% html_table()
The problem is that in the table returned in both columns with names (champion & runner-up) values are doubled. well not exactly doubled, it looks like two forms of presenting name/surname in different order and with comma once. It does not look like that on the original wiki page only "name surname" is visible there. Why does it happen and how to get rid of it? I need those columns to contain 'name surname' only.
head(mytable)
Year[f] Country Champion Country Runner-up Score in the final[4][14]
1 1969 AUS Laver, RodRod Laver[b] ESP Gimeno, AndrésAndrés Gimeno 6–3, 6–4, 7–5
2 1970 USA Ashe, ArthurArthur Ashe AUS Crealy, DickDick Crealy 6–4, 9–7, 6–2
3 1971 AUS Rosewall, KenKen Rosewall USA Ashe, ArthurArthur Ashe 6–1, 7–5, 6–3
4 1972 AUS Rosewall, KenKen Rosewall AUS Anderson, MalcolmMalcolm Anderson 7–6(7–2), 6–3, 7–5
5 1973 AUS Newcombe, JohnJohn Newcombe NZL Parun, OnnyOnny Parun 6–3, 6–7, 7–5, 6–1
6 1974 USA Connors, JimmyJimmy Connors AUS Dent, PhilPhil Dent 7–6(9–7), 6–4, 4–6, 6–3
htmltab could be used to scrap these Wiki tables.
library(htmltab)
#data cleaning steps
bFun <- function(node) {
x <- XML::xmlValue(node)
gsub("\\s[<†‡].*$", "", iconv(x, from = 'UTF-8', to = "Windows-1252", sub="byte"))
}
df1 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Australian_Open_men%27s_singles_champions",
which = 4,
rm_superscript = F,
bodyFun = bFun) #this function is not required if you are executing the code from Mac
head(df1)
which gives
# Year[f] Country Champion Country Runner-up Score in the final[4][14]
#2 1969 AUS Rod Laver[b] ESP Andrés Gimeno 6–3, 6–4, 7–5
#3 1970 USA Arthur Ashe AUS Dick Crealy 6–4, 9–7, 6–2
#4 1971 AUS Ken Rosewall USA Arthur Ashe 6–1, 7–5, 6–3
#5 1972 AUS Ken Rosewall AUS Malcolm Anderson 7–6(7–2), 6–3, 7–5
#6 1973 AUS John Newcombe NZL Onny Parun 6–3, 6–7, 7–5, 6–1
#7 1974 USA Jimmy Connors AUS Phil Dent 7–6(9–7), 6–4, 4–6, 6–3
and
df2 <- htmltab(doc = "https://en.wikipedia.org/wiki/List_of_Wimbledon_gentlemen%27s_singles_champions",
which = 3,
rm_superscript = F,
bodyFun = bFun) #this function is not required if you are executing the code from Mac
head(df2)
gives
# Year[d] Country Champion Country Runner-up Score in the final[4]
#2 1877 BRI[e] Spencer Gore BRI William Marshall 6–1, 6–2, 6–4
#3 1878 BRI Frank Hadow BRI Spencer Gore 7–5, 6–1, 9–7
#4 1879 BRI John Hartley BRI Vere St. Leger Goold 6–2, 6–4, 6–2
#5 1880 BRI John Hartley BRI Herbert Lawford 6–3, 6–2, 2–6, 6–3
#6 1881 BRI William Renshaw BRI John Hartley 6–0, 6–1, 6–1
#7 1882 BRI William Renshaw BRI Ernest Renshaw 6–1, 2–6, 4–6, 6–2, 6–2

substract two strings in dplyr row wise for R dataframe

Have two columns and need a third substracting the two using dplyr.
Very simple example for the sake of clarity. Split/separate approach not valid in my case.
x <- c("FRANCE","GERMANY","RUSSIA")
y <- c("Paris FRANCE", "Berlin GERMANY", "Moscow RUSSIA")
cities <- data.frame(x,y)
cities
x y
1 FRANCE Paris FRANCE
2 GERMANY Berlin GERMANY
3 RUSSIA Moscow RUSSIA
Expected results:
x y new
1 FRANCE Paris FRANCE Paris
2 GERMANY Berlin GERMANY Berlin
3 RUSSIA Moscow RUSSIA Moscow
What I've tried so far (to no avail):
this gets the very same df but removing the city (contrary as desired)
cities %>% mutate(new = setdiff(x,y))
x y new
1 FRANCE Paris FRANCE FRANCE
2 GERMANY Berlin GERMANY GERMANY
3 RUSSIA Moscow RUSSIA RUSSIA
On the contrary, setdiff in reverse order gets same initial data
cities %>% mutate(new = setdiff(y,x))
x y new
1 FRANCE Paris FRANCE Paris FRANCE
2 GERMANY Berlin GERMANY Berlin GERMANY
3 RUSSIA Moscow RUSSIA Moscow RUSSIA
Using gsub to remove worked just for first row issuing a warning
cities %>% mutate(new = gsub(x,"",y))
Warning message:
In gsub(x, "", y) :
argument 'pattern' has length > 1 and only the first element will be used
x y new
1 FRANCE Paris FRANCE Paris
2 GERMANY Berlin GERMANY Berlin GERMANY
3 RUSSIA Moscow RUSSIA Moscow RUSSIA
We can use stringr::str_replace:
library(tidyverse)
cities %>%
mutate_if(is.factor, as.character) %>%
mutate(new = trimws(str_replace(y, x, "")))
# x y new
#1 FRANCE Paris FRANCE Paris
#2 GERMANY Berlin GERMANY Berlin
#3 RUSSIA Moscow RUSSIA Moscow
Here is a solution with base R:
x <- c("FRANCE","GERMANY","RUSSIA")
y <- c("Paris FRANCE", "Berlin GERMANY", "Moscow RUSSIA")
cities <- data.frame(x,y,stringsAsFactors = F)
cities$new = mapply(function(a,b)
{setdiff(strsplit(a,' ')[[1]],strsplit(b,' ')[[1]])}, cities$y, cities$x)
Output:
x y new
1 FRANCE Paris FRANCE Paris
2 GERMANY Berlin GERMANY Berlin
3 RUSSIA Moscow RUSSIA Moscow
Hope this helps!

Summarize data using doBy package at region level

I have a dataset Data as below,
Region Country Market Price
EUROPE France France 30.4502
EUROPE Israel Israel 5.14110965
EUROPE France France 8.99665
APAC CHINA CHINA 2.6877232
APAC INDIA INDIA 60.9004
AFME SL SL 54.1729685
LA BRAZIL BRAZIL 56.8606917
EUROPE RUSSIA RUSSIA 11.6843732
APAC BURMA BURMA 63.5881232
AFME SA SA 115.0733685
I would like to summarize the data at Region level and get the SUM of Price at every Region Level.
I want the ouput to be Like below.
Data Output
Region Country Price
EUROPE France 30.4502
EUROPE Israel 5.14110965
EUROPE France 8.99665
EUROPE RUSSIA 11.6843732
Europe 56.27233285
APAC BURMA 63.5881232
APAC CHINA 2.6877232
APAC INDIA 60.9004
Apac 127.1762464
AFME BAHARAIN 54.1729685
AFME SA 115.0733685
AFME 169.246337
LA BRAZIL 56.8606917
LA 56.8606917
I have used summaryBy function of doBy package, i have tried the code below.
summaryBy
myfun1 <- function(x){c(s=Sum(x)}
DB= summaryBy(Data$Price ~Region + Country , data=Data, FUN=myfun1)
Anyhelp on this regard is very much appreciated.
You can do this by using dplyr to generate a summary table:
library(dplyr)
totals <- data %>% group_by(Region) %>% summarise(Country="",Price=sum(Price))
And then merging the summary with the rest of the data:
summary <- rbind(data[-3], totals)
Then you can sort by Region to put the summary with the region:
summary <- summary %>% arrange(Region)
Output:
Region Country Price
1 AFME SL 54.1730
2 AFME SA 115.0734
3 AFME 169.2463
4 APAC CHINA 2.6877
5 APAC INDIA 60.9004
6 APAC BURMA 63.5881
7 APAC 127.1762
8 EUROPE France 30.4502
9 EUROPE Israel 5.1411
10 EUROPE France 8.9967
11 EUROPE RUSSIA 11.6844
12 EUROPE 56.2723
13 LA BRAZIL 56.8607
14 LA 56.8607
You have to split data by Region factor and sum Price for each factor
lapply(split(data, data$Region), function(x) sum(x$Price))
Or, if you need to present result as you have shown:
totals = lapply(split(data, data$Region), function(x) rbind(x,data.frame(Region=unique(x$Region), Country="", Market="", Price=sum(x$Price))))
do.call(rbind, totals)

Resources