Sort data frame rows in the right columns - r

I am currently in the data cleaning process. My data has more than 6 digits rows. I cannot come up with a solution in order to have the data in the right order. Can you give me a hint please?
Thanks in advance
df <- data.frame(price= c("['380€']", "3hr 15 min", "4hr", "3hr 55min", "2h", "20€"),
airlines = c("['Icelandir']", "€1,142", "16€", "17€", "19€", "Iberia"),
duration = c("['3h']","Turkish airlines", "KLM", "easyJet", "2 hr 1min", "Finnair"),
depart = c("LGW", "AMS", "NUE", "ZRH", "LHR", "VAR"))
My desired output is
price airline duration price_right airline_right duration_right depart
['380€'] ['Icelandair'] ['3h'] ['380€'] ['Icelandair'] ['3h'] LGW
3 hr 15 min €1,142 Turkish airlines €1,142 Turkish airlines 3 hr 15 min AMS
4hr €16 KLM €16 KLM 4hr NUE
3hr 55min €17 easyJet €17 easyJet 3hr 55min ZRH
2h €19 2hr 1min €19 Iberia 2h LHR
2hr min "Iberia" Finnair €20 Finnair 2hr 1min VAR

For this example we could do something like this:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
arrange(value) %>%
group_by(group =as.integer(gl(n(),3,n()))) %>%
mutate(id = row_number()) %>%
mutate(name = case_when(id == 1 ~ "price",
id == 2 ~ "duration",
id == 3 ~ "airlines",
TRUE ~ NA_character_)) %>%
ungroup() %>%
select(-group, -id) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-id)
price duration airlines
<chr> <chr> <chr>
1 ['380€'] ['3h'] ['Icelandir']
2 €1,142 3hr 15 min Turkish airlines

Related

Replace Missing Values with NA in Web Scraping with R

I am trying web scraping with R (rvest) for the first time. I am trying to replace missing values with 'NA' but it doesn't seem to work at all. Can you guys check the code below and please help me?
library(rvest)
library('purrr')
link= "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt"
page=read_html(link)
movies<-data.frame(name = page %>% html_nodes(".lister-item-header a") %>% html_text,
year = page %>% html_nodes(".text-muted.unbold") %>% html_text(),
certificate = page %>% html_nodes(".certificate") %>% html_text(),
runtime = page %>% html_nodes(".runtime") %>% html_text(),
genre = page %>% html_nodes(".genre") %>% html_text(),
imdb_rating = page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text(),
director = page %>% html_nodes(".text-muted+ p a:nth-child(1)") %>% html_text(),
number_of_votes = page %>% html_nodes(".sort-num_votes-visible span:nth-child(2)") %>% html_text(),
gross = page %>% html_nodes(".ghost~ .text-muted+ span") %>% html_text())
The certificate and gross values are missing for certain movies. I tried the following methods to replace missing values with N/A
certificate = page %>%
html_nodes(".certificate") %>% html_text() %>% gsub('\\s+', ' ', .)
gross = page %>% html_nodes(".ghost~ .text-muted+ span") %>% html_text() %>% replace(!nzchar(.),NA)
certificate = page %>% html_nodes(".certificate") %>%
html_text(trim = TRUE) %>% {if(length(.) == "") NA else .}
None of them work for me. The commands execute without error but does not replace the missing values with NA and I get less number of entries.
Without replacing the missing values, I cannot make the movies data frame because I get the error as:
error in data.frame(name = page %>% html_nodes(".lister-item-header a") %>% :
arguments imply differing number of rows: 50, 49, 37
I recommend narrowing your web scraping focus to a specific parent element, such as the cards shown in the image, and then iterating through those elements to extract the specific child elements of interest. This approach will make the process more efficient and targeted. NA will be returned if no element is found in certain cards.
library(tidyverse)
library(rvest)
movies <-
"https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt" %>%
read_html()
movies %>%
html_elements(".lister-item-content") %>% # the cards
map_dfr(~ tibble( # interate through the list and grab the elements:
title = .x %>%
html_element(".lister-item-header a") %>%
html_text2(),
year = .x %>%
html_element(".text-muted.unbold") %>%
html_text2(),
certificate = .x %>%
html_element(".certificate") %>%
html_text2(),
runtime = .x %>%
html_element(".runtime") %>%
html_text2(),
genre = .x %>%
html_element(".genre") %>%
html_text2(),
rating = .x %>%
html_element(".ratings-imdb-rating strong") %>%
html_text2(),
director = .x %>%
html_element(".text-muted+ p a:nth-child(1)") %>%
html_text2(),
votes = .x %>%
html_element(".sort-num_votes-visible span:nth-child(2)") %>%
html_text2(),
gross = .x %>%
html_element(".ghost~ .text-muted+ span") %>%
html_text2()
))
Results
# A tibble: 50 × 9
title year certi…¹ runtime genre rating direc…² votes gross
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "The Dark Knight" (200… 15 152 min Acti… 9.0 Christ… 2,66… $534…
2 "Ringenes herre: Atter en kong… (200… 12 201 min Acti… 9.0 Peter … 1,85… $377…
3 "Inception" (201… 15 148 min Acti… 8.8 Christ… 2,36… $292…
4 "Ringenes herre: Ringens brors… (200… 12 178 min Acti… 8.8 Peter … 1,88… $315…
5 "Ringenes herre: To t\u00e5rn" (200… 12 179 min Acti… 8.8 Peter … 1,67… $342…
6 "The Matrix" (199… 15 136 min Acti… 8.7 Lana W… 1,92… $171…
7 "Star Wars: Episode V - Imperi… (198… 9 124 min Acti… 8.7 Irvin … 1,29… $290…
8 "Soorarai Pottru" (202… NA 153 min Acti… 8.7 Sudha … 117,… NA
9 "Stjernekrigen" (197… 11 121 min Acti… 8.6 George… 1,37… $322…
10 "Terminator 2 - Dommens dag" (199… 15 137 min Acti… 8.6 James … 1,10… $204…
# … with 40 more rows, and abbreviated variable names ¹​certificate, ²​director
# ℹ Use `print(n = ...)` to see more rows

Sum of column based on a condition a R

I would like to print out total amount for each date so that my new dataframe will have date and and total amount columns.
My data frame looks like this
permitnum
amount
6/1/2022
na
ascas
30.00
olic
40.41
6/2/2022
na
avrey
17.32
fev
32.18
grey
12.20
any advice on how to go about this will be appreciated
Here is another tidyverse option, where I convert to date (and then reformat), then we can fill in the date, so that we can use that to group. Then, get the sum for each date.
library(tidyverse)
df %>%
mutate(permitnum = format(as.Date(permitnum, "%m/%d/%Y"), "%m/%d/%Y")) %>%
fill(permitnum, .direction = "down") %>%
group_by(permitnum) %>%
summarise(total_amount = sum(as.numeric(amount), na.rm = TRUE))
Output
permitnum total_amount
<chr> <dbl>
1 06/01/2022 70.4
2 06/02/2022 61.7
Data
df <- structure(list(permitnum = c("6/1/2022", "ascas", "olic", "6/2/2022",
"avrey", "fev", "grey"), amount = c("na", "30.00", "40.41", "na",
"17.32", "32.18", "12.20")), class = "data.frame", row.names = c(NA,
-7L))
Here is an option. Split the data by the date marked by a row with a number, then summarize the total in amount and combine the date and all rows.
library(tidyverse)
dat <- read_table("permitnum amount
6/1/2022 na
ascas 30.00
olic 40.41
6/2/2022 na
avrey 17.32
fev 32.18
grey 12.20")
dat |>
group_split(id = cumsum(grepl("\\d", permitnum))) |>
map_dfr(\(x){
date <- x$permitnum[[1]]
x |>
slice(-1) |>
summarise(date = date,
total_amount = sum(as.numeric(amount)))
})
#> # A tibble: 2 x 2
#> date total_amount
#> <chr> <dbl>
#> 1 6/1/2022 70.4
#> 2 6/2/2022 61.7

How best to do this pivot operation in R

Below is the sample data and the desired outcome. This is a much simplified version of the actual data set. In the actual data set, there are 20 years and 4 quarters apiece. Looking to have each unique company entry listed once and the employment data series running from beginning to end from left to right. In the event that there is no data for Vision Inc in 2019 quarter 3, then I would want it to return a O and not an NA.
library(tidyverse)
library(dplyr)
legalname <- c("Vision Inc.","Expedia","Strong Enterprise","Vision Inc.","Expedia","Strong Enterprise")
year <- c(2019,2019,2019,2019,2019,2019)
quarter <- c(1,1,1,2,2,2)
cnty <- c(031,029,027,031,029,027)
naics <- c(345110,356110,362110,345110,356110,345110)
mnth1emp <- c (11,13,15,15,17,20)
mnth2emp <- c(12,14,15,16,18,22)
mnth3emp <-c(13,15,15,17,21,29)
employers <- data.frame(legalname,year,quarter,naics,mnth1emp,mnth2emp,mnth3emp)
Desired Outcome
legalname cnty naics 2019m1 2019m2 2019m3 2019m4 2019m5 2019m6
Vision Inc 031 345110 11 12 13 15 16 17
Expedia 029 356110 13 14 15 17 18 21
I first pivot to a long form, then arrange by legalname and year(just to double-check that they are in numerical order). Then, I create a unique month series for each year for each company. Then, I drop quarter and pivot back to wide form and put name and year together, and finally replace NA with 0. Here, I'm assuming that you want each unique naics on it's own row.
library(tidyverse)
employers %>%
pivot_longer(starts_with("mnth")) %>%
arrange(legalname, year) %>%
group_by(legalname, year, naics) %>%
mutate(name = paste0("m", 1:n())) %>%
select(-quarter) %>%
pivot_wider(names_from = c("year", "name"), names_sep = "", values_from = "value") %>%
mutate(across(everything(), ~replace_na(.,0)))
Output
legalname naics `2019m1` `2019m2` `2019m3` `2019m4` `2019m5` `2019m6`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Expedia 356110 13 14 15 17 18 21
2 Strong Enterprise 362110 15 15 15 0 0 0
3 Strong Enterprise 345110 0 0 0 20 22 29
4 Vision Inc. 345110 11 12 13 15 16 17
Does this work for you?
First pivot longer to get the months and values in a quarter; and then pivot wider to get the wide format you want.
employers %>%
filter(legalname != "Strong Enterprise") %>%
pivot_longer(mnth1emp:mnth3emp, names_to = "mnth", values_to = "value") %>%
mutate(month_in_quarter = as.numeric(str_extract(mnth, "\\d")),
month =str_c("m", month_in_quarter + 3*(quarter - 1))) %>%
select(-c(month_in_quarter, mnth)) %>%
pivot_wider(c(legalname,cnty, naics), names_from = c(year, month),
values_from = value,
values_fill = 0)
values_fill will fill NAs with 0s.
perhaps try this.
I found a way to get the pivot right in R. I used the library("pivottabler") with the data.frame "bhmtrains". This worked now.
library(pivottabler)
qhpvt(bhmtrains, c("=","TOC"), "TrainCategory",
c("Mean Speed"="mean(SchedSpeedMPH, na.rm=TRUE)", "Std Dev
Speed"="sd(SchedSpeedMPH, na.rm=TRUE)"),
formats=list("%.0f", "%.1f"), totals=list("", "TrainCategory"="All",
"Categories"))
my results out of the code

Conditionally counting number of ocurrences in a dataframe - performance improvement

I need to detect (among other things) the first occurrence of a non-"F" code in a patient's list, after the first "F" code occurrence. The below code seems to succeed in this, however it is shown to be too inefficient on the server running in a data set of one million observations.
The final data set should have a variable of number of non-F codes (nhosp), and the first non-F code found after the first F-code appearance on the DAIGNOSTICO variable. No duplicates of ID.
How can I improve both in terms of complexity and speed? Tidyverse pipe preferred.
This is how the result should look like:
# A tibble: 7 × 6
# Groups: ID [7]
ID DAIGNOSTICO data_entrada data_saida nhosp ficd
<dbl> <chr> <date> <date> <dbl> <chr>
1 1555 F180 1930-04-05 2005-03-15 1 T124
2 1234 F100 1980-04-01 2005-03-02 2 O155
3 16666 F120 1990-06-05 2005-03-18 0 <NA>
4 123456 F145 2001-03-07 2005-03-11 2 T123
5 177778 F155 2001-04-13 2005-03-22 2 G123
6 166666 F125 2002-03-12 2005-03-19 2 W345
7 12345 F150 2002-06-03 2005-03-07 4 K709
This is how my code looks like currently:
library(readr)
library(dplyr)
library(tidyr)
simulation <- read_csv("SIMULADO.txt", col_types = cols(
data_entrada = col_date("%d/%m/%Y"),
data_saida = col_date("%d/%m/%Y")
)
)
simulation <- as.data.frame(simulation)
simulation[, "nhosp"] <- 0
oldpos <- 1
for (i in 1:nrow(simulation)) {
if (grepl("F", simulation[i, "DAIGNOSTICO"], )) { # Has F?
oldpos <- i
clin <- 0
simulation[i, "hasF"] <- T
} else {
simulation[i, "hasF"] <-F
}
if (simulation[i, "ID"] == simulation[oldpos, "ID"]) { # same person?
if (simulation[oldpos, "hasF"] == T) { # Did she/him had F?
simulation[i, "hasF"] <- T
if (simulation[i, "data_entrada"] > simulation[oldpos, "data_entrada"]) { # é subsequente?
if (!grepl("F", simulation[i, "DAIGNOSTICO"], )) { # not-F?
simulation[i,"hasC"] <- T
clin <- 1
simulation[i, "ficd"] <- simulation[i, "DAIGNOSTICO"]
simulation[i, "nhosp"] <- clin
first_cc <- simulation[i, "DAIGNOSTICO"]
}
}
}
}
}
dt1 <- simulation %>%
arrange(data_entrada) %>%
group_by(ID) %>%
select(ficd) %>%
drop_na() %>%
slice(1)
dt2 <- simulation %>%
arrange(data_entrada) %>%
group_by(ID) %>%
filter(hasF == T) %>%
mutate(nhosp = cumsum(nhosp),
nhosp = max(nhosp)) %>%
select(-ficd,-hasF, -hasC) %>%
distinct(ID, .keep_all = TRUE) %>%
full_join(dt1, by = "ID")
dt2
And this is an example data set, with some errors to check robustness of the code:
ID, DAIGNOSTICO, data_entrada, data_saida
123490, O100, 01/04/1980, 02/03/2005
123490, O100, 01/04/1981, 02/03/2005
123491, O101, 01/04/1980, 02/03/2005
123491, O101, 01/04/1981, 02/03/2005
1234, F100, 01/04/1980, 02/03/2005
1234, O155, 02/04/1980, 03/03/2005
1234, G123, 05/05/1982, 04/03/2005
12345, T124, 01/06/2002, 05/03/2005
12345, Y124, 02/06/2002, 06/03/2005
12345, F150, 03/06/2002, 07/03/2005
12345, K709, 04/06/2002, 08/03/2005
12345, Y709, 05/06/2002, 09/03/2005
12345, F150, 03/06/2002, 07/03/2005
12345, K710, 06/06/2002, 08/03/2005
12345, K711, 07/06/2002, 10/03/2005
12345, F150, 08/06/2002, 07/03/2005
123456, F145, 07/03/2001, 11/03/2005
123456, T123, 08/03/2001, 12/03/2005
123456, P123, 09/03/2001, 13/03/2005
1555 ,R155, 04/04/1930, 14/03/2005
1555 ,F180, 05/04/1930, 15/03/2005
1555 ,T124, 06/04/1930, 16/03/2005
1555 ,F708, 07/04/1930, 17/03/2005
16666 ,F120, 05/06/1990, 18/03/2005
166666, F125, 12/03/2002, 19/03/2005
166666, W345, 13/03/2002, 20/03/2005
166666, L123, 14/03/2002, 21/03/2005
177778, F155, 13/04/2001, 22/03/2005
177778, G123, 14/04/2001, 23/03/2005
177778, F190, 15/04/2001, 24/03/2005
177778, E124, 16/04/2001, 25/03/2005
177779, G155, 13/04/2001, 22/03/2005
177779, G123, 14/04/2001, 23/03/2005
177779, G190, 15/04/2001, 24/03/2005
177779, E124, 16/04/2001, 25/03/2005
You could use
library(dplyr)
library(stringr)
df %>%
group_by(ID) %>%
filter(cumsum(str_detect(DAIGNOSTICO, "^F")) > 0) %>%
mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
ficd = lead(DAIGNOSTICO)) %>%
filter(str_detect(DAIGNOSTICO, "^F")) %>%
slice(1) %>%
ungroup()
This returns
# A tibble: 7 x 6
ID DAIGNOSTICO data_entrada data_saida nhosp ficd
<dbl> <chr> <chr> <chr> <int> <chr>
1 1234 F100 01/04/1980 02/03/2005 2 O155
2 1555 F180 05/04/1930 15/03/2005 1 T124
3 12345 F150 03/06/2002 07/03/2005 4 K709
4 16666 F120 05/06/1990 18/03/2005 0 NA
5 123456 F145 07/03/2001 11/03/2005 2 T123
6 166666 F125 12/03/2002 19/03/2005 2 W345
7 177778 F155 13/04/2001 22/03/2005 2 G123
Edit
I think there might be a flaw, perhaps
library(dplyr)
library(stringr)
df %>%
group_by(ID) %>%
filter(
cumsum(str_detect(DAIGNOSTICO, "^F")) == 1 |
!str_detect(DAIGNOSTICO, "^F") & cumsum(str_detect(DAIGNOSTICO, "^F")) > 0
) %>%
mutate(nhosp = sum(str_detect(DAIGNOSTICO, "^[^F]")),
ficd = lead(DAIGNOSTICO)) %>%
filter(str_detect(DAIGNOSTICO, "^F")) %>%
slice(1) %>%
ungroup()
is a better solution.

Adding overall mean when using group_by

I am using the dplyr package to generate some tables and I'm making use of the adorn_totals("row") function.
This works fine when I want to sum values within the groups, however in some cases I want an overall mean instead of a sum. Is there an adorn_means function?
Sample code:
Regions2 <- Data %>%
filter(!is.na(REGION))%>%
group_by(REGION) %>%
summarise(Numberofpeople=length(Names))%>%
adorn_totals("row")
here my "total" row is simply the sum of all people within the regions. This gives me
REGION NumberofPeople
East Midlands 578,943
East of England 682,917
London 1,247,540
North East 245,830
North West 742,886
South East 963,040
South West 623,684
West Midlands 653,335
Yorkshire 553,853
TOTAL 6,292,028
My next piece of code generates an average salary for each region, but I want to add an overall average for the total
Regions3 <- Data %>%
filter(!is.na(REGION))%>%
filter(!is.na(AVGSalary))%>%
group_by(REGION) %>%
summarise(AverageSalary=mean(AVGSalary))
if I use adnorn_totals("row") as before I simply get the sum of the averages, not the overall average for the dataset.
How do I get the overall average?
UPADATE with some noddy data:
Data
people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150
The group averages are therefore:
London 1000
South West 983.33
South East 900
Scotland 1020
East 600
I want to add the overall total to the bottom
Total 941.11
1) Because the overall average is the weighted average of the averages (not the plain average of the averages), i.e. it is 941 and not 901, we maintain an n column so that in the end we can correctly compute the overall average. Although the data shown does not have any NAs we use drop_na in order to also use it with such data. This will remove any row containing an NA.
library(dplyr)
library(tidyr)
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary), n = n()) %>%
ungroup %>%
bind_rows(summarize(., region = "Overall Avg",
avg = sum(avg * n) / sum(n),
n = sum(n))) %>%
select(-n)
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2) Another approach would be to construct the Overall Avg line by going back to the original data:
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region %>% drop_na, region = "Overall Avg", avg = mean(salary)))
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2a) If you object to referring to Region twice then try this.
Region_ <- Region %>%
drop_na
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
2b) or as a single pipeline where now Region_ is local to the pipeline and will automatically be removed after the pipeline completes:
Region %>%
drop_na %>%
{ Region_ <- .
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
}
Note
We used this as the input:
Lines <- "people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150"
library(gsubfn)
Region <- read.pattern(text = Lines, pattern = "^(\\S+) +(.*) (\\d+)$",
as.is = TRUE, skip = 1, strip.white = TRUE,
col.names = read.table(text = Lines, nrow = 1, as.is = TRUE))
One option is to add a row with bind_rows
library(dplyr)
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
bind_rows(data_frame(region = 'Total',
Avgsalary = mean(.$Avgsalary, na.rm = TRUE)))
Or another option is add_row from tibble
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
add_row(region = 'Total', Avgsalary = mean(.$Avgsalary))
If this is based on the overall mean before taking the mean, then we need to calculate it before
Data %>%
mutate(Total = mean(salary)) %>%
group_by(region) %>%
summarise(Avgsummary = mean(salary), Total = first(Total)) %>%
add_row(region = 'Total', Avgsummary = .$Total[1]) %>%
select(-Total)

Resources