I need help merging two rows based on certain string character, the string is complaint - r

I am trying to calculate the fraction of the construction noise per zip code across NY city. The data is from NYC 311.
I am using dplyr and have grouped the data per zip.
However, I am finding difficulties merging the row for the complain column, I have to merge the data as per the string "construction" it appear anywhere meaning middle, front or end.
My solution, this is just the beginning
comp_types <- df %>% select(complaint_type,descriptor,incident_zip) %>%
group_by(incident_zip)
can you help me merge the row if unique value in descriptor contains any construction value.

Can you clarify what you mean by "merging"? I don't think you actually want to merge because you only have one dataframe. The term "merging" is used to describe the joining of two dataframes.
See ?base::merge:
Merge two data frames by common columns or row names, or do other versions of database join operations.
If I understand correctly, you want to look into the descriptor variable and see if it contains the string "construction" anywhere in the cell, so you can determine if the person's complaint was construction-related; same for "music". I don't believe you need to use complaint_type since complaint_type never contains the string "construction" or "music"; only descriptor does.
You can use a combination of ifelse and grepl to create a new variable that indicates whether the complaint was construction-related, music-related, or other.
library(tidyverse)
library(janitor)
url <- "https://data.cityofnewyork.us/api/views/p5f6-bkga/rows.csv"
df <- read.csv(url, nrows = 10000) %>%
clean_names() %>%
select(complaint_type, descriptor, incident_zip)
comp_types <- df %>%
select(complaint_type, descriptor, incident_zip) %>%
group_by(incident_zip)
head(comp_types)
#> # A tibble: 6 × 3
#> # Groups: incident_zip [6]
#> complaint_type descriptor incident_zip
#> <chr> <chr> <int>
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
table(df$complaint_type)
#>
#> Noise Noise - Commercial Noise - Helicopter
#> 555 591 145
#> Noise - House of Worship Noise - Park Noise - Residential
#> 20 72 5675
#> Noise - Street/Sidewalk Noise - Vehicle
#> 2040 902
df <- df %>%
mutate(descriptor_misc = ifelse(grepl("Construction", descriptor), "Construction",
ifelse(grepl("Music", descriptor), "Music", "Other")))
df %>%
group_by(descriptor_misc) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: descriptor_misc [3]
#> descriptor_misc n
#> <chr> <int>
#> 1 Construction 328
#> 2 Music 6354
#> 3 Other 3318
head(df)
#> complaint_type descriptor incident_zip
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
#> descriptor_misc
#> 1 Other
#> 2 Music
#> 3 Other
#> 4 Music
#> 5 Music
#> 6 Construction

Related

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
library(pdftools)
# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
table_number,
number_columns,
column_names,
first_row,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
6
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
10
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
17
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
21
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!
While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(pdftools)
library(dplyr, warn.conflicts = F)
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>%
filter(!is.na(X3))
}
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
Result:
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

R combine rows and columns within a dataframe

I've looked around for a while trying to figure this out, but I just can't seem to describe my problem concisely enough to google my way out of it. I am trying to work with Michigan COVID stats where the data has Detroit listed separately from Wayne County. I need to add Detroit's numbers to Wayne County's numbers, then remove the Detroit rows from the data frame.
I have included a screen grab too. For the purposes of this problem, can someone explain how I can get Detroit City added to Dickinson, and then make the Detroit City rows disappear? Thanks.
library(tidyverse)
library(openxlsx)
cases_deaths <- read.xlsx("https://www.michigan.gov/coronavirus/-/media/Project/Websites/coronavirus/Cases-and-Deaths/4-20-2022/Cases-and-Deaths-by-County-2022-04-20.xlsx?rev=f9f34cd7a4614efea0b7c9c00a00edfd&hash=AA277EC28A17C654C0EE768CAB41F6B5.xlsx")[,-5]
# Remove rows that don't describe counties
cases_deaths <- cases_deaths[-c(51,52,101,102,147,148,167,168),]
Code chunk output picture
You could do:
cases_deaths %>%
filter(COUNTY %in% c("Wayne", "Detroit City")) %>%
mutate(COUNTY = "Wayne") %>%
group_by(COUNTY, CASE_STATUS) %>%
summarize_all(sum) %>%
bind_rows(cases_deaths %>%
filter(!COUNTY %in% c("Wayne", "Detroit City")))
#> # A tibble: 166 x 4
#> # Groups: COUNTY [83]
#> COUNTY CASE_STATUS Cases Deaths
#> <chr> <chr> <dbl> <dbl>
#> 1 Wayne Confirmed 377396 7346
#> 2 Wayne Probable 25970 576
#> 3 Alcona Confirmed 1336 64
#> 4 Alcona Probable 395 7
#> 5 Alger Confirmed 1058 8
#> 6 Alger Probable 658 5
#> 7 Allegan Confirmed 24109 294
#> 8 Allegan Probable 3024 52
#> 9 Alpena Confirmed 4427 126
#> 10 Alpena Probable 1272 12
#> # ... with 156 more rows
Created on 2022-04-23 by the reprex package (v2.0.1)

Joining tables and applying functions to columns with the same name in R and tidyverse

I am looking to join tables with customer id (easy enough) but then I want to multiply the columns to get updated values.
Customer_Week_1<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(2,5,8,9),
Factor_2=c(.5,.5,.75,.75),
Factor_3=c(0,1,2,3))
Customer_Week_2<-data.frame(First_name=c("John","Mary","David","Paul"),
Last_name=c("Jackson","Smith","Williams", "Zimmerman"),
Factor_1=c(3,7,1,7),
Factor_2=c(.51,.65,.72,.4),
Factor_3=c(1,2,3,4))
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))
The expected results can be found by in a vector by just
Customer_week3_expected<-Customer_Week_1[,3:5]*Customer_Week_2[,3:5]
And I know I can just manually type out every column. But I have dozens of columns and need to make this code as easy to follow as possible.
I also know that I can just bind the results vector to
Customer_week3<-Customer_Week_1%>%
left_join(Customer_Week_2, by = c("First_name","Last_name"))%>%
select(1:2)
But that does not look like best practice to me, and I would rather this be done with a join some way to ensure everything lines up when I am iterating over the customers(tables)
Assuming I understand the output you're trying to get, I can think of two methods. If you know that the names are in the first two columns and are the same in both data frames (this might not be the case in real life), you can use the same multiplication operation you tried above, bound to the first two columns of either of the data frames.
cbind(Customer_Week_1[1:2], Customer_Week_1[-1:-2] * Customer_Week_2[-1:-2])
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> 1 John Jackson 6 0.255 0
#> 2 Mary Smith 35 0.325 2
#> 3 David Williams 8 0.540 6
#> 4 Paul Zimmerman 63 0.300 12
Or you can be more verbose but maybe more flexible, and eshape to a long data frame, then do a grouped operation to summarize products for each person and factor. Starting from the join you have above:
library(dplyr)
library(tidyr)
Customer_week3 <- Customer_Week_1 %>%
left_join(Customer_Week_2, by = c("First_name", "Last_name"))
Make long-shaped data, separate the Factor_1.x into Factor_1 and x, and make products as your summary calculation.
products <- Customer_week3 %>%
gather(key = factor, value = value, -First_name, -Last_name) %>%
separate(factor, into = c("factor", "week"), sep = "\\.") %>%
group_by(First_name, Last_name, factor) %>%
summarise(value = prod(value))
head(products)
#> # A tibble: 6 x 4
#> # Groups: First_name, Last_name [2]
#> First_name Last_name factor value
#> <fct> <fct> <chr> <dbl>
#> 1 David Williams Factor_1 8
#> 2 David Williams Factor_2 0.54
#> 3 David Williams Factor_3 6
#> 4 John Jackson Factor_1 6
#> 5 John Jackson Factor_2 0.255
#> 6 John Jackson Factor_3 0
If you need to get back to a wide format, spread back.
products %>%
spread(key = factor, value = value)
#> # A tibble: 4 x 5
#> # Groups: First_name, Last_name [16]
#> First_name Last_name Factor_1 Factor_2 Factor_3
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 David Williams 8 0.54 6
#> 2 John Jackson 6 0.255 0
#> 3 Mary Smith 35 0.325 2
#> 4 Paul Zimmerman 63 0.3 12
Similar to #camille's reshaping, but in data.table (and disregarding Customer_week3):
library(data.table)
# long format
long = rbindlist(list(Customer_Week_1, Customer_Week_2), id=TRUE)
# aggregate
long[, lapply(.SD, prod), by=.(First_name, Last_name), .SDcols=patterns("^Factor")]
First_name Last_name Factor_1 Factor_2 Factor_3
1: John Jackson 6 0.255 0
2: Mary Smith 35 0.325 2
3: David Williams 8 0.540 6
4: Paul Zimmerman 63 0.300 12
Going longer (again as seen in #camille's answer) might also make sense, so as to avoid repeatedly fiddling with names of Factor_* columns:
longer = melt(long, meas=patterns("^Factor")) # analogous to gather
longer[, .(value = prod(value)), by=.(First_name, Last_name, variable)]

Tableau LOD R Equivalent

I'm using a Tableau Fixed LOD function in a report, and was looking for ways to mimic this functionality in R.
Data set looks like:
Soldto<-c("123456","122456","123456","122456","124560","125560")
Shipto<-c("123456","122555","122456","124560","122560","122456")
IssueDate<-as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01"))
Method<-c("Ground","Ground","Ground","Air","Ground","Ground")
Delivery<-c("000123","000456","000123","000345","000456","000555")
df1<-data.frame(Soldto,Shipto,IssueDate,Method,Delivery)
What I'm looking to do is "For each Sold-to/Ship-to/Method count the number of unique delivery IDs".
The intent is to find the number of unique deliveries that could potentially be "aggregated."
In Tableau that function looks like:
{FIXED [Soldto],[Shipto],[IssueDate],[Method],:countd([Delivery])
Could this be done with aggregate or summarize as in an example below:
df.new<-ddply(df,c("Soldto","Shipto","Method"),summarise,
Deliveries = n_distinct(Delivery))
This is fairly easy with dplyr. You are looking for the number of unique delivery for each combination of soldto, shipto and method, which is just group_by and then summarise:
library(tidyverse)
tbl <- tibble(
soldto = c("123456","122456","123456","122456","124560","125560"),
shipto = c("123456","122555","122456","124560","122560","122456"),
issuedate = as.Date(c("2017-01-01","2017-01-02","2017-01-01","2017-01-02","2017-01-01","2017-01-01")),
method = c("Ground","Ground","Ground","Air","Ground","Ground"),
delivery = c("000123","000456","000123","000345","000456","000555")
)
tbl %>%
group_by(soldto, shipto, method) %>%
summarise(uniques = n_distinct(delivery))
#> # A tibble: 6 x 4
#> # Groups: soldto, shipto [?]
#> soldto shipto method uniques
#> <chr> <chr> <chr> <int>
#> 1 122456 122555 Ground 1
#> 2 122456 124560 Air 1
#> 3 123456 122456 Ground 1
#> 4 123456 123456 Ground 1
#> 5 124560 122560 Ground 1
#> 6 125560 122456 Ground 1
Created on 2018-03-02 by the reprex package (v0.2.0).

r rvest webscraping hltv

Yes, that's just another "how-to-scrape" question. Sorry for that, but I've read the previous answers and the manual for rvest as well.
I'm doing web-scraping for my homework (so I do not plan to use the data for any commercial issue). The idea is to show that average skill of team affect individual skill. I'm trying to use CS:GO data from HLTV.org for it.
The information is available at http://www.hltv.org/?pageid=173&playerid=9216
I need two tables: Keystats (data only) and Teammates (data and URLs). I try to use CSS selectors generated by SelectorGadget and I also tryed to analyze the source code of webpage. I've failed. I'm doing the following:
library(rvest)
library(dplyr)
url <- 'http://www.hltv.org/?pageid=173&playerid=9216'
info <- html_session(url) %>% read_html()
info %>% html_node('.covSmallHeadline') %>% html_text()
Can you please tell me that is the right CSS selector?
If you look at the source, those tables aren't HTML tables, but just piles of divs with inconsistent nesting and inline CSS for alignment. Thus, it's easiest to just grab all the text and fix the strings afterwards, as the data is either all numeric or not at all.
library(rvest)
library(tidyverse)
h <- 'http://www.hltv.org/?pageid=173&playerid=9216' %>% read_html()
h %>% html_nodes('.covGroupBoxContent') %>% .[-1] %>%
html_text(trim = TRUE) %>%
strsplit('\\s*\\n\\s*') %>%
setNames(map_chr(., ~.x[1])) %>% map(~.x[-1]) %>%
map(~data_frame(variable = gsub('[.0-9]+', '', .x),
value = parse_number(.x)))
#> $`Key stats`
#> # A tibble: 9 × 2
#> variable value
#> <chr> <dbl>
#> 1 Total kills 9199.00
#> 2 Headshot %% 46.00
#> 3 Total deaths 6910.00
#> 4 K/D Ratio 1.33
#> 5 Maps played 438.00
#> 6 Rounds played 11242.00
#> 7 Average kills per round 0.82
#> 8 Average deaths per round 0.61
#> 9 Rating (?) 1.21
#>
#> $TeammatesRating
#> # A tibble: 4 × 2
#> variable value
#> <chr> <dbl>
#> 1 Gabriel 'FalleN' Toledo 1.11
#> 2 Fernando 'fer' Alvarenga 1.11
#> 3 Joao 'felps' Vasconcellos 1.09
#> 4 Epitacio 'TACO' de Melo 0.98

Resources