How to remove dupes and replace column variables - r

I'm working with a data set named CCCrn on candidates in a local election with some duplicate values. Here's a sample:
Adam Hill 4100 New Texas Rd. Pittsburgh 15239 School Director PLUM Democratic 4 5
Adam Hill 4100 New Texas Rd. Pittsburgh 15239 School Director PLUM Republican 4 5
As you can see, this candidate cross listed and was on both parties' ballots. I'd like to remove one of the rows, and then edit the Party variable to say "Cross Listed.
Obviously unique and distinct haven't been much help. I tried
test <- CCCrn[!duplicated(CCCrn$Name), ] which succeeded in removing the duplicate canidates, but now I'm not sure how I would go back and edit the "Party" variable.

create a flag for duplicate record
df <- df %>% mutate(dup = ifelse(duplicated(name)|duplicated(name, fromLast=TRUE),1,0))
df <- df[!duplicated(df$name),] ## remove duplicate
df <- df %>% mutate(party= ifelse(dup==1, "Cross Listed", party)) # update party
df <- df%>% select(-dup) ## remove flag

One way, using dplyr, would be to group_by all fields other than the party, and then summarise to "CrossListed" if the number of rows in the group is bigger than 1, i.e. if n()>1.
Something like this...
library(dplyr)
df2 <- df %>% group_by(-Party) %>%
summarise(Party = ifelse(n() > 1, "CrossListed", first(Party))
or an alternative to the last line would be to paste all the party names together so that you can see where they are cross-listed (which might be useful if there are lots of parties - less so if there are only two!)... summarise(Party = paste(sort(Party), collapse=", "))

Related

How to skip empty values with rbind.fill or replace them with another value?

first time asking on stack, so apologies for any mistakes in this question.
I am trying to scrape the suspension rates for all California high schools off of https://dq.cde.ca.gov/dataquest/, the public data sight for the California Department of education.
In case my code isn't very clear, let me describe my scraping process. The data I'm interested is on different webpages for each school with the only difference in the various URL's being the school CDS code. So using another dataframe composed of school CDS codes, I substitute the various school CDS codes into the URL, pull the data from the respective tables that schools have on their webpages. If there isn't data for a school in a specific year, no table is pulled up and the scraper will pull in empty values.
The problem I am running into is that when the scraper pulls in empty values (for when no data is found for a school in that year), I'm unable to continue binding scrapped data into my scrape dataframe.
I have two possible ways I think might solve the problem, but haven't been able to figure out the code for either of them.
First, I'm wondering is there a way to have my scraper either skip these school ID codes when the data is not found (and the html_text is then empty) when the values are empty, or to make it so I replace those empty values with NA's?
Secondly, is there a way to use the rbind.fill command where if empty values are found, to turn those into NA's or some other symbol that will represent missing data?
Any help would be appreciated, thanks
Code
#Initial Dataframe
CDS.code = c("01611190130229", "12626870111922", "19643031935618")
school = c("Alameda High", "American Indian Academy", "Mayfair High")
source = data.frame(CDS.code, school)
for (page_result in source$CDS.Code) {
link = paste0("https://dq.cde.ca.gov/dataquest/dqCensus/DisSuspCount.aspx?year=2020-21&agglevel=School&cds=", page_result," ")
page = read_html(link)
school_id = page_result
#Columns for Data
hs_name = page %>%
html_nodes("tr:nth-child(2) a") %>%
html_text()
total_suspensions = page %>%
html_nodes("#ContentPlaceHolder1_grdTotals tr:nth-child(2) td:nth-child(3)")%>%
html_text()
df_schools = rbind.fill(df_schools, data.frame(
school_id,
total_suspensions,
stringsAsFactors = FALSE))
I expected the missing values to be populated with NA's, I've tried replacing empty values with NA and a few other values.
I've also tried to figure out how to make the web scraping portion skip when no value is found, but it's broke each time.
How you can skip over data with empty table with purrr::possibly(). If the function encounter empty tables it will produce NA instead of summing the third column.
library(rvest)
library(httr2)
library(tidyverse)
library(magrittr)
Sample data
source = tibble(
CDS.code = c("01611190130229", "12626870111922", "19643031935618"),
school = c("Alameda High", "American Indian Academy", "Mayfair High")
)
# A tibble: 3 × 2
CDS.code school
<chr> <chr>
1 01611190130229 Alameda High
2 12626870111922 American Indian Academy
3 19643031935618 Mayfair High
Scraper function
get_susp <- function(cds_code) {
cat("SCraping CDS:", cds_code, "\n")
str_c(
"https://dq.cde.ca.gov/dataquest/dqCensus/DisSuspCount.aspx?year=2020-21&agglevel=School&cds=",
cds_code
) %>%
request() %>%
req_perform() %>%
resp_body_html() %>%
html_table() %>%
nth(8) %>% # Pluck the 8th table
mutate(across(3, as.numeric)) %$% # Convert it to numeric
sum(TotalSuspensions, na.rm = TRUE) # Sum of total suspension
}
Create a new column with the sum of total suspension for that High School
source %>%
mutate(total_susp = map_dbl(
CDS.code, possibly(get_susp, otherwise = NA_integer_)
))
# A tibble: 3 × 3
CDS.code school total_susp
<chr> <chr> <dbl>
1 01611190130229 Alameda High 5
2 12626870111922 American Indian Academy NA
3 19643031935618 Mayfair High 0

Add value in one column based on multiple key words in another column in r

I want to do the following things: if key words "GARAGE", "PARKING", "LOT" exist in column "Name" then I would add value "Parking&Garage" into column "Type".
Here is the dataset:
df<-data.frame(Name=c("GARAGE 1","GARAGE 2", "101 GARAGE","PARKING LOT","CENTRAL PARKING","SCHOOL PARKING 1","CITY HALL"))
The following codes work well for me, but is there a neat way to make the codes shorter? Thanks!
df$Type[grepl("GARAGE", df$Name) |
grepl("PARKING", df$Name) |
grepl("LOT", df$Name)]<-"Parking&Garage"
The regex "or" operator | is your friend here:
df$Type[grepl("GARAGE|PARKING|LOT", df$Name)]<-"Parking&Garage"
You can create a list of keywords to change, create a pattern dynamically and replace the values.
keywords <- c('GARAGE', 'PARKING', 'LOT')
df$Type <- NA
df$Type[grep(paste0(keywords, collapse = '|'), df$Name)] <- "Parking&Garage"
df
# Name Type
#1 GARAGE 1 Parking&Garage
#2 GARAGE 2 Parking&Garage
#3 101 GARAGE Parking&Garage
#4 PARKING LOT Parking&Garage
#5 CENTRAL PARKING Parking&Garage
#6 SCHOOL PARKING 1 Parking&Garage
#7 CITY HALL <NA>
This would be helpful if you need to add more keywords to your list later.
an alternative with dpylr and stringr packages:
library(stringr)
library(dplyr)
df %>%
dplyr::mutate(TYPE = stringr::str_detect(Name, "GARAGE|PARKING|LOT"),
TYPE = ifelse(TYPE == TRUE, "Parking&Garage", NA_character_))

Is there a more elegant way to collapse a variable with 88 levels to one with 5 levels?

I have a categorical variable with 88 levels (counties) and I want to aggregate those into five larger geographical regions. Is there a more elegant way to do this than a huge amount of ifelse statements (like below)?
survey.responses$admin<-ifelse(survey.responses$CNTY=="Lake","Northeast",
ifelse(survey.responses$CNTY=="Traverse","Northwest",
ifelse(survey.responses$CNTY=="Ramsey","Central",
ifelse(survey.responses$CNTY=="Cottonwood","South","out of state")
except imagine that CNTY has 88 levels! Any thoughts?
Two quick methods, I recommend the merge one for larger sets.
Data
dat <- data.frame(cnty = c("Lake", "Traverse", "Ramsey", "Cottonwood"),
stringsAsFactors = FALSE)
Merge/join. I prefer this for several reasons, most of all that it is quite easy to maintain a CSV of the matches and read.csv the CSV into the ref lookup table. I'll intentionally leave "Lake" out to show what happens with non-matches.
ref <- data.frame(cnty = c("Cottonwood", "Ramsey", "Traverse", "SomeOther"),
admin = c("South", "Central", "Northwest", "NeverNeverLand"),
stringsAsFactors = FALSE)
out <- merge(dat, ref, by = "cnty", all.x = TRUE)
out
# cnty admin
# 1 Cottonwood South
# 2 Lake <NA>
# 3 Ramsey Central
# 4 Traverse Northwest
The default value is assigned in this way:
out$admin[is.na(out$admin)] <- "out of state"
out
# cnty admin
# 1 Cottonwood South
# 2 Lake out of state
# 3 Ramsey Central
# 4 Traverse Northwest
If you're using other components of tidyverse, this can be done with
library(dplyr)
left_join(dat, ref, by = "cnty") %>%
mutate(admin = if_else(is.na(admin), "out of state", admin))
Lookup. This works fine for small things, perhaps not best for your fit. (Again, I've commented "Lake" out to show the non-match.)
c(Cottonwood="South", # Lake="Northeast",
Ramsey="Central", Traverse="Northwest")[dat$cnty]
# <NA> Traverse Ramsey Cottonwood
# NA "Northwest" "Central" "South"
Unless you have some pattern in CNTY which you can combine and create some logic on you need to include those levels manually. One way would be to use case_when from dplyr
library(dplyr)
survey.responses %>%
mutate(admin = case_when(CNTY %in% c("Lake","Northeast") ~ "GR1",
CNTY %in% c("Traverse","Northwest") ~ "GR2",
CNTY %in% c("Ramsey","Central") ~ "GR3",
TRUE ~ NA_character_))

R - Creating New Column Based off of a Partial String

I have a large dataset (Dataset "A") with a column Description which contains something along the lines
"1952 Rolls Royce Silver Wraith" or "1966 Holden".
I also have a separate dataset (Dataset "B") with a list of every Car Brand that I need (eg "Holden", "Rolls Royce", "Porsche").
How can I create a new column in dataset "A" that assigns the Partial strings of the Description with the correct Car Brand?
(This column would only hold the correct Car Brand with the appropriate matching cell).
Thank you.
Description New Column
1971 Austin 1300 Austin
A solution from the tidyverse
A <- data.frame (Description = c("1970 Austin"),
stringsAsFactors = FALSE)
B <- data.frame (Car_Brand = c("Austin"),
stringsAsFactors = FALSE)
library(tidyverse)
A %>% mutate( New_Column= str_match( Description, B$Car_Brand)[,1] )
# Description New_Column
# 1 1970 Austin Austin

Deleting duplicates in R, changing remainder

I have a fairly straightforward question, but very new to R and struggling a little. Basically I need to delete duplicate rows and then change the remaining unique row based on the number of duplicates that were deleted.
In the original file I have directors and the company boards they sit on, with directors appearing as a new row for each company. I want to have each director appear only once, but with column that lists the number of their board seats (so 1 + the number of duplicates that were removed) and a column that lists the names of all companies on which they sit.
So I want to go from this:
To this
Bonus if I can also get the code to list the directors "home company" as the company on which she/he is an executive rather than outsider.
Thanks so very much in advance!
N
You could use the ddply function from plyr package
#First I will enter a part of your original data frame
Name <- c('Abbot, F', 'Abdool-Samad, T', 'Abedian, I', 'Abrahams, F', 'Abrahams, F', 'Abrahams, F')
Position <- c('Executive Director', 'Outsider', 'Outsider', 'Executive Director','Outsider', 'Outsider')
Companies <- c('ARM', 'R', 'FREIT', 'FG', 'CG', 'LG')
NoBoards <- c(1,1,1,1,1,1)
df <- data.frame(Name, Position, Companies, NoBoards)
# Then you could concatenate the Positions and Companies for each Name
library(plyr)
sumPosition <- ddply(df, .(Name), summarize, Position = paste(Position, collapse=", "))
sumCompanies <- ddply(df, .(Name), summarize, Companies = paste(Companies, collapse=", "))
# Merge the results into a one data frame usin the name to join them
df2 <- merge(sumPosition, sumCompanies, by = 'Name')
# Summarize the number of oBoards of each Name
names_NoBoards <- aggregate(df$NoBoards, by = list(df$Name), sum)
names(names_NoBoards) <- c('Name', 'NoBoards')
# Merge the result whit df2
df3 <- merge(df2, names_NoBoards, by = 'Name')
You get something like this
Name Position Companies NoBoards
1 Abbot, F Executive Director ARM 1
2 Abdool-Samad, T Outsider R 1
3 Abedian, I Outsider FREIT 1
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3
In order to get a list the directors "home company" as the company on which she/he is an executive rather than outsider. You could use the next code
ExecutiveDirector <- df[Position == 'Executive Director', c(1,3)]
df4 <- merge(df3, ExecutiveDirector, by = 'Name', all.x = TRUE)
You get the next data frame
Name Position Companies.x NoBoards Companies.y
1 Abbot, F Executive Director ARM 1 ARM
2 Abdool-Samad, T Outsider R 1 <NA>
3 Abedian, I Outsider FREIT 1 <NA>
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3 FG

Resources