Subsetting to Output only certain CHAR names in a single column - r

I want to be able to keep all rows where the "conm" column does contain certain bank names. you can tell from the code I am trying to use subset to do this but to no avail.
I have tried using subset to do this.
CMPSTPRFT12 <- subset(CMPSPRFT11, conm = MORGUARD CORP | conm = LEHMAN BROTHERS HOLDINGS INC)
I expect the output in rstudio to just show all rows where the column containing the names of banks includes certain banks, not all banks. I want SUnTrust, Lehman Brothers, Morgan Stanley, Goldman Sachs, PennyMac, Bank of America, and Fannie Mae.

Please see other posts on how to phrase your questions more helpfully for others. How to make a great R reproducible example
You can use dplyr and filter.
df <- data.frame(bank=letters[1:10],
value=10:19)
df %>% filter(bank=='a' | bank=='b')
bank value
1 a 10
2 b 11
banks <- c('d','g','j')
df %>% filter(bank %in% banks)
bank value
1 d 13
2 g 16
3 j 19

Related

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

How do I replace values in an R dataframe column with a corresponding value?

Ok, so I have a dataframe that I downloaded from Pew Research Center. One of the columns (called 'cregion') contains a series of numbers from 1-56, with each number corresponding to a geographic location in the U.S. Most of these locations are states, and the additional 6 are at the sub-state level. So, for example, the number '1' corresponds to 'Alabama', and '11' corresponds to the 'District Of Columbia'.
What I'd like to do is replace each of those numbers in the 'cregion' column with the ACTUAL name of the region it corresponds to. Unfortunately, there is no column in this data frame that I can use to swap the values, as the key for which number corresponds to which region exists completely separately (word document). I'm new to R and while I've been searching for a few hours for the best way to go about this, I can't seem to find a method that would work (or I just don't understand the explanation). Can anybody suggest a method to me?
If you have a vector of the state names as strings called statevec whose ith element corresponds to cregion i, and your data frame is named dat, just do
dat <- data.frame(cregion = sample(1:50), stuff = runif(50))
head(dat)
# cregion stuff
#1 25 0.665843896
#2 11 0.144631131
#3 13 0.691616240
#4 28 0.507454243
#5 9 0.416535139
#6 30 0.004196311
statevec <- state.name
dat$cregion <- statevec[dat$cregion]
head(dat)
# cregion stuff
#1 Missouri 0.665843896
#2 Hawaii 0.144631131
#3 Illinois 0.691616240
#4 Nevada 0.507454243
#5 Florida 0.416535139
#6 New Jersey 0.004196311

How to write a for-loop that searches names from data.frame in a character vector?

I have a data.frame with names of football players, for example:
names <- data.frame(id=c(1,2,3,4,5,6,7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'))
> names
id year
1 1 Maradona
2 2 Cruyff
3 3 Messi
4 4 Ronaldo
5 5 Pele
6 6 Van Basten
7 7 Diego
I also have a 6,000 scraped text files, containing stories about these football players. These stories are stored as 6,000 elements in a large vector called stories.
Is there a way a loop (or an apply function) can be written that searches for the names of each of the football players. If a match or multiple matches occur, I would like to record the element number and the name(s) of the football player.
For example, consider the following text in stories[1]:
Diego Armando Maradona (born 30 October 1960) is a retired Argentine
professional footballer. He has served as a manager and coach at other
clubs as well as the national team of Argentina. Many in the sport,
including football writers, former players, current players and
football fans, regard Maradona as the greatest football player of all
time. He was joint FIFA Player of the 20th Century
with Pele.
The ideal data.frame would have the following structure:
> outcome
element name1 name2
1 1 Maradona Pele
Does somebody know a way to write such a code that results in one data.frame for with information on all football players?
I just did it with a loop, but maybe you can do it with an apply function
#Make sure you include stringsAsFactors = F or my code won't work
football_names <- data.frame(id=c(1:7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'),stringsAsFactors = F)
outcome <- data.frame(element=football_names$id)
for (i in 1:nrow(football_names)){
names_in_story <- football_names$year[football_names$year %in% unlist(strsplit(stories[i],split=" "))]
for (j in 1:length(names_in_story)){
outcome[i,j+1] <- names_in_story[j]
}
}
names(outcome) <- c("element",paste0("name",1:(ncol(outcome)-1)))
I don't undertsand your question exactly. But you can try to use a string match using astringr function and lapply.
I assumed that your data stories is a list.
The function finds all names you provide into the function as a vector and counts their occurence. The output is again a list.
foo <- function(x,y) table(unlist(str_match_all(x,paste0(y,collapse = "|"))))
The result
res <- lapply(series, foo,names$year)
Then you can merge and sum up the data (rowSums()) for example like this:
Reduce(function(...) merge(..., all=T, by="Var1"), res)

Averaging rows based upon a known, irregular relationship using R

I have data on energy companies whose jurisdiction overlaps in places. I want to be able to compute an average of sales for the places where these companies overlap. These companies will always overlap - so how can I use this information to calculate the averages just for those pairs? There are about 20 pairs of companies.
data <- data.frame(Company = c("Energy USA","Good Energy",
"Hydropower 4 U",
"Coal Town",
"Energy USA/Good Energy",
"Good Energy/Coal Town"),
Sales = c(100, 2500, 550, 6000, "?", "?"))
Company Sales
1 Energy USA 100
2 Good Energy 2500
3 Hydropower 4 U 550
4 Coal Town 6000
5 Energy USA/Good Energy ? (Answer: 1300)
6 Good Energy/Coal Town ? (Answer: 4250)
We use 'grep' to get index of 'Company' elements that have more than one entries i.e. separated by '/'. Then, split those elements by the delimiter (output will be a list), loop through the list with sapply, match the elements with the 'Company' column to get the position, use that to get the corresponding 'Sales' elements. As the 'Sales' column was factor, we need to convert it to numeric to get the mean. When we convert factor to numeric class, all non-numeric elements i.e. ? will be converted to NA. Replace those NA elements with the mean values.
i1 <- grepl('/', data$Company)
v1 <- sapply(strsplit(as.character(data$Company[i1]), '/'),
function(x) mean(as.numeric(as.character(data$Sales[match(x,
data$Company)]))))
data$Sales <- as.numeric(as.character(data$Sales))
data$Sales[is.na(data$Sales)] <- v1
data
# Company Sales
#1 Energy USA 100
#2 Good Energy 2500
#3 Hydropower 4 U 550
#4 Coal Town 6000
#5 Energy USA/Good Energy 1300
#6 Good Energy/Coal Town 4250
Without knowing how your original data is, it is hard to give a working answer. However, assuming your data has Company and Sales columns with multiple rows for each company, you can do something like this:
mean(data$Sales[data$Company %in% c('Energy USA', 'Good Energy')]])
mean(data$Sales[data$Company %in% c('Good Energy', 'Coal Town')]])
you could create a new column "jurisdiction" in "data", if your dataset is rather small..
MeansByJurisdiction <- tapply(data$sales, data$jurisdiction, mean)
then you could convert the vector to dataframe
MeansByJurisdiction <- data.frame(MeansByJurisdiction)
the rownames in the MeansByJurisdiction dataframe will be populated with the jurisdictions and you can extract them with a simple line of code:
MeansByJurisdiction$jurisdictions <- row.names(MeansByJurisdiction)

Consolidate data table factor levels in R

Suppose I have a very large data table, one column of which is "ManufacturerName". The data was not entered uniformly, so it's pretty messy. For example, there may be observations like:
ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...
I am looking for an automated way in R to try and consider similar names as one factor level. I have learned the syntax to manually do this, for example:
levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))
But I'm trying to think of an automated solution. Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. Or any other ideas. Thanks!
Look into the stringdist package. For starters, you could do something like this:
library(stringdist)
x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
# 1 2 3 4 5
# 2 1
# 3 9 10
# 4 6 7 15
# 5 16 16 16 18
# 6 15 15 15 17 1
For more help, see ?stringdistmatrix or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep.

Resources