Summing over rows containing particular strings in R - r

I have a dataframe where the first column contains names of campaigns. I need to sum up all rows where the campaign names contain certain strings (it can appear in different places within the name, i.e. sometimes in the beginning sometimes in the end). The dataframe looks something like this:
Campaign Impressions
1 Local display 1661246
2 Local text 1029724
3 National display 325832
4 National Audio 498900
5 Audio local 597339
6 TV Regional 597339
...
So in this case I want to sum up all rows containing "local" in to one row, "national" into one, "regional" into one etc, like this:
Campaign Impressions
1 Local 939293929
2 National 9232423423
2 Regional 1123123123
How can this be achieved? I've been trying with ddply without success....

You could use grep to find the rows that match the Campaign column categories ('Local', 'National', 'Regional') in a loop (lapply). Subset the dataset ('df') based on grep and sum the 'Impressions' column and rbind the list elements.
res1 <- do.call(rbind,lapply(c('Local', 'National', 'Regional'),
function(x) {
x1 <- df[grep(x, df$Campaign, ignore.case=TRUE),]
data.frame(Campaign= x, Impressions=sum(x1$Impressions))}))
Or use data.table. Keep only the 'Local', 'National', 'Region' in the 'Category' using sub and use that as "grouping" variable to sum the column 'Impressions'.
library(data.table)
setDT(df)[, list(Impressions=sum(Impressions)),by=
list(Category=sub('.*?(Local|National|Region).*','\\U\\1', Campaign,
ignore.case=TRUE, perl=TRUE))]
data
df <- structure(list(Campaign = c("Local display", "Local text",
"National display",
"National Audio", "Audio local", "TV Regional"), Impressions =
c(1661246L, 1029724L, 325832L, 498900L, 597339L, 597339L)), .Names =
c("Campaign", "Impressions"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

I guess you should use the grep function : say your data.frame is called mydata then
Local = grep(mydata$Campaign, pattern = "Local")
National = grep(mydata$Campaign, pattern = "National")
Regional = grep(mydata$Campaign, pattern = "Regional")
mydata_sum = data.frame(Campaign = c("Local", "National", "Regional"), Impressions = c(sum(mydata$Impressions[Local]), sum(mydata$Impressions[National]), sum(mydata$Impressions[Regional])))

Here's my approach using dplyr:
library(dplyr)
library(stringr)
categories <- "Local|National|Regional"
mydf %>%
mutate(Campaign = tolower(str_extract((Campaign), ignore.case(categories)))) %>%
group_by(Campaign) %>%
summarise(sum(Impressions))
I needed to add the tolower, after extracting the strings, to make sure the group_by groups "local" together with "Local".

Related

Can't remove empty `character(0)` or `list()` values from R data frame

I have an R data frame that has character(0) and list() values inside the cells. I want to replace these with NA values.
In the following example, the field "teaser" has this issue, but it can be anywhere in the data frame.
df <- structure(list(body = "BAKER TO VEGAS 2022The Office fielded two squads this year in the 36th Annual Baker to Vegas (“B2V”) Challenge Cup Relay on April 9-10. Members of our 2022 B2V Team include many staff and AUSAs who were joined by office alums and a cadre of friends and family who helped out during some rather brutal conditions this year with temperatures around 100 degrees for much of the days. Most importantly, everyone had fun… and nobody got hurt! It was a great opportunity to meet (and run past) various members of our law enforcement community and to see the amazing logistics of the yearly event. Congratulations to all the participants.",
changed = structure(19156, class = "Date"), created = structure(19156, class = "Date"),
date = structure(19090, class = "Date"), teaser = "character(0)",
title = "Baker to Vegas 2022", url = "https://www.justice.gov/usao-cdca/blog/baker-vegas-2022",
uuid = "cd7e1023-c3ed-4234-b8af-56d342493810", vuuid = "8971702d-6f96-4bbd-ba8c-418f9d32a486",
name = "USAO - California, Central,"), row.names = 33L, class = "data.frame")
I've tried numerous things that don't work, including the following:
df <- na_if(df, "character(0)")
Error in charToDate(x) :
character string is not in a standard unambiguous format
Thanks for your help.
We could use
library(dplyr)
df %>%
mutate(across(where(is.character), ~ na_if(.x, "character(0)")))
Here is a base R way.
create a logical index taking the value TRUE when the columns are of class "character";
create an index list on those columns with lapply;
with mapply change the bad values to NA.
i_chr <- sapply(df, is.character)
inx_list <- lapply(df[i_chr], \(x) x == "character(0)")
df[i_chr] <- Map(\(x, i) {is.na(x) <- i; x}, df[i_chr], inx_list)

Using tidyr::separate_rows on multiple connected rows

I have some data that I scrubbed from an offline source using some text recognition software. It looks something like the data below, but less Elvish.
elvish_ring_holders_unclean <- tibble(
name=c("Gandalf", "Galadriel", "Elrond", "Cirdan\n\nGil-Galad"),
city = c("Undying Lands","Lothlorien","Rivendell", "Mithlond\n\nLindon"),
race = c("Maiar", "Elf", "Elf", "Elf\n\nElf"))
The problem for both datasets is that certain rows have been concatenated together with spaces. What I would prefer is something like the data below with each observation having its own row
elvish_ring_holders <- tibble(
name=c("Gandalf", "Galadriel", "Elrond", "Cirdan","Gil-Galad"),
city = c("Undying Lands","Lothlorien","Rivendell", "Mithlond", "Lindon"),
race = c("Maiar", "Elf", "Elf", "Elf", "Elf"))
So far, I have tried a tidyr::separate_rows approach
elvish_ring_holders %>%
separate_rows(name, sep = "\n\n") %>%
separate_rows(city, sep = "\n\n") %>%
separate_rows(race, sep = "\n\n") %>%
distinct()
But, I end up with a dataset where Gil-Galad and Cirdan both have two observations with two different cities with one true city and one false city.
In my exterior data, my race variable also can duplicate in this way and the data has more observations. What I am looking for is some method of separating rows that can separate once across multiple cols.
Instead of separating each column on it's own do them all in one go.
elvish_ring_holders_unclean %>%
separate_rows(everything(), sep = "\n\n")
name
city
race
1
Gandalf
Undying Lands
Maiar
2
Galadriel
Lothlorien
Elf
3
Elrond
Rivendell
Elf
4
Cirdan
Mithlond
Elf
5
Gil-Galad
Lindon
Elf

Grouping two data frames using stringdist_join

I am currently working on a project and have reached a problem... I am trying to match two data frames based on a candidate's name. I have managed to do this, however with anything more than a max_dist of 2 I start to get duplicate entries. However, these would be easily avoided if I could 'group' the candidates by race (state and district) before running stringdist_join as there are only a few candidates in each race with very little chance of having two candidates with similar names.
The goal is to obtain a table called tmpJoin where I can have both the candidateID and the canVotes, along with the name, state, district.
Any suggestions would be greatly appreciated!
Below is my code as well as a replication of the two datasets
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Bevill','Faye Baggiano','Thomas
Bevill','Don Sledge','William Turner', 'Bill Turner','Ed Smith','Tom
Bevill','Edward Smith')
canVotes <-c('234','589','9234','729','149','245','879','385','8712','7099')
yearHouseResult <- data.frame(state, district, FullName,canVotes)
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Beville','Faye Baggiano','Thom Bevill','Donald Sledge','Bill Turner', 'Bill Turner','Ed Smith','Tom Bevill','Ed Smith')
candidateID <- c('1','2','3','4','5','6','7','8','9','10')
congrCands <- data.frame(state, district, FullName, candidateID)
tmpJoin <- stringdist_join(congrCands, yearHouseResult,
by = "FullName",
max_dist=2,
method = "osa",
ignore_case = FALSE,
distance_col = "matchingDistance")
You can test all three conditions with fuzzy_inner_join, also from the fuzzyjoin package.
First I had to change the factors into numerics and characters, because different factor levels will mess with the function.
Some information to the fuzzy_join. In argument match_fun is the description of the three conditions and in by the columns for the conditions are specified.
stringdist < 4 for FullName
district must be equal
state must be equal (district is a numeric, state is a character, therefore two different functions are needed to compare these columns)
The table includes more columns than you need. So you might select the needed columns. I just thought it would be easier to controll the matches this way.
yearHouseResult <- data.frame(state, district, FullName,canVotes) %>%
mutate(state = as.character(state),
district = as.numeric(district),
FullName = as.character(FullName))
congrCands <- data.frame(state, district, FullName, candidateID) %>%
mutate(state = as.character(state),
district = as.numeric(district),
FullName = as.character(FullName))
t <- fuzzy_inner_join(congrCands, yearHouseResult,
match_fun = list(function(x,y) stringdist(x,y,
method="osa") < 4,
`==`,
function(x,y) str_detect(x,y)),
by = c( "FullName", "district", "state"))
If you increase the number of stringdist from 4 to 5 you will correctly match Ed/Edward Smith but incorrectly match William/Bill Turner. So you need to decide whats more important a clean match or more matches.

Change column values depending on other column in R

I have problem with my data frame.
I have a dataframe with 2 columns, 'word' and 'word_categories'. I created different variables which include the different words, e.g. 'noun' which includes all the nouns of the word column. I now want to change the labels in the word_categories column to the corresponding variable. So if the word in the word column is included in the object 'noun', I want the word_categories column to display 'noun'.
df <- read.csv("palm.csv")
noun <- c("house", ...)
adj <- c("hard", ...)
...
The data frame looks like the following. It includes other columns but they are fine.
word word_categories
house
car
hard
...
I now want to look, if the words are in any of the created objects and if so, I want the corresponding label printed in the word_categories column. So for 'house' the column should show noun, for 'hard' it should show adjective. If the word is in none of the objects, it should show nothing or 'NA'.
I tried it with the following:
palm$word_categories <- ifelse(palm$word == noun, "noun",
ifelse(palm$word == adj, "adjective", "")))
This, however, doesn't work at all and I have 7 Objects in total so the statement becomes ridiculously long. How do I do it properly?
If the dataframe is called palm (you first call it df but later you use palm) and noun and adj are vectors as you define above, I would do:
library(dplyr)
palm <- palm %>%
mutate(word_categories = case_when(word %in% noun ~ "noun",
word %in% adj ~ "adjective",
TRUE ~ NA_character_))
One way would be to create a named vector of your noun/adjective dictionaries to select each element. The name would be the word and the corresponding data would be noun, adjective etc. You didn't really supply any data so I made some up.
df <- data.frame(
stringsAsFactors = FALSE,
word = c("dog", "short", "bird", "cat", "short", "man")
)
nounName <- c('dog', 'cat', 'bird')
adjName <- c('quick', 'brown', 'short')
noun <- rep('noun', length(nounName))
adj <- rep('adjective', length(adjName))
names(noun) <- nounName
names(adj) <- adjName
partsofspeech <- c(noun, adj)
df$word_categories <- partsofspeech[df$word]

Unlist column to create unique row in dataframe

I am faced with the following R transformation issue.
I have the following dataframe:
test_df <- structure(list(word = c("list of XYZ schools",
"list of basketball", "list of usa"), results = c("58", "151", "29"), key_list = structure(list(`coRq,coG,coQ,co7E,coV98` = c("coRq", "coG", "coQ", "co7E", "coV98"), `coV98,coUD,coHF,cobK,con7` = c("coV98","coUD", "coHF", "cobK", "con7"), `coV98,coX7,couC,coD3,copW` = c("coV98", "coX7", "couC", "coD3", "copW")), .Names = c("coRq,coG,coQ,co7E,coV98", "coV98,coUD,coHF,cobK,con7", "coV98,coX7,couC,coD3,copW"))), .Names = c("word", "results", "key_list"), row.names = c(116L, 150L, 277L), class = "data.frame")
In short there are three columns, unique on "word" and then a corresponding "key_list" that has a list of keys comma separated. I am interested in creating a new data frame where each key is unique and the word information is duplicated as well as the result information.
So a dataframe that looks as follows:
key word results
coV98 "list of XYZ schools" 58
coRq "list of XYZ schools" 58
coV98 "list of basketball" 151
coV98 "list of usa" 29
And so on for all the keys, so I would like to expand the keys unlist them and then reshape into a dataframe with repeating words and other columns.
I have tried a bunch of the following:
Created a unique list of keys and then attempted to grep for each of those keys in the column and loop through to create a new smaller dataframe and then rbind those together, the resulting dataframe however does not contain the key column:
keys <- as.data.frame(table(unname(unlist(test_df$key_list))))
ttt <- lapply(keys, function(xx){
idx <- grep(xx, test_df$key_list)
df <- all_data_sub[idx,]})
final_df <- do.call(rbind, ttt)
I have also played around with unlisting and reshaping, but I am not getting the right combination.
Any advice would be great!
thanks
May be we can use listCol_l from splitstackshape
library(splitstackshape)
listCol_l(test_df, 'key_list')[]
In case a base R solution is helpful for someone:
do.call(rbind, lapply(seq_along(test_df$key_list), function(i) {
merge(test_df$key_list[[i]], test_df[i,-3], by=NULL)
}))

Resources