I have some data that I scrubbed from an offline source using some text recognition software. It looks something like the data below, but less Elvish.
elvish_ring_holders_unclean <- tibble(
name=c("Gandalf", "Galadriel", "Elrond", "Cirdan\n\nGil-Galad"),
city = c("Undying Lands","Lothlorien","Rivendell", "Mithlond\n\nLindon"),
race = c("Maiar", "Elf", "Elf", "Elf\n\nElf"))
The problem for both datasets is that certain rows have been concatenated together with spaces. What I would prefer is something like the data below with each observation having its own row
elvish_ring_holders <- tibble(
name=c("Gandalf", "Galadriel", "Elrond", "Cirdan","Gil-Galad"),
city = c("Undying Lands","Lothlorien","Rivendell", "Mithlond", "Lindon"),
race = c("Maiar", "Elf", "Elf", "Elf", "Elf"))
So far, I have tried a tidyr::separate_rows approach
elvish_ring_holders %>%
separate_rows(name, sep = "\n\n") %>%
separate_rows(city, sep = "\n\n") %>%
separate_rows(race, sep = "\n\n") %>%
distinct()
But, I end up with a dataset where Gil-Galad and Cirdan both have two observations with two different cities with one true city and one false city.
In my exterior data, my race variable also can duplicate in this way and the data has more observations. What I am looking for is some method of separating rows that can separate once across multiple cols.
Instead of separating each column on it's own do them all in one go.
elvish_ring_holders_unclean %>%
separate_rows(everything(), sep = "\n\n")
name
city
race
1
Gandalf
Undying Lands
Maiar
2
Galadriel
Lothlorien
Elf
3
Elrond
Rivendell
Elf
4
Cirdan
Mithlond
Elf
5
Gil-Galad
Lindon
Elf
Related
I am currently working on a project and have reached a problem... I am trying to match two data frames based on a candidate's name. I have managed to do this, however with anything more than a max_dist of 2 I start to get duplicate entries. However, these would be easily avoided if I could 'group' the candidates by race (state and district) before running stringdist_join as there are only a few candidates in each race with very little chance of having two candidates with similar names.
The goal is to obtain a table called tmpJoin where I can have both the candidateID and the canVotes, along with the name, state, district.
Any suggestions would be greatly appreciated!
Below is my code as well as a replication of the two datasets
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Bevill','Faye Baggiano','Thomas
Bevill','Don Sledge','William Turner', 'Bill Turner','Ed Smith','Tom
Bevill','Edward Smith')
canVotes <-c('234','589','9234','729','149','245','879','385','8712','7099')
yearHouseResult <- data.frame(state, district, FullName,canVotes)
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Beville','Faye Baggiano','Thom Bevill','Donald Sledge','Bill Turner', 'Bill Turner','Ed Smith','Tom Bevill','Ed Smith')
candidateID <- c('1','2','3','4','5','6','7','8','9','10')
congrCands <- data.frame(state, district, FullName, candidateID)
tmpJoin <- stringdist_join(congrCands, yearHouseResult,
by = "FullName",
max_dist=2,
method = "osa",
ignore_case = FALSE,
distance_col = "matchingDistance")
You can test all three conditions with fuzzy_inner_join, also from the fuzzyjoin package.
First I had to change the factors into numerics and characters, because different factor levels will mess with the function.
Some information to the fuzzy_join. In argument match_fun is the description of the three conditions and in by the columns for the conditions are specified.
stringdist < 4 for FullName
district must be equal
state must be equal (district is a numeric, state is a character, therefore two different functions are needed to compare these columns)
The table includes more columns than you need. So you might select the needed columns. I just thought it would be easier to controll the matches this way.
yearHouseResult <- data.frame(state, district, FullName,canVotes) %>%
mutate(state = as.character(state),
district = as.numeric(district),
FullName = as.character(FullName))
congrCands <- data.frame(state, district, FullName, candidateID) %>%
mutate(state = as.character(state),
district = as.numeric(district),
FullName = as.character(FullName))
t <- fuzzy_inner_join(congrCands, yearHouseResult,
match_fun = list(function(x,y) stringdist(x,y,
method="osa") < 4,
`==`,
function(x,y) str_detect(x,y)),
by = c( "FullName", "district", "state"))
If you increase the number of stringdist from 4 to 5 you will correctly match Ed/Edward Smith but incorrectly match William/Bill Turner. So you need to decide whats more important a clean match or more matches.
I have materials lists from vendors that I would like to expand the description to other columns so I can use the filter function Excel to more easily find products based on their description. He's an example of the description I receive from a vendor:
2 SS 150LB 304L SLIP ON FLANGE
I would like to take this description and have R identify certain bits of text, and based on that text, add data to another column. For instance: if the string "SS" is in this cell, then put the word "STAINLESS" in a Materials column. If the string "BLK" is found in this cell, then put the word "BLACK" in the Materials column. If the string "FLANGE" is found in this cell, then put the word "FLANGE" in another column called Part_Type.
Here is one simple approach which looks for certain character sequences to use as a trigger to add strings to other columns.
library(tidyverse)
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK ON FLANGE')))
# add new columns filled with NA
df <- df %>%
add_column(Materials = NA_character_) %>%
add_column(Part_Type = NA_character_)
df %>%
mutate(Materials = if_else(str_detect(x, 'SS'), 'STAINLESS', Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), 'BLACK', Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'),'FLANGE', Part_Type))
Can an item be both 'stainless steel' and 'black'? i.e. do we want to add multiple strings to one column? In that case, it would be necessary to append rather than overwrite. Here's one approach to that problem.
my_nrow = 2
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK SS ON FLANGE')),
Materials = vector('character', my_nrow),
Part_type = vector('character', my_nrow))
df
df %>%
mutate(Materials = ifelse(str_detect(x, 'SS'), str_c(Materials,'STAINLESS '), Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), str_c(Materials,'BLACK '), Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'), str_c(Part_type,'FLANGE', sep = ' '), Part_type))
if i manually create 2 DFs then the code does what it was intended to do:
`df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank","tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA, -5L))-5L))
`
test <- df2 %>%
rowwise() %>%
mutate(CompanyName = as.character(Filter(length,
lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))) %>%
group_by(CompanyName) %>%
summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
cSplit("Variation", ",")
this produces the following result:
CompanyName Variation_1 Variation_2 Variation_3
1: Google google plc google finance google play
2: Tesco tesco bank tesco insurance NA
but..... if i import a data set (using read.csv)then i get the following error Error in mutate_impl(.data, dots) : Column CompanyName must be length 1 (the group size), not 0. my data sets are rather large so df1 would have 1000 rows and df2 will have 54k rows.
is there a specific reason why the code works when the data set is manually created and it does not when data is imported?
the DF1 contains company names and DF2 contains variation names of those companies
help please!
Importing from CSV can be tricky. See if the default separator (comma) applies to your file in particular. If not, you can change it by setting the sep argument to a character that works. (E.g.: read.csv(file_path, sep = ";") which is a commom problem in my country due to our local conventions.
In fact, if your standard is semicolons, read.csv2(file_path) will suffice.
And also (to avoid further trouble) it is very commom for csv to mess with columns with decimal values, because here we use commas as decimal separators rather then dots. So, it would be worth checking if this is a problem in your file too, in any of the other columns.
If that is your case, you can set the adequate parameter in either read.csv or read.csv2 by setting dec = "," (E.g.: read.csv(file_path, sep = ";", dec = ","))
I am trying to create a large list of file URLs by concatenating various pieces together. (Say, ~40 file URLs which represent multiple data types for each of the 50 states.) Eventually, I will download and then unzip/unrar these files. (I have working code for that portion of it.)
I'm very much an R noob, so please bear with me, here.
I have a set of data frames:
states - list of 50 state abbreviations
partial_url - a partial URL for the 50 states
url_parts - a list of each of the remaining URL
pieces (5 file types to download)
year
filetype
I need a URL that looks like this:
http://partial_url/state_urlpart_2017_file.csv.gz
I was able to build the partial_url data frame with the following:
for (i in seq_along(states)) {
url_part1 <- as.data.frame(paste0(url,states[[i]],"/",dir,"/"))
}
I was hoping that some kind of nested loop might work to do the rest, like so:
for (i in 1:partial_url){
for (j in 1:url_parts){
for(k in 1:states){
url_part2 <- as.data.frame(paste0(partial_url[[i]],"/",url_parts[[j]],states[[k]],year,filetype))
}}}
Can anyone suggest how to proceed with the final step?
In my understanding all OP needs can be handled by paste0 function itself. The paste0 works as vectorise format. Hence, the for-loop shown by OP is not needed. The data used in my example is stored in vector format but it can be represented by a column of data.frame.
For example:
states <- c("Alabama", "Colorado", "Georgia")
partial_url <- c("URL_1", "URL_2", "URL_3")
url_parts <- c("PART_1", "PART_2", "PART_3")
year <- 2017
fileType <- "xls"
#Now use paste0 will list out all the URLS
paste0(partial_url,"/",url_parts,states,year,fileType)
#[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls"
#[3] "URL_3/PART_3Georgia2017xls"
EDIT: multiple fileType based on feedback from #Onyambu
We can use rep(fileType, each = length(states)) to support multiple files.
The solution will look like.
fileType <- c("xls", "doc")
paste0(partial_url,"/",url_parts,states,year,rep(fileType,each = length(states)))
[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls" "URL_3/PART_3Georgia2017xls"
[4] "URL_1/PART_1Alabama2017doc" "URL_2/PART_2Colorado2017doc" "URL_3/PART_3Georgia2017doc"
Here is a tidyverse solution with some simple example data. The approach is to use complete to give yourself a data frame with all possible combinations of your variables. This works because if you make each variable a factor, complete will include all possible factor levels even if they don't appear. This makes it easy to combine your five url parts even though they appear to have different nrow (e.g. 50 states but only 5 file types). unite allows you to join together columns as strings, so we call it three times to include the right separators, and then finally add the http:// with mutate.
Re: your for loop, I find it hard to work through nested for loop logic in the first place. But at least two issues as written include that you have 1:partial_url instead of 1:length(partial_url) and similar, and you are simply overwriting the same object with every pass of the loop. I prefer to avoid loops unless it's a problem where they're absolutely necessary.
library(tidyverse)
states <- tibble(state = c("AK", "AZ", "AR", "CA", "NY"))
partial_url <- tibble(part = c("part1", "part2"))
url_parts <- tibble(urlpart = c("urlpart1", "urlpart2"))
year <- tibble(year = 2007:2010)
filetype <- tibble(filetype = c("csv", "txt", "tar"))
urls <- bind_cols(
states = states[[1]] %>% factor() %>% head(2),
partial_url = partial_url[[1]] %>% factor() %>% head(2),
url_parts = url_parts[[1]] %>% factor() %>% head(2),
year = year[[1]] %>% factor() %>% head(2),
filetype = filetype[[1]] %>% factor() %>% head(2)
) %>%
complete(states, partial_url, url_parts, year, filetype) %>%
unite("middle", states, url_parts, year, sep = "_") %>%
unite("end", middle, filetype, sep = ".") %>%
unite("url", partial_url, end, sep = "/") %>%
mutate(url = str_c("http://", url))
print(urls)
# A tibble: 160 x 1
url
<chr>
1 http://part1/AK_urlpart1_2007.csv
2 http://part1/AK_urlpart1_2007.txt
3 http://part1/AK_urlpart1_2008.csv
4 http://part1/AK_urlpart1_2008.txt
5 http://part1/AK_urlpart1_2009.csv
6 http://part1/AK_urlpart1_2009.txt
7 http://part1/AK_urlpart1_2010.csv
8 http://part1/AK_urlpart1_2010.txt
9 http://part1/AK_urlpart2_2007.csv
10 http://part1/AK_urlpart2_2007.txt
# ... with 150 more rows
Created on 2018-02-22 by the reprex package (v0.2.0).
I have a csv file as 'Campaigname.csv'
AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro
Desired Output in R
AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA
The code to the solution was given in a previous stackoverflow answer as:
## read the csv file - modify next line as needed
xx <- read.csv("Campaignname.csv",header=TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- xx$Market
ss <- stack(s)
DF <- with(ss, data.frame(Market = ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
But now another column like 'Identity' is included where the input is
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
And the desired result is
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
The number of columns may not be limited to just 3 columns, it may keep on increasing.
How to do it in R? New to R.Any help is appreciated.
I'm not sure that I fully understand your question, and you didn't provide a reproducible example (so I can't run your code and try to get to the end point you want). But I'll still try to help.
Generally speaking, in R you can add a new column to a data.frame simply by using it.
df = data.frame(advertiser = c("co1", "co2", "co3"),
campaign = c("camp1", "camp2", "camp3"))
df
advertiser campaign
1 co1 camp1
2 co2 camp2
3 co3 camp3
At this point, if I wanted to add an identity column I would simply create it with the $ operator like this:
df$identity = c(1, 2, 3)
df
advertiser campaign identity
1 co1 camp1 1
2 co2 camp2 2
3 co3 camp3 3
Note that there are other ways to accomplish this - see the transform (?transform) and rbind (?rbind) functions.
The caveat when adding a column to a data.frame is that I believe you must add a vector that has the same number of elements as their are rows in the data.frame. You can see the number of rows in the data.frame by typing nrow(df).