I have a list of indictors with periods in the name and I want to replace those periods with spaces. I know of the gsub() function that replaces punctuations. But every time I try to replace the dots with spaces the list returns null
list_AM = list(list(geo = "EU", sales="West.Europe.Sales",
indicator = list("SA","NSA","composites_industry_value","DUCS","WUCS","T30","Rovings",
"Mats","WE.Construction.Gross.output..sales...Real.USD","WE.Construction.Production.index","WE.Glass.Gross.operating.surplus..profits...Nominal.USD",
"WE.Glass.Gross.output..sales...Nominal.USD","WE.Glass.Investment..Nominal.USD","WE.Glass.Production.index","WE.Glass.Value.added.output..As.a.percent.of.GDP",
"WE.Glass.Value.added.output..As.a.percent.of.manufacturing","WE.Glass.Value.added.output..As.a.percent.of.world.total","WE.Industrial.Production.Gross.operating.surplus..profits...Nominal.USD",
"WE.Industrial.Production.Gross.output..sales...Nominal.USD","WE.Glass.Investment..Nominal.USD","WE.Glass.Production.index","WE.Glass.Value.added.output..As.a.percent.of.GDP","WE.Glass.Value.added.output..As.a.percent.of.manufacturing",
"WE.Glass.Value.added.output..As.a.percent.of.world.total","WE.Industrial.Production.Gross.operating.surplus..profits...Nominal.USD","WE.Industrial.Production.Gross.output..sales...Nominal.USD","WE.Industrial.Production.Production.index",
"WE.Industrial.Production.Value.added.output..As.a.percent.of.GDP","WE.Industrial.Production.Value.added.output..As.a.percent.of.world.total","WE.Manufacturing.Gross.operating.surplus..profits...Nominal.USD","WE.Manufacturing.Gross.output..sales...Nominal.USD",
"WE.Manufacturing.Investment..Nominal.USD","WE.Manufacturing.Production.index","WE.Manufacturing.Value.added.output..As.a.percent.of.GDP","WE.Manufacturing.Value.added.output..As.a.percent.of.world.total","WE.Current.account.of.balance.of.payments.in.US...share.of.GDP",
"WE.Employment..total.1","WE.External.debt..total..US.","WE.Foreign.direct.investment..US.","WE.GDP.per.capita..nominal..US.","WE.GDP..nominal..US.","WE.Government.balance..share.of.GDP","WE.Population..total","WE.Reserves..foreign.exchange..US.",
"WE.Reserves..months.of.import.cover","WE.Stockbuilding..real..share.of.GDP","WE.Visible.trade.balance..share.of.GDP","WE.Consumer.price.index","WE.Gross.government.debt..as.a...of.GDP.","WE.Industrial.production.index","WE.Interest.rate..short.term",
"WE.Interest.rate..Yield.on.10.year.Government.Debt.Securities....per.annum.","WE.Services.balance..as...of.GDP","WE.Share.price.index","WE.Unemployment.rate","WE.Capacity.utilisation","WE.Consumption..government..PPP.exchange.rate..nominal..US.","WE.Consumption..government..nominal..US.",
"WE.Consumption..government..nominal..share.of.GDP.1","WE.Consumption..private..PPP.exchange.rate..nominal..US.","WE.Exports..goods...services..constant.prices.and.exchange.rate..US.....of.World","WE.GDP..industry..real","WE.GVA.Agriculture.share.of.GVA","WE.GVA.Industry.share.of.GVA",
"WE.GVA.Manufacturing.of.GVA","WE.GVA.Services..share.of.GVA","WE.Gross.value.added.in.construction..real","WE.Gross.value.added.in.services..real","WE.Imports..goods...services..constant.prices.and.exchange.rate..US.....of.World","WE.Imports..goods..PPP.exchange.rate..nominal..US.",
"WE.Industrial.production.index.1","WE.Investment..government..nominal","WE.Investment..machinery...equipment..nominal","WE.Investment..private..non.residential.structures..nominal","WE.Investment..total.fixed.investment..nominal..US.",
"WE.Investment..total.fixed..nominal..share.of.GDP","WE.Net.investment..nominal..US.","WE.Output.gap","WE.Productivity..trend","WE.Stockbuilding..nominal..US.",
"WE.Stockbuilding..nominal..share.of.GDP","WE.Stockbuilding..real..annual.contribution.to.growth","WE.Trend.productivity.target","WE.World.trade.index","WE.House.price.index","WE.Housing.starts","WE.Interest.rate.on.building.society.mortgages","WE.Market.value.of.housing.stock..LCU",
"WE.Residential.property.transactions","WE.Stock.of.owner.occupied.houses","WE.Consumers..expenditure..durables..nominal","WE.Financial.liabilities..household.sector..as.a...of.disposable.income","WE.Liabilities..debt.other.than.loans..households","WE.Personal.consumer.credit",
"WE.Retail.sales..value.index","WE.Retail.sales..volume.index","WE.Savings..personal.sector.ratio")))
For example Instead of "WE.Residential.property.transactions" I want the list to return
"WE Residential property transactions"
Based on the structure, it is a recursive list, therefore, functions that loop over the nested list in a recursive way i.e. rapply or rrapply can be used and apply the gsub to match the . and replace with space (' ').
Note that . is a metacharacter that matches any character in regex mode (default case), thus we could match literally by either using fixed = TRUE (should be faster) or escape (\\.) or place it inside square brackets ([.])
library(rrapply)
list_AM2 <- rrapply(list_AM, f = function(x) gsub(".", " ", x, fixed = TRUE))
-ouput
> list_AM2
[[1]]
[[1]]$geo
[1] "EU"
[[1]]$sales
[1] "West Europe Sales"
[[1]]$indicator
[[1]]$indicator[[1]]
[1] "SA"
[[1]]$indicator[[2]]
[1] "NSA"
[[1]]$indicator[[3]]
[1] "composites_industry_value"
[[1]]$indicator[[4]]
[1] "DUCS"
[[1]]$indicator[[5]]
[1] "WUCS"
[[1]]$indicator[[6]]
[1] "T30"
[[1]]$indicator[[7]]
[1] "Rovings"
[[1]]$indicator[[8]]
[1] "Mats"
[[1]]$indicator[[9]]
[1] "WE Construction Gross output sales Real USD"
[[1]]$indicator[[10]]
[1] "WE Construction Production index"
[[1]]$indicator[[11]]
[1] "WE Glass Gross operating surplus profits Nominal USD"
[[1]]$indicator[[12]]
[1] "WE Glass Gross output sales Nominal USD"
[[1]]$indicator[[13]]
[1] "WE Glass Investment Nominal USD"
[[1]]$indicator[[14]]
[1] "WE Glass Production index"
[[1]]$indicator[[15]]
[1] "WE Glass Value added output As a percent of GDP"
[[1]]$indicator[[16]]
[1] "WE Glass Value added output As a percent of manufacturing"
[[1]]$indicator[[17]]
[1] "WE Glass Value added output As a percent of world total"
[[1]]$indicator[[18]]
[1] "WE Industrial Production Gross operating surplus profits Nominal USD"
[[1]]$indicator[[19]]
[1] "WE Industrial Production Gross output sales Nominal USD"
[[1]]$indicator[[20]]
[1] "WE Glass Investment Nominal USD"
[[1]]$indicator[[21]]
[1] "WE Glass Production index"
[[1]]$indicator[[22]]
[1] "WE Glass Value added output As a percent of GDP"
[[1]]$indicator[[23]]
[1] "WE Glass Value added output As a percent of manufacturing"
[[1]]$indicator[[24]]
[1] "WE Glass Value added output As a percent of world total"
[[1]]$indicator[[25]]
[1] "WE Industrial Production Gross operating surplus profits Nominal USD"
[[1]]$indicator[[26]]
[1] "WE Industrial Production Gross output sales Nominal USD"
[[1]]$indicator[[27]]
[1] "WE Industrial Production Production index"
[[1]]$indicator[[28]]
[1] "WE Industrial Production Value added output As a percent of GDP"
[[1]]$indicator[[29]]
[1] "WE Industrial Production Value added output As a percent of world total"
[[1]]$indicator[[30]]
[1] "WE Manufacturing Gross operating surplus profits Nominal USD"
[[1]]$indicator[[31]]
[1] "WE Manufacturing Gross output sales Nominal USD"
[[1]]$indicator[[32]]
[1] "WE Manufacturing Investment Nominal USD"
[[1]]$indicator[[33]]
[1] "WE Manufacturing Production index"
[[1]]$indicator[[34]]
[1] "WE Manufacturing Value added output As a percent of GDP"
[[1]]$indicator[[35]]
[1] "WE Manufacturing Value added output As a percent of world total"
[[1]]$indicator[[36]]
[1] "WE Current account of balance of payments in US share of GDP"
[[1]]$indicator[[37]]
[1] "WE Employment total 1"
[[1]]$indicator[[38]]
[1] "WE External debt total US "
[[1]]$indicator[[39]]
[1] "WE Foreign direct investment US "
[[1]]$indicator[[40]]
[1] "WE GDP per capita nominal US "
[[1]]$indicator[[41]]
[1] "WE GDP nominal US "
[[1]]$indicator[[42]]
[1] "WE Government balance share of GDP"
[[1]]$indicator[[43]]
[1] "WE Population total"
[[1]]$indicator[[44]]
[1] "WE Reserves foreign exchange US "
[[1]]$indicator[[45]]
[1] "WE Reserves months of import cover"
[[1]]$indicator[[46]]
[1] "WE Stockbuilding real share of GDP"
[[1]]$indicator[[47]]
[1] "WE Visible trade balance share of GDP"
[[1]]$indicator[[48]]
[1] "WE Consumer price index"
[[1]]$indicator[[49]]
[1] "WE Gross government debt as a of GDP "
[[1]]$indicator[[50]]
[1] "WE Industrial production index"
[[1]]$indicator[[51]]
[1] "WE Interest rate short term"
[[1]]$indicator[[52]]
[1] "WE Interest rate Yield on 10 year Government Debt Securities per annum "
[[1]]$indicator[[53]]
[1] "WE Services balance as of GDP"
[[1]]$indicator[[54]]
[1] "WE Share price index"
[[1]]$indicator[[55]]
[1] "WE Unemployment rate"
[[1]]$indicator[[56]]
[1] "WE Capacity utilisation"
[[1]]$indicator[[57]]
[1] "WE Consumption government PPP exchange rate nominal US "
[[1]]$indicator[[58]]
[1] "WE Consumption government nominal US "
[[1]]$indicator[[59]]
[1] "WE Consumption government nominal share of GDP 1"
[[1]]$indicator[[60]]
[1] "WE Consumption private PPP exchange rate nominal US "
[[1]]$indicator[[61]]
[1] "WE Exports goods services constant prices and exchange rate US of World"
[[1]]$indicator[[62]]
[1] "WE GDP industry real"
[[1]]$indicator[[63]]
[1] "WE GVA Agriculture share of GVA"
[[1]]$indicator[[64]]
[1] "WE GVA Industry share of GVA"
[[1]]$indicator[[65]]
[1] "WE GVA Manufacturing of GVA"
[[1]]$indicator[[66]]
[1] "WE GVA Services share of GVA"
[[1]]$indicator[[67]]
[1] "WE Gross value added in construction real"
[[1]]$indicator[[68]]
[1] "WE Gross value added in services real"
[[1]]$indicator[[69]]
[1] "WE Imports goods services constant prices and exchange rate US of World"
[[1]]$indicator[[70]]
[1] "WE Imports goods PPP exchange rate nominal US "
[[1]]$indicator[[71]]
[1] "WE Industrial production index 1"
[[1]]$indicator[[72]]
[1] "WE Investment government nominal"
[[1]]$indicator[[73]]
[1] "WE Investment machinery equipment nominal"
[[1]]$indicator[[74]]
[1] "WE Investment private non residential structures nominal"
[[1]]$indicator[[75]]
[1] "WE Investment total fixed investment nominal US "
[[1]]$indicator[[76]]
[1] "WE Investment total fixed nominal share of GDP"
[[1]]$indicator[[77]]
[1] "WE Net investment nominal US "
[[1]]$indicator[[78]]
[1] "WE Output gap"
[[1]]$indicator[[79]]
[1] "WE Productivity trend"
[[1]]$indicator[[80]]
[1] "WE Stockbuilding nominal US "
[[1]]$indicator[[81]]
[1] "WE Stockbuilding nominal share of GDP"
[[1]]$indicator[[82]]
[1] "WE Stockbuilding real annual contribution to growth"
[[1]]$indicator[[83]]
[1] "WE Trend productivity target"
[[1]]$indicator[[84]]
[1] "WE World trade index"
[[1]]$indicator[[85]]
[1] "WE House price index"
[[1]]$indicator[[86]]
[1] "WE Housing starts"
[[1]]$indicator[[87]]
[1] "WE Interest rate on building society mortgages"
[[1]]$indicator[[88]]
[1] "WE Market value of housing stock LCU"
[[1]]$indicator[[89]]
[1] "WE Residential property transactions"
[[1]]$indicator[[90]]
[1] "WE Stock of owner occupied houses"
[[1]]$indicator[[91]]
[1] "WE Consumers expenditure durables nominal"
[[1]]$indicator[[92]]
[1] "WE Financial liabilities household sector as a of disposable income"
[[1]]$indicator[[93]]
[1] "WE Liabilities debt other than loans households"
[[1]]$indicator[[94]]
[1] "WE Personal consumer credit"
[[1]]$indicator[[95]]
[1] "WE Retail sales value index"
[[1]]$indicator[[96]]
[1] "WE Retail sales volume index"
[[1]]$indicator[[97]]
[1] "WE Savings personal sector ratio"
If there are multiple .s, can use \\.+ i.e. one or more and replace with ' '
list_AM2 <- rrapply(list_AM, f = function(x) gsub("\\.+", " ", x))
Related
As the title says, I should split a string at every . ! and ?
That doesn't work:
strsplit(x, "/ (\\?|\\.|!) /")
$`352`
[1] "Saudi Arabian Oil Minister Hisham (...)
the\n... accord and it will never sell its oil at prices below the\npronounced prices under any circumstance.\"\n Saudi Arabia was a main architect of December pact under\nwhich OPEC agreed to cut its total oil output ceiling by 7.25\npct and return to fixed prices of around 18 dollars a barrel.\n Reuter"
$`353`
[1] "Kuwait's oil minister said (...)
daily (bpd).\n Crude oil prices fell sharply last week as international\noil traders and analysts estimated the 13-nation OPEC was\npumping up to one million bpd over its self-imposed limits.\n Reuter"
$`368`
[1] "The port of Philadelphia (...)
the ship on the high tide.\n After delivering oil to a refinery in Paulsboro, New\nJersey, the ship apparently lost its steering and hit the power\ntransmission line carrying power from the nuclear plant to the\nstate of Delaware.\n Reuter"
I shortened it with "(...)" here, so that's not part of the code obviously.
There should be far more splits because there are points where it doesn't split.
Jonathan V. Solórzano is right:
x <- "Ceci.est!un?pipe. . ."
strsplit(x, "\\?|\\.|!")
[[1]]
[1] "Ceci" "est" "un" "pipe" " " " "
I am cleaning a huge dataset made up of tens of thousands of texts using R. I know regular expression will do the job conveniently but I am poor in using it. I have combed stackoverflow but could not find solution. This is my dummy data:
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982",
"04/02/2016 Health is a priority: WAI000553",
"09/ 08/2016 Economy is bad: 2031CE8D",
": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")
I want to remove all the dates, punctuations and IDs and want my result to be this:
[1] "Education is good"
[2] "Health is a priority"
[3] "Economy is bad"
[4] "Vehicle license is needed"
Any help in R will be appreciated.
I think specificity is in order here:
First, let's remove the date-like strings. I'll assume either mm/dd/yyyy or dd/mm/yyyy, where the first two can be 1-2 digits, and the third is always 4 digits. If this is variable, the regex can be changed to be a little more permissive:
foo_data2 <- gsub("\\d{1,2}\\s*/\\s*\\d{1,2}\\s*/\\s*\\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982" " Health is a priority: WO0002021" " Economy is bad: WO001999" " Vehicle license is needed: WO001050"
From here, the abbreviations seem rather easy to remove, as the other answers have demonstrated. You have not specified if the abbreviation is hard-coded to be anything after a colon, numbers prepended with "WO", or just some one-word combination of letters and numbers. Those could be:
gsub(":.*", "", foo_data2)
# [1] " Education is good" " Health is a priority" " Economy is bad" " Vehicle license is needed"
gsub("\\bWO\\S*", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
gsub("\\b[A-Za-z]+\\d+\\b", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
The : removal should be straight forward, and using trimws(.) will remove the leading/trailing spaces.
This can obviously be combined into a single regex (using the logical | with pattern grouping) or a single R call (nested gsub) without complication, I kept them broken apart for discussion.
I think https://stackoverflow.com/a/22944075/3358272 is a good reference for regex in general, note that while that page shows many regex things with single-backslashes, R requires all of those use double-backslashes (e.g., \d in regex needs to be \\d in R). The exception to this is if you use R-4's new raw-strings, where these two are identical:
"\\b[A-Za-z]+\\d+\\b"
r"(\b[A-Za-z]+\d+\b)"
Using stringr try this:
library(stringr)
library(magrittr)
str_remove_all(foo_data, "\\/|\\d+|\\: WO") %>%
str_squish()
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)
data
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\\d{4}[[:space:]]+(.*):.*", "\\1", foo_data)
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)
library(rvest)
jobbank <- read_html("https://www.jobbank.gc.ca/LMI_bulletin.do?cid=3373&AREA=0007&INDUSTRYCD=&EVENTCD=")
Error in open.connection(x, "rb") :
Timeout was reached: Connection timed out after 10015 milliseconds
jobbank %>%
html_node(".lmiBox") %>%
html_text()
Error in eval(lhs, parent, parent) : object 'jobbank' not found
I'm trying to find keywords from the news section of the websites but it seems to be showing me these 2 error messages.
Seems to be working fine on my side.
library(rvest)
#> Loading required package: xml2
library(stringr)
jobbank <- read_html("https://www.jobbank.gc.ca/LMI_bulletin.do?cid=3373&AREA=0007&INDUSTRYCD=&EVENTCD=")
jobbank %>%
html_node(".lmiBox") %>%
html_text() %>%
str_split("(\r\\n+\\s+)|(\\n\\s+)")
#> [[1]]
#> [1] ""
#> [2] "Week of Jan 14 - Jan 18, 2019Lowe's Canada is looking to hire about 2,650 full-time, part-time and seasonal staff at its stores in Ontario. The company will hold a National Hiring Day on February 23."
#> [3] "The Ministry of Innovation, Science, and Economic Development announced $5M in funding to support automotive innovation at APAG Elektronik Corp. and Service Mold + Aerospace Inc. in Windsor, creating 160 jobs"
#> [4] "A $1M investment by the provincial government into Kenora's Downtown Revitalization Project for a plaza and infrastructure upgrades will create 75 new jobs"
#> [5] "Redfin Corp., an American real estate brokerage, is expanding into Canada and hiring in Toronto"
#> [6] "The construction of townhomes at Walkerville Stones in Windsor is expected to begin this spring "
#> [7] "The Ontario Emerging Jobs Institute (OEJI) at the Nav Centre in Cornwall opened. The OEJI provides skills training in areas with worker shortages."
#> [8] "The Chartwell Meadowbrook Retirement Residence in Lively broke ground on their expansion project, which includes 41 new suites and 14 town homes"
#> [9] "Lambton College created an Information Technology and Communication Research Centre using a five-year, $2M grant from the Natural Sciences and Engineering Research Council of Canada. They hope to use part of the funding to employ students."
#> [10] "SnapCab, a workspace pod manufacturer in Kingston, has grown from 20 to 25 employees with more hiring expected to occur in 2019"
#> [11] "Niagara Pallet & Recyclers Ltd., a manufacturer of pallets and shipping materials in Smithville, is hiring general labour workers, AZ and DZ drivers, production staff, forklift drivers and saw operators"
#> [12] "A1 Demolition will begin demolition of the former Maliboo Club in Simcoe. The plan is to rebuild the structure with residential and commercial space."
#> [13] "MidiCi: The Neapolitan Pizza Co., Sweet Jesus, La Carnita and The Pie Commission will be among several restaurants opening in the 34,000-sq.-ft. Food District in Mississauga this spring "
#> [14] "Menkes Developments Ltd., in partnership with TD Greystone Asset Management, will renovate the former Canada Permanent Trust Building in Toronto. Work on the 270,000-sq.-ft. space is expected to take between 12 and 18 months."
#> [15] "Westmount Signs & Printing in Waterloo is hiring experienced installers after doubling the size of its workforce to 24 employees in the last year and a half"
#> [16] "Microbrewery, Heral Haus Brewing Co. opened in Stratford at the end of December"
#> [17] "Demolition is expected to start this month on Windsor's old City Hall and is expected to be complete by August"
#> [18] "Urban Planet, a clothing store, will open as early as February 2019 at the Cornwall Square mall in Cornwall"
#> [19] "The federal government committed $3.5M towards the construction of a new art gallery in Thunder Bay, bringing total government funding for the project to $27.5M"
#> [20] "The Rec Room, a 44,000-sq.-ft. entertainment complex by Cineplex Entertainment LP, is scheduled to open in Mississauga in March "
#> [21] "Yang Teashop opened a second location in Toronto with plans to open two more locations in the Greater Toronto Area"
#> [22] "Spacecraft Brewery opened in Sudbury"
#> [23] "The Town of Lakeshore will be accepting applications for 11 summer student positions until March 1"
#> [24] "Virtual reality arcade Cntrl V opened in Lindsay"
#> [25] "A new restaurant, Presqu'ile Café and Burger, opened in Brighton"
#> [26] "Beauty brand Morphe LLC opened a store in Mississauga"
#> [27] "Footwear retailer Brown Shoe Company of Canada Ltd. Inc. will open an outlet store in Halton Hills in April"
#> [28] "The Westdale Theatre in Hamilton is scheduled to reopen in February "
#> [29] "Early ON/Family Grouping will open a child care centre in Monkton"
#> [30] "The De Novo addiction treatment centre opened in Huntsville "
#> [31] "French Revolution Bakery & Crêperie opened in Dundas"
#> [32] "A Williams Fresh Cafe is slated to open in Stoney Creek, one of three new locations opening this year in southwestern Ontario"
#> [33] "Monigram Coffee Midtown cafe will open in Kitchener this winter "
#> [34] "My Roti Place opened a fourth restaurant in Toronto"
#> [35] "A Gangster Cheese restaurant opened in Whitby"
#> [36] "A Copper Branch restaurant opened in Mississauga "
#> [37] "Hallmark Canada will exit about 20 company-owned stores across Canada in 2019 by either transitioning them to independent ownership or closing them. The loacations of the affected stores have not been identified."
#> [38] "Lush Cosmetics at the Intercity Shopping Centre in Thunder Bay will close at the end of January"
#> [39] ""
Created on 2019-01-28 by the reprex package (v0.2.1)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
complete dataset link : https://drive.google.com/open?id=12u0Ql1z5T2lzCXRVjp75i9ke9mNYrCWv
In this you can see general motors are not counted together as they are in different category. Like this many more manufacturer's are there. I want to group them together like General Motors. How can I group them together using nlp in r?
Try this way to achieve your goal:
Your Input data.frame:
Vehicle_Manufacturer<-c("GENERAL MOTORS CORP.","FORD MOTOR COMPANY","CHRYSLER CORPORATION","PACCAR INCORPORATED","MACK TRUCKS, INCORPORATED","FOREST RIVER, INC.","BLUE BIRD BODY COMPANY","DAIMLER TRUCKS NORTH AMERICA","GENERAL MOTORS LLC","HONEYWELL INTERNATIONAL, INC.","WINNEBAGO INDUSTRIES, INC.","BMW OF NORTH AMERICA, LLC","NISSAN NORTH AMERICA, INC.","NAVISTAR INTL CORP.","INTERNATIONAL TRUCK AND ENGINE","FREIGHTLINER LLC","HONDA (AMERICAN HONDA MOTOR CO.)","NEWMAR CORPORATION","NAVISTAR, INC","INTERNATIONAL TRUCK & ENGINE CORPORATION","PIERCE MANUFACTURING","GULF STREAM COACH, INC.","FLEETWOOD ENTERPRISES, INC.","FREIGHTLINER CORPORATION","DAIMLER TRUCKS NORTH AMERICA LLC","PACCAR, INCORPORATED","WHITE MOTOR CORPORATION","BAYERISCHE MOTOREN WERKE","THOMAS BUILT BUSES, INC.","DAIMLERCHRYSLER CORPORATION","VOLKSWAGEN OF AMERICA,INC","SPARTAN MOTORS, INC.","VOLVO TRUCKS NORTH AMERICA INC","TOYOTA MOTOR ENGINEERING & MANUFACTURING","PREVOST CAR, INCORPORATED","CHAMPION BUS, INC.","ALTEC INDUSTRIES INC.","SABERSPORT","MERCEDES-BENZ USA, LLC.","HARLEY-DAVIDSON MOTOR COMPANY","COOPER TIRE & RUBBER CO.","KEYSTONE RV COMPANY","SUBARU OF AMERICA, INC.","CHRYSLER (FCA US LLC)","MONACO COACH CORPORATION","CHRYSLER GROUP LLC","JAYCO, INC.","MITSUBISHI FUSO TRUCK OF AMERICA, INC.","COLLINS BUS CORPORATION","PRO-A MOTORS, INC.","NAVISTAR, INC.")
Recalls<-c(6228,5403,2787,2317,1988,1903,1898,1737,1620,1558,1353,1297,1174,1130,1055,987,985,980,955,950,925,922,918,896,835,824,818,801,797,794,749,731,724,709,694,669,641,623,616,613,599,586,582,578,578,572,569,568,559,549,511)
df<-data.frame(Vehicle_Manufacturer,Recalls)
Using package stringdist find similar strings between Vehicle_Manufacturer, in this example using Jaro-Winkler distance:
dist_matrix<-stringdistmatrix(as.character(df[,1]),as.character(df[,1]),method="jw")
Find a threshold under that similar strings are grouped, like this:
thr<-quantile(dist_matrix,probs=0.025) #2.5% quantile
Find strings to merge (in this example a for-loop but if you have a lot of data a lapply solution is better)
to_merge<-NULL
for(i in 1:nrow(df))
{
to_merge[[i]]<-Vehicle_Manufacturer[dist_matrix[i,]<thr]
}
Your output will be in to_merge list
To see only possible merge:
to_merge[sapply(to_merge, length) > 1]
[[1]]
[1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"
[[2]]
[1] "PACCAR INCORPORATED" "PACCAR, INCORPORATED"
[[3]]
[1] "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"
[[4]]
[1] "DAIMLER TRUCKS NORTH AMERICA" "DAIMLER TRUCKS NORTH AMERICA LLC"
[[5]]
[1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"
[[6]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
[[7]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
[[8]]
[1] "DAIMLER TRUCKS NORTH AMERICA" "DAIMLER TRUCKS NORTH AMERICA LLC"
[[9]]
[1] "PACCAR INCORPORATED" "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"
[[10]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
I have raw bibliographic data as follows:
bib =
c("Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte",
"Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in",
"Republican China*, Cambridge: Harvard University Press, 1976.",
"", "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing",
"History* and the Crisis of Traditional Chinese Historiography,\"",
"*Historiography East & West*2.2 (Sept. 2004): 173-204", "",
"Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:",
"Yale University Press, 1988.", "")
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte"
[2] "Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in"
[3] "Republican China*, Cambridge: Harvard University Press, 1976."
[4] ""
[5] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing"
[6] "History* and the Crisis of Traditional Chinese Historiography,\""
[7] "*Historiography East & West*2.2 (Sept. 2004): 173-204"
[8] ""
[9] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:"
[10] "Yale University Press, 1988."
[11] ""
I would like to collapse elements between the ""s in one line so that:
clean_bib[1]=paste(bib[1], bib[2], bib[3])
clean_bib[2]=paste(bib[5], bib[6], bib[7])
clean_bib[3]=paste(bib[9], bib[10])
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
Is there a one-liner that does this automatically?
You can use tapply while grouping with all "" then paste together the groups
unname(tapply(bib,cumsum(bib==""),paste,collapse=" "))
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] " Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] " Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
[4] ""
you can also do:
unname(c(by(bib,cumsum(bib==""),paste,collapse=" ")))
or
unname(tapply(bib,cumsum(grepl("^$",bib)),paste,collapse=" "))
etc
Similar to the other answer. This uses split and sapply. The second line is just to remove any elements with only has "".
vec <- unname(sapply(split(bib, f = cumsum(bib %in% "")), paste0, collapse = " "))
vec[!vec %in% ""]