I have a dataframe that contains a column with brokernames, which are handwritten by customers, that I would like to go through in order to replace the handwritten brokernames with unique brokernames that I have in a list.
A snippet of my data looks like this:
Data <- data.frame(Date = c("01-10-2020", "01-10-2020", "01-11-2020", "01-11-2020"),
Broker = c("RealEstate", "REALestate", "Estate", "ESTATE"))
My list of unique broker names looks like this:
Unique_brokers <- list("REALESTATE", "ESTATE")
Based on some sort of pattern-recognition, I would like to replace the brokernames in my Data dataframe with the unique brokernames in my Unique_brokers list.
I've partially managed to do this somewhat manually using a combination of case_when and str_detect from dplyr and stringr respectively.
Data <- Data %>%
mutate("UniqueBroker" = case_when(str_detect(Broker, regex("realestate", ignore_case=T))~"REALEASTE",
str_detect(Broker, regex("estate", ignore_case=T))~"ESTATE",
TRUE~"OTHER"))
However, this is fairly timeconsuming with >100 unique brokers and more than 12500 combinations of handwritten brokernames in ~80.000 records.
I was wondering whether it would be possible to make this replacement using mapply, I haven't been able to so far, however.
Many thanks in advance!
EDIT
Data$Broker consists of all kinds of combinations in terms of spelling, information included, etc.
E.g.
Data$Broker <- c("Real-estate", "Real estate", "Real estate department 788", "Michael / REAL Estate")
You can use the i flag for case-insensitive regex and use str_replace_all -
library(stringr)
Unique_brokers <- c("REALESTATE", "ESTATE")
Data$Unique_brokers <- str_replace_all(Data$Broker,
setNames(Unique_brokers, str_c('(?i)', Unique_brokers)))
A dplyr and stringrsolution:
Data %>%
mutate(Broker_unique = if_else(str_detect(Broker, "(?i)real(-|\\s)?estate"),
"REALESTATE",
"ESTATE"))
Date Broker Broker_unique
1 01-10-2020 Real-estate REALESTATE
2 01-10-2020 estate ESTATE
3 01-11-2020 Real estate department 788 REALESTATE
4 01-11-2020 Michael / REAL Estate REALESTATE
The pattern works like this:
(?i): make match case-insensitive
real: match literal real
-(-|\\s)?: match - OR (i.e., one whitespace) optionally
estate: match estate literally
Test data:
Data <- data.frame(Date = c("01-10-2020", "01-10-2020", "01-11-2020", "01-11-2020"),
Broker = c("RealEstate", "REALestate", "Estate", "ESTATE"))
Data$Broker <- c("Real-estate", "estate", "Real estate department 788", "Michael / REAL Estate")
Related
I have been given an oddly structured dataset that I need to prepare for visualisation in GIS. The data is from historic newspapers from different locations in China, published between 1921 and 1937. The excel table is structured as follows:
There is a sheet for each location, 2. each sheet has a column for every year and the variables for each newspaper is organised in blocks of 7 rows and separated by a blank row. Here's a sample from one of the sheets:
,1921年,1922年,1923年
,,,
Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,
,,,
Title of Newspaper 2,ノーウォスチ・ジーズニ(Nōuosuchi Jīzuni),ノォウォスチ・ジーズニ,ノウウスチジーズニ
(Language),露文,露文,露文
(Ideology),政治、社会、文学、商工新聞,社会民主,社会民主
(Owner),タワリシエスウオ・ペチャヤチャ,ぺチヤチ合名会社,ぺチヤチ合名会社
(Editior),イ・エフ・ブロクミユレル,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー
(Publication Frequency),日刊,日刊,日刊
(Circulation),"約3,000","約3,000","3,000"
(Others),1909年創刊、猶太人会より補助を受く、「エス・エル」党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓せるものにして記事多く比較的正確なり、日本お対露干渉排斥排日記事を掲載す,1909年創刊、エス・エル党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓し記事多く比較的正確にして最も有力なるものなり、日本お対露干渉排斥及一般排日記事を掲載し「チタ」政府を擁護す,1909年創刊、過激派に益々接近し長春会議以後は「ダリタ」通信と相待って過激派系の両翼たりし感あり、紙面整頓し記事比較的正確且つ金力に於て猶太人系の後援を有し最も有力なる新聞たり、一般排日記事を掲載し支那側に媚を呈す、「チタ」政権の擁護をなし当地に於ける機関紙たりと自任す
,,,
Title of Newspaper 3,北満洲(Kita Manshū),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun)
(Language),邦文,邦文,邦文
(Ideology),,,
(Owner),合資組織,社長 児玉右二,株式組織 (社長)児玉右二
(Editior),木下猛、遠藤規矩郎,編集長代理 阿武信一,(副社長)磯部検三 (主筆)阿武信一
(Publication Frequency),日刊,日刊,日刊
(Circulation),"1,000内外",,
(Others),大正3年7月創刊,大正11年1月創刊の予定、西比利亜新聞、北満洲及哈爾賓新聞の合同せるものなり,大正11年1月創刊
Yes, it's also in numerous non-latin languages, which makes it a little bit more challenging.
I want to create a new matrix for every year, then rotate the table to turn the 7 rows for each newspaper into columns so that I end up with each row corresponding to one newspaper. Finally, I need to generate a new column that gives me the location of the newspaper. I would also like to add a unique identifier for each newspaper and add another column that states the year, just in case I decide to merge the entire dataset into a single matrix. I did the transformation manually in Excel but the entire dataset contains data several from thousand newspapers, so I need to automate the process. Here is what I want to achieve (sans unique identifier and year column):
Title of Newspaper,Language,Ideology,Owner,Editor,Publication Frequency,Circulation,Others,Location
直隷公報(Zhi Li Gong Bao),漢文,直隷省公署の公布機関,直隷省,,日刊,2500,光緒22年創刊、官報の改称,Tientsin
大公報(Da Gong Bao),漢文,稍親日的,合資組織,樊敏鋆,日刊,,光緒28年創刊、倪嗣仲の機関にて現に王祝山其の全権を握り居れり、9年夏該派の没落と共に打撃を受け少しく幹部を変更して再発行せり、但し資金は依然王より供給し居れり,Tientsin
天津日々新聞(Tianjin Ri Ri Xin Wen),漢文,日支親善,方若,郭心培,日刊,2000,光緒27年創刊、親日主義を以て一貫す、國聞報の後身なり民国9年安直戦争中直隷派の圧迫を受けたるも遂に屈せさりし,Tientsin
時聞報(Shi Wen Bao),漢文,中立,李大義,王石甫,,1000,光緒30年創刊、紙面相当価値あり,Tientsin
Is there a way of doing this in R? How will I go about it?
I've outlined a plan in a comment above. This is untested code that makes it more concrete. I'll keep testing till it works
inps <- readLines( ~/Documents/R_code/Tientsin unformatted.txt")
inp2 <- inps[ -(1:2) ]
# 'identify groupings with cumsum(grepl(",,,", inp2) as the second arg to split'
inp.df <- lapply( split(inp2, cumsum(grepl(",,,", inp2) , read.csv)
library(data.table) # only needed if you use rbindlist not needed for do.call(rbind , ..
# make a list of one-line dataframes as below
# finally run rbindlist or do.call(rbind, ...)
in.dt <- do.call( rbind, (inp.df)) # rbind checks for ordering of columns
This is the step that makes a one line dataframe from a set of text lines:
txt <- 'Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,'
temp=read.table(text=txt, sep="," , colClasses=c(rep("character", 2), NA, NA))
in1 <- setNames( data.frame(as.list(temp$V2)), temp$V1)
in1
#-------------------------
Title of Newspaper 1 (Language) (Ideology) (Owner) (Editior) (Publication Frequency) (Circulation)
1 遠東報(Yuan Dong Bao) 漢文 東支鉄道機関紙 (総経理)史秉臣 張福臣 日刊 1,000
(Others)
1 1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す
So it looks like the column names of the individually constructed items would need to further processing to make them capable of being successfully "bindable" by plyr::rbindlist or data.table::rbindlist
I have a dataset that looks simplified similar to this:
call_id<- c("001","002","003","004","005","012","024")
transcript <- c("All the best and happy birthday",
"万事如意,生日快乐",
"See you tomorrow",
"Nice hearing from you",
"再相见",
"玩",
"恭喜你 ")
df <- as.data.frame(cbind(call_id, transcript))
I need a code that gives me the call_id or row numbers for the observations where the transcript column includes chinese language. My final goal is to exclude the rows where the transcript column contains chinese language. As I have a data set with 250,000 observation, obviously it must be a code that does this automatically, not one that does this by hand for this small data set. I have already done some analysis with Quanteda. Is there any possibility in Quanteda for this ? Thanks in advance.
How about using the Unicode character class for Chinese characters?
> txt <- c("All the best and happy birthday", "万事如意,生日快乐")
> stringi::stri_detect_regex(txt, "\\p{Han}")
[1] FALSE TRUE
You can use textcat package in R to detect multiple languages. It can detect upto 74 languages and uses a reduced n-gram approach designed to remove redundancies of the original approach.
Here's an example to remove rows having Chinese language-
library("textcat")
out_df <- df[textcat(df$transcript) != "chinese",]
So I've got a .txt file that uses commas to separate fields, but it also uses pipes ("|") as text delimiters. I would like to read this .txt file using R (though I could use other programs if this is impossible with R), and I would like that all values would be in the right columns.
A sample of data:
15,|0370A01D-DC1E-4534-8176-A08A1E2F82E4|,|EDU|,|Education|,|Appropriations and authorization regarding higher education issues.|,|2008|
16,|03A8F7BB-9716-4494-BF41-013C27B5ECA6|,|GOV|,|Government Issues|,|issues affecting local government including appropriations|,|2003|
17,|04696109-082B-4EF6-9AA8-A6DB1013D15D|,|TEC|,|Telecommunications|,|RUS Broadband Applikcation|,|2008|
18,|04FA0BA7-E9D2-4F1E-8193-45F023065C89|,|DOC|,|District of Columbia|,|HUD Appropriations FY2009, CDBG
Financial Services Appropriations FY2009, District of Columbia
Commerce, Justice, Science Appropriations, Juvenile Justice, Byrne Grant|,|2008|
19,|04FA0BA7-E9D2-4F1E-8193-45F023065C89|,|HOU|,|Housing|,|HUD Appropriations FY2009, CDBG
Financial Services Appropriations FY2009, District of Columbia
Commerce, Justice, Science Appropriations, Juvenile Justice, Byrne Grant|,|2008|
So each row contains a row number (15, 16, ..., 19), a |uniqueID|, an |IssueID| of three letters, a longer version of |Issue|, a |SpecificIssue|, and a |Year|.
The closest I got to reading this file is by using the following code (I know that I identify pipe as a separator in it and it is incorrect, but this gives the best result thus far):
lob_issues2 <- fread("file.txt", sep = "|", fill = TRUE)
This results in the following table.
As you can see, the SpecificIssue column in rows 18 and 19 are causing trouble. Perhaps these values are too long or sth, and this makes R assign parts of these values in new columns. I would like that R would keep these values in the SpecificIssue column. Any suggestions on what code to use in order to achieve that?
Thanks in advance. Also, if you think another program is better for this, please let me know.
Use the quote= argument to let it know that | is being used as the quote character:
lob_issues2 <- read.table("file.txt", quote = "|", sep = ",")
I am cleaning a company file that contains the adresses and postal codes of the companies.
Some companies are added multiple times, however the postal codes differ. This is probably caused by human errors, but makes working with the dataset very difficult.
The dataset would look something like this:
Company | Adress | Postal Code
Company1 | Limestreet | 4444ER
Company1 | Limestreet | 4445ER
Company2 | Applestreet | 3745BB
I would like to check which companies have different postal codes. Since the companynames are often spelled differently too (also human errors), it would be best to check this based on matching addresses.
I've tried to solve with tidyverse, but it's not working. My plan was to find all the faulty postal codes and correct them manually. However, if there are too many, I might have to find a way to do it more efficiently. So not only would I like to ask advise on how to detect the errors, but I'd also like to ask advise on how to correct it in R. Maybe point me towards some good packages or pages describing how to fix it?
df2 <- df1 %>%
select(Adress PostalCode) %>%
group_by(Adress) %>%
summarise( n())
To create a mock example of the dataset:
company <- c("company1", "company1", "company2", "company2", "company3")
Address <- c("Limestreet", "Limestreet", "Applestreet", "Applestreet",
"Pearstreet")
Postal_code <- c("4444ER", "4445ER", "3745BB", "3745BC", "8743IJ")
trail_data <- data.frame(company, Address, Postal_code)
I think you were close with your code, but I would just show the ones that have different lines. This will show you the ones to focus on.
trail_data %>%
select(Address, Postal_code) %>%
group_by(Address) %>%
unique() %>%
filter(n() > 1)
I think we need a little more information from your database to get the final answer, BUT you can start by writing a little code that identifies if, when sorted, there is a discrepancy in the postal codes. Note that I added one more row of data (company 3) that serves as a "non-discrepant" instance.
I created a new variable called same which is equal to 1 if company name and address match for any pair of rows, but 0 otherwise. You can use this information with other data (which we don't have) to determine which value might be the correct one.
company <- c("company1", "company1", "company2", "company2", "company3","company3")
Address <- c("Limestreet", "Limestreet", "Applestreet", "Applestreet",
"Pearstreet","Pearstreet")
Postal_code <- c("4444ER", "4445ER", "3745BB", "3745BC", "8743IJ","8743IJ")
trail_data <- data.frame(company, Address, Postal_code)
trail_data$same<-ifelse(trail_data$company==lag(trail_data$company, trail_data$Address==lag(trail_data$Address,1) & trail_data$Postal_code!=lag(trail_data$Postal_code),0,1)
I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.