I have a list of UK postcodes in my dataset, and I would like to convert them to their deprivation index. This website does it http://imd-by-postcode.opendatacommunities.org/imd/2019 but I need it to be done in R, rather than manually entering 1000s postcodes individually.
Does anyone have any experience/idea of a package that does this?
Many thanks
The Office for National Statistics has some lookup tables that match postcodes to various scales of outputs areas, for example: Postcode to Output Area
Hopefully you can find a common field to merge by.
Related
I have this problem in R where I have a list of Spanish communities and inside each community there is a list of towns/municipalities.
For example, this is a list of municipalities inside the community of Catalonia.
https://en.wikipedia.org/wiki/Municipalities_of_Catalonia
So; Catalonia is one community and within this community it has a list of towns/cities which I would like to group/ assign a new value 'Catalona'.
I have a list of all the municipalities/towns/cities in my dataset and I would like to group them into communities such as; Andalusia, Catalonia, Basque Country, Madrid etc.
Firstly, how can I go about grouping these rows into the list of communities?
For example; el prat de llobregat is a municipality within Catalonia so I would like to assign this to the region of Catalonia. Getafe is a municipality of Madrid so I would like to assign this to a value of Madrid. Alicante is a municipality of Valencia so I would like to assign this to a value Valencia. Etc.
#
That was my first question and if you are able to help with just that, I would be very thankful.
However, my dataset is not that clean, I did my best to remove Spanish accents, remove unnecessary code identifiers in the municipality names but there still contains some small errors. For example, castellbisbal is a municipality of Catalonia, however some entries have very small spelling mistakes, i.e. including 1 'l' instead of two, spelling; (castelbisbal).
These errors are human errors and are very small, is there a way I can work around this?
I was thinking of a vector of all correctly spelt names and then rename the incorrectly spelt names based on a percentage of incorectness, could this work? For instance castellbisbal is 13 characters long, and has an error of 1 character, with less than an 8% error rate. Can I rename values based on an error rate?
Do you have any suggestions on how I can proceed with the second part?
Any tips/suggestions would be great.
As for the spelling errors, have you tried the soundex algorithm? It was meant for that and at least two R packages implement it.
library(stringdist)
phonetic("barradas")
[1] "B632"
phonetic("baradas")
[1] "B632"
And the soundex codes for for the same words are the same with package phonics.
library(phonics)
soundex("barradas")
[1] "B632"
soundex("baradas")
[1] "B632"
All you would have to do would be to compare soundex codes, not the words themselves. Note that soundex was designed for the english language so it can only handle english language characters, not accents. But you say you are already taking care of those, so it might work with the words you have to process.
So, sometimes I need to get some data from the web organizing it into a dataframe and waste a lot of time doing it manually. I've been trying to figure out how to optimize this proccess, and I've tried with some R scraping approaches, but couldn't get to do it right and I thought there could be an easier way to do this, can anyone help me out with this?
Fictional exercise:
Here's a webpage with countries listed by continents: https://simple.wikipedia.org/wiki/List_of_countries_by_continents
Each country name is also a link that leads to another webpage (specific of each country, e.g. https://simple.wikipedia.org/wiki/Angola).
I would like as a final result to get a data frame with number of observations (rows) = number of countries listed and 4 variables (colums) as ID=Country Name, Continent=Continent it belongs to, Language=Official language (from the specific webpage of the Countries) and Population = most recent population count (from the specific webpage of the Countries).
Which steps should I follow in R in order to be able to reach to the final data frame?
This will probably get you most of the way. You'll want to play around with the different nodes and probably do some string manipulation (clean up) after you download what you need.
I use tm package to analyse 5 docs. Initially the data is in csv format, contains few columns. I searched for most common words in the 1st column which represents the title of some books.(i created separate txt doc for each needed column).
Now I want to analyse the column that contains the First and Last Name of the authors (eg. John, Smith). I want to determine the number(frequency) of books for each author.
Please tell me how can I analyse both words together not separately like in the first case?
Convert the variable name_author into factor, then you have juste to determine how frequency you have for each levels (Authors).
I'm trying to merge two large datasets. The common variable, first and last name, vary in spelling between the datasets and there are many duplicates, even between similarly spelled names. I've included download links for the files and some R code below. I'll walk through what I've tried and what went wrong.
There are a few R tutorials that have tried to tackle (the common) problem of record linking, but none of dealt with large datasets. I'm hoping the SO community can help me solve this problem.
The first dataset is a large file (several hundred thousand
rows) of Federal Elections Commission political contributions.
The second is a custom dataset of the name and companies of
every Internet company founder (~5,000 rows)
https://www.dropbox.com/s/lfbr9lmurv791il/010614%20CB%20Founders%20%20-%20CB%20Founders.csv?dl=0
--Attempted code matching with regular expressions--
My first attempt, thanks to the help of previous SO suggestions, was to use agrep and regular string matching. This narrowed down the names, but resulted in too many duplicates
#Load files#
expends12 <- fread("file path for FEC", sep="|", header=FALSE)
crunchbase.raw <- fread("file path for internet founders")
exp <- expends12
cr <- crunchbase.raw
#user regular string matching#
exp$xsub= gsub("^([^,]+)\\, (.{7})(.+)", "\\2 \\1", tolower(expends12$V8))
cr$ysub= gsub("^(.{7})([^ ]+) (.+)", "\\1 \\3", tolower(cr$name))
#merge files#
fec.merge <- merge(exp,cr, by.x="xsub", by.y="ysub")
The result is 6,900 rows, so there are (a lot) of duplicates. Many rows are people with similar names as Internet founders, such as Alexander Black, but are from different states and have different job titles. So, now its a question of finding the real Internet founder.
One option to narrow the results would be filter the results by states. So, I might only take the Alexander Black from California or New York, because that is where most startups are founded. I might also only take certain job titles, such as CEO or founder. But, many founders had jobs before and after their companies, so i wouldn't want to narrow by job title too much.
Alternatively, there is an r package, RecordLinkage, but as I far as I can tell, there needs to be similar rows and columns between the datasets, which is a nonstarter for this task
I'm familiar with R, but have somewhat limited statistical knowledge and programming ability. Any step-by-step help is very much appreciated. Thank you and please let me know if there's any trouble downloading the data.
Why don't you select the columns you need from both datasets, rename them similarly and in the result object, you get the row indices for matches returned. As long as you don't reorder things, you can use the results to match both datasets.
I need to know how to map unstructured data to structured data.
I have a variable that has customer's addresses that includes their cities. The name of the city for example DELHI, can be of the form "DELHI", "DEHLI" "DILLI", "DELI" and I need to detect the city name from these addresses and map it to the correct name that is "DELHI".
I am trying to implement a solution in SAS or R.
If you want to try to automate the process of matching your numerous incorrect values to correct values, you could put together something based on Hamming Distance or Levenshtein distance, perhaps via the COMPGED function. You can calculate a score for each manually input row for each possible matching structured value, then keep the one with the lowest score as your best guess. This will probably not be 100% accurate, but it ought to do a fairly good job far faster than a human could.
I doubt it is practical to completely code this in an automated fashion, but I would suggest a two step approach.
First, identify possible matches. You can use a number of potential solutions; this is far more complex than a StackOverflow solution, but you have some suggestions already, and you can look at papers on the internet, such as this paper which explains many of the SAS functions and call routines (COMPGED, SPEDIS, COMPLEV, COMPCOST, SOUNDEX, COMPARE).
Use this approach with a fairly broad stroke - ie, prefer false positives to false negatives. Simply focus on identifying words one to one; build a dataset of original, translation, such as
Delli, Delhi
Deli, Delhi
Dalhi, Delhi
etc.
Then visually inspect the file and make corrections as needed (ie, remove false positives).
Once you have this dataset, you have a few options for utilizing the results. If you already have the city name as a separate field, or if you can put it in a separate field or work with it using scan easily to identify just the city, you can use a format solution.
data for_fmt;
set translations;
start=original;
label=translation;
fmtname='$CITYF';
*no hlo=o record as we want to preserve nonmatches as is;
run;
proc format cntlin=for_Fmt;
quit;
data want;
set have;
city_fixed=put(city,$CITYF.);
run;
If you cannot easily identify the city in the address (ie, your address field is something like "10532 NELSON DRIVE DELHI" with no commas or such), then the TRANWRD solution is probably best. You can code a hash-based or array-based solution to implement it (rather than a lot of if statements); if your data does have this problem post a comment and I'll add to the solution later.
In SAS this might not be the easiest way, but one way of doing this if your city name is inside the address string is to use the TRANWRD function. This can replace a string inside your address variable. The syntax is:
tranwrd(variable, original_str, new_str);
For example using your city DELHI:
data city;
input address $1-30;
datalines;
1 Ocean drive, DEHLI
2 Peak road, DELI
45 Buck street DILLI
;
run;
data change;
set city;
address = tranwrd(address,' DEHLI ',' DELHI ');
address = tranwrd(address,' DELI ',' DELHI ');
address = tranwrd(address,' DILLI ',' DELHI ');
run;
I put a space before and after both the original and new strings so that it won't replace a correct string that is inside a word (E.g. DELICIOUS Road will be changed to DELHICIOUS Road)