I'm running the following query with SQLITE against a geo database:
SELECT name, max(inhabitants) FROM countries GROUP BY continent
I'm consistently getting the names and inhabitants of the countries with the largest population, per continent. Is this by chance or some kind of expected (documented?) behavior that I can rely on?
In this statement:
SELECT name, max(inhabitants) FROM countries GROUP BY continent
the column name although it is in the SELECT list, it does not appear appear in the GROUP BY clause neither it is aggregated.
For SQLite this is a "bare" column and since it is used side by side with MAX() aggregate function, the value returned in the results for name is the value from the row that contains the max value for inhabitants for each continent.
And yes, this behavior is documented here: Simple Select Processing (at the end of the section).
I have a list of UK postcodes in my dataset, and I would like to convert them to their deprivation index. This website does it http://imd-by-postcode.opendatacommunities.org/imd/2019 but I need it to be done in R, rather than manually entering 1000s postcodes individually.
Does anyone have any experience/idea of a package that does this?
Many thanks
The Office for National Statistics has some lookup tables that match postcodes to various scales of outputs areas, for example: Postcode to Output Area
Hopefully you can find a common field to merge by.
So i've extracted a dataset of customers from our system and i've loaded in this dataset. I need the dataset to only have street names and the postal code with city in order to send out customer letters The names and NA rows need to be removed. i need to remove all empty lines and all names. I only need the address and zip code.
Therefor i need to delete all rows where there isn't a number in it.
I've found the answer.
df[str_detect(df$..., "\\d")]
So, sometimes I need to get some data from the web organizing it into a dataframe and waste a lot of time doing it manually. I've been trying to figure out how to optimize this proccess, and I've tried with some R scraping approaches, but couldn't get to do it right and I thought there could be an easier way to do this, can anyone help me out with this?
Fictional exercise:
Here's a webpage with countries listed by continents: https://simple.wikipedia.org/wiki/List_of_countries_by_continents
Each country name is also a link that leads to another webpage (specific of each country, e.g. https://simple.wikipedia.org/wiki/Angola).
I would like as a final result to get a data frame with number of observations (rows) = number of countries listed and 4 variables (colums) as ID=Country Name, Continent=Continent it belongs to, Language=Official language (from the specific webpage of the Countries) and Population = most recent population count (from the specific webpage of the Countries).
Which steps should I follow in R in order to be able to reach to the final data frame?
This will probably get you most of the way. You'll want to play around with the different nodes and probably do some string manipulation (clean up) after you download what you need.
I need to know how to map unstructured data to structured data.
I have a variable that has customer's addresses that includes their cities. The name of the city for example DELHI, can be of the form "DELHI", "DEHLI" "DILLI", "DELI" and I need to detect the city name from these addresses and map it to the correct name that is "DELHI".
I am trying to implement a solution in SAS or R.
If you want to try to automate the process of matching your numerous incorrect values to correct values, you could put together something based on Hamming Distance or Levenshtein distance, perhaps via the COMPGED function. You can calculate a score for each manually input row for each possible matching structured value, then keep the one with the lowest score as your best guess. This will probably not be 100% accurate, but it ought to do a fairly good job far faster than a human could.
I doubt it is practical to completely code this in an automated fashion, but I would suggest a two step approach.
First, identify possible matches. You can use a number of potential solutions; this is far more complex than a StackOverflow solution, but you have some suggestions already, and you can look at papers on the internet, such as this paper which explains many of the SAS functions and call routines (COMPGED, SPEDIS, COMPLEV, COMPCOST, SOUNDEX, COMPARE).
Use this approach with a fairly broad stroke - ie, prefer false positives to false negatives. Simply focus on identifying words one to one; build a dataset of original, translation, such as
Delli, Delhi
Deli, Delhi
Dalhi, Delhi
etc.
Then visually inspect the file and make corrections as needed (ie, remove false positives).
Once you have this dataset, you have a few options for utilizing the results. If you already have the city name as a separate field, or if you can put it in a separate field or work with it using scan easily to identify just the city, you can use a format solution.
data for_fmt;
set translations;
start=original;
label=translation;
fmtname='$CITYF';
*no hlo=o record as we want to preserve nonmatches as is;
run;
proc format cntlin=for_Fmt;
quit;
data want;
set have;
city_fixed=put(city,$CITYF.);
run;
If you cannot easily identify the city in the address (ie, your address field is something like "10532 NELSON DRIVE DELHI" with no commas or such), then the TRANWRD solution is probably best. You can code a hash-based or array-based solution to implement it (rather than a lot of if statements); if your data does have this problem post a comment and I'll add to the solution later.
In SAS this might not be the easiest way, but one way of doing this if your city name is inside the address string is to use the TRANWRD function. This can replace a string inside your address variable. The syntax is:
tranwrd(variable, original_str, new_str);
For example using your city DELHI:
data city;
input address $1-30;
datalines;
1 Ocean drive, DEHLI
2 Peak road, DELI
45 Buck street DILLI
;
run;
data change;
set city;
address = tranwrd(address,' DEHLI ',' DELHI ');
address = tranwrd(address,' DELI ',' DELHI ');
address = tranwrd(address,' DILLI ',' DELHI ');
run;
I put a space before and after both the original and new strings so that it won't replace a correct string that is inside a word (E.g. DELICIOUS Road will be changed to DELHICIOUS Road)