Select people with a given surname from database - sqlite

I do the following to get the population in a set of districts for a given year:
SELECT Year, County, District, Count(*) FROM census_data group by Year, County, District where Year = ?;
Then I do the following many thousands of times to get the population in each district for each surname I am interested in:
SELECT Year, County, District, COUNT(*) FROM census_data where Year = ? and Surname = ? group by Year, County, District;
There are 8 million rows in my db covering two specific years. There are roughly 40 counties and a county typically has a few hundred districts.
Should I add an index on my table to speed up the above queries as follows:
CREATE INDEX surname_index ON census_data (surname);
My thinking is that since generally speaking there are not many people with a given surname then it should be enough just to index it. Or would you recommend something else? I could also change the query to:
SELECT Year, County, District, COUNT(*) FROM census_data where Surname = ? group by Year, County, District;
for I am usually interested in both years anyway. When doing queries, how do I see if my index is being used?

Yes, I would use an index on the columns you're grouping by. Like I mentioned in the comments, I'd also use one query that produces all the desired rows over 1000 queries that produce a fragment of the total apiece. Make the database do all that work only once. Since you mentioned the names you're interested in are the 1000 most common ones, not random names, that actually makes it a bit easier.
The following demonstrates two slightly different approaches to getting the count per (year, county, district, surname) of the most common surnames overall:
First, populate a table with some sample data:
CREATE TABLE census(year INTEGER, county TEXT, district TEXT, surname TEXT);
INSERT INTO census VALUES
(2012, 'Lake', 'West', 'Smith'),
(2012, 'Lake', 'West', 'Jones'),
(2012, 'Lake', 'West', 'Smith'),
(2012, 'Lake', 'West', 'Washington'),
(2012, 'Lake', 'West', 'Washington'),
(2012, 'Lake', 'East', 'Smith'),
(2012, 'Lake', 'East', 'Jackson'),
(2012, 'Williams', 'Downtown', 'Jones'),
(2012, 'Williams', 'Downtown', 'McMaster'),
(2012, 'Williams', 'West Side', 'Jones'),
(2012, 'Williams', 'West Side', 'Jones');
CREATE INDEX census_idx ON census(year, county, district, surname);
(Your real data will, of course, have a lot more rows, and presumably more columns. Depending on space constraints, you might want to drop surname from the index, at the cost of a slower query. With all four columns in the index, it's a covering index for the queries below and the actual table rows never get accessed. With just the first three (Or two, or one), it'll need temporary b-trees for the grouping, and more table accesses.).
Approach one: Populate a temporary table with the 1000 most common names overall, and use that table in a join to restrict the results to just those names:
CREATE TEMP TABLE names(name TEXT PRIMARY KEY) WITHOUT ROWID;
INSERT INTO names
SELECT surname FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000;
SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN names AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;
Approach two: Do the same thing, but a subquery instead of a table for the most common names:
SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN (SELECT surname AS name FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000) AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;
Both produce:
year county district surname number
---------- ---------- ---------- ---------- ----------
2012 Lake East Jackson 1
2012 Lake East Smith 1
2012 Lake West Smith 2
2012 Lake West Washington 2
2012 Lake West Jones 1
2012 Williams Downtown Jones 1
2012 Williams Downtown McMaster 1
2012 Williams West Side Jones 2
If you're going to run this query a lot in a session, the first approach will be faster - it only has to build the list of most common names once, while the second one has to do it every time the query is run. It is, however, more involved because it takes multiple SQL statements. For a single run, benchmarking the two on a decent sized dataset is the best guide, of course.

Related

Rmatching strings across columns of two dataframes in R

My apologies, in my haste to post my question I forgot to follow the basic rules of posting. I have edited my post in line with these rules:
R experts,
I appreciate that a similar question has been raised before but I am unable to adapt the solutions suggested before to my specific data problem. I basically have a dataframe (call it df1), where one of the columns is a string of sentences, part of which contains a city name and a country name. As an example, dataframe df1 has a column called bus_desc with the following data:
bus_desc
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
In another dataframe (call it df2), I have two columns of data (named city and country) where each row contains a city name and the corresponding country as follows:
city
country
MOBILE
US
DELHI
INDIA
LONDON
UK
I want R to search the string of sentences of each row for that column in dataframe df1, and match the city (and the same for country) against the city name (and also the country name) from the relevant column in dataframe df2. If there is a match, I want to create a column called city in dataframe df1 and extract the city name from dataframe df2 and assign it to the row in dataframe df1. My final output should look like this:
bus_desc
city
country
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
MOBILE
US
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
DELHI
INDIA
Can anyone please suggest a straightforward solution for this if it exists? I tried the below but it does not work
df1 <- df1 %>% rowwise() %>%
mutate(city=ifelse(grepl(toupper(bus_desc),df2$city),df2$city,df1))
Many thanks for your solutions and help on this.
Regards,
Dev
You can use str_extract to extract the city and country values listed in df2.
library(dplyr)
library(stringr)
df1 %>%
mutate(city = str_extract(toupper(bus_desc), str_c(df2$city, collapse = '|')),
country = str_extract(toupper(bus_desc), str_c(df2$country, collapse = '|')))
bus_desc
#1 Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares
#2 Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to
# city country
#1 MOBILE US
#2 DELHI INDIA
data
It is easier to help if you provide data in a reproducible format -
df1 <- data.frame(bus_desc = c('Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares',
'Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to'))
df2 <- data.frame(city = c('MOBILE', 'DELHI', 'LONDON'),
country = c('US', 'INDIA', 'UK'))

Getting a distinct count by year from an access database

I'm very rusty with my SQL so this might be not so complicated but I just can't seem to be able to crack it.
I have a database with two tables - one containing details of patients and one of visits each patient has had. Patient_ID is the unique identifier for a patient and is used in the Visits table and I'm trying to pull the number of distinct patients and the total number of visits they've had (i.e. Patient A has visited 3 times in 2018)
I'm trying to get a Total count of the Distinct individual patients who have visited a centre per YEAR (field in Visits table), and also see information about the patient from the Patients table (gender, country, etc).
I've tried several count and distinct functions but can't get anything to work. The below is one of the last attempts but the distinct function doesn't actually show distinct values (am I doing something wrong with it?) in this scenario. It does work in other queries... Any help would be greatly appreciated.
SELECT DISTINCT Visits.Patient_ID, Patient.Gender, Patient.Village, Visits.Months_Of_Visit, Visits.Year
FROM Visits
INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID
WHERE Year='2018';
Expected result:
Unique Patient Id, Patient Gender, Patient Village PER month and PER Year.
If you want the number of times each patient visited each village/month/year:
SELECT Count(*) AS CountVisits, Visits.Patient_ID, Gender, Village, Months_Of_Visit, [Year]
FROM Visits
INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID
GROUP BY Patient_ID, Gender, Village, Months_Of_Visit, [Year];
If you want the number of DISTINCT patients per village/month/year:
Query1:
SELECT DISTINCT Visits.Patient_ID, Gender, Village, Months_Of_Visit, [Year]
FROM Visits
INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID;
Query2:
SELECT Count(*) AS CountPerVillage, Village, Months_Of_Visit, [Year]
FROM Query1 GROUP BY Village, Months_Of_Visit, [Year];
All in one:
SELECT Count(*) AS CountPerVillage, Village, Months_Of_Visit, [Year]
FROM (SELECT DISTINCT Visits.Patient_ID, Village, Months_Of_Visit, [Year]
FROM Visits INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID) AS Query1
GROUP BY Village, Months_Of_Visit, [Year];
Since Year is a reserved word (it is an intrinsic function), enclose in [ ] or include the table name prefix in the field reference.

ERROR: recursive reference in a subquery

I have a table 'visits' containing names of 'train', 'country' and 'city'.
Country and city are the country and stations to which train visits.
Country names are unique however city names can be same.
I want to find out minimum trains for given country names, so as to visit all the stations in those country.
My Approach:
First I will select a train which visits most of the stations of given country/countries, then i need to select a train which visits most of the stations from unvisited stations of given country/countries.
I am trying following CTE but couldn't get rid of error: 'recursive reference in a subquery'.
WITH RECURSIVE
selectedTrains(name) AS(
select train
from visits
where country in (select country from countries)
group by train
order by count(city) DESC
LIMIT 1
UNION
select train
from visits
where country in (select country from countries)
and city not in (
select city
from visits
where train in (select name from selectedTrains)
and country in (select country from countries)
)
group by train
order by count(city) DESC
LIMIT 1
),
countries(country) AS (
select country_name
from country_data
where country_name in ("USA","China","India")
)
SELECT * FROM train_data WHERE train_no IN selectedTrains;

Is there a way to partition a query that has a "group by" clause?

Say we I have a query that displays groups of population by country having the country as its first column, and total population of that country as its the second column.
To achieve this I have the following query:
select
i.country,
count(1) population
from
individual i
group by
i.country
Now I want to introduce two more columns to that query to display the population of males and females for each country.
What I want to achieve might look something similar to this:
select
i.country,
count(1) population total_population,
count(1) over (partition by 1 where i.gender='male') male_population,
count(1) over (partition by 1 where i.gender='female') female_population,
from
individual i
group by
i.country
The problem with this is that
"partition by clause" is not allowed in a "group by" query
"where clause" is not allowed in "partition by" clause
I hope you get the point. Please excuse my grammar and the way I titled this (couldn't know any better description).
You don't need analytic functions here:
select
i.country
,count(1) population
,count(case when gender = 'male' then 1 end) male
,count(case when gender = 'female' then 1 end) female
from
individual i
group by
i.country
;
see http://www.sqlfiddle.com/#!4/7dfa5/4

Fusion Tables distinct

I have a merged Fusion Table table that displays data from 2 other tables (county & city) that are merged on countyId . The merged table has the columns countyid,countyName,cityName
I am trying to write a query that will list the countyName once and then list each cityName within that countyName before it moves on to the next countyName.
County 1
City 1
City 2
County 2
City 3
City 4
etc.
I have the following query which returns the unique countyName just fine but I don't know how to get it to pull the cityName for each countyName.
'SELECT countyName, count() FROM table_id GROUP BY countyName'
Any help much appreciated. Thanks
The SELECT clause lists the columns you're going to get back in the response. So try adding cityName. (You don't have to ask for the count() column if you don't need it).
SELECT countyName, cityName FROM .....
(Note, if you have multiple records for a city, you'll want to add that to the GROUP BY list too)
This should give you an answer structured like:
County 1, City 1
County 1, City 2
County 2, City 3
County 2, City 4
-Rebecca

Resources