ERROR: recursive reference in a subquery - sqlite

I have a table 'visits' containing names of 'train', 'country' and 'city'.
Country and city are the country and stations to which train visits.
Country names are unique however city names can be same.
I want to find out minimum trains for given country names, so as to visit all the stations in those country.
My Approach:
First I will select a train which visits most of the stations of given country/countries, then i need to select a train which visits most of the stations from unvisited stations of given country/countries.
I am trying following CTE but couldn't get rid of error: 'recursive reference in a subquery'.
WITH RECURSIVE
selectedTrains(name) AS(
select train
from visits
where country in (select country from countries)
group by train
order by count(city) DESC
LIMIT 1
UNION
select train
from visits
where country in (select country from countries)
and city not in (
select city
from visits
where train in (select name from selectedTrains)
and country in (select country from countries)
)
group by train
order by count(city) DESC
LIMIT 1
),
countries(country) AS (
select country_name
from country_data
where country_name in ("USA","China","India")
)
SELECT * FROM train_data WHERE train_no IN selectedTrains;

Related

How to create a new variable in a specific table from multiple conditions in another table

I have 2 different tables from 2 different samples, TABLE 1 and TABLE 2(with different number of rows or individuals). The 2 tables contain the same variables : AGE (quantitative), PROVINCE (qualitative with 11 choices), PLACE (urban or rural), AGE AT MARRIAGE, PARITY or number of children (quantitative).
From TABLE 1, I want to create a new variable (PAR_REF) in TABLE 2 defined as the average PARITY achieved by an individual from the same PROVINCE , the same PLACE, with the same AGE and with the same AGE AT MARRIAGE as a woman in TABLE 1. I mean, the values for the new variable I want to create(PAR_RF)in TABLE 2 must be the average PARITY for all individuals having the same characteristics (for the variables AGE, PROVINCE, PLACE, AGE AT MARRIAGE) in TABLE 1.
This is what I want to do
What is the process?
On TABLE1, group_by those variables and summarize, then join the result with TABLE2.
However, for this to work, you need in TABLE1 at least one individual of the same province/place/age/age of marriage as every one found in TABLE2, otherwise the join will leave missing values.
library(tidyverse)
TABLE_REF <- TABLE1 %>%
group_by(AGE, PROVINCE, PLACE, AGE_AT_MARRIAGE) %>%
summarize(PAR_REF = mean(PARITY)) %>%
ungroup()
TABLE2 %>% left_join(TABLE_REF, by = c("AGE", "PROVINCE", "PLACE", "AGE_AT_MARRIAGE"))

Select people with a given surname from database

I do the following to get the population in a set of districts for a given year:
SELECT Year, County, District, Count(*) FROM census_data group by Year, County, District where Year = ?;
Then I do the following many thousands of times to get the population in each district for each surname I am interested in:
SELECT Year, County, District, COUNT(*) FROM census_data where Year = ? and Surname = ? group by Year, County, District;
There are 8 million rows in my db covering two specific years. There are roughly 40 counties and a county typically has a few hundred districts.
Should I add an index on my table to speed up the above queries as follows:
CREATE INDEX surname_index ON census_data (surname);
My thinking is that since generally speaking there are not many people with a given surname then it should be enough just to index it. Or would you recommend something else? I could also change the query to:
SELECT Year, County, District, COUNT(*) FROM census_data where Surname = ? group by Year, County, District;
for I am usually interested in both years anyway. When doing queries, how do I see if my index is being used?
Yes, I would use an index on the columns you're grouping by. Like I mentioned in the comments, I'd also use one query that produces all the desired rows over 1000 queries that produce a fragment of the total apiece. Make the database do all that work only once. Since you mentioned the names you're interested in are the 1000 most common ones, not random names, that actually makes it a bit easier.
The following demonstrates two slightly different approaches to getting the count per (year, county, district, surname) of the most common surnames overall:
First, populate a table with some sample data:
CREATE TABLE census(year INTEGER, county TEXT, district TEXT, surname TEXT);
INSERT INTO census VALUES
(2012, 'Lake', 'West', 'Smith'),
(2012, 'Lake', 'West', 'Jones'),
(2012, 'Lake', 'West', 'Smith'),
(2012, 'Lake', 'West', 'Washington'),
(2012, 'Lake', 'West', 'Washington'),
(2012, 'Lake', 'East', 'Smith'),
(2012, 'Lake', 'East', 'Jackson'),
(2012, 'Williams', 'Downtown', 'Jones'),
(2012, 'Williams', 'Downtown', 'McMaster'),
(2012, 'Williams', 'West Side', 'Jones'),
(2012, 'Williams', 'West Side', 'Jones');
CREATE INDEX census_idx ON census(year, county, district, surname);
(Your real data will, of course, have a lot more rows, and presumably more columns. Depending on space constraints, you might want to drop surname from the index, at the cost of a slower query. With all four columns in the index, it's a covering index for the queries below and the actual table rows never get accessed. With just the first three (Or two, or one), it'll need temporary b-trees for the grouping, and more table accesses.).
Approach one: Populate a temporary table with the 1000 most common names overall, and use that table in a join to restrict the results to just those names:
CREATE TEMP TABLE names(name TEXT PRIMARY KEY) WITHOUT ROWID;
INSERT INTO names
SELECT surname FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000;
SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN names AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;
Approach two: Do the same thing, but a subquery instead of a table for the most common names:
SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN (SELECT surname AS name FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000) AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;
Both produce:
year county district surname number
---------- ---------- ---------- ---------- ----------
2012 Lake East Jackson 1
2012 Lake East Smith 1
2012 Lake West Smith 2
2012 Lake West Washington 2
2012 Lake West Jones 1
2012 Williams Downtown Jones 1
2012 Williams Downtown McMaster 1
2012 Williams West Side Jones 2
If you're going to run this query a lot in a session, the first approach will be faster - it only has to build the list of most common names once, while the second one has to do it every time the query is run. It is, however, more involved because it takes multiple SQL statements. For a single run, benchmarking the two on a decent sized dataset is the best guide, of course.

Getting a distinct count by year from an access database

I'm very rusty with my SQL so this might be not so complicated but I just can't seem to be able to crack it.
I have a database with two tables - one containing details of patients and one of visits each patient has had. Patient_ID is the unique identifier for a patient and is used in the Visits table and I'm trying to pull the number of distinct patients and the total number of visits they've had (i.e. Patient A has visited 3 times in 2018)
I'm trying to get a Total count of the Distinct individual patients who have visited a centre per YEAR (field in Visits table), and also see information about the patient from the Patients table (gender, country, etc).
I've tried several count and distinct functions but can't get anything to work. The below is one of the last attempts but the distinct function doesn't actually show distinct values (am I doing something wrong with it?) in this scenario. It does work in other queries... Any help would be greatly appreciated.
SELECT DISTINCT Visits.Patient_ID, Patient.Gender, Patient.Village, Visits.Months_Of_Visit, Visits.Year
FROM Visits
INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID
WHERE Year='2018';
Expected result:
Unique Patient Id, Patient Gender, Patient Village PER month and PER Year.
If you want the number of times each patient visited each village/month/year:
SELECT Count(*) AS CountVisits, Visits.Patient_ID, Gender, Village, Months_Of_Visit, [Year]
FROM Visits
INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID
GROUP BY Patient_ID, Gender, Village, Months_Of_Visit, [Year];
If you want the number of DISTINCT patients per village/month/year:
Query1:
SELECT DISTINCT Visits.Patient_ID, Gender, Village, Months_Of_Visit, [Year]
FROM Visits
INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID;
Query2:
SELECT Count(*) AS CountPerVillage, Village, Months_Of_Visit, [Year]
FROM Query1 GROUP BY Village, Months_Of_Visit, [Year];
All in one:
SELECT Count(*) AS CountPerVillage, Village, Months_Of_Visit, [Year]
FROM (SELECT DISTINCT Visits.Patient_ID, Village, Months_Of_Visit, [Year]
FROM Visits INNER JOIN Patient ON Patient.Patient_ID=Visits.Patient_ID) AS Query1
GROUP BY Village, Months_Of_Visit, [Year];
Since Year is a reserved word (it is an intrinsic function), enclose in [ ] or include the table name prefix in the field reference.

Is there a way to partition a query that has a "group by" clause?

Say we I have a query that displays groups of population by country having the country as its first column, and total population of that country as its the second column.
To achieve this I have the following query:
select
i.country,
count(1) population
from
individual i
group by
i.country
Now I want to introduce two more columns to that query to display the population of males and females for each country.
What I want to achieve might look something similar to this:
select
i.country,
count(1) population total_population,
count(1) over (partition by 1 where i.gender='male') male_population,
count(1) over (partition by 1 where i.gender='female') female_population,
from
individual i
group by
i.country
The problem with this is that
"partition by clause" is not allowed in a "group by" query
"where clause" is not allowed in "partition by" clause
I hope you get the point. Please excuse my grammar and the way I titled this (couldn't know any better description).
You don't need analytic functions here:
select
i.country
,count(1) population
,count(case when gender = 'male' then 1 end) male
,count(case when gender = 'female' then 1 end) female
from
individual i
group by
i.country
;
see http://www.sqlfiddle.com/#!4/7dfa5/4

Fusion Tables distinct

I have a merged Fusion Table table that displays data from 2 other tables (county & city) that are merged on countyId . The merged table has the columns countyid,countyName,cityName
I am trying to write a query that will list the countyName once and then list each cityName within that countyName before it moves on to the next countyName.
County 1
City 1
City 2
County 2
City 3
City 4
etc.
I have the following query which returns the unique countyName just fine but I don't know how to get it to pull the cityName for each countyName.
'SELECT countyName, count() FROM table_id GROUP BY countyName'
Any help much appreciated. Thanks
The SELECT clause lists the columns you're going to get back in the response. So try adding cityName. (You don't have to ask for the count() column if you don't need it).
SELECT countyName, cityName FROM .....
(Note, if you have multiple records for a city, you'll want to add that to the GROUP BY list too)
This should give you an answer structured like:
County 1, City 1
County 1, City 2
County 2, City 3
County 2, City 4
-Rebecca

Resources