Fusion Tables distinct - google-maps-api-3

I have a merged Fusion Table table that displays data from 2 other tables (county & city) that are merged on countyId . The merged table has the columns countyid,countyName,cityName
I am trying to write a query that will list the countyName once and then list each cityName within that countyName before it moves on to the next countyName.
County 1
City 1
City 2
County 2
City 3
City 4
etc.
I have the following query which returns the unique countyName just fine but I don't know how to get it to pull the cityName for each countyName.
'SELECT countyName, count() FROM table_id GROUP BY countyName'
Any help much appreciated. Thanks

The SELECT clause lists the columns you're going to get back in the response. So try adding cityName. (You don't have to ask for the count() column if you don't need it).
SELECT countyName, cityName FROM .....
(Note, if you have multiple records for a city, you'll want to add that to the GROUP BY list too)
This should give you an answer structured like:
County 1, City 1
County 1, City 2
County 2, City 3
County 2, City 4
-Rebecca

Related

create DF in R referring 2 different columns following one after another

I need to create a data frame with the new column 'Variable1' in R below are the requirement. I have the one column name as 'Country' and another column name as 'City'. I need to check first if the data is available in the City column, if there is no data then move to the Country column based on the week. example:
Country Count
A 5
B 6
C 7
City Count
A 3
B 5
When I create a new column it first checks the count in the City column for all count values.. and if it's not available for it will move to Country and bring that count. and if even its not available in the country then it will just fill forward the last value
NEW Variable
A - 3
B-5
C-7
Can someone please assist with it? Is there any way to do it?

Rmatching strings across columns of two dataframes in R

My apologies, in my haste to post my question I forgot to follow the basic rules of posting. I have edited my post in line with these rules:
R experts,
I appreciate that a similar question has been raised before but I am unable to adapt the solutions suggested before to my specific data problem. I basically have a dataframe (call it df1), where one of the columns is a string of sentences, part of which contains a city name and a country name. As an example, dataframe df1 has a column called bus_desc with the following data:
bus_desc
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
In another dataframe (call it df2), I have two columns of data (named city and country) where each row contains a city name and the corresponding country as follows:
city
country
MOBILE
US
DELHI
INDIA
LONDON
UK
I want R to search the string of sentences of each row for that column in dataframe df1, and match the city (and the same for country) against the city name (and also the country name) from the relevant column in dataframe df2. If there is a match, I want to create a column called city in dataframe df1 and extract the city name from dataframe df2 and assign it to the row in dataframe df1. My final output should look like this:
bus_desc
city
country
Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares .....
MOBILE
US
Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to....
DELHI
INDIA
Can anyone please suggest a straightforward solution for this if it exists? I tried the below but it does not work
df1 <- df1 %>% rowwise() %>%
mutate(city=ifelse(grepl(toupper(bus_desc),df2$city),df2$city,df1))
Many thanks for your solutions and help on this.
Regards,
Dev
You can use str_extract to extract the city and country values listed in df2.
library(dplyr)
library(stringr)
df1 %>%
mutate(city = str_extract(toupper(bus_desc), str_c(df2$city, collapse = '|')),
country = str_extract(toupper(bus_desc), str_c(df2$country, collapse = '|')))
bus_desc
#1 Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares
#2 Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to
# city country
#1 MOBILE US
#2 DELHI INDIA
data
It is easier to help if you provide data in a reproducible format -
df1 <- data.frame(bus_desc = c('Company ABCD has a base capital of USD 5 million. Company ABCD is based in Mobile, Alabama, US. It is also known to have issued 10 million shares',
'Company XYZ has a history of producing bolts. Company XYZ operates out of Delhi, India. Its directors decided to'))
df2 <- data.frame(city = c('MOBILE', 'DELHI', 'LONDON'),
country = c('US', 'INDIA', 'UK'))

Issues with copying row data and Paste -> R

I have an ascii file that contains one week of data. This data is a text file and does not have header names. I currently have nearly completed a smaller task using R, and have made some attempts with Python as well. Being a pro at neither, its been a steep learning curve. Here is my data/code to paste rows together based on a specific sequence of chr in R that I created and is not working.
Each column holds different data, but the row data is what matters most. for example:
column 1 column 2 column 3 column 4
Row 1 Name Age YR Birth Date
Row 2 Middle Name School name siblings # of siblings
Row 3 Last Name street number street address
Row 4 Name Age YR Birth Date
Row 5 Middle Name School name siblings # of siblings
Row 6 Last Name street number street address
Row 7 Name Age YR Birth Date
Row 8 Middle Name School name siblings # of siblings
Row 9 Last Name street number street address
I have a folder to iterate or loop over that some files hold 100's of rows, and others hold 1000's. I have a code written that drops all the rows I don't need, and writes to a new .csv however, any pasting and/or merging isn't producing the desirable results.
What I need is a code to select only the Name and Last name rows (and their adjacent data) from the entire file and paste the last name row beside the end of the name row. Each file has the same amount of columns but different rows.
I have the file to a data frame, and have tried merging/pasting/binding (r and c) the rows/columns, and the result is still just shy of what I need. Rbind works the best thus far, but instead of producing the data with the rows pasted one after another on the same line, they are pasted beside each other in columns like this:
ie:
Name Last Name Name Last Name Name Last Name
Age Street Num Age Street Num Age Street Num
YR Street address YR Street address YR Street address
Birth NA Birth NA Birth NA
Date NA Date NA Date NA
I have tried to rbind them or family[c(Name, Age, YR Birth...)] and I am not successful. I have looked at how many columns I have and tried to add more columns to account for the paste, and instead it populates with the data from row 1.
I'm really at a loss here and if anyone can provide some insight I'd really appreciate it. I'm newer than some, but not as new as others. The results I am achieving look like:
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
Name Age YR Birth date Last Name Street Num Street Address NA NA
codes tried:
rowData <- rbind(name$Name, name$Age, name$YRBirth, name$Date)
colData <- cbind(name$V1 == "Name", name$V1 == "Last Name")
merge and paste also do not work. I have tried to create each variable as new data frames and am still not achieving the results I am looking for. Does anyone have any insight?
Ok, so if I understand your situation correctly, you want to first slice your data and pull out every third row starting with the 1st row and then pull out every 3rd row starting with the 3rd row. I'd do it like this (assume your data is in df:
df1 <- df[3*(1:(nrow(df)/3)) - 2,]
df2 <- df[3*(1:(nrow(df)/3)),]
once you have these, you can just slap them together, but instead of using rbind you want to use cbind. Then you can drop the NA columns and rename them.
df3 <- cbind(df1,df2)
df3 <- df3[1:7]
colnames(df3) <- c("Name", "Age", "YR", "Birth date", "Last Name", "Street Num", "Street Address")

Select people with a given surname from database

I do the following to get the population in a set of districts for a given year:
SELECT Year, County, District, Count(*) FROM census_data group by Year, County, District where Year = ?;
Then I do the following many thousands of times to get the population in each district for each surname I am interested in:
SELECT Year, County, District, COUNT(*) FROM census_data where Year = ? and Surname = ? group by Year, County, District;
There are 8 million rows in my db covering two specific years. There are roughly 40 counties and a county typically has a few hundred districts.
Should I add an index on my table to speed up the above queries as follows:
CREATE INDEX surname_index ON census_data (surname);
My thinking is that since generally speaking there are not many people with a given surname then it should be enough just to index it. Or would you recommend something else? I could also change the query to:
SELECT Year, County, District, COUNT(*) FROM census_data where Surname = ? group by Year, County, District;
for I am usually interested in both years anyway. When doing queries, how do I see if my index is being used?
Yes, I would use an index on the columns you're grouping by. Like I mentioned in the comments, I'd also use one query that produces all the desired rows over 1000 queries that produce a fragment of the total apiece. Make the database do all that work only once. Since you mentioned the names you're interested in are the 1000 most common ones, not random names, that actually makes it a bit easier.
The following demonstrates two slightly different approaches to getting the count per (year, county, district, surname) of the most common surnames overall:
First, populate a table with some sample data:
CREATE TABLE census(year INTEGER, county TEXT, district TEXT, surname TEXT);
INSERT INTO census VALUES
(2012, 'Lake', 'West', 'Smith'),
(2012, 'Lake', 'West', 'Jones'),
(2012, 'Lake', 'West', 'Smith'),
(2012, 'Lake', 'West', 'Washington'),
(2012, 'Lake', 'West', 'Washington'),
(2012, 'Lake', 'East', 'Smith'),
(2012, 'Lake', 'East', 'Jackson'),
(2012, 'Williams', 'Downtown', 'Jones'),
(2012, 'Williams', 'Downtown', 'McMaster'),
(2012, 'Williams', 'West Side', 'Jones'),
(2012, 'Williams', 'West Side', 'Jones');
CREATE INDEX census_idx ON census(year, county, district, surname);
(Your real data will, of course, have a lot more rows, and presumably more columns. Depending on space constraints, you might want to drop surname from the index, at the cost of a slower query. With all four columns in the index, it's a covering index for the queries below and the actual table rows never get accessed. With just the first three (Or two, or one), it'll need temporary b-trees for the grouping, and more table accesses.).
Approach one: Populate a temporary table with the 1000 most common names overall, and use that table in a join to restrict the results to just those names:
CREATE TEMP TABLE names(name TEXT PRIMARY KEY) WITHOUT ROWID;
INSERT INTO names
SELECT surname FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000;
SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN names AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;
Approach two: Do the same thing, but a subquery instead of a table for the most common names:
SELECT year, county, district, surname, count(*) as number
FROM census AS c
JOIN (SELECT surname AS name FROM census GROUP BY surname ORDER BY count(*) DESC LIMIT 1000) AS n ON c.surname = n.name
GROUP BY year, county, district, surname
ORDER BY year, county, district, count(*) DESC, surname;
Both produce:
year county district surname number
---------- ---------- ---------- ---------- ----------
2012 Lake East Jackson 1
2012 Lake East Smith 1
2012 Lake West Smith 2
2012 Lake West Washington 2
2012 Lake West Jones 1
2012 Williams Downtown Jones 1
2012 Williams Downtown McMaster 1
2012 Williams West Side Jones 2
If you're going to run this query a lot in a session, the first approach will be faster - it only has to build the list of most common names once, while the second one has to do it every time the query is run. It is, however, more involved because it takes multiple SQL statements. For a single run, benchmarking the two on a decent sized dataset is the best guide, of course.

ERROR: recursive reference in a subquery

I have a table 'visits' containing names of 'train', 'country' and 'city'.
Country and city are the country and stations to which train visits.
Country names are unique however city names can be same.
I want to find out minimum trains for given country names, so as to visit all the stations in those country.
My Approach:
First I will select a train which visits most of the stations of given country/countries, then i need to select a train which visits most of the stations from unvisited stations of given country/countries.
I am trying following CTE but couldn't get rid of error: 'recursive reference in a subquery'.
WITH RECURSIVE
selectedTrains(name) AS(
select train
from visits
where country in (select country from countries)
group by train
order by count(city) DESC
LIMIT 1
UNION
select train
from visits
where country in (select country from countries)
and city not in (
select city
from visits
where train in (select name from selectedTrains)
and country in (select country from countries)
)
group by train
order by count(city) DESC
LIMIT 1
),
countries(country) AS (
select country_name
from country_data
where country_name in ("USA","China","India")
)
SELECT * FROM train_data WHERE train_no IN selectedTrains;

Resources