Big data structure - bigdata

Explain the code below in a step-by-step manner. Also explain what the two join statements intend to achieve.
emp = [(1,"Smith",-1,"2018","10","M",3000), \
(2,"Rose",1,"2010","20","M",4000), \
(3,"Williams",1,"2010","10","M",1000), \
(4,"Jones",2,"2005","10","F",2000), \
]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
"emp_dept_id","gender","salary"]
empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
empDF.show(truncate=False)
dept = [("Finance",10), \
("Marketing",20), \
("Sales",30), \
("IT",40) \
]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)
Join statement 1
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"outer") \
.show(truncate=False)
Join statement 2
empDF.join(deptDF,empDF.emp_dept_id == deptDF.dept_id,"right") \
.show(truncate=False)

The above mentioned code is doing the following steps:
We are creating a list named emp which contains list of employee data.
Then we are creating a list named empColumns which has the column names for the employee data list.
Then we are creating spark dataframe using the data in emp list and applying the column names on top of data.
Further we are printing the schema of spark dataframe and printing the data in it.
Then we are repeating the above process for department data.
In Join Statement 1, we are performing the full outer join of the employee and department dataframes. The outer join keyword returns all matching records from both tables whether the other table matches or not.
In Join Statement 2, we are performing right outer join of employee dataframe with department dataframe. The right join keyword returns all records from the right table (department), and the matching records from the left table (employee).
Note: Here truncate=False in dataframe show function makes sure that the data in the columns are not truncated to 20 characters which is default limit applied in pyspark show method.

Related

Using clinicaltrials.gov database in R

I am trying to use R to access the clinicaltrials.gov AACT database to create a list of facility_investigators for a specific topic.
The following code is an example of how to get a list of all clinical trials on the topic TP53
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'studies')
x = study_tbl %>% filter(official_title %like% '%TP53%') %>% collect()
Similarly, if I want a list of principal investigators,
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'facility_investigators')
I am unable to make a list on only TP53 facility_investigators. Something like TP53 & facility_investigators. Any help would be appreciated
This is a link where some explanation is provided, but my problem is not resolved - http://www.cancerdatasci.org/post/2017/03/approaches-to-accessing-clinicaltrials.gov-data/
Is this what your asking...Your pulling from two different tables in the same database the first one is 'studies' and the second one is 'facilities investigators'. What you need to do is run the head() command for each of the tables (or run glimpse() or run str()) and see if the two tables have a common variable you can merge on after you load them into R. If they do then you would do something like this:
library(dplyr)
merged_table <- inner_join(x, study_table, by = "common column")
If the columns have different names it would like:
library(dplyr)
merged_table <- inner_join(x, study_table, by = c("x_column_name" = "study_table_column_name"))
From there you can limit your dataset to just facility investigators that have the characteristics you want.
If you want to do it in one PostgreSQL query you can do it like so. For more information about this syntax in particular see page 18:
con <- dbConnect() # use the same parameters you use above to connect
query <- dbSendQuery(con,
'select s.*, fi.*
from (select * from studies where official_title like "%TP53%")
as s
inner join facility_investigators as fi
on s."joining column" = fi."joining column"'
)
r_dataset <- fetch(query)
# I would just close the connection in RStudio using the connection tab.
The above query has an inner join in the main query and a subquery in the from statement. The subquery performs the filtering you where trying to do in R. It essentially allows you to select only from the table where the results are already filtered. An inner join combines all the records the two tables have in common and puts them into one table. If you need to join on more than one column add an 'and' between the two statements in the on statement.

RSQLite couldnt read column names with "."

I am trying to use RSQLite to read in tables from my database. All the tables have column names with ".".
For example: my test table has 2 columns: index, first.name
How do I write a query to filter test table with first name column:
My code is:
dbGetQuery(con,"SELECT * FROM test WHERE 'first.name' = 'Joe'")
and it gave me an error:
Error: no such column: first.name
The below should work: Adding []
dbGetQuery(con,"SELECT * FROM test WHERE [first.name] = 'Joe'")
See the below thread:
How to write a column name with dot (".") in the SELECT clause?

creating a looped SQL QUERY using RODBC in R

First and foremost - thank you for taking your time to view my question, regardless of if you answer or not!
I am trying to create a function that loops through my df and queries in the necessary data from SQL using the RODBC package in R. However, I am having trouble setting up the query, since the parameter of the query change through each iteration (example below)
So my df looks like this:
ID Start_Date End_Date
1 2/2/2008 2/9/2008
2 1/1/2006 1/1/2007
1 5/7/2010 5/15/2010
5 9/9/2009 10/1/2009
How would I go about specifying the start date and end date in my sql program?
here's what i have so far:
data_pull <- function(df) {
a <- data.frame()
b <- data.frame()
for (i in df$id)
{
dbconnection <- odbcDriverConnect(".....")
query <- paste("Select ID, Date, Account_Balance from Table where ID = (",i,") and Date > (",df$Start_Date,") and Date <= (",df$End_Date,")")
a <- sqlQuery(dbconnection, paste(query))
b <- rbind(b,a)
}
return(b)
}
However, this doesn't query in anything. I believe it has something to do with how I am specifying the start and the end date for the iteration.
If anyone can help on this it would be greatly appreciated. If you need further explanation, please don't hesitate to ask!
A couple of syntax issues arise from current setup:
LOOP: You do not iterate through all rows of data frame but only the atomic ID values in the single column, df$ID. In that same loop you are passing the entire vectors of df$Start_Date and df$End_Date into query concatenation.
DATES: Your date formats do not align to most data base date formats of 'YYYY-MM-DD'. And still some others like Oracle, you require string to data conversion: TO_DATE(mydate, 'YYYY-MM-DD').
A couple of aforementioned performance / best practices issues:
PARAMETERIZATION: While parameterization is not needed for security reasons since your values are not generated by user input who can inject malicious SQL code, for maintainability and readability, parameterized queries are advised. Hence, consider doing so.
GROWING OBJECTS: According to Patrick Burn's Inferno Circle 2: Growing Objects, R programmers should avoid growing multi-dimensional objects like data frames inside a loop which can cause excessive copying in memory. Instead, build a list of data frames to rbind once outside the loop.
With that said, you can avoid any looping or listing needs by saving your data frame as a database table then joined to final table for a filtered, join query import. This assumes your database user has CREATE TABLE and DROP TABLE privileges.
# CONVERT DATE FIELDS TO DATE TYPE
df <- within(df, {
Start_Date = as.Date(Start_Date, format="%m/%d/%Y")
End_Date = as.Date(End_Date, format="%m/%d/%Y")
})
# SAVE DATA FRAME TO DATABASE
sqlSave(dbconnection, df, "myRData", rownames = FALSE, append = FALSE)
# IMPORT JOINED AND DATE FILTERED QUERY
q <- "SELECT ID, Date, Account_Balance
FROM Table t
INNER JOIN myRData r
ON r.ID = t.ID
AND t.Date BETWEEN r.Start_Date AND r.End_Date"
final_df <- sqlQuery(dbconnection, q)

Create a chart with WHILE statement applying only for one column in ASP.Net

I would like to create a statistical chart for my database on ASP.NET website which I created beforehand. The data is in a boolean format - the output desired is calculate all the data in the table on Y axis and true values for X axis.
As I'm using the Design view I tried to use the build query but it just gives me the wrong results:
SELECT COUNT(Voting.Voting_ID) AS Expr1,
COUNT(Voting.Canvassed) AS Expr2
FROM (Voting
INNER JOIN Elector
ON Voting.Voting_ID = Elector.Voting_ID)
WHERE (Voting.Canvassed = 0)
Thats the query:
SELECT COUNT(Voting.Voting_ID) AS Expr1,
COUNT(Voting.Canvassed) AS Expr2
FROM (Voting
INNER JOIN Elector
ON Voting.Voting_ID = Elector.Voting_ID)
WHERE (Voting.Canvassed = 0)
Result:
3264,3264
Where the result I am looking for it 4870,3264. (4870 is the whole amount of entries)
Source code:
SelectCommand="SELECT COUNT(Voting.Canvassed) AS Expr2, COUNT(Voting.Canvassed) AS Expr1 FROM (Voting INNER JOIN Elector ON Voting.Voting_ID = Elector.Voting_ID) WHERE (Voting.Canvassed = 0)">
</asp:AccessDataSource>
I tried changing the sql query in the script but it keeps giving me errors in terms of syntax. Any idea how this could be done?

How to read csv files in matlab as you would in R?

I have a data set that is saved as a .csv file that looks like the following:
Name,Age,Password
John,9,\i1iiu1h8
Kelly,20,\771jk8
Bob,33,\kljhjj
In R I could open this file by the following:
X = read.csv("file.csv",header=TRUE)
Is there a default command in Matlab that reads .csv files with both numeric and string variables? csvread seems to only like numeric variables.
One step further, in R I could use the attach function to create variables with associated with teh columns and columns headers of the data set, i.e.,
attach(X)
Is there something similar in Matlab?
Although this question is close to being an exact duplicate, the solution suggested in the link provided by #NathanG (ie, using xlsread) is only one possible way to solve your problem. The author in the link also suggests using textscan, but doesn't provide any information about how to do it, so I thought I'd add an example here:
%# First we need to get the header-line
fid1 = fopen('file.csv', 'r');
Header = fgetl(fid1);
fclose(fid1);
%# Convert Header to cell array
Header = regexp(Header, '([^,]*)', 'tokens');
Header = cat(2, Header{:});
%# Read in the data
fid1 = fopen('file.csv', 'r');
D = textscan(fid1, '%s%d%s', 'Delimiter', ',', 'HeaderLines', 1);
fclose(fid1);
Header should now be a row vector of cells, where each cell stores a header. D is a row vector of cells, where each cell stores a column of data.
There is no way I'm aware of to "attach" D to Header. If you wanted, you could put them both in the same structure though, ie:
S.Header = Header;
S.Data = D;
Matlab's new table class makes this easy:
X = readtable('file.csv');
By default this will parse the headers, and use them as column names (also called variable names):
>> x
x =
Name Age Password
_______ ___ ___________
'John' 9 '\i1iiu1h8'
'Kelly' 20 '\771jk8'
'Bob' 33 '\kljhjj'
You can select a column using its name etc.:
>> x.Name
ans =
'John'
'Kelly'
'Bob'
Available since Matlab 2013b.
See www.mathworks.com/help/matlab/ref/readtable.html
I liked this approach, supported by Matlab 2012.
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of cource you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));

Resources