SQLite return rank value python - sqlite

I have a sqlite3 table containing students' marks for an assingment. Below is a sample data of the table
Id
Name
Marks
1
Mark
87
2
John
50
3
Adam
65
4
Cindy
68
5
Ruth
87
I would like to create a new column 'Rank', giving the students a rank according to marks scored.
These are 2 main criterias to follow:
If both students have the same marks, their rank would be the same
The total rank number would be the same as the total number of students. For example if there are two student with Rank 1, the next student below them would be Rank 3.
Below is a sample output of what i need
Id
Name
Marks
Rank
1
Mark
87
1
2
John
50
5
3
Adam
65
4
4
Cindy
68
3
5
Ruth
87
1
This is the code that i have at the moment
import sqlite3
conn = sqlite3.connect('students.sqlite')
cur = conn.cursor()
cur.execute('ALTER TABLE student_marks ADD Rank INTEGER')
conn.commit()

If you are using a recent version of SQLite, then you should probably avoid the update and just use the RANK() analytic function:
SELECT Id, Name, Marks, RANK() OVER (ORDER BY Marks DESC, Id) "Rank"
FROM student_marks
ORDER BY Id;

Related

How to take the cumulative sum based on unique values in a character variable?

I have df that represents users browsing behavior over time. Therefore the df contains a unique UserId and each row has a timestamp and represents a visit to a certain website. Each website has a unique website Id and a unique website category, say c("electronics", "clothes",....).
Now I want to count per row how many unique websites per category the user has visited up to that row (including that row). I call this variable "breadth" since it represents how broad a user is browsing through the internet.
So far I only manage to produce dumb code that creates the total number of unique websites visited per category by filtering on each category and then take the length of the unique vector by the user and then do a left join.
Therefore I do lose information about the development over time.
Thanks so much in advance!
total_breadth <- df %>% filter(category=="electronics") %>%
group_by(user_id) %>%
mutate(breadth=length(unique(website_id)))
#Structure of the df I want to achieve:
user_id time website_id category breadth
1 1 70 "electronics" 1
1 2 93 "clothing" 1
1 3 34 "electronics" 2
1 4 93 "clothing" 1
1 5 26 "electronics" 3
1 6 70 "electronics" 3
#Structure of the df I produce:
user_id time website_id category breadth
1 1 70 "electronics" 3
1 2 93 "clothing" 1
1 3 34 "electronics" 3
1 4 93 "clothing" 1
1 5 26 "electronics" 3
1 6 70 "electronics" 3
This seems to be a case of a split, apply and combine.
Create a binary matrix of 1s and 0s whose dimensions are:
No. of Rows = No. of rows in the original data
No of Columns = No. of unique website categories
Each Row represents the timestamp and each column represents the respective website category. So a cell will be equal to 1 if and only if the user has visited the website for that website category on the respective timestamp else it will be 0.
Take the cumulative sum for individual columns of this matrix and then create a final column where it takes the value only for the visited website category on the respective timestamp.
Though it doesn't seem to be an elegant solution, hope this should solve your problem temporarily.

R - Replace unique identifiers with something less complicated

I have two data frames that are related by a really long user ID, and I want to replace these values with something more readable, like a simple integer value. Obviously I want to keep these values consistent between data frames and I was wondering if there is a simple way to do this. Here is what the data.frames look like:
ArtistData - Shows how many times a user listened to a particular artist:
UserID Artist Plays
00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 706
00000c289a1829a808ac09c00daf10bc3c4e223b lunachicks 538
00001411dc427966b17297bf4d69e7e193135d89 stars 373
... ... ...
UserData - Shows information on each individual user:
UserID gender age country
00001411dc427966b17297bf4d69e7e193135d89 m 21 Germany
00004d2ac9316e22dc007ab2243d6fcb239e707d f 34 Mexico
000063d3fe1cf2ba248b9e3c3f0334845a27a6bf m 27 Poland
... ... ... ...
So basically, can I replace these long strings that have no meaning for me with an integer that is consistent between each data frame?
Convert to factors with simplified labels, using all possible UserID's in both datasets:
levs <- union(UserData$UserID, ArtistData$UserID)
ArtistData$newid <- factor(
ArtistData$UserID, levels=levs, labels=seq_along(levs)
)
UserData$newid <- factor(
UserData$UserID, levels=levs, labels=seq_along(levs)
)
ArtistData
# UserID Artist Plays newid
#1 00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 706 4
#2 00000c289a1829a808ac09c00daf10bc3c4e223b lunachicks 538 4
#3 00001411dc427966b17297bf4d69e7e193135d89 stars 373 1
UserData
# UserID gender age country newid
#1 00001411dc427966b17297bf4d69e7e193135d89 m 21 Germany 1
#2 00004d2ac9316e22dc007ab2243d6fcb239e707d f 34 Mexico 2
#3 000063d3fe1cf2ba248b9e3c3f0334845a27a6bf m 27 Poland 3

Error in frequency table in R

I have a dataframe which is as follow:
Name Condition NumMessage
Table 1 NULL 80
Table 1 Fair 20
Table 1 Good 60
Table 1 Ideal 50
Table 1 Great 80
Table 2 NULL 80
Table 2 Fair 100
Table 2 Good 90
Table 2 Ideal 50
Table 2 Great 40
and so on. I tried to create a frequency table for the number of message for each table.
data = as.data.frame(prop.table(table(dataframe$Name)))
colnames(data) = c('Table Name', 'Frequency')
data
but this returns same frequency for all tables. For example, Table 1 contains total of 290 messages where Table 2 contains 360 messages. But the above code gives same frequency for both tables.
Also when I tried to get frequency of each condition for each table, I also got same numbers across tables.
prop.table(table(dataframe$Condition, dataframe$Name))
NULL | some value
Fair | some value
Good | some value
Ideal | some value
Great | some value
is this the correct way to get the frequency of total number of messages for each table and frequency of conditions for each table?
xtabs is the base R way to get a summed contingency table.
prop.table(xtabs(NumMessage ~ ., data=df), 1)
# Condition
#Name Fair Good Great Ideal NULL
# Table1 0.06896552 0.20689655 0.27586207 0.17241379 0.27586207
# Table2 0.27777778 0.25000000 0.11111111 0.13888889 0.22222222
We could try with acast
library(reshape2)
prop.table(acast(df1, Name~Condition, value.var='NumMessage', sum),1)
# Fair Good Great Ideal NULL
#Table 1 0.06896552 0.2068966 0.2758621 0.1724138 0.2758621
#Table 2 0.27777778 0.2500000 0.1111111 0.1388889 0.2222222
If we call your dataset df, then perhaps this is what you are looking for?
df1 = subset(df, Name=='Table1')
df2 = subset(df, Name=='Table2')
prop.table(df1[,3])
prop.table(df2[,3])
aggregate(df1$NumMessage, list(df1$Name), sum)
aggregate(df1$NumMessage, list(df2$Name), sum)
You can always tackle this with the package sqldf.
library(sqldf)
Name<-c('Table1','Table1','Table1','Table1','Table1','Table2','Table2','Table2','Table2','Table2')
Cond<-c(NA,'Fair','Good','Ideal','Great',NA,'Fair','Good','Ideal','Great')
Msg<-c(80,20,60,50,80,80,100,90,50,40)
df<-data.frame(Name,Cond,Msg)
Your dataframe:
Name Cond Msg
1 Table1 <NA> 80
2 Table1 Fair 20
3 Table1 Good 60
4 Table1 Ideal 50
5 Table1 Great 80
6 Table2 <NA> 80
7 Table2 Fair 100
8 Table2 Good 90
9 Table2 Ideal 50
10 Table2 Great 40
Now simply use this statement for sum of messages for each table:
sqldf("select Name, sum(Msg) from df group by Name ")
Name sum(Msg)
1 Table1 290
2 Table2 360
If you want sum of messages for each condition then use:
sqldf("select Cond, sum(Msg) from df group by Cond ")
Cond sum(Msg)
1 <NA> 160
2 Fair 120
3 Good 150
4 Great 120
5 Ideal 100
Hope that helps.

SQLite query with LAST and DISTINCT

I have an example table:
ID | ArticleID | Price | SupplierID | dateAdded
1 1 100 1 2014-08-01
2 1 110 2 2014-08-01
3 2 105 1 2014-08-01
4 2 106 1 2014-08-01
5 2 101 2 2014-08-01
6 3 100 1 2014-08-01
7 1 107 2 2014-09-01
8 3 343 2 2014-09-01
9 3 232 2 2014-09-01
10 1 45 1 2014-09-01
I want to use .query on this table and select LAST value entered for each DISTINCT ArticleID for each SupplierID, resulting in:
ID | ArticleID | Price | SupplierID
10 1 45 1
9 3 232 2
6 3 100 1
7 1 107 2
4 2 106 1
5 2 101 2
I want to get price for last ArticleID entered for each SupplierID.
What should I enter into
public Cursor query (boolean distinct, String table, String[] columns, String selection, String[] selectionArgs, String groupBy, String having, String orderBy, String limit)
I came up with this so far:
String[] columns = new String[]{DatabaseOpenHelper.KEY_ID, DatabaseOpenHelper.KEY_CENA, DatabaseOpenHelper.KEY_IZDELEK_ID};
Cursor crs = database.query(true,"prices", columns, selection, selectionArgs, null, null, null, null);
but now I'm stuck:S
Any hint how to do this?
You can also suggest raw query if possible..
Raw query would be like this:
SELECT ID, ArticleID, Price, SupplierID FROM your_table WHERE ID IN (SELECT max(ID) from your_table GROUP BY ArticleID, SupplierID);
I assumed the IDs are autoincremented and the more recent entries have higher ids. If that's not the case change the HAVING clause to operate on DATE column.
After fidling around a bit and help of a friend I have came with SQL query that does what I want, not sure about optimization:
select tab.* from cene tab inner join (
select izdelek_id, trgovina_id, Max(enter_date) as maxDate
from cene group by izdelek_id, trgovina_id) art
on (art.izdelek_id = tab.izdelek_id) and (art.trgovina_id = tab.trgovina_id) and (art.maxDate = tab.enter_date)
izdelek_id = ArticleID
trgovina_id = SupplierID
cene is the name of a table.
Hope it helps to somebody..

Get the unique number of entries in column two from table two, where the first column of both tables match

I have two tables
tmp_CID_EIDs:
EID
====
1
2
3
5
EID_PID:
EID PID
==========
1 99
2 99
3 88
5 99
12 55
18 66
I use the following query to get a list of all positions where EID matches in both tables:
SELECT EID,
PID
FROM EID_PID
WHERE EID IN temp_CID_EIDs
-->
EID PID
=========
1 99
2 99
3 88
5 99
But my final goal is to get the number of unique PIDs from this query.
--> 99
88
How can I do that? Thanks..
SELECT DISTINCT PID FROM EID_PID WHERE EID IN tmp_CID_EIDs;

Resources