How to sum distinctly the values of a single column using htsql? - htsql

I have multiple records in my table's column, most of them are duplicate entries. I want to sum them up so that what ever the number is duplicate should be summed up just once, like:
Numbers
10
10
10
15
20
The summed result should be 45
I am using this query:
/sum(summary.filter(start_time>='2013-01-01'&end_time<='2013-05-01'&student='john'&course='BCS').s_sub_n)
Please help me where I can put ^ to distinct s_sub_n

ok, if I am not getting you wrong, you want the Number column to be distinct and then added.
Distinct Number Column(s_sub_n):
10
15
20
and then its sum = 45
so the htsql query would be:
sum(summary.filter(start_time>='2013-01-01'&end_time<='2013-05-01'&student='john'&course='BCS')^{s_sub_n}{s_sub_n})

Related

How to find the ID number of a value?

I am currently working with a dataset with 551 observation and 141 variables. Normally there are some mistakes done by the data entry operators and I am now screening and correcting those. But the problem is the ID number and the row number of the dataset is not similar/corresponding. And I can only bring the row number where the problematic data lies in. It is taking more time of mine to find the ID number as they do not correspond. Is there any way to get the ID number of the problematic data within one command?
Suppose, the row number of the B345 ID, is #1. For B346 ID the row is #2.
My dataset is presented like this-
ID S1 S2 S3 I30 I31 I34
B345 12 23 3 2 1 4
B346 15 4 4 3 2 4
I am using the following command in my original dataset and got the following results. Row number 351 and 500 but actually their ID number is B456 and B643.
which (x$I30 ==0)
[1] 351 500
I am expecting to get the ID number within 1 command. It will be very helpful to me.
How about this?
x$ID[which(x$I30==0)]
We can just use the logical condition to subset the 'ID'
x$ID[x$I30 ==0]

Fail to identify correct alternative from objective question

My question is related to R.
I have code snippet related to 5 answer choice. When I run this answer choice every choice except one get error. The right one also did not match with the question.
My question is
A B C D E
1 7 4 23 68 15
2 12 53 14 10 20
3 39 88 98 50 84
4 18 38 33 47 72
5 31 6 51 38 27
6 20 15 68 99 50
This dataframe is given. To create this data frame I write the following code block.
A = c(7,12,39,18,31,20)
B = c(4,53,88,38,6,15)
C = c(23,14,98,33,51,68)
D = c(68,10,50,47,38,99)
E = c(15,20,84,72,27,50)
df_x = data.frame(A,B,C,D,E)
Question: Which of the following R code will sunset data frame df_x,returning the final three rows?
My answer choice is
df_x[nrow(df_x)-2:nrow(df_x)]
df_x[(nrow(df_x)-2):nrow(df_x)]
df_x[nrow(df-x)-2:,]
df_x[-3:]
df_x[(nrow(df_x)-2):nrow(df_x)
Among them only the 1st choice df_x[nrow(df_x)-2:nrow(df_x)] some output.
Output:
D C B A
1 68 23 4 7
2 10 14 53 12
3 50 98 88 39
4 47 33 38 18
5 38 51 6 31
6 99 68 15 20
I think this is not the correct one. All other choices give error. Can any one tell me which one is the correct choice? Or what is the actual query to answer the following question? I am new to R. So it is hard for me to find out the correct one.
df_x[(nrow(df_x)-2):nrow(df_x),]
Keep in mind, convention is df[rows, columns]. And you need to specify both arguments, which is why I put a comma after the row argument in the solution
Cheers,
Joe
The answers in those choices will produce errors because they are not creating the indexes properly.
In R, when you are subsetting database, you need to give the row numbers and the column numbers.
For example,df[row,col] will give you the data that is the given row and the given column. df[row,] will select all columns for the given row number.
If you don't put a comma (,) in the index, you are only selecting the columns. For e.gdf[1:2] is going to select the first and second columns
If you want to select multiple rows or multiple columns, you can put the numbers in as well e.g df[1:3,3:9]
When you use -, R removes the given row or column. So for example, df[-1,] removes the first row. df[,-3] removes the third column. df[-1:-5,] removes the first five rows.
Those answers all have errors in them because they don't have commas in the right places. If you want to select up to the last row or column in R, you need to give the last row or column number. You get this number by using nrow(df) or ncol(df). Using the : is the Python way of doing things.
The closest answer here is: df_x[(nrow(df_x)-2):nrow(df_x)] but you need to add a comma: df_x[(nrow(df_x)-2):nrow(df_x),]
The problem you are being expected to recognize (but have not) is operator precedence. The colon operator (for sequencing) has a higher precedence than the binary minus operator, so the expression: nrow(df_x)-2:nrow(df_x) gives you the vector difference possibly with recycling of the value of nrow(df_x) and the vector 2:nrow(df_x). So option number 2 which isolates nrow(df_x)-2 from the colon-operator with parentheses will give you the correct index. Adding parentheses to make terms obvious is good programming practice. See:
?Syntax
The other problem is that there is a missing comma after those expressions ... I think your course text should have given option 2 as
df_x[(nrow(df_x)-2):nrow(df_x),]

Moving sum with dates for teradata

I have a situation where I have to create a moving sum for the past 6 months. My data looks like
A B 20-Jan-18 20
A B 20-Mar-18 45
A B 10-Apr-18 15
A B 21-May-18 30
A B 30-Jul-18 10
A B 15-Aug-18 25
And the expected result is
A B 20-Jan-18 20 20 Sum of row1
A B 20-Mar-18 45 65 Sum of row1+2
A B 10-Apr-18 15 80 Sum of row1+2+3
A B 21-May-18 30 110 Sum of row1+2+3+4
A B 30-Jul-18 10 100 Sum of row2+3+4+5 (as row1 is > 6 months in the past)
A B 15-Aug-18 25 125 Sum of row2+3+4+5+6
I tried to use the solution proposed in an earlier thread by inserting dummy records for dates where there is no record and then using ROWS BETWEEN 181 PRECEDING AND CURRENT ROW
But there may be situations where there are multiple records on the same day which means that choosing the last 181 rows will lead to the earliest record getting dropped.
I have checked a lot of cases on this forum and others but can't find a solution for this moving average where the window size is not constant. Please help.
Teradata doesn't implement RANGE in Windowed Aggregates, but you can use old-style SQL to get the same result. If the number of rows per group is not too high it's very efficient, but needs an intermediate table (unless the GROUP BY columns are the PI of the souce table). The self-join on the PI columns results in an AMP-local direct join plus aggregated locally, without matching PIs it will be a less efficient join plus aggregated globally
create volatile table vt as
( select a,b,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.a, t1.b, t1.datecol
,sum(t2.sumcol)
from vt as t1
join vt as t2
on t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
group by 1,2,3
Of course this will not work as expected if there are multiple rows per day (this will increase the number of rows for the sum due to the n*m join). You need some unique column combination, this defect_id might be useful.
Otherwise you need to switch to a Scalar Subquery which takes care about non-uniqueness, but is usually less efficient:
create volatile table vt as
( select a,b,defect_id,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.*
,(select sum(t2.sumcol)
from vt as t2
where t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
)
from vt as t1
To use your existing approach you must aggregate those multiple rows per day first

Using list of row numbers as criteria to populate field

I have a list of row numbers that represent row containing outliers in a data set. I would like to add an "outlier" column to the original data set that flags the rows containing outliers, but I can't figure out how to use row numbers as criteria in r.
Example:
I have a dataframe like this:
id <-c("a","b","c","d")
values <-c(10,11,22,33)
df<-data.frame(names,values)
id values
1 a 10
2 b 11
3 c 22
4 d 33
And a list like this containing row number (more correctly "row names"):
outliers <-c(2,4)
I'd like to find a way to use the list of row numbers as criteria in something like:
df$outlier_test<-ifelse( if row number is on my list, "outlier","")
to produce something like this:
id values outlier_test
1 a 10
2 b 11 outlier
3 c 22
4 d 33 outlier
Spent quite a while trying to puzzle this out and had inspiration as soon as I posted the question. For anyone else who comes here with this question:
First:
df$rownumber<- row.names(df)
then:
df$outlier_test<- ifelse(df$rownumber %in% outliers,"outlier","")

Dealing with duplicated data, reassign a new value

It seems that when we have duplicated data, most of the time we want to remove the duplicated data.
Lets say, we do not want to exclude it, but instead assign it with a new variable.
Taking the following data as a example
b <- c(1:100,1:99,1:104,1:105,1:105)
So we see that between the values for 1-99 are repeated 5 times, the number 100 repeated 4 times, the number 101 repeated 4 times etc.....
How can one search through b (ideally in sequential order), find a repeated/duplicate number and then assign it a new value
Try this if you're interested in assigning one (universal) new value
b <- c(1:100,1:99,1:104,1:105,1:105)
b[duplicated(b)] = 888 # new value
The duplicated command helps you spot the positions of all values that are duplicates in b.

Resources