How do I get duplicate rows? - teradata

I have got a Teradata table. I have attached a part of the table for reference.
I need to print out the rows which have exactly the same values
Table Values:
id Name City Country
1 John Berlin Germany
2 Mike Warsaw Poland
3 Neil London England
1 John Berlin Germany
2 Mike Warsaw Poland
4 Alan Moscow Russia
The output that I am expecting is
id Name City Country
1 John Berlin Germany
2 Mike Warsaw Poland

This might be solved your problem.
SELECT *
FROM TableName
group by id, Name, city, country
having count(*) > 1;

#ManguYogi's solution works fine (and is probably more efficient), but I wanted to add another solution because it's a rare case that except all can be useful:
select * from mytab
EXCEPT ALL
select distinct * from mytab
If a row exists more than twice it will be returned multiple times. Of course, if you're interested in this you can simply add the count to the #ManguYogi's Select.

-- this will select all duplicate rows
select * from johns_table
qualify row_number() over (partition by id order by id) > 1

Related

I have to edit one element in a dataframe. How to to it?

I have the following dataframe -
first_name last_name Location Qualification
1 Saif Mehdi Hyderabad M.Tech
2 Rishi Gupta Pune B.Tech
3 Aditi Pandey unknown unkown
4 Poorwa Kunwar Japan B.Tech
5 Rajesh Choudhary Jaipur B.Tech
6 Hari Desai Mumbai M.Tech
7 Sayan Das Kolkata B.Tech
8 Deepjyoti Borah Kolkata B.Tech
9 Bharathi Ganesan R Trichy M.Tech
10 Jayarama Krishna Vizag M.Tech
11 Akhil Gopinath Banglore M.S
Now I want to edit the 4th row's Qualification column
How to do it?
I am new to R
You can access the Qualification column via df$Qualification, where df is your data frame. Then, you can access the 4th element of that via df$Qualification[4]. To replace the existing value with a new value, new_value, you can write: df$Qualification[4] <- new_value.

Subtracting subset from larger dataset in R

Hi all: I have two variables. The first is entitled WITHOUT_VERANDAS. It is a list of cities, aggregated by average rental prices of homes WITHOUT verandas (there are about 200 rows):
City Price
1 Appleton 5000
2 Ames 9000
3 Lodi 1020
4 Milwaukee 2010
5 Barstow 2000
6 Chicago 2320
7 Champaign 2000
The second variable is entitled WITH_VERANDAS. It's a list of cities, aggregated by average rental prices of homes WITH verandas (there are about 10 rows, this is a subset of the previous dataset, since not every city has rental properties with verandas):
City Price
1 Milwaukee 3000
2 Chicago 2050
3 Lodi 5000
For each city on the WITH_VERANDAS list, I want to subtract that city's WITHOUT_VERANDAS city value from the first list. I want to see which cities have the highest or lowest differential. Essentially, the result should only include the WITH_VERANDAS data.
I've tried this:
difference <- WITH_VERANDAS$Price-WITHOUT_VERANDAS$Price
View(difference)
However, this returns as many rows as the WITHOUT_VERANDAS dataset. I also get an error:
longer object length is not a multiple of shorter object length
And the result is simply subtracting WITHOUT_VERANDAS's row 1 from WITH_VERANDA's row 1, as seen in the results: (for example, row 1 of the output would be the value of Milwaukee-Appleton, row 2 output would be Chicago - Ames, and so forth)
1. -2000
2. -6950
If I could only filter WITHOUT_VERANDAS to include only the cities included in WITH_VERANDAS, I think it would work. Thanks!
R2evans, thank you ! this worked great. Now, I have:
City Price.x Price.y
1 Appleton NA 5000
2 Ames NA 9000
3 Lodi 5000 1020
4 Milwaukee 3000 2010
How would I go about filtering this list to take out any row where Price.x is "NA"? i.e all rows that did not match. Thanks again!

Construct a vector of names from data frame using R

I have a big data frame that contains data about the outcomes of sports matches. I want to try and extract specific data from the data frame depending on certain criteria. Here's a quick example of what I mean...
Imagine I have a data frame df, which displays data about specific football matches of a tournament on each row, like so:
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Man utd John Scotland R Madrid Juan Spain
4 Paris SG Teirey France Chelsea Mark England
So, for example, in row [1] Man utd won against Barcalona, Man utd's captain's name was John and he is from England. Barcalona's (the losers of the match) captain's name was Carlos and he is from Spain.
I want to construct a vector with the names of all English players in the tournament, where the output should look something like this:
[1] "John" "Mark" "Steve"
Here's what I've tried so far...
My first step was to create a data frame that discards all the matches that don't have English captains
> England_player <- data.frame(filter(df, Win_Country=="England" ))
> England_player
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Paris SG Teirey France Chelsea MArk England
Then I used select() on England_player to isolate just the names:
> England_player_names <- select(England_player, Win_Capt_Nm, Lose_Capt_Nm)
> England_player_names
Win_Capt_Nm Lose_Capt_Nm
1 John Carlos
2 Steve Mario
3 Teirey Mark
And then I get stuck! As you can see, the output displays the English winner's name and the name of his opponent... which is not what I want!
It's easy to just read the names off this data frame.. but the data frame I'm working with is large, so just reading the values is no good!
Any suggestions as to how I'd do this?
english.players <- union(data$Win_Capt_Nm[data$Win_Country == 'England'], data$Lose_Capt_Nm[data$Lose_Country == 'England'])
[1] "John" "Steve" "Mark"

The logic of WHERE Clause along with > operator and the sub-query

I don't get the logic for the query 3 as below, and hope someone could give me some idea.
For the query 3,
SELECT ID, NAME, AGE, SALARY FROM COMPANY WHERE AGE > (SELECT AGE FROM COMPANY WHERE SALARY < 20000);
The sub-query would find out the result where the salary < 20000 first, and that is what query2 shown as below. And then the parent query would find out the result where using all the age's record from the table COMPANY(total of 7 record: 18,19,22,23,24,29,37) to compare with the age's result from sub-query(total of 4 record: 18,19,23,29) and then show the greater record based on age.
I expect the result should only show the ID 7 only like below, since only this record is met the condition. The greater age from the result of sub-query(query 2) is 29, so only this record the age is over 29.
ID NAME AGE SALARY
7 Vicky 37 32500.0
Unfortunately my expectation is not met, and it show me the result like query 3 as below.
I hope to understand the logic how its work for query 3, and hope someone could assist.
1.sqlite> SELECT ID, NAME, AGE, SALARY FROM COMPANY;
ID NAME AGE SALARY
1 John 24 21000.0
2 Davy 22 20000.0
3 Kenny 19 9700.0
4 Henry 23 13555.0
5 Sam 18 17000.0
6 Ray 29 8000.0
7 Vicky 37 32500.0
2.sqlite> SELECT ID, NAME, AGE, SALARY FROM COMPANY WHERE SALARY < 20000;
ID NAME AGE SALARY
3 Kenny 19 9700.0
4 Henry 23 13555.0
5 Sam 18 17000.0
6 Ray 29 8000.0
3.sqlite> SELECT ID, NAME, AGE, SALARY FROM COMPANY WHERE AGE > (SELECT AGE FROM COMPANY WHERE SALARY < 20000);
ID NAME AGE SALARY
1 John 24 21000.0
2 Davy 22 20000.0
4 Henry 23 13555.0
6 Ray 29 8000.0
7 Vicky 37 32500.0
At a guess, since it doesn't throw an error (which seems a better idea; see also Col. 32's comment):
Sqlite just picks the first returned age. That age should be random, but going by the results shown in your query 2 and assuming some consistency, the first result is likely 19. Then, it picks all ages larger than 19, which is what you see in the results of query 3.
Shuffle things around or create another set of data, and see if what you get now from query 2 and 3 are consistent with this assumption.
Someone else may know the internals of Sqlite enough to explain why this happens.

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

Resources