I have a table that consists of names, points, and years. I need a command to return all the names for a specific year even if the name isn't included in that year. Example
Name Points Year
------- -------
tom 8 2011
jim 45 2011
jerry 25 2011
zack 124 2011
jeff 45 2011
tom 62 2012
jim 214 2012
jerry 13 2012
zack 32 2012
arnold 4 2012
Name Points Year
------- -------
tom 8 2011
jim 45 2011
jerry 25 2011
zack 124 2011
jeff 45 2011
arnold NULL NULL
I figured this would be easy but I am struggling to make it work.
From your explanation, I'm thinking you need could use something like this:
SELECT DISTINCT
N.`Name`,
D.`Points`,
Y.`Year`
FROM
`MyData` Y
LEFT JOIN (SELECT DISTINCT `Name` FROM `MyData`) N ON 1=1
LEFT JOIN `MyData` D
ON D.`Year` = Y.`Year`
AND D.`Name` = N.`Name`
ORDER BY
Y.`Year`
While It's not pretty, It does seem to work as intended:
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN? [duplicate]
(3 answers)
Closed 2 years ago.
I have two tables. The columns i am interested in table 1 is "Year" and "CompanyName". Table 2 has 3 columns including: "Year" and "CompanyName".
How can I join these two tables together? The problem I have is that table 1 has many columns that have for example the year value as "Year" = "2004" and "CompanyName" = "Adidas". e.g.
# There are many other columns
Year CompanyName Spent
1 2004 Adidas 50
2 2004 Nike 34
3 2004 Adidas 45
4 2005 Reebok 33
5 2006 Reebok 11
6 2006 Adidas 47
7 2007 Nike 33
8 2007 Reebok 92
9 2007 Nike 01
10 2007 Adidas 23
#I want to join this to it
Year CompanyName Loss
1 2004 Nike 23
2 2004 Adidas 22
3 2005 Reebok 633
4 2006 Reebok 2
5 2006 Adidas 09
6 2007 Reebok 22
7 2007 Nike 34
I want to join the tables so when ever Year is 2004 and CompanyName is Adidas a column is added for Loss with the value 23
Thank You!
You can do that by
library(dplyr)
df3 <- df1 %>%
left_join(df2, by = c("Year", "CompanyName"))
Just make sure you don't have duplications in df2 when it comes to year & company name. You can do so through dplyr::distinct(df2, Year, CompanyName, .keep_all = T), however that might lead to dropping some relevant information. If you're not certain about it, it might make sense to aggregate by those two dimensions:
df2 %>%
group_by(Year, CompanyName) %>%
summarise(Loss = sum(Loss))
id product id2 year cost
1 biscuits 202-55-3041 2017 2
2 biscuits 903-36-9457 2014 2
3 biscuits 938-33-7254 2014 2
4 biscuits 739-29-5963 2017 2
5 biscuits 731-49-5483 2017 2
6 biscuits 892-15-2567 2018 2
7 biscuits 518-79-7674 2017 2
8 biscuits 305-63-7908 2017 2
This is my current data set the name of this data is called 'total1'
I am a beginner in R and I was wondering if there was a way to add up the cost of the product based on the year, for example;
In 2017 there were 10 biscuits sold
In 2018 there were 8 biscuits sold
I am trying to determine which is the least profitable year in terms of biscuits sold.
I apologise if this is answered elsewhere if it is direct me thank you.
Assuming that the number of sold items is stored in column cost, here's a simple solution using tapply:
tapply(total1$cost, total1$year, sum)
2014 2017 2018
4 10 2
Another simple solution is by using aggregate:
Edit:
thanks to #Darren Tsai's comment, the code here is simplified:
aggregate(cost ~ year, total1, sum)
total1$year total1$cost
1 2014 4
2 2017 10
3 2018 2
I've searched but can't find the right answer, and I'm going round in circles.
I have
CREATE TABLE History (yr Int, output Int, cat Text);
yr output cat
---------- ---------- ----------
2015 10 a
2016 20 a
2017 30 a
2018 50 a
2019 70 a
2015 100 b
2016 200 b
2017 300 b
2018 500 b
2019 700 b
2015 1000 c
2016 2000 c
2017 3000 c
2018 5000 c
2019 7000 c
2015 10000 d
2016 20000 d
2017 30000 d
2018 50000 d
2019 70000 d
I've created two views
CREATE VIEW Core AS select * from History where cat = "c" or cat = "d";
CREATE VIEW Plus AS select * from History where cat = "a" or cat = "b";
My query is
select distinct yr, sum(output), (select sum(output) from core group by yr) as _core, (select sum(output) from plus group by yr) as _plus from history group by yr;
yr sum(output) _core _plus
---------- ----------- ---------- ----------
2015 11110 11000 110
2016 22220 11000 110
2017 33330 11000 110
2018 55550 11000 110
2019 77770 11000 110
Each of the individual queries works but _core and _plus columns are wrong when it's all put together. How should I approach this please.
You may generate your expected output without a view, using a single query with conditional aggregation:
SELECT
yr,
SUM(output) AS sum_output,
SUM(CASE WHEN cat IN ('c', 'd') THEN output ELSE 0 END) AS _core,
SUM(CASE WHEN cat IN ('a', 'b') THEN output ELSE 0 END) AS _plus
FROM History
GROUP BY
yr;
If you really wanted to make your current approach work, one way would be to just join the two views by year. But that would leave open the possibility that each view might not have every year present.
I have the following dataframe:
Count Year
32 2018
346 2017
524 2016
533 2015
223 2014
1 2010
3 2008
1 1992
Is it possible to exclude the years 1992 and 2008. I tried different ways, but don't find a flexible solution.
I would like to have the same dataframe without the years 1993 and 2008.
Many thanks in advance,
jeemer
library(dplyr); filter(df, year != 1992 | year != 2008)
This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 2 years ago.
I'm working on a project where I need to sort data based on how people vote. I cannot find a function where I can delete duplicate rows based on certain conditions being met.
I'm looking for a function that will remove duplicate rows based on one column having duplicate values and another column meeting certain conditions.
For example in the table below I would like to remove voters who voted in three different elections. Paul needs to be removed from this data frame.
df <- data.frame(Name=c("Paul","Paul","Mary","Bill","Jane","Paul","Mary","John",
"Bill","John"),ElectionDay=c("November 2010","November 2014",
"November 2010","November 2010","November 2014","November 2006",
"November 2014","November 2010","November 2014","November 2014"))
df
# Name ElectionDay
# 1 Paul November 2010
# 2 Paul November 2014
# 3 Mary November 2010
# 4 Bill November 2010
# 5 Jane November 2014
# 6 Paul November 2006
# 7 Mary November 2014
# 8 John November 2010
# 9 Bill November 2014
# 10 John November 2014
Below is an example of the result I am looking for:
Name ElectionDay
1 Mary November 2010
2 Bill November 2010
3 Jane November 2014
4 Mary November 2014
5 John November 2010
6 Bill November 2014
7 John November 2014
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Name', we get the length of unique 'ElectionDay' (uniqueN(ElectionDay)). If the length is less than 3, we get the Subset of Data.Table (.SD).
library(data.table)#v1.9.6+
setDT(df)[, if(uniqueN(ElectionDay) < 3) .SD, by = Name]
A similar base R option would be using ave. We get the length of unique elements of 'ElectionDay' grouped by 'Name' and check whether it is less than 3 to get a logical index. The index can be used to subset the rows of dataset.
df[with(df, ave(as.character(ElectionDay), Name,
FUN=function(x) length(unique(x)))) < 3,]
# Name ElectionDay
#3 Mary November 2010
#4 Bill November 2010
#5 Jane November 2014
#7 Mary November 2014
#8 John November 2010
#9 Bill November 2014
#10 John November 2014
The names that occur in more than 2 rows are calculated as
names(which(table(df$Name) > 2))
#[1] "Paul"
So what you need is
df[!(df$Name %in% names(which(table(df$Name) > 2))), ]
# Name ElectionDay
#3 Mary November 2010
#4 Bill November 2010
#5 Jane November 2014
#7 Mary November 2014
#8 John November 2010
#9 Bill November 2014
#10 John November 2014
Or you can also use dplyr, counting the number of elections on which every people voted and then removing the rows for which the count is 3:
library(dplyr)
df %>%
group_by(Name) %>%
mutate(NumberElections = length(unique(ElectionDay))) %>%
ungroup() %>%
filter(NumberElections != 3)