How to find & remove duplicates in data frames? - r

I have the follwing data frame which happens to be NBA draft data:
draft_year draft_round teamid playerid draft_from
1961 1 Bos Pol1 Nan
2001 1 LA Ben2 Cal
1967 2 Min Mac2 Nan
2001 1 LA Ben2 Cal
2000 1 C Sio1 Bud
2000 1 C Gio1 Bud
I would like to find & remove only those rows with duplicates in playerid. For obvious reasons, the remaining duplicates have a meaningful purpose and must be kept.

In data.table package you have a by parameter in the unique function
library(data.table)
unique(setDT(df), by = "playerid")
# draft_year draft_round teamid playerid draft_from
# 1: 1961 1 Bos Pol1 Nan
# 2: 2001 1 LA Ben2 Cal
# 3: 1967 2 Min Mac2 Nan
# 4: 2000 1 C Sio1 Bud
# 5: 2000 1 C Gio1 Bud

You can achieve this by using duplicated or unique()
new_df <- df[!duplicated( df$playerid), ]

You could also use dplyr
library(dplyr)
unique(df, group_by="playerid")
# draft_year draft_round teamid playerid draft_from
#1 1961 1 Bos Pol1 Nan
#2 2001 1 LA Ben2 Cal
#3 1967 2 Min Mac2 Nan
#5 2000 1 C Sio1 Bud
#6 2000 1 C Gio1 Bud
Or
df %>%
group_by(playerid) %>%
filter(row_number()==1)

Related

data.table mapping based on another data.table [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have two data(.xlsx), DT1 and DT2. I want to create a new column newcol in DT1 based on original column in DT1, mapping with columns in DT2.
I know this is ambiguous so I explain more here:
First, here is my two data.
DT1
code type
AH1 AM
AS5 AM
NMR AM
TOS AM
IP AD
CC ADCE
CA Wa
DT2
code year month
AH1 2011 2
AH1 2011 5
AS5 2012 7
AS5 2012 6
AS5 2013 3
CC 2014 6
CA 2016 11
Second, in DT2, column year and month are unimportant in this question. We don't need to cosider it.
Third, the result I want is:
DT2
code year month newcol
AH1 2011 2 AM
AH1 2011 5 AM
AS5 2012 7 AM
AS5 2012 6 AM
AS5 2013 3 AM
CC 2014 6 ADCE
CA 2016 11 Wa
newcol in DT2 is created based on data DT1.
I saw a syntax like DT2[DT1, ...] to solve but I forget it. Any help?
Data
DT1 <- " code type
1: AH1 AM
2: AS5 AM
3: NMR AM
4: TOS AM
5: IP AD
6: CC ADCE
7: CA Wa
"
DT1 <- read.table(text=DT1, header = T)
DT1 <- as.data.table(DT1)
DT2 <- "code year month
1: AH1 2011 2
2: AH1 2011 5
3: AS5 2012 7
4: AS5 2012 6
5: AS5 2013 3
6: CC 2014 6
7: CA 2016 11
"
DT2 <- read.table(text=DT2, header =T)
DT2 <- as.data.table(DT2)
P.S. Moreover, in excel, there is a function VLOOKUP to solve it:
# Take first obs. as an example.
DT2
code year month
AH1 2011 2
# newcol is column D. So in D2, we type:
=VLOOKUP(TRIM(A1), 'DT1'!$A$2:$A$8, 2, FALSE)
UPDATE based on comment under #akrun's answer.
My original DT1 has 86 obs. and DT2 has 451125 obs. I use the #akrun's answer and DT2 reduces to 192409. So weird. DT2$code doesn't contain any NA. I don't know why.
length(unique(DT1$code1))
[1] 86
length(unique(DT2$code))
[1] 39
table(DT1$code1)
AHI AHI002 AHI004 AHI005 AHS002 AHS003 AHS004 AHS005 AMR AMR002 AMR003 AMRHI3 CARD CCRU HPA01 HWPA1 HWPA1T IOA IOA01
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
IOA01T IPA010 IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5 IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 PLFI REI SPA SPA001 SPA3 TADS TADS2 TAHI TAHI2 TAHS
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TAHS2 TAMB TAMB2 TAMD TAMD2 TAMR TAMR2 TBURN TBURN2 TCCR TFPS TFS TFS2 THE THIBN THIBN2 TICU TICU2 TIPA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TIPA2 TIPAK TIPAK2 TNCC TOS TOS2 TSAO TSAO2 TSPA WED
1 1 1 1 1 1 1 1 1 1
table(DT2$code)
AHI002 AHI005 AHS002 AHS005 AMR AMR003 Card HPA01 HWPA1 HWPA1T IOA01 IOA01T IPA011 IPA012 IPA013 IPA014 IPACC3 IPACC4 IPACC5
19408 12215 34184 12226 19408 12215 19408 7344 9198 405 9198 405 12215 5137 1148 2853 31703 9198 7878
IPACC6 IPAR IPAR2 IPARK2 IPARKI NAHI NAHI2 NAMR NAMR2 NCC2 NCC5 NCC5T NNAHI NNAHI2 NNAMR NNAMR2 PL PL2 SPA
9668 41909 9643 2362 2967 10018 3589 10018 3589 7878 2845 536 14776 8104 14754 8118 18624 8302 40856
SPA3
6823
We can do this with join from data.table
library(data.table)
DT2[DT1, on = .(code), nomatch = 0]
# code year month type
#1: AH1 2011 2 AM
#2: AH1 2011 5 AM
#3: AS5 2012 7 AM
#4: AS5 2012 6 AM
#5: AS5 2013 3 AM
#6: CC 2014 6 ADCE
#7: CA 2016 11 Wa
You can use merge in base R:
DT2 <- (merge(DT1, DT2, by = 'code'))
Note: It'd also sort it by 'code' column.
You can also use plyr package:
DT2 <- plyr::join(DT2, DT1, by = "code")
As you are interested in using data.table package:
library(data.table)
DT2 <- data.table(DT2, key='code')
DT1 <- data.table(DT1, key='code')
DT2[DT1]
Or qdap package:
DT2$type <- qdap::lookup(DT2$code, DT1)

R: conditional aggregate based on factor level and year

I have a dataset in R which I am trying to aggregate by column level and year which looks like this:
City State Year Status Year_repealed PolicyNo
Pitt PA 2001 InForce 6
Phil. PA 2001 Repealed 2004 9
Pitt PA 2002 InForce 7
Pitt PA 2005 InForce 2
What I would like to create is where for each Year, I aggregate the PolicyNo across states taking into account the date the policy was repealed. The results I would then get is:
Year State PolicyNo
2001 PA 15
2002 PA 22
2003 PA 22
2004 PA 12
2005 PA 14
I am not sure how to go about splitting and aggregating the data conditional on the repeal data and was wondering if there is a way to achieve this is R easily.
It may help you to break this up into two distinct problems.
Get a table that shows the change in PolicyNo in every city-state-year.
Summarize that table to show the PolicyNo in each state-year.
To accomplish (1) we add the missing years with NA PolicyNo, and add repeals as negative PolicyNo observations.
library(dplyr)
df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
repeals = df %>%
filter(!is.na(Year_repealed)) %>%
mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
# City State Year Status Year_repealed PolicyNo
# 1 Phil. PA 2004 Repealed 2004 -9
all_years = expand.grid(City = unique(df$City), State = unique(df$State),
Year = 2001:2005)
df = bind_rows(df, repeals, all_years)
# City State Year Status Year_repealed PolicyNo
# 1 Pitt PA 2001 InForce NA 6
# 2 Phil. PA 2001 Repealed 2004 9
# 3 Pitt PA 2002 InForce NA 7
# 4 Pitt PA 2005 InForce NA 2
# 5 Phil. PA 2004 Repealed 2004 -9
# 6 Pitt PA 2001 <NA> NA NA
# 7 Phil. PA 2001 <NA> NA NA
# 8 Pitt PA 2002 <NA> NA NA
# 9 Phil. PA 2002 <NA> NA NA
# 10 Pitt PA 2003 <NA> NA NA
# 11 Phil. PA 2003 <NA> NA NA
# 12 Pitt PA 2004 <NA> NA NA
# 13 Phil. PA 2004 <NA> NA NA
# 14 Pitt PA 2005 <NA> NA NA
# 15 Phil. PA 2005 <NA> NA NA
Now the table shows every city-state-year and incorporates repeals. This is a table we can summarize.
df = df %>%
group_by(Year, State) %>%
summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
#
# Year State annual_change
# <int> <chr> <dbl>
# 1 2001 PA 15
# 2 2002 PA 7
# 3 2003 PA 0
# 4 2004 PA -9
# 5 2005 PA 2
That gets us PolicyNo change in each state-year. A cumulative sum over the changes gets us levels.
df = df %>%
ungroup() %>%
mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
# Year State annual_change PolicyNo
# <int> <chr> <dbl> <dbl>
# 1 2001 PA 15 15
# 2 2002 PA 7 22
# 3 2003 PA 0 22
# 4 2004 PA -9 13
# 5 2005 PA 2 15
With the data.table package you could do it as follows:
melt(setDT(dat),
measure.vars = c(3,5),
value.name = 'Year',
value.factor = FALSE)[!is.na(Year)
][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
][is.na(PolicyNo), PolicyNo := 0
][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
][, .(Year, State, PolicyNo = cumsum(PolicyNo))]
The result of the above code:
Year State PolicyNo
1: 2001 PA 15
2: 2002 PA 22
3: 2003 PA 22
4: 2004 PA 13
5: 2005 PA 15
As you can see, there are several steps needed to come to the desired endresult:
First you convert to a data.table (setDT(dat)) and reshape this into long format and remove the rows with no Year
Then you make the value for the rows that have 'Year_repealed' to negative.
With a cross-join (CJ) you make sure that alle the years for each state are present and convert the NA-values in the PolicyNo column to zero.
Finally, you summarise by year and do a cumulative sum on the result.

How can I aggregate data.table in quarterly frequency?

My data is available in monthly frequency and I'm trying to aggregate them in quarterly frequency. I'm working with data.table which package I dont understand very well, to be honest.
X.DATA_BASE NOME_INSTITUICAO SALDO.x SALDO.y
1: 199407 ASB S/A - CFI 1694581 1124580
2: 199407 BANCO ARAUCARIA S.A. 40079517 6314782
3: 199407 BANCO ATLANTIS S.A. 200463907 9356445
4: 199407 BANCO BANKPAR 1078342 5770046
5: 199407 BANCO BBI 97812975 31112289
For each date, which is defined by X.DATA_BASE, 199407 = July 1994. I have several institutions with SALDO.x and SALDO.y values. I want to add SALDO.x and SALDO.y for each institution in each quarterly. One of the problem is that some institutions get in and get out through the time. In the end of the day I want to have mydata with the same columns but quarterly frequency.
How could I do that?
Here's an example of how to group and sum by quarter (with thanks to #eddi for his suggested improvement). First let's create some fake date:
library(data.table)
set.seed(1485)
dat = data.table(date=rep(c(199401:199412,199501:199512),2),
firm=rep(c("A","B"), each=24),
value1=rnorm(48,1000,10),
value2=rnorm(48,2000,100))
dat
date firm value1 value2
1: 199401 A 1009.8620 2054.251
2: 199402 A 1009.7180 2124.202
3: 199403 A 1014.3421 1919.251
...
46: 199510 B 992.9961 2079.517
47: 199511 B 997.9147 1968.676
48: 199512 B 1002.5993 2006.231
Now, summarize by firm, year, and quarter. To do this, we create year and quarter grouping variables from date (we use integer division (%/%) to create the years and mod (%%) plus integer division to create the quarters), and calculate the sum of value1 and value2 for each sub-group. This all assumes date is numeric. If you have it stored as character or factor, convert to numeric first:
dat.summary = dat[ , list(valueByQuarter = sum(sum(value1) + sum(value2))),
by=list(firm,
year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1)]
dat.summary
firm year quarter valueByQuarter
1: A 1994 1 9131.626
2: A 1994 2 8953.116
3: A 1994 3 8981.407
4: A 1994 4 9175.959
5: A 1995 1 9003.225
6: A 1995 2 8962.690
7: A 1995 3 8809.256
8: A 1995 4 8885.264
9: B 1994 1 9000.791
10: B 1994 2 8936.356
11: B 1994 3 8905.789
12: B 1994 4 8951.369
13: B 1995 1 8922.716
14: B 1995 2 9097.134
15: B 1995 3 8724.188
16: B 1995 4 9047.934
For dplyr fans, here's a dplyr approach:
library(dplyr)
dat %>%
group_by(firm, year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1) %>%
summarise(valueByQuarter = sum(value1 + value2))

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.
We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2
I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1

R aggregating on date then character

I have a table that looks like the following:
Year Country Variable 1 Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 2 5
1971 UK 2 3
1971 UK 1 5
1971 USA 2 2
1972 USA 1 1
1972 USA 2 5
I'd be grateful if someone could tell me how I can aggregate the data to group it first by year, then country with the sum of variable 1 and variable 2 coming afterwards so the output would be:
Year Country Sum Variable 1 Sum Variable 2
1970 UK 1 3
1970 USA 1 3
1971 UK 5 13
1971 USA 2 2
1972 USA 3 6
This is the code I've tried to no avail (the real dataframe is 125,000 rows by 30+ columns hence the subset. Please be kind, I'm new to R!)
#making subset from data
GT2 <- subset(GT1, select = c("iyear", "country_txt", "V1", "V2"))
#making sure data types are correct
GT2[,2]=as.character(GT2[,2])
GT2[,3] <- as.numeric(as.character( GT2[,3] ))
GT2[,4] <- as.numeric(as.character( GT2[,4] ))
#removing NA values
GT2Omit <- na.omit(GT2)
#trying to aggregate - i.e. group by year, then country with the sum of Variable 1 and Variable 2 being shown
aggGT2 <-aggregate(GT2Omit, by=list(GT2Omit$iyear, GT2Omit$country_txt), FUN=sum, na.rm=TRUE)
Your aggregate is almost correct:
> aggGT2 <-aggregate(GT2Omit[3:4], by=GT2Omit[c("country_txt", "iyear")], FUN=sum, na.rm=TRUE)
> aggGT2
country_txt iyear V1 V2
1 UK 1970 1 3
2 USA 1970 1 3
3 UK 1971 5 13
4 USA 1971 2 2
5 USA 1972 3 6
dplyr is almost always the answer nowadays.
library(dplyr)
aggGT1 <- GT1 %>% group_by(iyear, country_txt) %>% summarize(sv1=sum(V1), sv2=sum(V2))
Having said that, it is good to learn basic R functions like aggregate and by.

Resources