R, subset two data frames with different rows - r

df1 <-
Year Month
2011 08
2011 08
2011 09
2011 10
2012 11
2012 11
df2 <-
Year Month
2001 02
2011 08
2011 10
2013 01
2012 11
My goal is to make data matrix with (Month, Year) that are common to both data sets.
goal <-
Year Month
2011 10
2011 08
2012 11
Can anyone please help me???

You can merge() the two then find the unique rows.
unique(merge(df1, df2))
# Year Month
# 1 2011 10
# 2 2011 8
# 4 2012 11

If you load dplyr, you can take the intersection
library(dplyr)
intersect(df1,df2)
# Year Month
# 1 2011 8
# 2 2011 10
# 3 2012 11
which I find intuitive.

Related

How to find maximum value from dataframe with specific condition? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a dataframe named employee with 100 rows like this :
Date Name ride food income bonus sallary
1 01 Jan 2020 Ludociel 10 6 330000 0 330000
2 01 Jan 2020 Estarossa 15 8 465000 100000 565000
3 01 Jan 2020 Tarmiel 8 10 420000 100000 520000
4 01 Jan 2020 Sariel 5 8 315000 0 315000
5 01 Jan 2020 Escanor 15 7 435000 100000 535000
6 01 Jan 2020 Ban 13 9 465000 100000 565000
7 01 Jan 2020 Meliodas 6 15 540000 100000 640000
8 01 Jan 2020 King 15 12 585000 100000 685000
9 01 Jan 2020 Zeldris 15 11 555000 100000 655000
10 01 Jan 2020 Rugal 15 6 405000 100000 505000
11 02 Jan 2020 Ludociel 14 6 390000 100000 490000
12 02 Jan 2020 Estarossa 12 14 600000 100000 700000
...
100 10 Jan 2020 Rugal 13 10 495000 100000 595000
The problem is I want to find which employee that has the highest total sallary from 1 Jan to 10 Jan. My expected output is just a vector like this :
[1] "varName" is the highest with total sallary "varTotal_sallary"
I have tried using for loop + if clause and it only return total of 1 name only, and every name will have the function.
function_ludociel<-function(name, date, sallary){
total=integer()
for(i in 100){
if(date[i]=="01 Jan 2020" & name[i]=="Ludociel"){
total=sum(sallary)
}
}
return(total)
}
ludociel=function_ludociel(employee$name,employee$date,employee$sallary)
After that I planned to combine them in 1 variable and use max(), but i know it is silly to code.
Anyone have solution for this? Thankyou very much...
Convert date to actual date class
Use aggregate to calculate total salary from 1st Jan to 10th Jan
Select row with maximum salary
Print the result.
employee$Date <- as.Date(employee$Date, '%d %b %Y')
sub_data <- aggregate(sallary~Name, employee,
subset = Date >= as.Date('2020-01-01') &
Date <= as.Date('2020-01-10'), sum)
max_data <- sub_data[which.max(sub_data$sallary), ]
sprintf('%s has the highest salary %d', max_data$Name, max_data$sallary)

Extracting unique records in R?

I tried "unique" and "duplicated" but cannot get R to do what I want, which is basically compare two sets of data and find out who one the first data set is not on the second data set. data1 contains a customer ID, name and the year that person bought X. data2 contains a customer ID and year (2017) indicating they purchased X this year. What I want to do is extract a list of people from data1 who have NOT purchase X this year...so I can contact them and tell them to buy X again.
> data1
ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014
> data2
ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017
Merging data1 and data2 by ID ( merge(data1,dat2, by"ID") ) gives me:
> merged_d1d2
ID NAME YEAR.x YEAR.y
1 5 Fred 2014 2017
2 7 Sara 2015 2017
3 10 Bill 2014 2017
4 11 Doug 2016 2017
5 15 Matt 2014 2017
...But I want everyone EXCEPT these people! I also added the names into data2 and then combined data1 and data2 using rbind which gives me a data set with duplicates (e.g. 2 Fred, 2 Sara, 2 Bill, etc.) I then tried to use "unique" and "duplicated" but these always leave one of those duplicates (1 Fred, 1 Sara) in the new data. I want everyone from data1 except those people. I have a feeling this is a simple process, but any help would be greatly appreciated.
Simply:
data1[!data1$ID%in%data2$ID,]
ID NAME YEAR
1 8 Ann 2016
4 12 Emma 2015
6 9 Julie 2014
7 13 Karl 2016
9 14 Rhett 2014
11 4 Tom 2014
Or you could try anti_join by ID from dplyr:
data1 <- read.table(text="ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014",header=TRUE, stringsAsFactors=FALSE)
data2 <- read.table(text="ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
anti_join(data1,data2,by="ID")
ID NAME YEAR
1 4 Tom 2014
2 8 Ann 2016
3 9 Julie 2014
4 12 Emma 2015
5 13 Karl 2016
6 14 Rhett 2014

Create groups based on time period

How can I create a new grouping variable for my data based on 5-year steps?
So from this:
group <- c(rep("A", 7), rep("B", 10))
year <- c(2008:2014, 2005:2014)
dat <- data.frame(group, year)
group year
1 A 2008
2 A 2009
3 A 2010
4 A 2011
5 A 2012
6 A 2013
7 A 2014
8 B 2005
9 B 2006
10 B 2007
11 B 2008
12 B 2009
13 B 2010
14 B 2011
15 B 2012
16 B 2013
17 B 2014
To this:
> dat
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014
7 A 2014 2010_2014
8 B 2005 2005_2009
9 B 2006 2005_2009
10 B 2007 2005_2009
11 B 2008 2005_2009
12 B 2009 2005_2009
13 B 2010 2010_2014
14 B 2011 2010_2014
15 B 2012 2010_2014
16 B 2013 2010_2014
17 B 2014 2010_2014
I guess I could use cut(dat$year, breaks = ??) but I don't know how to set the breaks.
Here is one way of doing it:
dat$period <- paste(min <- floor(dat$year/5)*5, min+4,sep = "_")
I guess the trick here is to get the biggest whole number smaller than your year with the floor(year/x)*x function.
Here is a version that should work generally:
x <- 5
yearstart <- 2000
dat$period <- paste(min <- floor((dat$year-yearstart)/x)*x+yearstart,
min+x-1,sep = "_")
You can use yearstart to ensure e.g. year 2000 is the first in a group for when x is not a multiple of it.
cut should do the job if you create actual Date objects from your 'year' column.
## convert 'year' column to dates
yrs <- paste0(dat$year, "-01-01")
yrs <- as.Date(yrs)
## create cuts of 5 years and add them to data.frame
dat$period <- cut(yrs, "5 years")
## create desired factor levels
library(lubridate)
lvl <- as.Date(levels(dat$period))
lvl <- paste(year(lvl), year(lvl) + 4, sep = "_")
levels(dat$period) <- lvl
head(dat)
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014

Merge 2 resulting vectors into 1 data frame using R

I have a df like this
Month <- c('JAN','JAN','JAN','JAN','FEB','FEB','MAR','APR','MAY','MAY')
Category <- c('A','A','B','C','A','E','B','D','E','F')
Year <- c(2014,2015,2015,2015,2014,2013,2015,2014,2015,2013)
Number_Combinations <- c(3,2,3,4,1,3,6,5,1,1)
df <- data.frame(Month ,Category,Year,Number_Combinations)
df
Month Category Year Number_Combinations
1 JAN A 2014 3
2 JAN A 2015 2
3 JAN B 2015 3
4 JAN C 2015 4
5 FEB A 2014 1
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
9 MAY E 2015 1
10 MAY F 2013 1
I have another df that I got from the above dataframe with a condition
df1 <- subset(df,Number_Combinations > 2)
df1
Month Category Year Number_Combinations
1 JAN A 2014 3
3 JAN B 2015 3
4 JAN C 2015 4
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
Now I want to create a table reporting the month, the total number of rows for the month in df and the total number of for the month in df1
Desired Output would be
Month Number_Month_df Number_Month_df1
1 JAN 4 3
2 FEB 2 1
3 MAR 1 1
4 APR 1 1
5 MAY 2 0
While I used table(df) and table(df1) and tried merging but not getting the desired result. Could someone please help me in getting the above dataframe?
We get the table of the 'Month' column from both 'df' and 'df1', convert to 'data.frame' (as.data.frame), merge by the 'Var1', and change the column names accordingly.
res <- merge(as.data.frame(table(df$Month)),
as.data.frame(table(df1$Month)), by='Var1')
colnames(res) <- c('Month', 'Number_Month_df', 'Number_Month_df1')
res <- data.frame(Number_Month_df=sort(table(df$Month),T),
Number_Month_df1=sort(table(df1$Month),T))
res$Month <- rownames(res)

Increase efficiency of dplyr summarising

I am trying to sort and make a new table from a large data set (>60k; NDw) with a sample here
Season ENo HNo Month Day Year Group
638447 2011 A903851 1881023 10 6 2011 Ducks
589219 2010 C409324 3648019 10 8 2010 Ducks
137451 2006 M576033 2506116 10 13 2006 Ducks
883040 2013 P886755 43313010 10 17 2013 Ducks
851378 2013 C700399 36413199 11 5 2013 Geese
552791 2010 M902312 2508141 11 16 2010 Ducks
152368 2006 M599973 2496101 10 3 2006 Ducks
395393 2008 C412049 3646096 10 28 2008 Ducks
857709 2013 C671619 36413012 9 15 2013 Ducks
67354 2005 C349762 3643011 10 22 2005 Geese
67126 2005 C427496 3643037 11 25 2005 Geese
62260 2005 C349776 3643023 10 7 2005 Ducks
847364 2013 C570491 36411001 10 5 2013 Ducks
447414 2009 A686943 1808206 11 3 2009 Geese
474743 2009 M813353 2509214 10 24 2009 Ducks
439477 2009 A746048 1639142 10 26 2009 Ducks
781218 2012 P792862 4142177 11 27 2012 Geese
806946 2013 M052893 20712036 11 5 2013 Ducks
174932 2006 C450351 3645098 12 5 2006 Geese
828816 2013 M054683 25012010 9 30 2013 Ducks
I want to group by Season and HNo and get a number of new variables. These include how many groups each Season/HNo is in, a count of rows total, in each group, and each group during each month. The result would look like this, but with all months.
Season HNo groupN total.envelopes ducks geese Octducks
1 2005 1253041 1 2 2 0 2
2 2005 1254026 1 5 5 0 5
3 2005 1254063 2 26 23 3 0
4 2005 1254115 2 14 10 4 10
5 2005 1274023 2 39 28 11 28
I have code that works but it runs slow and I feel like there should be a better way to code this block. Maybe I'm wrong, and it's not a large issue, just wanted to learn how to make my code more efficient. Here is what I use to get the above output.
NDw1 = NDw %>%
group_by(Season,HNo) %>%
summarise(groupN = n_distinct(Group),
total.envelopes=n(),
ducks = length(ENo[Group %in% 'Ducks']),
geese = length(ENo[Group %in% 'Geese']),
Octducks = length(ENo[Group=='Ducks' & Month == 10]))
The entire code has lines for Aug-Jan ducks and geese. I tried to use count rather than length but it didn't work with a factor variable as is ENo. Any thoughts would be appreciated. Thanks for your time and help.

Resources