Extracting unique records in R? - r

I tried "unique" and "duplicated" but cannot get R to do what I want, which is basically compare two sets of data and find out who one the first data set is not on the second data set. data1 contains a customer ID, name and the year that person bought X. data2 contains a customer ID and year (2017) indicating they purchased X this year. What I want to do is extract a list of people from data1 who have NOT purchase X this year...so I can contact them and tell them to buy X again.
> data1
ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014
> data2
ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017
Merging data1 and data2 by ID ( merge(data1,dat2, by"ID") ) gives me:
> merged_d1d2
ID NAME YEAR.x YEAR.y
1 5 Fred 2014 2017
2 7 Sara 2015 2017
3 10 Bill 2014 2017
4 11 Doug 2016 2017
5 15 Matt 2014 2017
...But I want everyone EXCEPT these people! I also added the names into data2 and then combined data1 and data2 using rbind which gives me a data set with duplicates (e.g. 2 Fred, 2 Sara, 2 Bill, etc.) I then tried to use "unique" and "duplicated" but these always leave one of those duplicates (1 Fred, 1 Sara) in the new data. I want everyone from data1 except those people. I have a feeling this is a simple process, but any help would be greatly appreciated.

Simply:
data1[!data1$ID%in%data2$ID,]
ID NAME YEAR
1 8 Ann 2016
4 12 Emma 2015
6 9 Julie 2014
7 13 Karl 2016
9 14 Rhett 2014
11 4 Tom 2014
Or you could try anti_join by ID from dplyr:
data1 <- read.table(text="ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014",header=TRUE, stringsAsFactors=FALSE)
data2 <- read.table(text="ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
anti_join(data1,data2,by="ID")
ID NAME YEAR
1 4 Tom 2014
2 8 Ann 2016
3 9 Julie 2014
4 12 Emma 2015
5 13 Karl 2016
6 14 Rhett 2014

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

How to filter based on equality between several columns

I have a data frame similar to the one below.
And I want to filter it, grouping by product, that I would only keep the rows where month >= start_month for the year == start_year. But for the years after I need to retain all months.
product year month start_year start_month
A 2015 1 2015 10
A 2015 2 2015 10
A 2015 3 2015 10
A 2015 4 2015 10
A 2015 5 2015 10
A 2015 6 2015 10
A 2015 7 2015 10
A 2015 8 2015 10
A 2015 9 2015 10
A 2015 10 2015 10
A 2015 11 2015 10
A 2015 12 2015 10
A 2016 1 2015 10
A 2016 2 2015 10
A 2016 3 2015 10
A 2016 4 2015 10
...
B 2015 1 2015 11
B 2015 2 2015 11
...
I have tried:
df <- df %>%
group_by(product) %>%
filter(month >= start_month)
But this filters months in the following years too.
So the final results should look like this:
product year month start_year start_month
A 2015 10 2015 10
A 2015 11 2015 10
A 2015 12 2015 10
A 2016 1 2015 10
A 2016 2 2015 10
A 2016 3 2015 10
A 2016 4 2015 10
...
A 2018 12 2015 10
...

Replicating table in R with change in one column

I have this table in R :
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
I want to replicate this same table 4 times, all values should be the same. Except the Month column, which needs to be incremented by 1 every time. And the final table should look like this:
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
John 8 2017 8 16
Carol 90 2017 8 30
Bug 9 2017 8 1
John 8 2017 9 16
Carol 90 2017 9 30
Bug 9 2017 9 1
John 8 2017 10 16
Carol 90 2017 10 30
Bug 9 2017 10 1
John 8 2017 11 16
Carol 90 2017 11 30
Bug 9 2017 11 1
Please point how to do this efficiently in R. Many thanks!
If this is your dataframe:
df = read.table(text = "Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1", header = TRUE)
Then this is your dataframe repeating:
df2 = df[rep(rownames(df), 4),]
And this is it again, but with the months incremented:
df2$Month = df2$Month + rep(0:3, 3)
In the more general case:
m = 4 # <-- number of rows desired
df2 = df[rep(rownames(df), m), ]
df2$Month = df2$Month + rep(0:m, nrow(df))

replace NA with previous 2 years values

i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance

How do I keep strings in first column with tidyr::gather?

This may be a very basic question about tidyr, which I just started learning, but I don't seem to find an answer after much searching in SO and Google.
Suppose I have a data frame:
mydf<- data.frame(name=c("Joe","Mary","Bob"),
jan=1:3,
feb=4:6,
mar=7:9,
apr=10:12)
which I want to reshape from wide to long. Before, I used melt, so:
library(reshape)
melt(mydf,id.vars = "name",measure.vars = colnames(mydf)[-1])
Which produces
name variable value
1 Joe jan 1
2 Mary jan 2
3 Bob jan 3
4 Joe feb 4
5 Mary feb 5
6 Bob feb 6
7 Joe mar 7
8 Mary mar 8
9 Bob mar 9
10 Joe apr 10
11 Mary apr 11
12 Bob apr 12
I wanted to use tidyr::gather, so I tried
gather(mydf,month,sales,jan:apr)
Which produces
name month sales
1 2 jan 1
2 3 jan 2
3 1 jan 3
4 2 feb 4
5 3 feb 5
6 1 feb 6
7 2 mar 7
8 3 mar 8
9 1 mar 9
10 2 apr 10
11 3 apr 11
12 1 apr 12
I'm lost here, as I haven't been able to keep the names in the first column.
What am I missing here?
######### EDIT TO ADD #######
> R.Version()$version.string
[1] "R version 3.2.2 (2015-08-14)"
> packageVersion("tidyr")
[1] ‘0.3.0’
It looks like in tidyr 0.3.0 you will need to convert the factor column name to character. I'm not sure why that has changed from version 0.2.0, where it worked without conversion to character. Nevertheless, here we go ...
gather(transform(mydf, name = as.character(name)), month, sales, jan:apr)
# name month sales
# 1 Joe jan 1
# 2 Mary jan 2
# 3 Bob jan 3
# 4 Joe feb 4
# 5 Mary feb 5
# 6 Bob feb 6
# 7 Joe mar 7
# 8 Mary mar 8
# 9 Bob mar 9
# 10 Joe apr 10
# 11 Mary apr 11
# 12 Bob apr 12
R.version.string
# [1] "R version 3.2.2 (2015-08-14)"
packageVersion("tidyr")
# [1] ‘0.3.0’
Credit to #aosmith for finding the closed github issue. You should be able to use the development version without issue now. To install the dev version, use
devtools::install_github(
"hadley/tidyr",
ref = "2e08772d154babcc97912bcae8b0b64b65b964ab"
)

Resources