I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))
This is my first question on stackoverflow. I searched for similar questions but I didn't find an answer.
I know that the question in the title isn't clear but I hope you are going to understand what I want as output.
I have a dataframe that looks like this:
ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019
if the names in the column "Name" have the same ID, it means that are collaborators in a project.
I want to add a column with the number of UNIQUE collaborators per year for each name (by including each name in the count of his collborators)
The output should look like this: (I added the last column to explain how to count-Not needed)
ID Name Year Unique_Coll explication
1 1 Anas 2018 3 (Anas, Carl, Catherine)
2 1 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
3 1 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
4 2 Anas 2018 3 (Anas, Carl, Catherine)
5 2 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
6 3 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
7 3 Julien 2018 2 (Julien, Catherine)
8 4 Raul 2018 4 (Raul, Ahmed, Laurence, Carl)
9 4 Ahmed 2018 4 (Ahmed, Raul, Laurence, Carl)
10 4 Laurence 2018 4 (Laurence, Raul, Ahmed, Carl)
11 4 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
12 5 Anas 2019 4 (Anas, Georges, Arman, Pietro)
13 5 Georges 2019 3 (Georges, Anas, Arman)
14 5 Arman 2019 3 (Arman Anas, Georges)
15 6 Anas 2019 4 (Anas, Georges, Arman, Pietro)
16 6 Pietro 2019 2 (Pietro, Anas)
17 7 Pietro 2019 2 (Pietro, Anas)
18 8 Diego 2019 1 (Diego)
Thank you
You could construct a variable that would be a list of names and count the number of unique names in the following way:
library(dplyr)
df = df %>%
group_by(ID) %>%
mutate(group = list(Name)) %>%
group_by(Year,Name) %>%
mutate(n = n_distinct(unlist(list(group)))) %>%
select(-group)
# A tibble: 18 x 4
# Groups: Year, Name [12]
ID Name Year n
<int> <chr> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1
The following solution uses dplyr to first join all collaborators to every Name, creating a column Name_collab (note that this expands the data frame and could blow it up if it were large). Then, we count the distinct Name_collab for every Name, Year combination and get rid of duplicates.
library(dplyr)
df %>%
left_join(df, by = c("ID", "Year"), suffix = c("", "_collab")) %>%
group_by(Name, Year) %>%
mutate(Unique_Coll = n_distinct(Name_collab)) %>%
ungroup() %>%
distinct(ID, Name, Year, Unique_Coll)
which gives
# A tibble: 18 x 4
ID Name Year Unique_Coll
<int> <fct> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1
Input:
df <- read.table(text="ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019")
I have a solution using joins.
library(tidyverse)
# read data
dta <- tribble(~ID, ~Name, ~Year,
1, "Anas", 2018,
1, "Carl", 2018,
1, "Catherine", 2018,
2, "Anas", 2018,
2, "Carl", 2018,
3, "Catherine", 2018,
3, "Julien", 2018,
4, "Raul", 2018,
4, "Ahmed", 2018,
4, "Laurence", 2018,
4, "Carl", 2018,
5, "Anas", 2019,
5, "Georges", 2019,
5, "Arman", 2019,
6, "Anas", 2019,
6, "Pietro", 2019,
7, "Pietro", 2019,
8, "Diego", 2019)
nb_collabs <- dta %>%
left_join(dta, by = c("ID", "Year")) %>%
select(-ID) %>%
group_by(Name.x, Year) %>%
nest(collaborators = Name.y) %>%
mutate(unique_collaborators = map(collaborators, distinct),
Unique_Coll = map_int(unique_collaborators, nrow)) %>%
select(-collaborators, -unique_collaborators)
left_join(dta, nb_collabs, by = c("Name"="Name.x", "Year"))
# A tibble: 18 x 4
# ID Name Year Unique_Coll
# <dbl> <chr> <dbl> <int>
# 1 1 Anas 2018 3
# 2 1 Carl 2018 6
# 3 1 Catherine 2018 4
# 4 2 Anas 2018 3
# 5 2 Carl 2018 6
# 6 3 Catherine 2018 4
# 7 3 Julien 2018 2
# 8 4 Raul 2018 4
# 9 4 Ahmed 2018 4
#10 4 Laurence 2018 4
#11 4 Carl 2018 6
#12 5 Anas 2019 4
#13 5 Georges 2019 3
#14 5 Arman 2019 3
#15 6 Anas 2019 4
#16 6 Pietro 2019 2
#17 7 Pietro 2019 2
#18 8 Diego 2019 1
So the first step is to join the data with itself. The point is to have the name in "Name.x", and a separate row for each collaborator as "Name.y". Then we can nest the collaborator names, so that we get a data frame with one row for each Name, with a nested data frame with the collaborators, so we just need to remove the duplicates and count the number of persons.
In nb_collabs we have a table with each person and the number of collaborators, we can simply join it back with the original data frame to get the desired format.
I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.
I tried "unique" and "duplicated" but cannot get R to do what I want, which is basically compare two sets of data and find out who one the first data set is not on the second data set. data1 contains a customer ID, name and the year that person bought X. data2 contains a customer ID and year (2017) indicating they purchased X this year. What I want to do is extract a list of people from data1 who have NOT purchase X this year...so I can contact them and tell them to buy X again.
> data1
ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014
> data2
ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017
Merging data1 and data2 by ID ( merge(data1,dat2, by"ID") ) gives me:
> merged_d1d2
ID NAME YEAR.x YEAR.y
1 5 Fred 2014 2017
2 7 Sara 2015 2017
3 10 Bill 2014 2017
4 11 Doug 2016 2017
5 15 Matt 2014 2017
...But I want everyone EXCEPT these people! I also added the names into data2 and then combined data1 and data2 using rbind which gives me a data set with duplicates (e.g. 2 Fred, 2 Sara, 2 Bill, etc.) I then tried to use "unique" and "duplicated" but these always leave one of those duplicates (1 Fred, 1 Sara) in the new data. I want everyone from data1 except those people. I have a feeling this is a simple process, but any help would be greatly appreciated.
Simply:
data1[!data1$ID%in%data2$ID,]
ID NAME YEAR
1 8 Ann 2016
4 12 Emma 2015
6 9 Julie 2014
7 13 Karl 2016
9 14 Rhett 2014
11 4 Tom 2014
Or you could try anti_join by ID from dplyr:
data1 <- read.table(text="ID NAME YEAR
8 Ann 2016
10 Bill 2014
11 Doug 2016
12 Emma 2015
5 Fred 2014
9 Julie 2014
13 Karl 2016
15 Matt 2014
14 Rhett 2014
7 Sara 2015
4 Tom 2014",header=TRUE, stringsAsFactors=FALSE)
data2 <- read.table(text="ID YEAR
29 2017
32 2017
10 2017
21 2017
11 2017
5 2017
28 2017
33 2017
24 2017
22 2017
31 2017
15 2017
25 2017
30 2017
26 2017
7 2017
23 2017
27 2017",header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
anti_join(data1,data2,by="ID")
ID NAME YEAR
1 4 Tom 2014
2 8 Ann 2016
3 9 Julie 2014
4 12 Emma 2015
5 13 Karl 2016
6 14 Rhett 2014
This may be a very basic question about tidyr, which I just started learning, but I don't seem to find an answer after much searching in SO and Google.
Suppose I have a data frame:
mydf<- data.frame(name=c("Joe","Mary","Bob"),
jan=1:3,
feb=4:6,
mar=7:9,
apr=10:12)
which I want to reshape from wide to long. Before, I used melt, so:
library(reshape)
melt(mydf,id.vars = "name",measure.vars = colnames(mydf)[-1])
Which produces
name variable value
1 Joe jan 1
2 Mary jan 2
3 Bob jan 3
4 Joe feb 4
5 Mary feb 5
6 Bob feb 6
7 Joe mar 7
8 Mary mar 8
9 Bob mar 9
10 Joe apr 10
11 Mary apr 11
12 Bob apr 12
I wanted to use tidyr::gather, so I tried
gather(mydf,month,sales,jan:apr)
Which produces
name month sales
1 2 jan 1
2 3 jan 2
3 1 jan 3
4 2 feb 4
5 3 feb 5
6 1 feb 6
7 2 mar 7
8 3 mar 8
9 1 mar 9
10 2 apr 10
11 3 apr 11
12 1 apr 12
I'm lost here, as I haven't been able to keep the names in the first column.
What am I missing here?
######### EDIT TO ADD #######
> R.Version()$version.string
[1] "R version 3.2.2 (2015-08-14)"
> packageVersion("tidyr")
[1] ‘0.3.0’
It looks like in tidyr 0.3.0 you will need to convert the factor column name to character. I'm not sure why that has changed from version 0.2.0, where it worked without conversion to character. Nevertheless, here we go ...
gather(transform(mydf, name = as.character(name)), month, sales, jan:apr)
# name month sales
# 1 Joe jan 1
# 2 Mary jan 2
# 3 Bob jan 3
# 4 Joe feb 4
# 5 Mary feb 5
# 6 Bob feb 6
# 7 Joe mar 7
# 8 Mary mar 8
# 9 Bob mar 9
# 10 Joe apr 10
# 11 Mary apr 11
# 12 Bob apr 12
R.version.string
# [1] "R version 3.2.2 (2015-08-14)"
packageVersion("tidyr")
# [1] ‘0.3.0’
Credit to #aosmith for finding the closed github issue. You should be able to use the development version without issue now. To install the dev version, use
devtools::install_github(
"hadley/tidyr",
ref = "2e08772d154babcc97912bcae8b0b64b65b964ab"
)