This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I got a problem when trying to merge two dataframes in R and I need your help.
Suppose that I have following dataframes:
> Data_A
Code year score
A 1991 1
A 1992 2
A 1993 3
B 1991 3
B 1993 7
> Data_B
Code year l.score
A 1991 NA
A 1992 1
A 1993 2
A 1994 3
B 1991 NA
B 1992 3
B 1993 NA
B 1994 7
And the desire result after merging them should be like that:
> Data_merge
Code year score l.score
A 1991 1 NA
A 1992 2 1
A 1993 3 2
B 1991 3 NA
B 1993 7 NA
It means that when merging these dataframes, share columns in one will be kept (in this case, "Code" and "year" of Data_A). I tried merge(Data_A, Data_B, all = FALSE) but not success. Someone have any idea? Thanks for reading!
library(dplyr)
Data_A %>%
left_join(Data_B, by = c('Code', 'year'))
Code year score l.score
1 A 1991 1 NA
2 A 1992 2 1
3 A 1993 3 2
4 B 1991 3 NA
5 B 1993 7 NA
It seems your own solution can work (I am using R 3.6.1), but no idea why you cannot
> merge(Data_A, Data_B, all = FALSE)
Code year score X1.score
1 A 1991 1 NA
2 A 1992 2 1
3 A 1993 3 2
4 B 1991 3 NA
5 B 1993 7 NA
DATA
Data_A <- structure(list(Code = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), year = c(1991L, 1992L, 1993L, 1991L,
1993L), score = c(1L, 2L, 3L, 3L, 7L)), class = "data.frame", row.names = c(NA,
-5L))
Data_B <- structure(list(Code = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), year = c(1991L,
1992L, 1993L, 1994L, 1991L, 1992L, 1993L, 1994L), X1.score = c(NA,
1L, 2L, 3L, NA, 3L, NA, 7L)), class = "data.frame", row.names = c(NA,
-8L))
Related
I have a longitudinal data where respondents recruited as cohort. Right now, I have year in which they took the survey. But I want to create a new column simply counting if it is the first, second, or third time a person took the survey.
Original Table
PersonID
SurveyYear
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
5
4
f
12
2012
4
4
f
12
2010
3
3
f
2
2007
4
4
m
2
2008
3
3
m
2
2009
3
5
m
2
2010
5
5
m
2
2013
2
2
m
5
2013
4
4
f
5
2014
5
5
f
Target Table (Where I created a new col SurveytTime to mark the ith time one took the survey)
PersonID
SurveyYear
SurveyTime
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
3
5
4
f
12
2012
2
4
4
f
12
2010
1
3
3
f
2
2007
1
4
4
m
2
2008
2
3
3
m
2
2009
3
3
5
m
2
2010
4
5
5
m
2
2013
5
2
2
m
5
2013
1
4
4
f
5
2014
2
5
5
f
A base solution:
df |>
transform(SurveyTime = ave(SurveyYear, PersonID, FUN = rank))
Its dplyr equivalent:
library(dplyr)
df %>%
group_by(PersonID) %>%
mutate(SurveyTime = dense_rank(SurveyYear)) %>%
ungroup()
Data
df <- structure(list(PersonID = c(12L, 12L, 12L, 2L, 2L, 2L, 2L, 2L,
5L, 5L), SurveyYear = c(2013L, 2012L, 2010L, 2007L, 2008L, 2009L,
2010L, 2013L, 2013L, 2014L), SurveyQ1Rating = c(5L, 4L, 3L, 4L,
3L, 3L, 5L, 2L, 4L, 5L), SurveyQ2Rating = c(4L, 4L, 3L, 4L, 3L,
5L, 5L, 2L, 4L, 5L), Gender = c("f", "f", "f", "m", "m", "m",
"m", "m", "f", "f")), class = "data.frame", row.names = c(NA, -10L))
Using data.table
library(data.table)
setDT(df1)[, SurveyTime := frank(SurveyYear), PersonID]
Hi i am a stata user and i am trying to pass my codes to R. I have a panel data as shown below, and i am looking for a command that can create a constant variable according to which year and quarter the row is located. In stata such command would be reproduced by gen new_variable = yq(year, quarter)
My dataframe look like this
id year quarter
1 2007 1
1 2007 2
1 2007 3
1 2007 4
1 2008 1
1 2008 2
1 2008 3
1 2008 4
1 2009 1
1 2009 2
1 2009 3
1 2009 4
2 2007 1
2 2007 2
2 2007 3
2 2007 4
2 2008 1
2 2008 2
2 2008 3
2 2008 4
3 2009 2
3 2009 3
3 2010 2
3 2010 3
I my expected output should look like this: (Values inside new_variable are arbitrary, just looking for a constant value the would be always the same for each year and quarter)
id year quarter new_variable
1 2007 1 220
1 2007 2 221
1 2007 3 222
1 2007 4 223
1 2008 1 224
1 2008 2 225
1 2008 3 226
1 2008 4 227
1 2009 1 228
1 2009 2 229
1 2009 3 230
1 2009 4 231
2 2007 1 220
2 2007 2 221
2 2007 3 222
2 2007 4 223
2 2008 1 224
2 2008 2 225
2 2008 3 226
2 2008 4 227
3 2009 2 229
3 2009 3 230
3 2010 2 233
3 2010 3 234
Any of these will work:
# basic: just concatenate year and quarter
df$new_variable = paste(df$year, df$quarter)
# made for this, has additional options around
# ordering of the categories and including unobserved combos
df$new_variable = interaction(df$year, df$quarter)
# for an integer value, 1 to the number of combos
df$new_variable = as.integer(factor(paste(df$year, df$quarter)))
Here are two options:
library(dplyr) # with dplyr
df %>% mutate(new_variable = group_indices(., year, quarter))
library(data.table) # with data.table
setDT(df)[, new_variable := .GRP, .(year, quarter)]
Data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), year = c(2007L,
2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2009L, 2009L,
2009L, 2009L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2008L, 2009L, 2009L, 2010L, 2010L), quarter = c(1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
2L, 3L, 2L, 3L)), .Names = c("id", "year", "quarter"), class = "data.frame", row.names = c(NA,
-24L))
1) yearqtr The yearqtr class in the zoo package does this. yearqtr objects have a type of double with the value year + 0 for Q1, year + 1/4 for Q2, etc. When displayed they are shown in a meaningful way; however, they can still be manipulated as if they were plain numbers, e.g. if yq is yearqtr variable then yq + 1 is the same quarter in the next year.
library(zoo)
transform(df, new_variable = as.yearqtr(year + (quarter - 1)/4))
1a) or
transform(df, new_variable = as.yearqtr(paste(year, quarter, sep = "-")))
Either of these give:
id year quarter new_variable
1 1 2007 1 2007 Q1
2 1 2007 2 2007 Q2
3 1 2007 3 2007 Q3
4 1 2007 4 2007 Q4
5 1 2008 1 2008 Q1
... etc ...
2) 220 If you specifically wanted to assign 220 to the first date and have each subsequent quarter increment by 1 then:
transform(df, new_variable = as.numeric(factor(4 * year + quarter)) + 220 - 1)
I have a data frame that shows the number of publications by year. But I am interested just in Conference and Journals Publications. I would like to sum all other categories in Others type.
Examples of data frame:
year type n
1994 Conference 2
1994 Journal 3
1995 Conference 10
1995 Editorship 3
1996 Conference 20
1996 Editorship 2
1996 Books and Thesis 3
And the result would be:
year type n
1994 Conference 2
1994 Journal 3
1995 Conference 10
1995 Other 3
1996 Conference 20
1996 Other 5
With dplyr we can replace anything other than "Journal" or "Conference" to "Other" and then sum them by year and type.
library(dplyr)
df %>%
mutate(type = sub("^((Journal|Conference))", "Other", type)) %>%
group_by(year, type) %>%
summarise(n = sum(n))
# year type n
# <int> <chr> <int>
#1 1994 Conference 2
#2 1994 Journal 3
#3 1995 Conference 10
#4 1995 Other 3
#5 1996 Conference 20
#6 1996 Other 5
We can use data.table
library(data.table)
library(stringr)
setDT(df1)[, .(n = sum(n)), .(year, type = str_replace(type,
'(Journal|Conference)', 'Other'))]
# year type n
#1: 1994 Other 5
#2: 1995 Other 10
#3: 1995 Editorship 3
#4: 1996 Other 20
#5: 1996 Editorship 2
#6: 1996 Books and Thesis 3
levels(df$type)[levels(df$type) %in% c("Editorship", "Books_and_Thesis")] <- "Other"
aggregate(n ~ type + year, data=df, sum)
# type year n
# 1 Conference 1994 2
# 2 Journal 1994 3
# 3 Other 1995 3
# 4 Conference 1995 10
# 5 Other 1996 5
# 6 Conference 1996 20
Input data:
df <- structure(list(year = c(1994L, 1994L, 1995L, 1995L, 1996L, 1996L,
1996L), type = structure(c(2L, 3L, 2L, 1L, 2L, 1L, 1L), .Label = c("Other",
"Conference", "Journal"), class = "factor"), n = c(2L, 3L, 10L,
3L, 20L, 2L, 3L)), .Names = c("year", "type", "n"), row.names = c(NA, -7L), class = "data.frame")
This question already has answers here:
Efficient method to filter and add based on certain conditions (3 conditions in this case)
(3 answers)
Closed 6 years ago.
Let's say I have a data frame like the following:
year stint ID W
1 2003 1 abc 10
2 2003 2 abc 3
3 2003 1 def 16
4 2004 1 abc 15
5 2004 1 def 11
6 2004 2 def 7
I would like to combine the data so that it looks like
year ID W
1 2003 abc 13
3 2003 def 16
4 2004 abc 15
5 2004 def 18
I found a way to combine the data as desired, but I'm very sure that there's a better way.
combinedData = unique(ddply(data, "ID", function(x) {
ddply(x, "year", function(y) {
data.frame(ID=x$ID, W=sum(y$W))
})
}))
combinedData[order(combinedData$year),]
This produces the following output:
year ID W
1 2003 abc 13
7 2003 def 16
4 2004 abc 15
10 2004 def 18
Specifically I don't like that I had to use unique (otherwise I get each unique combo of year,ID,W three times in the outputted data), and I don't like that the row numbers aren't sequential. How can I do this more cleanly?
Do this with base R:
aggregate(W~year+ID, df, sum)
# year ID W
#1 2003 abc 13
#2 2004 abc 15
#3 2003 def 16
#4 2004 def 18
data
df <- structure(list(year = c(2003L, 2003L, 2003L, 2004L, 2004L, 2004L
), stint = c(1L, 2L, 1L, 1L, 1L, 2L), ID = structure(c(1L, 1L,
2L, 1L, 2L, 2L), .Label = c("abc", "def"), class = "factor"),
W = c(10L, 3L, 16L, 15L, 11L, 7L)), .Names = c("year", "stint",
"ID", "W"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))
I am trying to get the maximum value in the column event until an agreement (dummy) is reached; Events are nested in agreements, agreements are nested in dyad which run over year. Note that years are not always continuous, meaning there are breaks between the years (1986, 1987,2001,2002).
I am able to get the maximum values within the dyad with a ddply and max(event); but I struggle how to ‘assign’ the different events to the right agreement (until/after). I am basically lacking an 'identifier' which assigns each observation to an agreement.
The results which I am looking for are already in the column "result".
dyad year event agreement agreement.name result
1 1985 9
1 1986 4 1 agreement1 9
1 1987
1 2001 3
1 2002 1 agreement2 3
2 1999 1
2 2000 5
2 2001 1 agreement3 5
2 2002 2
2 2003
2 2004 1 agreement 4 2
Here is the data in a format which is hopefully easier to use:
df<-structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L,
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L,
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA,
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "",
"agreement2", "", "", "agreement3", "", "", "agreement 4"), result = c(NA,
9L, NA, NA, 3L, NA, NA, 5L, NA, NA, 2L)), .Names = c("dyad",
"year", "event", "agreement", "agreement.name", "result"), class = "data.frame", row.names = c(NA,
-11L))
Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), create another grouping variable ('ind') based on the non-empty elements in 'agreement.name'. Grouped by both 'dyad' and 'ind' columns, we create a new column 'result' using ifelse to fill the rows that have 'agreement.name' is non-empty with the max of 'event'
library(data.table)
setDT(df)[, ind:=cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad][,
result:=ifelse(agreement.name!='', max(event, na.rm=TRUE), NA) ,
list(dyad, ind)][, ind:=NULL][]
# dyad year event agreement agreement.name result
# 1: 1 1985 9 NA NA
# 2: 1 1986 4 1 agreement1 9
# 3: 1 1987 NA NA NA
# 4: 1 2001 3 NA NA
# 5: 1 2002 NA 1 agreement2 3
# 6: 2 1999 1 NA NA
# 7: 2 2000 5 NA NA
# 8: 2 2001 NA 1 agreement3 5
# 9: 2 2002 2 NA NA
#10: 2 2003 NA NA NA
#11: 2 2004 NA 1 agreement 4 2
Or instead of ifelse, we can use numeric index
setDT(df)[, result:=c(NA, max(event, na.rm=TRUE))[(agreement.name!='')+1L] ,
list(ind= cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad)][]
data
df <- structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L,
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L,
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA,
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "",
"agreement2", "", "", "agreement3", "", "", "agreement 4")),
.Names = c("dyad",
"year", "event", "agreement", "agreement.name"), row.names = c(NA,
-11L), class = "data.frame")