Let's say I have two data frames. Each has a DAY, a MONTH, and a YEAR column along with one other variable, C and P, respectively. I want to merge the two data frames in two different ways. First, I merge by data:
test<-merge(data1,data2,by.x=c("DAY","MONTH","YEAR"),by.y=c("DAY","MONTH","YEAR"),all.x=T,all.y=F)
This works perfectly. The second merge is the one I'm having trouble with. So, I currently I have merged the value for January 5, 1996 from data1 and the value for January 5, 1996 from data2 into one data frame, but now I would like to merge a third value onto each row of the new data frame. Specifically, I want to merge the value for Jan 4, 1996 from data2 with the two values from January 5, 1996. Any tips on getting merge to be flexible in this way?
sample data:
data1
C DAY MONTH YEAR
1 1 1 1996
6 5 1 1996
5 8 1 1996
3 11 1 1996
9 13 1 1996
2 14 1 1996
3 15 1 1996
4 17 1 1996
data2
P DAY MONTH YEAR
1 1 1 1996
4 2 1 1996
8 3 1 1996
2 4 1 1996
5 5 1 1996
2 6 1 1996
7 7 1 1996
4 8 1 1996
6 9 1 1996
1 10 1 1996
7 11 1 1996
3 12 1 1996
2 13 1 1996
2 14 1 1996
5 15 1 1996
9 16 1 1996
1 17 1 1996
Make a new column that is a Date type, not just some day,month,year integers. You can use as.Date() to do this, though you will need to look up the right format the format= argument given your string. Let's call that column D1. Now do data1$D2 = data1$D1 + 1. The key point here is that Date types allow simple date arithmetic. Now just merge by x=D1 and y=D2.
In case that was confusing, the bottom line is that you need to covert you columns to Date types so that you can do date arithmetic.
Related
I have a dataset where I observe individuals for different years (e.g., individual 1 is observed in 2012 and 2014, while individuals 2 and 3 are only observed in 2016). I would like to expand the data for each individual (i.e., each individual would have 3 rows: 2012, 2014 and 2016) in order to create a panel data with an indicator for whether an individual is observed or not.
My initial dataset is:
year
individual_id
rank
2012
1
11
2014
1
16
2016
2
76
2016
3
125
And I would like to get something like that:
year
individual_id
rank
present
2012
1
11
1
2014
1
16
1
2016
1
.
0
2012
2
.
0
2014
2
.
0
2016
2
76
1
2012
3
.
0
2014
3
.
0
2016
3
125
1
So far I have tried to play with "expand":
bys researcher: egen count=count(year)
replace count=3-count+1
bys researcher: replace count=. if _n>1
expand count
which gives me 3 rows per individual. Unfortunately this copies one of the initial row, but I am unable to go from there to the final desired dataset.
Thanks in advance for your help!
You can use expand.grid to create a data frame of all combinations your inputs. Then full join the tables together and add a condition to determine if the individual was present that year or not.
library(dplyr)
dt = data.frame(
year = c(2012,2014,2016,2016),
individual_id = c(1,1,2,3),
rank = c(11,16,76,125)
)
exp = expand.grid(year = c(2012,2014,2016), individual_id = c(1:3))
dt %>%
full_join(exp, by = c("year","individual_id")) %>%
mutate(present = ifelse(!is.na(rank), 1, 0)) %>%
arrange(individual_id, year)
year individual_id rank present
1 2012 1 11 1
2 2014 1 16 1
3 2016 1 NA 0
4 2012 2 NA 0
5 2014 2 NA 0
6 2016 2 76 1
7 2012 3 NA 0
8 2014 3 NA 0
9 2016 3 125 1
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
NOTE: This is a modified version of How do I turn monadic data into dyadic data in R (country-year into pair-year)?
I have data organized by country-year, with a ID for a dyadic relationship. I want to organize this by dyad-year.
Here is how my data is organized:
dyadic_id country_codes year
1 1 200 1990
2 1 20 1990
3 1 200 1991
4 1 20 1991
5 1 200 1991
6 1 300 1991
7 1 300 1991
8 1 20 1991
9 2 300 1990
10 2 10 1990
11 3 100 1990
12 3 10 1990
13 4 500 1991
14 4 200 1991
Here is how I want the data to be:
dyadic_id_want country_codes_1 country_codes_2 year_want
1 1 200 20 1990
2 1 200 20 1991
3 1 200 300 1991
4 1 300 20 1991
5 2 300 10 1990
6 3 100 10 1990
7 4 500 200 1991
Here is reproducible code:
dyadic_id<-c(1,1,1,1,1,1,1,1,2,2,3,3,4,4)
country_codes<-c(200,20,200,20,200,300,300,20,300,10,100,10,500,200)
year<-c(1990,1990,1991,1991,1991,1991,1991,1991,1990,1990,1990,1990,1991,1991)
mydf<-as.data.frame(cbind(dyadic_id,country_codes,year))
dyadic_id_want<-c(1,1,1,1,2,3,4)
country_codes_1<-c(200,200,200,300,300,100,500)
country_codes_2<-c(20,20,300,20,10,10,200)
year_want<-c(1990,1991,1991,1991,1990,1990,1991)
my_df_i_want<-as.data.frame(cbind(dyadic_id_want,country_codes_1,country_codes_2,year_want))
This is a unique problem since there are more than one country that participate in each event (noted by a dyadic_id).
You can actually do it very similar to akrun's solution for dplyr. Unfortunately I'm not well versed enough in data.table to help you with that part, and I'm sure others may have better solution to this one.
Basically for the mutate(ind=...) portion you need to be a little more clever on how you construct this indicator so that it is unique and will lead to the same result that you're looking for. For my solution, I notice that since you have groups of two, then your indicator should just have modulus operator attached to it.
ind=paste0('country_codes', ((row_number()+1) %% 2+1))
Then you need an indentifier for each group of two which again can be constructed using the similar idea.
ind_row = ceiling(row_number()/2)
Then you can proceed as normal in the code.
The full code is as follows:
mydf %>%
group_by(dyadic_id, year) %>%
mutate(ind=paste0('country_codes', ((row_number()+1) %% 2+1)),
ind_row = ceiling(row_number()/2)) %>%
spread(ind, country_codes) %>%
select(-ind_row)
# dyadic_id year country_codes1 country_codes2
#1 1 1990 200 20
#2 1 1991 200 20
#3 1 1991 200 300
#4 1 1991 300 20
#5 2 1990 300 10
#6 3 1990 100 10
#7 4 1991 500 200
All credit to akrun's solution though.
I need to change some of my datasets in the following way.
I have one panel dataset containing an unique firm id as identifier (id), the observation year (year, 2002-2012) and some firm variables with the value for the corresponding year (size, turnover etc,). It looks somewhat like:
[ID] [year] [size] [turnover] ...
1 2002 14 1200
1 2003 15 1250
1 2004 17 1100
1 2005 18 1350
2 2004 10 5750
2 2005 11 6025
...
I need to transform it now in the following way.
I create an own matrix for every characteristic of interest, where
every firm (according to its id) has only one row and the
corresponding values per year in separated columns.
It should be mentioned, that not every firm is in the dataset in every year, since they might be founded later, already closed down etc., as illustrated in the example. In the end it should look somehow like the following (example for size variable):
[ID] [2002] [2003] [2004] [2005]
1 14 15 17 18
2 - - 10 11
I tried it so far with the %in% command, but did not manage to get the values in the right columns.
DF <- read.table(text="[ID] [year] [size] [turnover]
1 2002 14 1200
1 2003 15 1250
1 2004 17 1100
1 2005 18 1350
2 2004 10 5750
2 2005 11 6025",header=TRUE)
library(reshape2)
dcast(DF, X.ID.~X.year.,value.var="X.size.")
# X.ID. 2002 2003 2004 2005
# 1 1 14 15 17 18
# 2 2 NA NA 10 11
I have a data frame like this:
FisherID Year Month VesselID
1 2000 1 56
1 2000 1 81
1 2000 2 81
1 2000 3 81
1 2000 4 81
1 2000 5 81
1 2000 6 81
1 2000 7 81
1 2000 8 81
1 2000 9 81
1 2000 10 81
1 2001 1 56
1 2001 2 56
1 2001 3 81
1 2001 4 56
1 2001 5 56
1 2001 6 56
1 2001 7 56
1 2002 3 81
1 2002 4 81
1 2002 5 81
1 2002 6 81
1 2002 7 81
...and I need the number of time that ID changes per year, so the output that I want to is:
FisherID Year DiffVesselUsed
1 2000 1
1 2001 2
1 2002 0
I tried to get that using aggregate():
aggregate(vesselID, by=list(FisherID,Year,Month ), length)
but what I got was:
FisherID Year DiffVesselUsed
1 2000 2
1 2001 1
1 2002 1
because aggregate() counted those different vessels when those only appeared in the same month. I have tried different way to aggregate without success. Any help will be deeply appreciated. Cheers, Rafael
First a question: Your expected output does't seem to reflect what you ask for. You ask for the number of times an ID changes per year, but your expected output seems to indicate that you want to know how many unique VesselIDs are observed per year. For example, in 2000, the ID changes once, and in 2001 the ID changes twice. In both years, two unique IDs are observed.
So to get the result you posted,
If you're looking for a statistic by FisherID and Year, then there's no reason to look by Month as well. Instead, you should look at the unique values of VesselID for each combination of FisherID and Year.
aggregate(VesselID, by = list(FisherID, Year), function(x) length(unique(x)))
# Group.1 Group.2 x
# 1 1 2000 2
# 2 1 2001 2
# 3 1 2002 1
If you really want the number of times ID changes, use the rle function.
aggregate(VesselID, by = list(FisherID, Year),
function(x) length(rle(x)$values) - 1)
# Group.1 Group.2 x
# 1 1 2000 1
# 2 1 2001 2
# 3 1 2002 0