Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have biographical data of more than 1600 people. The data includes their gender, birth year, hometowns, etc., as well as their career trajectories from the year they begun their work. I'm trying to turn this into a panel data, so that I have a grip of how their workplaces have changed since they have started their jobs. I have the following problems with this dataset:
1) How do I turn this into a panel dataset? The optimal format I want for each person(id) is:
id gender hometown year job
1 1 1 NY 1990 3
1 1 1 NY 1991 3
1 1 1 NY 1992 3
1 1 1 NY 1993 3
1 1 1 NY 1994 5
2) How do I save information if the person had overlapping positions? For instance, the person can have job 3 and job 5 at the same time. I'm hoping later to only use the job that is higher than the other, but meanwhile I would like to save as much information as possible.
Okay, give this a try.
First select a subset of your data.
> (D = head(origin[, c("id", "name1", "gender", "job1", "job1s", "job1e",
"job2", "job10")]))
id name1 gender job1 job1s job1e job2 job10
1 1 Abulaiti Abureduxiti 1 2305 1980 1991 2303 NA
2 2 Aisihaiti Kelimubai 1 2307 1972 1987 2307 NA
3 3 Ai Zhisheng 1 4509 1996 1997 1075 10103
4 4 An Pingsheng 1 3555 1975 1977 3561 2191
5 5 An Zhiwen 1 2063 1977 1979 1127 2507
6 6 An Ziwen 1 4509 1954 1966 4007 2517
Next we re-organise the data into what the format that I think you are after.
> library(reshape2)
> (D = melt(D, id.vars = c("id", "name1", "gender")))
id name1 gender variable value
1 1 Abulaiti Abureduxiti 1 job1 2305
2 2 Aisihaiti Kelimubai 1 job1 2307
3 3 Ai Zhisheng 1 job1 4509
4 4 An Pingsheng 1 job1 3555
5 5 An Zhiwen 1 job1 2063
6 6 An Ziwen 1 job1 4509
7 1 Abulaiti Abureduxiti 1 job1s 1980
8 2 Aisihaiti Kelimubai 1 job1s 1972
9 3 Ai Zhisheng 1 job1s 1996
10 4 An Pingsheng 1 job1s 1975
11 5 An Zhiwen 1 job1s 1977
12 6 An Ziwen 1 job1s 1954
13 1 Abulaiti Abureduxiti 1 job1e 1991
14 2 Aisihaiti Kelimubai 1 job1e 1987
15 3 Ai Zhisheng 1 job1e 1997
16 4 An Pingsheng 1 job1e 1977
17 5 An Zhiwen 1 job1e 1979
18 6 An Ziwen 1 job1e 1966
19 1 Abulaiti Abureduxiti 1 job2 2303
20 2 Aisihaiti Kelimubai 1 job2 2307
21 3 Ai Zhisheng 1 job2 1075
22 4 An Pingsheng 1 job2 3561
23 5 An Zhiwen 1 job2 1127
24 6 An Ziwen 1 job2 4007
25 1 Abulaiti Abureduxiti 1 job10 NA
26 2 Aisihaiti Kelimubai 1 job10 NA
27 3 Ai Zhisheng 1 job10 10103
28 4 An Pingsheng 1 job10 2191
29 5 An Zhiwen 1 job10 2507
30 6 An Ziwen 1 job10 2517
We can see that the job field is empty for a few of these records, so we exclude those.
> (D = D[complete.cases(D),])
id name1 gender variable value
1 1 Abulaiti Abureduxiti 1 job1 2305
2 2 Aisihaiti Kelimubai 1 job1 2307
3 3 Ai Zhisheng 1 job1 4509
4 4 An Pingsheng 1 job1 3555
5 5 An Zhiwen 1 job1 2063
6 6 An Ziwen 1 job1 4509
7 1 Abulaiti Abureduxiti 1 job1s 1980
8 2 Aisihaiti Kelimubai 1 job1s 1972
9 3 Ai Zhisheng 1 job1s 1996
10 4 An Pingsheng 1 job1s 1975
11 5 An Zhiwen 1 job1s 1977
12 6 An Ziwen 1 job1s 1954
13 1 Abulaiti Abureduxiti 1 job1e 1991
14 2 Aisihaiti Kelimubai 1 job1e 1987
15 3 Ai Zhisheng 1 job1e 1997
16 4 An Pingsheng 1 job1e 1977
17 5 An Zhiwen 1 job1e 1979
18 6 An Ziwen 1 job1e 1966
19 1 Abulaiti Abureduxiti 1 job2 2303
20 2 Aisihaiti Kelimubai 1 job2 2307
21 3 Ai Zhisheng 1 job2 1075
22 4 An Pingsheng 1 job2 3561
23 5 An Zhiwen 1 job2 1127
24 6 An Ziwen 1 job2 4007
27 3 Ai Zhisheng 1 job10 10103
28 4 An Pingsheng 1 job10 2191
29 5 An Zhiwen 1 job10 2507
30 6 An Ziwen 1 job10 2517
Sorting out overlapping positions is a secondary problem. If I know that the above is basically what you are after then we can address that next.
Related
I'm trying to calculate the compound annual growth rate of my data (snipet shown below), does anyone know the best way to do this or if there is a function that does part of the job?
Data: (only woried about the preds column here, others can be ignored)
year month timestep ymin ymax preds date
1 1998 1 1 17.84037 18.58553 18.21295 1998-01-01
2 1998 2 2 17.05009 17.70642 17.37826 1998-02-01
3 1998 3 3 16.97067 17.61320 17.29193 1998-03-01
4 1998 4 4 18.38551 19.00838 18.69695 1998-04-01
5 1998 5 5 21.39082 21.97338 21.68210 1998-05-01
6 1998 6 6 24.77679 25.35464 25.06571 1998-06-01
7 1998 7 7 27.27057 27.82818 27.54938 1998-07-01
8 1998 8 8 28.24703 28.76702 28.50702 1998-08-01
9 1998 9 9 27.72370 28.24619 27.98494 1998-09-01
10 1998 10 10 25.83783 26.33969 26.08876 1998-10-01
11 1998 11 11 22.94968 23.42268 23.18618 1998-11-01
12 1998 12 12 19.50499 20.05466 19.77982 1998-12-01
13 1999 1 13 17.98323 18.50530 18.24426 1999-01-01
14 1999 2 14 17.20124 17.61746 17.40935 1999-02-01
15 1999 3 15 17.11064 17.53492 17.32278 1999-03-01
This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41
I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)
This is Fips data set
State Fips State.Abbreviation ANSI.Code GU.Name
1 1 67 AL 2403054 Abbeville
2 1 73 AL 2403063 Adamsville
3 1 117 AL 2403069 Alabaster
4 1 95 AL 2403074 Albertville
5 1 123 AL 2403077 Alexander City
6 1 107 AL 2403080 Aliceville
7 1 39 AL 2403097 Andalusia
8 1 15 AL 2403101 Anniston
:
:
:
41774 51 720 VA 1498434 Norton
41775 51 730 VA 1498435 Petersburg
41776 51 735 VA 1498436 Poquoson
41777 51 740 VA 1498556 Portsmouth
41778 51 750 VA 1498438 Radford
41779 51 760 VA 1789073 Richmond
41780 51 770 VA 1498439 Roanoke
41781 51 775 VA 1789074 Salem
41782 51 790 VA 1789075 Staunton
41783 51 800 VA 1498560 Suffolk
41784 51 810 VA 1498559 Virginia Beach
41785 51 820 VA 1498443 Waynesboro
41786 51 830 VA 1789076 Williamsburg
41787 51 840 VA 1789077 Winchester
dim(fips)
[1] 2937 5
This is data head cancer
PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS Fips State State.Abbreviation
1 93261752 1544 2 15 0 1 3 3 34 NY
2 93264865 1544 2 1 0 1 15 15 34 NY
3 93268186 1544 2 1 0 1 5 5 34 NY
4 93272027 1544 2 1 0 2 17 17 34 NY
5 93274555 1544 1 1 0 1 13 13 34 NY
6 93275343 1544 5 1 0 2 25 25 34 NY
7 93279759 1544 5 1 0 2 9 9 34 NY
8 93280754 1544 2 1 0 2 35 35 34 NY
9 93281166 1544 2 1 0 2 31 31 34 NY
10 93282602 1544 5 1 0 1 33 33 34 NY
11 93287646 1544 1 1 0 1 11 11 34 NY
12 93288255 1544 4 1 4 1 39 39 34 NY
13 93290660 1544 9 1 0 2 25 25 34 NY
14 93291461 1544 1 1 6 1 39 39 34 NY
15 93291778 1544 2 1 0 1 3 3 34 NY
dim(headcancer)
[1] 75313 10
when I merged together I expect to get the same row with head.cancer 75313 rows, but I got 951423 rows.
Here is my code and output
n = merge(head.cancer,fips, by=c('State','Fips','State.Abbreviation'), all.x= TRUE)
State Fips State.Abbreviation PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS ANSI.Code GU.Name
1 6 5 CA 70128269 1541 4 1 0 2 5 2409693 Amador City
2 6 5 CA 70128269 1541 4 1 0 2 5 2411446 Plymouth
3 6 5 CA 70128269 1541 4 1 0 2 5 226085 Jackson
4 6 5 CA 70128269 1541 4 1 0 2 5 1675841 Amador
5 6 5 CA 70128269 1541 4 1 0 2 5 2418631 Ione Band of Miwok
6 6 5 CA 70128269 1541 4 1 0 2 5 2412019 Sutter Creek
7 6 5 CA 70128269 1541 4 1 0 2 5 2410110 Ione
8 6 5 CA 70128269 1541 4 1 0 2 5 2410128 Jackson
9 6 5 CA 67476209 1541 2 1 1 2 5 2409693 Amador City
10 6 5 CA 67476209 1541 2 1 1 2 5 2411446 Plymouth
11 6 5 CA 67476209 1541 2 1 1 2 5 226085 Jackson
12 6 5 CA 67476209 1541 2 1 1 2 5 1675841 Amador
13 6 5 CA 67476209 1541 2 1 1 2 5 2418631 Ione Band of Miwok
14 6 5 CA 67476209 1541 2 1 1 2 5 2412019 Sutter Creek
15 6 5 CA 67476209 1541 2 1 1 2 5 2410110 Ione
16 6 5 CA 67476209 1541 2 1 1 2 5 2410128 Jackson
17 6 5 CA 56544761 1541 4 1 0 2 5 2409693 Amador City
18 6 5 CA 56544761 1541 4 1 0 2 5 2411446 Plymouth
19 6 5 CA 56544761 1541 4 1 0 2 5 226085 Jackson
20 6 5 CA 56544761 1541 4 1 0 2 5 1675841 Amador
dim(n)
[1] 951423 12
The first row to 8th "PUBCSNUM "duplicate 8 times, "PUBCSNUM" is ID, so it's unique, "ANSI.Code" is supposed only 1 value, now they are so many value.I don't know why it's duplicate like that
Please help me, I stuck for couples hours but I couldn't figure out. Thanks
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I'm trying to create a fiscal year variable called 'period', which will run from September through August for six years. My data frame 'dat' is structured as follows:
'data.frame': 52966 obs. of 4 variables:
$ userid : int 96 96 96 101 101 101 101 101 101 101 ...
$ comment.year : int 2008 2009 2009 2008 2008 2008 2008 2008 2008 2009 ...
$ comment.month: int 7 3 8 7 8 9 10 11 12 1 ...
$ num.comments : int 1 1 1 33 51 16 27 29 40 39 ...
I get this error message: Error: unexpected '=' in "dat$period[comment.year=2008 & comment.month="
when I run the following code. I've experimented with double equal signs and putting the month and year integers in quotes, but no success. I'm also wondering if there's a simpler way to do the recode. Since I'm dealing with 6 years, my approach takes 72 lines.
dat$period[comment.year=2008 & comment.month=9]<-"1"
dat$period[comment.year=2008 & comment.month=10]<-"1"
dat$period[comment.year=2008 & comment.month=11]<-"1"
dat$period[comment.year=2008 & comment.month=12]<-"1"
dat$period[comment.year=2009 & comment.month=1]<-"1"
dat$period[comment.year=2009 & comment.month=2]<-"1"
dat$period[comment.year=2009 & comment.month=3]<-"1"
dat$period[comment.year=2009 & comment.month=4]<-"1"
dat$period[comment.year=2009 & comment.month=5]<-"1"
dat$period[comment.year=2009 & comment.month=6]<-"1"
dat$period[comment.year=2009 & comment.month=7]<-"1"
dat$period[comment.year=2009 & comment.month=8]<-"1"
dat$period[comment.year=2009 & comment.month=9]<-"2"
dat$period[comment.year=2009 & comment.month=10]<-"2"
dat$period[comment.year=2009 & comment.month=11]<-"2"
dat$period[comment.year=2009 & comment.month=12]<-"2"
Rather than doing a bunch of partial assignments, why not just calculate the different in years with a bonus bump for months >=9?
#sample data
dat<-data.frame(
comment.year=rep(2009:2011, each=12),
comment.month=rep(1:12, 3)
)[-(1:8), ]
#assign new period
dat$period<- dat$comment.year-min(dat$comment.year) + ifelse(dat$comment.month>=9,1,0)
which gives you
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
If you want to make sure to start at a certain user, you can use 2009 rather than min(dat$comment.year).
Using MrFlick's sample data:
dat$period = rep(1:3, each=12)[1:28]
dat
comment.year comment.month period
9 2009 9 1
10 2009 10 1
11 2009 11 1
12 2009 12 1
13 2010 1 1
14 2010 2 1
15 2010 3 1
16 2010 4 1
17 2010 5 1
18 2010 6 1
19 2010 7 1
20 2010 8 1
21 2010 9 2
22 2010 10 2
23 2010 11 2
24 2010 12 2
25 2011 1 2
26 2011 2 2
27 2011 3 2
28 2011 4 2
29 2011 5 2
30 2011 6 2
31 2011 7 2
32 2011 8 2
33 2011 9 3
34 2011 10 3
35 2011 11 3
36 2011 12 3
>
Can easily be extended to your data.
I guess you could also try (Using #MrFlick's data)
set.seed(42)
dat1 <- dat[sample(1:nrow(dat)),]
dat<- within(dat, {period<- as.numeric(factor(comment.year))
period[comment.month <9] <- period[comment.month <9] -1})
dat
# comment.year comment.month period
#9 2009 9 1
#10 2009 10 1
#11 2009 11 1
#12 2009 12 1
#13 2010 1 1
#14 2010 2 1
#15 2010 3 1
#16 2010 4 1
#17 2010 5 1
#18 2010 6 1
#19 2010 7 1
#20 2010 8 1
#21 2010 9 2
#22 2010 10 2
#23 2010 11 2
#24 2010 12 2
#25 2011 1 2
#26 2011 2 2
#27 2011 3 2
#28 2011 4 2
#29 2011 5 2
#30 2011 6 2
#31 2011 7 2
#32 2011 8 2
#33 2011 9 3
#34 2011 10 3
#35 2011 11 3
#36 2011 12 3
Using the unordered dat1
within(dat1, {period<- as.numeric(factor(comment.year)); period[comment.month <9] <- period[comment.month <9] -1})[,3]
#[1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3
Crosschecking the results with #MrFlick's method
dat1$comment.year-min(dat1$comment.year) + ifelse(dat1$comment.month>=9,1,0)
# [1] 3 3 1 2 2 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 3 1 2 1 2 1 2 3