Duplicated rows when I merge 2 data frame - r

This is Fips data set
State Fips State.Abbreviation ANSI.Code GU.Name
1 1 67 AL 2403054 Abbeville
2 1 73 AL 2403063 Adamsville
3 1 117 AL 2403069 Alabaster
4 1 95 AL 2403074 Albertville
5 1 123 AL 2403077 Alexander City
6 1 107 AL 2403080 Aliceville
7 1 39 AL 2403097 Andalusia
8 1 15 AL 2403101 Anniston
:
:
:
41774 51 720 VA 1498434 Norton
41775 51 730 VA 1498435 Petersburg
41776 51 735 VA 1498436 Poquoson
41777 51 740 VA 1498556 Portsmouth
41778 51 750 VA 1498438 Radford
41779 51 760 VA 1789073 Richmond
41780 51 770 VA 1498439 Roanoke
41781 51 775 VA 1789074 Salem
41782 51 790 VA 1789075 Staunton
41783 51 800 VA 1498560 Suffolk
41784 51 810 VA 1498559 Virginia Beach
41785 51 820 VA 1498443 Waynesboro
41786 51 830 VA 1789076 Williamsburg
41787 51 840 VA 1789077 Winchester
dim(fips)
[1] 2937 5
This is data head cancer
PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS Fips State State.Abbreviation
1 93261752 1544 2 15 0 1 3 3 34 NY
2 93264865 1544 2 1 0 1 15 15 34 NY
3 93268186 1544 2 1 0 1 5 5 34 NY
4 93272027 1544 2 1 0 2 17 17 34 NY
5 93274555 1544 1 1 0 1 13 13 34 NY
6 93275343 1544 5 1 0 2 25 25 34 NY
7 93279759 1544 5 1 0 2 9 9 34 NY
8 93280754 1544 2 1 0 2 35 35 34 NY
9 93281166 1544 2 1 0 2 31 31 34 NY
10 93282602 1544 5 1 0 1 33 33 34 NY
11 93287646 1544 1 1 0 1 11 11 34 NY
12 93288255 1544 4 1 4 1 39 39 34 NY
13 93290660 1544 9 1 0 2 25 25 34 NY
14 93291461 1544 1 1 6 1 39 39 34 NY
15 93291778 1544 2 1 0 1 3 3 34 NY
dim(headcancer)
[1] 75313 10
when I merged together I expect to get the same row with head.cancer 75313 rows, but I got 951423 rows.
Here is my code and output
n = merge(head.cancer,fips, by=c('State','Fips','State.Abbreviation'), all.x= TRUE)
State Fips State.Abbreviation PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS ANSI.Code GU.Name
1 6 5 CA 70128269 1541 4 1 0 2 5 2409693 Amador City
2 6 5 CA 70128269 1541 4 1 0 2 5 2411446 Plymouth
3 6 5 CA 70128269 1541 4 1 0 2 5 226085 Jackson
4 6 5 CA 70128269 1541 4 1 0 2 5 1675841 Amador
5 6 5 CA 70128269 1541 4 1 0 2 5 2418631 Ione Band of Miwok
6 6 5 CA 70128269 1541 4 1 0 2 5 2412019 Sutter Creek
7 6 5 CA 70128269 1541 4 1 0 2 5 2410110 Ione
8 6 5 CA 70128269 1541 4 1 0 2 5 2410128 Jackson
9 6 5 CA 67476209 1541 2 1 1 2 5 2409693 Amador City
10 6 5 CA 67476209 1541 2 1 1 2 5 2411446 Plymouth
11 6 5 CA 67476209 1541 2 1 1 2 5 226085 Jackson
12 6 5 CA 67476209 1541 2 1 1 2 5 1675841 Amador
13 6 5 CA 67476209 1541 2 1 1 2 5 2418631 Ione Band of Miwok
14 6 5 CA 67476209 1541 2 1 1 2 5 2412019 Sutter Creek
15 6 5 CA 67476209 1541 2 1 1 2 5 2410110 Ione
16 6 5 CA 67476209 1541 2 1 1 2 5 2410128 Jackson
17 6 5 CA 56544761 1541 4 1 0 2 5 2409693 Amador City
18 6 5 CA 56544761 1541 4 1 0 2 5 2411446 Plymouth
19 6 5 CA 56544761 1541 4 1 0 2 5 226085 Jackson
20 6 5 CA 56544761 1541 4 1 0 2 5 1675841 Amador
dim(n)
[1] 951423 12
The first row to 8th "PUBCSNUM "duplicate 8 times, "PUBCSNUM" is ID, so it's unique, "ANSI.Code" is supposed only 1 value, now they are so many value.I don't know why it's duplicate like that
Please help me, I stuck for couples hours but I couldn't figure out. Thanks

Related

Estimate number of people alive for every single year between 1850 and 1950 in R

I'm still having trouble regarding my workflow. I need to estimate the number of people alive by gender in every single year between 1850 and 1950. I have the following information:
id, birth_year, death_year and gender
id <- c(1:6)
birth_year <- c(1850:1855)
death_year <- c(1890:1895)
gender <- c("female", "male", "female", "male", "male", "male")
df <- data.frame(id, birth_year, death_year, gender)
Think about the steps to achieve my goal, I realize that a should add columns in my df for each year. In each column, I would estimate the age of a person iat the year x, then, the year of a person i + 1 at the year x + 1. Being i = 1 and x = 1850.
df$age1850 <- 1850 - df$birth_year
df$age1851 <- 1851 - df$birth_year
df$age1852 <- 1852 - df$birth_year
df$age1853 <- 1853 - df$birth_year
df$age1854 <- 1854 - df$birth_year
df$age1855 <- 1855 - df$birth_year
# The expected result would be:
id birth_year death_year gender age1850 age1851 age1852 age1853 age1854 age1855
1 1 1850 1890 female 0 1 2 3 4 5
2 2 1851 1891 male -1 0 1 2 3 4
3 3 1852 1892 female -2 -1 0 1 2 3
4 4 1853 1893 male -3 -2 -1 0 1 2
5 5 1854 1894 male -4 -3 -2 -1 0 1
6 6 1855 1895 male -5 -4 -3 -2 -1 0
Thanks in advance!
To estimate the number of people alive by gender in every single year between 1850 and 1950 you can use table and subset you df with the year.
df$gender <- as.factor(df$gender)
years <- 1850:1950
sapply(setNames(years, years), function(i) {table(df$gender[df$birth_year <= i &
df$death_year >= i])})
# 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863
#female 1 1 2 2 2 2 2 2 2 2 2 2 2 2
#male 0 1 1 2 3 4 4 4 4 4 4 4 4 4
# 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877
#female 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#male 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891
#female 2 2 2 2 2 2 2 2 2 2 2 2 2 1
#male 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905
#female 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#male 3 3 2 1 0 0 0 0 0 0 0 0 0 0
#...

Growth Rate for daily data

I have a data for selling some product and I would like to calculate the growth rate of this data such that N_win and N_lose are the win and lose over a period of time 1-19 March. Also, I would like to predict the growth rate and win and lose?
Date N_win N_lose tot1 tot2
1 2018-03-01 0 0 0 0
2 2018-03-02 1 0 1 1
3 2018-03-03 0 0 1 1
4 2018-03-04 1 0 2 2
5 2018-03-05 3 0 5 5
6 2018-03-06 0 0 5 5
7 2018-03-07 2 0 7 7
8 2018-03-08 4 0 11 11
9 2018-03-09 4 0 15 15
10 2018-03-10 5 0 20 20
11 2018-03-11 1 1 21 20
12 2018-03-12 24 1 45 44
13 2018-03-13 41 1 86 85
14 2018-03-14 17 2 103 101
15 2018-03-15 15 3 118 115
16 2018-03-16 15 6 133 127
17 2018-03-17 38 6 171 165
18 2018-03-18 67 6 238 232
I tried to apply this function but it seems not working
Growthrate = function(x1,x2, n){
gr = (x2/x1)^(1/n)-1
return(gr)
}
GR = NULL
for(i in 1:length(DF[,1])){
GR[i] = Growthrate(DF[i,2],DF[i+1,2], sum(i))
}

Panel data in long format

I have two data frames:
d1:
Id group occu D Year
12 1 1 12 2007
13 4 2 67 2007
14 6 3 34 2007
15 7 1 88 2007
16 2 2 72 2007
17 1 1 43 2007
18 4 1 66 2007
and d2:
Id group occu D Year
12 1 1 34 2010
13 4 2 100 2010
14 6 3 76 2010
15 7 1 99 2010
16 2 2 102 2010
17 1 1 55 2010
18 4 1 32 2010
The variables "group" and "occu" are factors I want to make a panel data for the year 2007 and 2010 in the long form in R.
How can I do this?

Creating a rank column based on two other (linked) columns in R

I have the following dataframe (example data) which has the dates of different DVD recordings for different pairs of birds for numerous broods:
PairID BroodRef DVDdate
1 512 2004-05-22
1 512 2004-05-30
1 512 2004-05-26
1 588 2004-06-30
1 588 2004-07-04
1 588 2004-07-09
2 673 2004-07-19
3 543 2004-06-03
3 543 2004-06-07
3 543 2004-06-11
3 620 2004-07-19
3 39 2005-05-19
3 39 2005-05-23
What I'd like is a brood number for each pair, such as:
PairID BroodRef DVDdate BroodNumber
1 512 2004-05-22 1
1 512 2004-05-30 1
1 512 2004-05-26 1
1 588 2004-06-30 2
1 588 2004-07-04 2
1 588 2004-07-09 2
2 673 2004-07-19 1
3 543 2004-06-03 1
3 543 2004-06-07 1
3 543 2004-06-11 1
3 620 2004-07-19 2
3 39 2005-05-19 3
3 39 2005-05-23 3
I have tried
ddply(df,.(PairID),transform,BroodNumber = dense_rank(BroodRef))
which I saw on another question, but this results in Pair 3, BroodRef 39 being BroodNumber 1 rather than the 3 it should be.
Appreciate any help!
We could use rleid() from data.table to create a sequence based on BroodRef, grouped by PairID.
library(data.table)
setDT(df)[,BroodNumber := rleid(BroodRef), by = PairID]
# PairID BroodRef DVDdate BroodNumber
# 1: 1 512 2004-05-22 1
# 2: 1 512 2004-05-30 1
# 3: 1 512 2004-05-26 1
# 4: 1 588 2004-06-30 2
# 5: 1 588 2004-07-04 2
# 6: 1 588 2004-07-09 2
# 7: 2 673 2004-07-19 1
# 8: 3 543 2004-06-03 1
# 9: 3 543 2004-06-07 1
#10: 3 543 2004-06-11 1
#11: 3 620 2004-07-19 2
#12: 3 39 2005-05-19 3
#13: 3 39 2005-05-23 3
We can use dplyr
library(dplyr)
df1 %>%
group_by(PairID) %>%
mutate(BroodNumber = match(BroodRef, unique(BroodRef)))
# PairID BroodRef DVDdate BroodNumber
# (int) (int) (chr) (int)
#1 1 512 2004-05-22 1
#2 1 512 2004-05-30 1
#3 1 512 2004-05-26 1
#4 1 588 2004-06-30 2
#5 1 588 2004-07-04 2
#6 1 588 2004-07-09 2
#7 2 673 2004-07-19 1
#8 3 543 2004-06-03 1
#9 3 543 2004-06-07 1
#10 3 543 2004-06-11 1
#11 3 620 2004-07-19 2
#12 3 39 2005-05-19 3
#13 3 39 2005-05-23 3

turning biographical data into panel data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have biographical data of more than 1600 people. The data includes their gender, birth year, hometowns, etc., as well as their career trajectories from the year they begun their work. I'm trying to turn this into a panel data, so that I have a grip of how their workplaces have changed since they have started their jobs. I have the following problems with this dataset:
1) How do I turn this into a panel dataset? The optimal format I want for each person(id) is:
id gender hometown year job
1 1 1 NY 1990 3
1 1 1 NY 1991 3
1 1 1 NY 1992 3
1 1 1 NY 1993 3
1 1 1 NY 1994 5
2) How do I save information if the person had overlapping positions? For instance, the person can have job 3 and job 5 at the same time. I'm hoping later to only use the job that is higher than the other, but meanwhile I would like to save as much information as possible.
Okay, give this a try.
First select a subset of your data.
> (D = head(origin[, c("id", "name1", "gender", "job1", "job1s", "job1e",
"job2", "job10")]))
id name1 gender job1 job1s job1e job2 job10
1 1 Abulaiti Abureduxiti 1 2305 1980 1991 2303 NA
2 2 Aisihaiti Kelimubai 1 2307 1972 1987 2307 NA
3 3 Ai Zhisheng 1 4509 1996 1997 1075 10103
4 4 An Pingsheng 1 3555 1975 1977 3561 2191
5 5 An Zhiwen 1 2063 1977 1979 1127 2507
6 6 An Ziwen 1 4509 1954 1966 4007 2517
Next we re-organise the data into what the format that I think you are after.
> library(reshape2)
> (D = melt(D, id.vars = c("id", "name1", "gender")))
id name1 gender variable value
1 1 Abulaiti Abureduxiti 1 job1 2305
2 2 Aisihaiti Kelimubai 1 job1 2307
3 3 Ai Zhisheng 1 job1 4509
4 4 An Pingsheng 1 job1 3555
5 5 An Zhiwen 1 job1 2063
6 6 An Ziwen 1 job1 4509
7 1 Abulaiti Abureduxiti 1 job1s 1980
8 2 Aisihaiti Kelimubai 1 job1s 1972
9 3 Ai Zhisheng 1 job1s 1996
10 4 An Pingsheng 1 job1s 1975
11 5 An Zhiwen 1 job1s 1977
12 6 An Ziwen 1 job1s 1954
13 1 Abulaiti Abureduxiti 1 job1e 1991
14 2 Aisihaiti Kelimubai 1 job1e 1987
15 3 Ai Zhisheng 1 job1e 1997
16 4 An Pingsheng 1 job1e 1977
17 5 An Zhiwen 1 job1e 1979
18 6 An Ziwen 1 job1e 1966
19 1 Abulaiti Abureduxiti 1 job2 2303
20 2 Aisihaiti Kelimubai 1 job2 2307
21 3 Ai Zhisheng 1 job2 1075
22 4 An Pingsheng 1 job2 3561
23 5 An Zhiwen 1 job2 1127
24 6 An Ziwen 1 job2 4007
25 1 Abulaiti Abureduxiti 1 job10 NA
26 2 Aisihaiti Kelimubai 1 job10 NA
27 3 Ai Zhisheng 1 job10 10103
28 4 An Pingsheng 1 job10 2191
29 5 An Zhiwen 1 job10 2507
30 6 An Ziwen 1 job10 2517
We can see that the job field is empty for a few of these records, so we exclude those.
> (D = D[complete.cases(D),])
id name1 gender variable value
1 1 Abulaiti Abureduxiti 1 job1 2305
2 2 Aisihaiti Kelimubai 1 job1 2307
3 3 Ai Zhisheng 1 job1 4509
4 4 An Pingsheng 1 job1 3555
5 5 An Zhiwen 1 job1 2063
6 6 An Ziwen 1 job1 4509
7 1 Abulaiti Abureduxiti 1 job1s 1980
8 2 Aisihaiti Kelimubai 1 job1s 1972
9 3 Ai Zhisheng 1 job1s 1996
10 4 An Pingsheng 1 job1s 1975
11 5 An Zhiwen 1 job1s 1977
12 6 An Ziwen 1 job1s 1954
13 1 Abulaiti Abureduxiti 1 job1e 1991
14 2 Aisihaiti Kelimubai 1 job1e 1987
15 3 Ai Zhisheng 1 job1e 1997
16 4 An Pingsheng 1 job1e 1977
17 5 An Zhiwen 1 job1e 1979
18 6 An Ziwen 1 job1e 1966
19 1 Abulaiti Abureduxiti 1 job2 2303
20 2 Aisihaiti Kelimubai 1 job2 2307
21 3 Ai Zhisheng 1 job2 1075
22 4 An Pingsheng 1 job2 3561
23 5 An Zhiwen 1 job2 1127
24 6 An Ziwen 1 job2 4007
27 3 Ai Zhisheng 1 job10 10103
28 4 An Pingsheng 1 job10 2191
29 5 An Zhiwen 1 job10 2507
30 6 An Ziwen 1 job10 2517
Sorting out overlapping positions is a secondary problem. If I know that the above is basically what you are after then we can address that next.

Resources