merging two data frame with different age form - r

I have two data frame with different variables named "df" and df1. what I want to do is merging df1 with "df" based on "gender", "age" and "district" in such a way that the age in "df" get given values of AC. for example, if AC is in age group 20-24, all age in "df" which is between 20 to 24 get that same value of AC. thank you in advance.
df<-
district residence gender age weight id
1 1 1 12 26.8 1
2 2 2 14 21.4 2
3 1 1 20 24.2 3
4 2 2 23 35.8 4
5 1 1 31 42.3 5
6 2 2 16 25.2 6
7 1 1 22 35.3 7
8 2 2 45 25.3 8
9 1 1 48 36.2 9
10 2 2 39 35.5 10
df1<-
district age gender AC
1 15-19 2 0.0301
2 20-24 2 0.0934
3 25-29 2 0.108
4 30-34 2 0.0894
5 35-39 2 0.0444
6 40-44 2 0.00945
7 45-49 2 0.00226
8 15-19 2 0.0258
9 20-24 2 0.0701
10 25-29 2 0.0827

You can separate the age column of df1 into two columns and use fuzzyjoin.
library(dplyr)
library(tidyr)
library(fuzzyjoin)
df1 %>%
separate(age, c('start', 'end'), sep = '-', convert = TRUE) %>%
fuzzy_right_join(df,
by = c('district', 'gender', 'start' = 'age', 'end' = 'age'),
match_fun = c(`==`, `==`, `<=`, `>=`))

This is actually a poor minimal example, because there are no such matches in your data. I have modified your data a little bit. Also note that you have some ages in df for which there are no labels in df1.
df$district=1
df1$district=1
df$age1=cut(
df$age,
c(0,as.numeric(unlist(lapply(strsplit(unique(df1$age),"-"),"[[",2)))),
labels=sort(unique(df1$age))
)
merge(
df,
df1,
by.x=c("gender","age1","district"),
by.y=c("gender","age","district")
)
gender age1 district residence age weight id AC
1 2 15-19 1 2 14 21.4 2 0.03010
2 2 15-19 1 2 14 21.4 2 0.02580
3 2 15-19 1 2 16 25.2 6 0.03010
4 2 15-19 1 2 16 25.2 6 0.02580
5 2 20-24 1 2 23 35.8 4 0.07010
6 2 20-24 1 2 23 35.8 4 0.09340
7 2 35-39 1 2 39 35.5 10 0.04440
8 2 45-49 1 2 45 25.3 8 0.00226

Related

changing the order of age-group into normal order

I have a data frame named df. in first step I have changed age into age-group and then got sum of each row based on agegroup and gender.
df<- data_frame(age= c(0,1,3,5,6,29,43,12,1,3,5,12,29,43,0,6), pop= c(12,11,33,45,56,54,67,76,65,11,78,90,112,29,70,60),gender=c(2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1))
changing age into age-group :
x <- df$age %/% 5
x <- pmax(0, pmin(20, x))
df$agegroup<- c(paste(0:19*5, 1:20*5-1, sep="-"), "+100")[x+1]
sum of each row:
df1 <- aggregate(formula = pop ~ gender + agegroup, data = df, FUN = sum)
gender agegroup pop
1 1 0-4 146
2 2 0-4 56
3 1 10-14 90
4 2 10-14 76
5 1 25-29 112
6 2 25-29 54
7 1 40-44 29
8 2 40-44 67
9 1 5-9 138
10 2 5-9 101
as shown in df1, the age-group 5-9 is located after 40-44 but I want to have ordered age-group. my desired output would be like this :
gender agegroup pop
1 1 0-4 146
2 2 0-4 56
3 1 5-9 138
4 2 5-9 101
5 1 10-14 90
6 2 10-14 76
7 1 25-29 112
8 2 25-29 54
9 1 40-44 29
10 2 40-44 67
You're going to want to set agegroup to a factor and specify the factor order. One way to do this is with reorder(). For example
df$agegroup <- reorder(df$agegroup,
as.numeric(gsub("-\\d+","", df$agegroup)))
We use gsub() to take off the second number, and then we can use that to sort by the numeric value of the first number.
Once you've updated the level order to be what you want, you should get the results in the order you want.
levels(df$agegroup)
# [1] "0-4" "5-9" "10-14" "25-29" "40-44"
I am kind of reinventing the wheel here for something that you have already solved but you can use cut and pass breaks and labels to it.
The benefit of using cut is that it will give you factor levels which are already in the order that you want, you just need to arrange them.
library(dplyr)
x1 <- c(0, seq(4, 100, 5))
labels <- c(paste(x1[-length(x1)] + 1, x1[-1], sep = '-'), '100+')
labels[1] <- '0-4'
df %>%
group_by(gender, agegroup = cut(age, c(x1, Inf), labels, include.lowest = TRUE)) %>%
summarise(pop = sum(pop)) %>%
ungroup %>%
arrange(agegroup)
# gender agegroup pop
# <dbl> <fct> <dbl>
# 1 1 0-4 146
# 2 2 0-4 56
# 3 1 5-9 138
# 4 2 5-9 101
# 5 1 10-14 90
# 6 2 10-14 76
# 7 1 25-29 112
# 8 2 25-29 54
# 9 1 40-44 29
#10 2 40-44 67
We can use mixedorder from gtools
df1[gtools::mixedorder(df1$agegroup),]
gender agegroup pop
1 1 0-4 146
2 2 0-4 56
9 1 5-9 138
10 2 5-9 101
3 1 10-14 90
4 2 10-14 76
5 1 25-29 112
6 2 25-29 54
7 1 40-44 29
8 2 40-44 67

extract data from data frame and delete extracted data

i have a data frame with three variables named df. what i want is in "df1" subset df in such a way that the extracted data to no longer exist in the df. it can be done by "subset" but The extracted data will still exist in df.
any help would be appreciated.
df<-
gender age pro
1 22 0.0301
2 11 0.0934
1 44 0.108
2 56 0.0894
1 70 0.0444
2 33 0.00945
1 23 0.00226
2 32 0.0258
1 12 0.0701
2 1 0.0827
1 17 0.0657
1 9 0.0324
2 44 0.00755
1 49 0.000456
2 39 0.0255
1 18 0.0828
2 31 0.0931
1 8 0.0717
df1<- subset(df, age > 14 & age< 50 & gender==2)
You can use dplyr::anti_join to remove the extracted data from original data.
df1<- subset(df90, age > 14 & age< 50 & gender==2)
df90 <- dplyr::anti_join(df90, df1)
We could do with base R:
df1 <- subset(df, !(age > 14 & age < 50 & gender==2))
Output:
gender age pro
<dbl> <dbl> <dbl>
1 1 22 0.0301
2 2 11 0.0934
3 1 44 0.108
4 2 56 0.0894
5 1 70 0.0444
6 1 23 0.00226
7 1 12 0.0701
8 2 1 0.0827
9 1 17 0.0657
10 1 9 0.0324
11 1 49 0.000456
12 1 18 0.0828
13 1 8 0.0717
Using dplyr
library(dplyr)
filter(df, !(age > 14 & age < 50 & gender==2))

Creating a grouping indicator per row in R

I have following data
x1 <- rnorm(20,0,1)
x2 <- rnorm(20,0,1)
group <- sample(50:55, size=20, replace=TRUE)
data <- data.frame(x1,x2,group)
head(data)
x1 x2 group
1 -0.88001290 0.53866432 50
2 0.34228653 -0.54503078 52
3 -2.42308971 0.09542262 54
4 0.07310148 -1.03226594 50
5 -0.47786709 2.46726615 55
6 0.45224510 -1.46224926 55
I need to create a grouping indicator based on group variable. (so that the rows where group=50 will equal to 1, group=51 equal to 2 so on)
I tried to do this using dplyr package in R. But I am not getting the correct answer as I have not defined the indicator variable correctly.
data %>% arrange(group) %>% group_by(group) %>% mutate(Indicator = n() )
Can anyone help me to correct my code?
Thank you
We need cur_group_id instead of n() (n() - returns the number of rows of that group)
library(dplyr)
data %>%
arrange(group) %>%
group_by(group) %>%
mutate(indicator = cur_group_id()) %>%
ungroup
-output
# A tibble: 20 x 4
# x1 x2 group indicator
# <dbl> <dbl> <int> <int>
# 1 -1.24 -0.497 50 1
# 2 -0.648 1.59 50 1
# 3 0.598 -0.325 51 2
# 4 -0.721 0.510 51 2
# 5 0.259 1.62 51 2
# 6 -0.288 0.872 52 3
# 7 0.403 0.785 52 3
# 8 1.84 1.65 52 3
# 9 0.116 -0.0234 52 3
#10 -1.31 -0.244 52 3
#11 -0.615 0.994 53 4
#12 -0.469 0.695 53 4
#13 -0.324 -0.599 53 4
#14 -0.394 -0.971 53 4
#15 1.30 0.323 54 5
#16 0.0242 -1.46 54 5
#17 -0.342 -1.96 54 5
#18 1.10 -0.569 54 5
#19 -0.967 -0.863 54 5
#20 -0.396 -0.441 55 6
Or another option is match
data %>%
mutate(indicator = match(group, sort(unique(group))))
base R using factor()
levels = 50:55
labels = 1:6
data$indicator <- factor(data$group, levels, labels)
or
levels = unique(data$group)
labels = seq_len(length(levels))
data$indicator <- factor(data$group, levels, labels)
dplyr::dense_rank may also help even without grouping
data %>% mutate(indicator = dense_rank(group) )
baseR way
data$indicator <- as.numeric(as.factor(data$group))
data
x1 x2 group indicator
1 -1.453628399 -1.78776319 55 6
2 -0.119413813 -0.07656982 52 3
3 0.387951296 -0.26845052 55 6
4 3.117977719 0.69280780 51 2
5 -0.938126762 -0.16898209 50 1
6 -1.596371818 0.35289797 52 3
7 -2.291376398 -1.59385221 55 6
8 0.161164263 -0.99387565 54 5
9 -0.281744752 -0.26801191 53 4
10 0.760719223 -0.28255900 50 1
11 -0.204073022 -1.10262114 51 2
12 0.653628314 0.77778039 54 5
13 0.043736298 -0.37896178 55 6
14 0.002800531 1.17034334 55 6
15 0.451136658 -0.38459588 51 2
16 0.151793862 0.60303631 55 6
17 0.173976519 -0.41745808 53 4
18 0.282827170 -0.16794851 52 3
19 0.737444975 -0.45712603 51 2
20 0.014182869 0.99013155 51 2

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

Counting the number of appearances of a certain value in a column

I have a column in my data frame which looks like this:
> df
# A tibble: 20 x 1
duration
<dbl>
1 0
2 40.0
3 247.
4 11.8
5 116.
6 10.2
7 171.
8 7.58
9 87.8
10 23.2
11 390.
12 35.8
13 4.73
14 29.1
15 0
16 36.8
17 73.8
18 12.9
19 124.
20 10.7
I need to group this data, so that all rows starting from a 0 to the last row before the next zero are in a group. I've accomplished this using a for-loop:
counter <- 0
df$group <- NA
df$group[1] <- 1
for (i in 2:NROW(df)) {
df$group[i] <-
ifelse(df$duration[i] == 0, df$group[i - 1] + 1, df$group[i - 1])
}
which gives me the desired output:
> df
# A tibble: 20 x 2
duration group
<dbl> <dbl>
1 0 1
2 40.0 1
3 247. 1
4 11.8 1
5 116. 1
6 10.2 1
7 171. 1
8 7.58 1
9 87.8 1
10 23.2 1
11 390. 1
12 35.8 1
13 4.73 1
14 29.1 1
15 0 2
16 36.8 2
17 73.8 2
18 12.9 2
19 124. 2
20 10.7 2
But as my original dataframe is quite big i'm looking for a faster solution, and I've been trying to get it working with dplyr but to no avail. Other related questions are counting how often the current value has already appeared, not a specific one so I haven't found a solution to this problem yet.
I'd appreaciate your help in finding a vectorized solution for my problem, thanks! Heres the example-data:
df <-
structure(
list(
duration = c(
0,
40.0009999275208,
247.248000144958,
11.8349997997284,
115.614000082016,
10.2449998855591,
171.426000118256,
7.58200001716614,
87.805999994278,
23.1909999847412,
390.417999982834,
35.8229999542236,
4.73100018501282,
29.0869998931885,
0,
36.789999961853,
73.8420000076294,
12.8770000934601,
123.771999835968,
10.7190001010895
)
),
row.names = c(NA,-20L),
class = c("tbl_df", "tbl", "data.frame")
)
We can create the desired column using cumsum as below
df %>%
mutate(grp = cumsum(duration == 0))
# A tibble: 20 x 2
# duration grp
# <dbl> <int>
# 1 0 1
# 2 40.0 1
# 3 247. 1
# 4 11.8 1
# 5 116. 1
# 6 10.2 1
# 7 171. 1
# 8 7.58 1
# 9 87.8 1
#10 23.2 1
#11 390. 1
#12 35.8 1
#13 4.73 1
#14 29.1 1
#15 0 2
#16 36.8 2
#17 73.8 2
#18 12.9 2
#19 124. 2
#20 10.7 2

Resources