changing the order of age-group into normal order - r

I have a data frame named df. in first step I have changed age into age-group and then got sum of each row based on agegroup and gender.
df<- data_frame(age= c(0,1,3,5,6,29,43,12,1,3,5,12,29,43,0,6), pop= c(12,11,33,45,56,54,67,76,65,11,78,90,112,29,70,60),gender=c(2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1))
changing age into age-group :
x <- df$age %/% 5
x <- pmax(0, pmin(20, x))
df$agegroup<- c(paste(0:19*5, 1:20*5-1, sep="-"), "+100")[x+1]
sum of each row:
df1 <- aggregate(formula = pop ~ gender + agegroup, data = df, FUN = sum)
gender agegroup pop
1 1 0-4 146
2 2 0-4 56
3 1 10-14 90
4 2 10-14 76
5 1 25-29 112
6 2 25-29 54
7 1 40-44 29
8 2 40-44 67
9 1 5-9 138
10 2 5-9 101
as shown in df1, the age-group 5-9 is located after 40-44 but I want to have ordered age-group. my desired output would be like this :
gender agegroup pop
1 1 0-4 146
2 2 0-4 56
3 1 5-9 138
4 2 5-9 101
5 1 10-14 90
6 2 10-14 76
7 1 25-29 112
8 2 25-29 54
9 1 40-44 29
10 2 40-44 67

You're going to want to set agegroup to a factor and specify the factor order. One way to do this is with reorder(). For example
df$agegroup <- reorder(df$agegroup,
as.numeric(gsub("-\\d+","", df$agegroup)))
We use gsub() to take off the second number, and then we can use that to sort by the numeric value of the first number.
Once you've updated the level order to be what you want, you should get the results in the order you want.
levels(df$agegroup)
# [1] "0-4" "5-9" "10-14" "25-29" "40-44"

I am kind of reinventing the wheel here for something that you have already solved but you can use cut and pass breaks and labels to it.
The benefit of using cut is that it will give you factor levels which are already in the order that you want, you just need to arrange them.
library(dplyr)
x1 <- c(0, seq(4, 100, 5))
labels <- c(paste(x1[-length(x1)] + 1, x1[-1], sep = '-'), '100+')
labels[1] <- '0-4'
df %>%
group_by(gender, agegroup = cut(age, c(x1, Inf), labels, include.lowest = TRUE)) %>%
summarise(pop = sum(pop)) %>%
ungroup %>%
arrange(agegroup)
# gender agegroup pop
# <dbl> <fct> <dbl>
# 1 1 0-4 146
# 2 2 0-4 56
# 3 1 5-9 138
# 4 2 5-9 101
# 5 1 10-14 90
# 6 2 10-14 76
# 7 1 25-29 112
# 8 2 25-29 54
# 9 1 40-44 29
#10 2 40-44 67

We can use mixedorder from gtools
df1[gtools::mixedorder(df1$agegroup),]
gender agegroup pop
1 1 0-4 146
2 2 0-4 56
9 1 5-9 138
10 2 5-9 101
3 1 10-14 90
4 2 10-14 76
5 1 25-29 112
6 2 25-29 54
7 1 40-44 29
8 2 40-44 67

Related

merging two data frame with different age form

I have two data frame with different variables named "df" and df1. what I want to do is merging df1 with "df" based on "gender", "age" and "district" in such a way that the age in "df" get given values of AC. for example, if AC is in age group 20-24, all age in "df" which is between 20 to 24 get that same value of AC. thank you in advance.
df<-
district residence gender age weight id
1 1 1 12 26.8 1
2 2 2 14 21.4 2
3 1 1 20 24.2 3
4 2 2 23 35.8 4
5 1 1 31 42.3 5
6 2 2 16 25.2 6
7 1 1 22 35.3 7
8 2 2 45 25.3 8
9 1 1 48 36.2 9
10 2 2 39 35.5 10
df1<-
district age gender AC
1 15-19 2 0.0301
2 20-24 2 0.0934
3 25-29 2 0.108
4 30-34 2 0.0894
5 35-39 2 0.0444
6 40-44 2 0.00945
7 45-49 2 0.00226
8 15-19 2 0.0258
9 20-24 2 0.0701
10 25-29 2 0.0827
You can separate the age column of df1 into two columns and use fuzzyjoin.
library(dplyr)
library(tidyr)
library(fuzzyjoin)
df1 %>%
separate(age, c('start', 'end'), sep = '-', convert = TRUE) %>%
fuzzy_right_join(df,
by = c('district', 'gender', 'start' = 'age', 'end' = 'age'),
match_fun = c(`==`, `==`, `<=`, `>=`))
This is actually a poor minimal example, because there are no such matches in your data. I have modified your data a little bit. Also note that you have some ages in df for which there are no labels in df1.
df$district=1
df1$district=1
df$age1=cut(
df$age,
c(0,as.numeric(unlist(lapply(strsplit(unique(df1$age),"-"),"[[",2)))),
labels=sort(unique(df1$age))
)
merge(
df,
df1,
by.x=c("gender","age1","district"),
by.y=c("gender","age","district")
)
gender age1 district residence age weight id AC
1 2 15-19 1 2 14 21.4 2 0.03010
2 2 15-19 1 2 14 21.4 2 0.02580
3 2 15-19 1 2 16 25.2 6 0.03010
4 2 15-19 1 2 16 25.2 6 0.02580
5 2 20-24 1 2 23 35.8 4 0.07010
6 2 20-24 1 2 23 35.8 4 0.09340
7 2 35-39 1 2 39 35.5 10 0.04440
8 2 45-49 1 2 45 25.3 8 0.00226

Label columns with a ascending number [duplicate]

This question already has answers here:
Make sequential numeric column names prefixed with a letter
(3 answers)
Closed 2 years ago.
I want to label columns with a ascending number. The reason is because in a bigger dataset I want to be able to sort the columns so they get in the right order.
How do i code this? Thanks!
set.seed(8)
id <- 1:6
diet <- rep(c("A","B"),3)
period <- rep(c(1,2),3)
score1 <- sample(1:100,6)
score2 <- sample(1:100,6)
score3 <- sample(1:100,6)
df <- data.frame(id, diet, period, score1, score2,score3)
df
id diet period score1 score2 score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
It should look like:
x1id x2diet x3period x4score1 x5score2 x6score3
1 1 A 1 47 30 44
2 2 B 2 21 93 54
3 3 A 1 79 76 14
4 4 B 2 64 63 90
5 5 A 1 31 44 1
6 6 B 2 69 9 26
I was thinking something like this, but something is missing....
colnames(wellbeing) <- paste(1:ncol, colnames(wellbeing))
Another options:
colnames(df) <- paste0('x', 1:dim(df)[2], colnames(df))
or
df %>%
dplyr::rename_all(~ paste0('x', 1:ncol(df), .))
Both methods would yield the same output:
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
You can use :
names(df) <- paste0('x', seq_along(df), names(df))
df
# x1id x2diet x3period x4score1 x5score2 x6score3
#1 1 A 1 96 1 52
#2 2 B 2 52 93 75
#3 3 A 1 55 50 68
#4 4 B 2 79 3 9
#5 5 A 1 12 6 76
#6 6 B 2 42 86 62
Maybe add an underscore?
names(df) <- paste0('x', seq_along(df), "_", names(df))
names(df)
#[1] "x1_id" "x2_diet" "x3_period" "x4_score1" "x5_score2" "x6_score3"
Here is a mapply approach.
mapply(paste0, paste0("x", 1:ncol(df)), names(df))

How to reshape data frame from a row level to person level in R

I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows

Use dplyr (I think) to manipulate a dataset

I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0.
I first cleaned the data set and took out only the chicks that were recorded for all 12 weigh ins:
library(datasets)
library(dplyr)
Frequency <- dplyr::count(ChickWeight$Chick)
colnames(Frequency)[colnames(Frequency)=="x"] <- "Chick"
a <- inner_join(ChickWeight, Frequency, by='Chick')
complete <- a[(a$freq == 12),]
head(complete,3)
This data set is in the library(datasets) of r, called ChickWeight.
You can try:
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(wdiff = weight - first(weight))
# A tibble: 540 x 5
# Groups: Chick [45]
weight Time Chick Diet wdiff
<dbl> <dbl> <ord> <fct> <dbl>
1 42 0 1 1 0
2 51 2 1 1 9
3 59 4 1 1 17
4 64 6 1 1 22
5 76 8 1 1 34
6 93 10 1 1 51
7 106 12 1 1 64
8 125 14 1 1 83
9 149 16 1 1 107
10 171 18 1 1 129
# ... with 530 more rows

ddply type functionality on multiple datafrmaes

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Resources