creating new column based on other grouped variables and multiple conditions - r

suppose I have following data:
df1<- data.frame(province= c(1,1,2,3,3,3,4,4,4,4,4,5,5,5),year= c(2001,2001,2001,2001,2001,2001,2002,2002,2003,2003,2003,2004,2005,2005),
residence= c(1,1,1,2,2,2,1,1,1,2,2,2,2,2),marriage= c(1,2,2,1,2,1,1,1,2,1,1,1,2,1),count=c(4,1,3,5,3,2,2,3,2,1,2,4,2,5))
in my data marriage = 1 is ever-married and marriage = 2 is never-married. the proportion of ever-married can be estimated by column count: ever-married / ever-married + never-married
what I want is estimating the proportion of ever-married based on columns province, year and residence and two conditions:
1- if there is no ever-married based on three columns, the proportion would be 0
2- if there is no never-married based on three columns, the proportion would be 100.
my expected output would be like this:
province year residence sub
1 2001 1 0.80
2 2001 1 0.00
3 2001 2 0.70
4 2002 1 100.00
4 2003 1 0.00
4 2003 2 100.00
5 2004 2 100.00
5 2005 2 0.71
thank you in advance.

We group by 'province', 'year', 'residence', create a condition based on if/else when 'marriage' values 1, 2 are not present, then return 0, 100 respectively or else get the 'count' values that correspond to 'marriage' value of 1, divide by the sum of 'count' and then sum the proportions
library(dplyr)
df1 %>%
group_by(province, year, residence) %>%
summarise(sub = if(!any(marriage == 1)) 0
else if(!any(marriage == 2)) 100 else
sum(count[marriage == 1]/sum(count)), .groups = 'drop')
-output
# A tibble: 8 × 4
province year residence sub
<dbl> <dbl> <dbl> <dbl>
1 1 2001 1 0.8
2 2 2001 1 0
3 3 2001 2 0.7
4 4 2002 1 100
5 4 2003 1 0
6 4 2003 2 100
7 5 2004 2 100
8 5 2005 2 0.714

Related

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

Trying to keep values of a column based on the unique values of two other columns

I want to keep only the 2 largest values in a column of a df according to the unique pair of values in two other columns. e.g., I have this df:
df <- data.frame('ID' = c(1,1,1,2,2,3,4,4,4,5),
'YEAR' = c(2002,2002,2003,2002,2003,2005,2010,2011,2012,2008),
'WAGES' = c(100,98,60,120,80,300,50,40,30,500));
And I want to drop the 3rd and 9th rows, equivalently, keep the first two largest values in WAGES column. The df has roughly 300,000 rows.
You can use dplyr's top_n:
library(dplyr)
df %>%
group_by(ID) %>%
top_n(n = 2, wt = WAGES)
## A tibble: 8 x 3
## Groups: ID [5]
# ID YEAR WAGES
# <dbl> <dbl> <dbl>
#1 1 2001 100
#2 1 2002 98
#3 2 2002 120
#4 2 2003 80
#5 3 2005 300
#6 4 2010 50
#7 4 2011 40
#8 5 2008 500
If I understood your question correctly, using base R:
for (i in 1:2) {
max_row <- which.max(df$WAGES)
df <- df[-c(max_row), ]
}
df
# ID YEAR WAGES
# 1 1 2001 100
# 2 1 2002 98
# 3 1 2003 60
# 4 2 2002 120
# 5 2 2003 80
# 7 4 2010 50
# 8 4 2011 40
# 9 4 2012 30
Note - and , in df <- df[-c(max_row), ].

How to change date format(d.m.Y) to year(Y) & find annual cumulative sum?

I have a 2 columns dataframe x as shown below. The "Publication.Date" column carry the format of "%d.%m.%Y". Is there anyway to create a new column of "year" with the format of "%Y" from the "Publication.Date"?
head(x,10)
Publication.Date n
1 1979-09-05 1
2 1979-09-19 1
3 1980-03-19 1
4 1980-10-01 1
5 1980-12-10 1
6 1981-01-07 1
7 1981-04-02 1
8 1981-05-06 1
9 1981-11-18 1
10 1982-01-20 2
I tried create a new column of cumulative sum using dplyr (as shown as below) but actually I wanted to create a new column of "Annual cumulative sum, N" that is by adding up "n" anually.
y <- mutate(x, N=cumsum(n))
head(y,10)
Publication.Date n N
1 1979-09-05 1 1
2 1979-09-19 1 2
3 1980-03-19 1 3
4 1980-10-01 1 4
5 1980-12-10 1 5
6 1981-01-07 1 6
7 1981-04-02 1 7
8 1981-05-06 1 8
9 1981-11-18 1 9
10 1982-01-20 2 11
My desired outcome should be as below. Appreciating any of your kind advice. Thanks.
Year n N
1 1979 2 2
3 1980 3 5
6 1981 4 9
10 1982 2 11
You could do this manually, but I would get the year function from data.table and just do something like directly on your original data set x
library(data.table)
x %>%
group_by(Year = year(Publication.Date)) %>%
tally() %>%
mutate(N = cumsum(n))
# Source: local data frame [4 x 3]
#
# Year n N
# (int) (int) (int)
# 1 1979 2 2
# 2 1980 3 5
# 3 1981 4 9
# 4 1982 2 11
Though I would just do without calculating n a priori
x %>%
count(Year = year(Publication.Date)) %>%
mutate(N = cumsum(n))
# Source: local data frame [4 x 3]
#
# Year n N
# (int) (int) (int)
# 1 1979 2 2
# 2 1980 3 5
# 3 1981 4 9
# 4 1982 1 10
But this doesn't exactly match your desired output because you predefined n without actually providing the full data, but this approach seems better to me anyway.
We can either extract the 'Year' using regex, group by that and use summarise to get the desired output. Starting from 'y' from the OP's post
y %>%
group_by(Year= sub('-.*', '', Publication.Date)) %>%
summarise(n= sum(n), N= last(N))
# Year n N
# (chr) (int) (int)
#1 1979 2 2
#2 1980 3 5
#3 1981 4 9
#4 1982 2 11
Or use year from library(lubridate) to extract 'Year' and use summarise.
library(lubridate)
y %>%
group_by(Year = year(as.Date(Publication.Date))) %>%
summarise(n= sum(n), N= last(N))
# Year n N
# (int) (int) (int)
#1 1979 2 2
#2 1980 3 5
#3 1981 4 9
#4 1982 2 11
If we are using data.table, we convert the initial dataset to 'data.table' (setDT(x), grouped by the 'Year' (extracted using year), get the sum of 'n', create a new column 'N' by doing the cumsum of 'n'.
library(data.table)
setDT(x)[, list(n= sum(n)), .(Year= year(Publication.Date))][, N:= cumsum(n)][]
# Year n N
#1: 1979 2 2
#2: 1980 3 5
#3: 1981 4 9
#4: 1982 2 11

R - compare multiple columns and create new columns indicating matches

I'd like to know how I can compare multiple columns to the values in a single column, then use those matches to create a table of differences. I have a political dataset of policy outcomes, and whether certain organizations supported or opposed those outcomes, by year. Here's some mock data:
Outcome 0 means the law never happened, outcome 1 means it happened.
For organizations, a negative number means they opposed the law and positive means they supported it:
set.seed(123)
Data <- data.frame(
year = sample(1998:2004, 200, replace = TRUE),
outcome = sample(0:1, 200, replace = TRUE),
union = sample(-1:1, 200, replace = TRUE),
chamber = sample(-1:1, 200, replace = TRUE),
pharma = sample(-1:1, 200, replace = TRUE),
gun = sample(-1:1, 200, replace = TRUE),
dem = sample(-1:1, 200, replace = TRUE),
repub = sample(-1:1, 200, replace = TRUE)
)
I would like to know how many times an organization matched the support or opposition of the union, per year.
I imagine its going to be some table like this, where a match equals 1 and otherwise -1 (there are also many NAs in the data were organizations take no position):
DATA$contra <- ifelse(DATA$union == page.bin$chamber, 1, -1)
In the dataset, there's about 50 organizations in consecutive columns. It seems unwieldy to create 50 new columns, one for each match. Even if that is the best way to do it, I don't know how to apply the function to create 50 new columns.
Eventually, I'd like to create a heatmap or a way to visualize which organizations match the union column. But, first, I think, I need some kind of table of data.
Thanks for your help!
When you say "I would like to know how many times an organization matched the support or opposition of the union, per year." then I'm assuming that you want the net number of agreement, i.e. that a 1/1 vote or a -1/-1 vote pairing occurred and that from that you want subtracted the number of disagreement, and do not care about the number of times one of the votes was 0.
Before running your code I used set.seed(123) so there could be reproducibility:
> head(Data)
year outcome union chamber pharma gun dem repub
1 2000 0 1 -1 0 -1 1 -1
2 2003 1 -1 1 0 0 1 -1
3 2000 1 1 -1 -1 -1 0 -1
4 2004 1 0 -1 -1 1 1 0
5 2004 0 0 -1 -1 1 0 -1
6 1998 1 0 1 1 0 1 1
> head( Data[-(1:3)] * Data[[3]])
chamber pharma gun dem repub
1 -1 0 -1 1 -1
2 -1 0 0 -1 1
3 -1 -1 -1 0 -1
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
This makes 1/1 and -1/-1 pairings be all ==1 and -1/1 and 1/-1 pairings ==-1 and others ==0. Now one can aggregate this by year:
> head( aggregate( Data[-(1:3)] * Data[[3]], Data[1], sum) )
year chamber pharma gun dem repub
1 1998 0 -2 1 2 6
2 1999 0 0 2 4 3
3 2000 -3 2 -3 -4 -11
4 2001 2 3 2 9 1
5 2002 0 -1 7 9 1
6 2003 0 -2 -11 5 -2
If instead you only wanted the sum of only the agreements it would be:
> aggregate( Data[-(1:3)] * Data[[3]], Data[1], function(x) {sum(x==1)} )
year chamber pharma gun dem repub
1 1998 5 4 5 7 9
2 1999 8 7 7 9 9
3 2000 5 8 5 3 3
4 2001 7 9 7 11 4
5 2002 7 6 11 12 9
6 2003 7 5 1 8 5
7 2004 4 4 9 2 4
Using dplyr
library(dplyr)
Data %>%
select(-outcome) %>%
group_by(year, union) %>%
mutate_each(funs(union * .)) %>%
group_by(year) %>%
summarise_each(funs(sum(. == 1)), -union)
You get:
Source: local data frame [7 x 6]
year chamber pharma gun dem repub
1 1998 5 4 5 7 9
2 1999 8 7 7 9 9
3 2000 5 8 5 3 3
4 2001 7 9 7 11 4
5 2002 7 6 11 12 9
6 2003 7 5 1 8 5
7 2004 4 4 9 2 4
Using gather() from tidyr to get data in a tall format and ggvis heatmap
library(dplyr)
library(tidyr)
library(ggvis)
Data %>%
select(-outcome) %>%
group_by(year, union) %>%
mutate_each(funs(union * .)) %>%
group_by(year) %>%
summarise_each(funs(sum(. == 1)), -union) %>%
gather(org, value, -year) %>%
mutate(org = as.factor(org), year = as.factor(year)) %>%
ggvis(~year, ~org, fill=~value) %>%
layer_rects(width = band(), height = band()) %>%
layer_text(
x = prop("x", ~year, scale = "xcenter"),
y = prop("y", ~org, scale = "ycenter"),
text:=~value, fontSize := 14, fill:="white",
baseline:="middle", align:="center") %>%
scale_nominal("x", padding = 0, points = FALSE) %>%
scale_nominal("y", padding = 0, points = FALSE) %>%
scale_nominal("x", name = "xcenter", padding = 1, points = TRUE) %>%
scale_nominal("y", name = "ycenter", padding = 1, points = TRUE) %>%
hide_legend("fill")
Maybe the following helps. First, you create a new data frame that contains for each organisation and each row whether the support matched the union:
match.union <- data.frame(year=Data$year,
lapply(Data[,4:ncol(Data)],function(col) col==Data$union))
It is important to add the column with the year for the next step, which is to sum up the number of agreements with the union per year:
aggregate(.~year,match.union,sum)
The output I get from this is
year chamber pharma gun dem repub
1 1998 11 9 10 9 7
2 1999 10 8 16 9 14
3 2000 8 9 8 7 12
4 2001 7 9 10 9 13
5 2002 11 12 11 13 8
6 2003 5 7 8 5 6
7 2004 13 13 15 15 10

R replace value with the value shown by an index

I have a table called "merged" like:
Nationality CustomerID_count ClusterId
1 argentina 1 1
2 ARGENTINA 26 1
3 ARGENTINO 1 1
4 argentona 1 1
5 boliviana 14 2
6 paragauy 1 3
7 paraguay 1 3
8 PARAGUAY 1 3
I need to create a new Nationality column, searching the max value of Customer_ID_count within each cluster.
I did this other table with the following code:
merged1<-data.table(merged)
merged2<-merged1[, which.max(CustomerID), by = ClusterId]
And I got:
ClusterId V1
1: 1 2
2: 2 1
3: 3 1
After that I did a merge:
tot<-merge(x=merged, y=merged2, by= "ClusterId", all.x=TRUE)
And I got the following table:
ClusterId Nationality CustomerID V1
1 1 argentina 1 2
2 1 ARGENTINA 26 2
3 1 ARGENTINO 1 2
4 1 argentona 1 2
5 2 boliviana 14 1
6 3 paragauy 1 1
7 3 paraguay 1 1
8 3 PARAGUAY 1 1
But I didn't know how to finish. I tried this:
tot[,5]=tot[V1,5]
Because I want to have for each row the Nationality that is in the line shown in column V1. This didn't work.
How can I do the last part? and also is there a better way to solve this?
Thanks!
Note that you may have more that one CustomerID_count that matches the maximum value (e.g. all versions of "paraguay" have CustomerID_count == 1, which is the max for that cluster).
It's very easy using the plyr package:
library(plyr)
ddply(merged, .(ClusterId), mutate, Nationality2 = Nationality[CustomerID_count == max(CustomerID_count)])
This could be a good use-case for `dplyr:
library(dplyr)
merged <- merged %>%
group_by(ClusterId) %>%
mutate(newNat=Nationality[CustomerID_count == max(CustomerID_count)]) %>%
ungroup
print(merged)
## Source: local data frame [8 x 4]
##
## Nationality CustomerID_count ClusterId newNat
## 1 argentina 1 1 ARGENTINA
## 2 ARGENTINA 26 1 ARGENTINA
## 3 ARGENTINO 1 1 ARGENTINA
## 4 argentona 1 1 ARGENTINA
## 5 boliviana 14 2 boliviana
## 6 paragauy 1 3 paragauy
## 7 paraguay 1 3 paraguay
## 8 PARAGUAY 1 3 PARAGUAY

Resources