Creating Bilateral migration Matrix in Stata or R - r

My data is similar to table given below. I have data to province where an individual is living and data on previous province where he was living last year. I want to construct the outflow rate and inflow rate of migrants in Stata or R
Province
Previous Province
Delhi
Mumbai
Mumbai
Kolkata
Kolkata
Mumbai
Delhi
Mumbai
Kolkata
Delhi
Mumbai
Mumbai
I want a matrix which is as follows
Delhi
Mumbai
Kolkata
Delhi
0
2
0
Mumbai
0
1
1
Kolkata
1
1
0

table(df)
Previous_Province
Province Delhi Kolkata Mumbai
Delhi 0 0 2
Kolkata 1 0 1
Mumbai 0 1 1
In dataframe format:
as.data.frame.matrix(table(df))
Delhi Kolkata Mumbai
Delhi 0 0 2
Kolkata 1 0 1
Mumbai 0 1 1
pivot_wider(df, names_from = Previous_Province,
values_from = Previous_Province, values_fn = length,
values_fill = 0)
# A tibble: 3 × 4
Province Mumbai Kolkata Delhi
<chr> <int> <int> <int>
1 Delhi 2 0 0
2 Mumbai 1 1 0
3 Kolkata 1 0 1

Related

Number of reports one week before an event R

I'm trying to add a column (AC_1_before) to my dataframe that would count the number of reports in the week (or two, three or four weeks) prior to an event within a park.
My dataframe currently looks like this:
View(Reaction_per_park_per_day_3)
Park Date Type_1_2 Coy_season AC_code Year Total_prior_AC
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl>
1 Airways Park 2019-01-14 1 1 3 2019 0
2 Airways Park 2019-01-16 0 1 2 2019 1
3 Airways Park 2019-01-24 0 1 2 2019 2
4 Auburn Bay 2021-03-02 1 1 1 2021 0
5 Auburn Bay 2021-03-03 0 1 1 2021 1
6 Auburn Bay 2021-05-08 0 1 1 2021 2
7 Bears Paw 2019-05-22 0 2 1 2019 0
8 Bears Paw 2019-05-22 0 2 2 2019 1
Where Type_1_2 represents a specific reaction, Coy_season refers to a season, AC_code represents a treatment, and Total_prior_AC represents the total number of events prior to a report within a park.
With the added column, I would like my dataframe to look like this:
Park Date Type_1_2 Coy_season AC_code Year Total_prior_AC AC_1_before
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Airways Park 2019-01-14 1 1 3 2019 0 0
2 Airways Park 2019-01-16 0 1 2 2019 1 1
3 Airways Park 2019-01-24 0 1 2 2019 2 1
4 Auburn Bay 2021-03-02 1 1 1 2021 0 0
5 Auburn Bay 2021-03-03 0 1 1 2021 1 1
6 Auburn Bay 2021-05-08 0 1 1 2021 2 0
7 Bears Paw 2019-05-22 0 2 1 2019 0 0
8 Bears Paw 2019-05-22 0 2 2 2019 1 1
I tried this:
library(lubridate)
library(dplyr)
Reaction_per_park_per_day_4 <- Reaction_per_park_per_day_3 %>%
group_by(Park, Date) %>%
mutate(Start_date = min(Date)) %>%
group_by(Park, Date, Start_date) %>%
summarise(AC_1_before = sum(Date <= Start_date & Date >= Start_date - weeks(1)),
.groups = "drop")
This does not seem to work; although the code does run, the result obtained is not correct (I get 1s where I should get 0s, and the sums are often wrong). By grouping by Park and Date, I also group together events that were conducted on the same park and on the same day, which I do not want to do.
Any ideas on how I could do this?
If I understood you correctly, one way to do this could be to a for loop. For simplicity I made a new dataframe:
library(dplyr)
library(lubridate)
Reaction_per_park_per_day_3<-data.frame("Park" = c(rep("Airways Park", 3), rep("Auburn Bay", 3), rep("Bears Paw", 2)),
"Date" = as.POSIXct(c("2019-01-14", "2019-01-16", "2019-01-24", "2021-03-02", "2021-03-03", "2021-05-08", "2019-05-22", "2019-05-22")),
"Type_1_2" = c(1,0,0,1,0,0,0,0),
"Coy_season" = c(1,1,1,1,1,1,2,2),
"AC_code" = c(3,2,2,1,1,1,1,2),
"Year" = c(2019,2019,2019,2021,2021,2021,2019,2019),
"Total_prior_AC" = c(0,1,2,0,1,2,0,1))
for(i in 1:nrow(Reaction_per_park_per_day_3)) {
Reaction_per_park_per_day_3$AC_1_before[i] <- nrow(Reaction_per_park_per_day_3[0:(i-1),]%>%
filter(Park == Reaction_per_park_per_day_3$Park[i] &
Date %within% interval(Reaction_per_park_per_day_3$Date[i]-604800,
Reaction_per_park_per_day_3$Date[i])))
#604800 is # of seconds in a week
}
So for each row, count the number of rows before which matches in the "Park" column and is within the interval of 7 days from the current row. I'm sure there's a better way to do this but this could work I think!

R - clean up data based on preceding and following values

I have got a table which is later on divided into multiple intervals based on multiple conditions. In some rare cases, I one or multiple rows which do not fall into the defined interval, so I'd like to preform some extra clean-up in the data.
For each group (name, location), if the row value in stop == 0, I need to count how many of those rows are in the interval. If that less then <3, I need to check how many continous rows are market as stop == 1 above and below the interval with zero value. If the count of values with stop == 1 above & below == 1 then I need to change values in the intervals with zero to 1.
I hope the picture will make it more clear:
df <- read.table(text="name location stop
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John London 0
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = FALSE)
You could iterate over the rows, but it seems that all you want to do is replace all instances of 101 with 111 and 1001 with 1111 in stop. You can do this by turning the stop column to string and then make substitutions using gsub():
stopString = paste0(df$stop, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
df$stop = as.numeric(unlist(strsplit(stopString,"")))
> df
name location stop
1 John London 1
2 John London 1
3 John London 1
4 John London 1
5 John London 1
6 John London 1
7 John London 1
8 John London 1
9 John London 1
10 John London 1
11 John London 1
12 John London 1
13 John London 1
14 John London 1
15 John London 1
16 John London 0
17 John New_York 0
18 John New_York 0
19 John New_York 0
20 John New_York 1
21 John New_York 0
Edit: grouping by name and location:
df <- read.table(text="name location stop
John London 1
John London 0
John London 1
John New_York 0
John New_York 1
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = TRUE)
f <- function(x)
{
stopString = paste0(x, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
as.numeric(unlist(strsplit(stopString,"")))
}
> df %>% dplyr::group_by(name, location) %>%
dplyr::summarise(stop=stop, s=f(stop))
# A tibble: 10 x 4
# Groups: name, location [2]
name location stop s
<fct> <fct> <int> <dbl>
1 John London 1 1
2 John London 0 1
3 John London 1 1
4 John New_York 0 0
5 John New_York 1 1
6 John New_York 0 0
7 John New_York 0 0
8 John New_York 0 0
9 John New_York 1 1
10 John New_York 0 0

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

Sum up values with same ID from different column in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
My data set sometimes contains multiple observations for the same year as below.
id country ccode year region protest protestnumber duration
201990001 Canada 20 1990 North America 1 1 1
201990002 Canada 20 1990 North America 1 2 1
201990003 Canada 20 1990 North America 1 3 1
201990004 Canada 20 1990 North America 1 4 57
201990005 Canada 20 1990 North America 1 5 2
201990006 Canada 20 1990 North America 1 6 1
201991001 Canada 20 1991 North America 1 1 8
201991002 Canada 20 1991 North America 1 2 5
201992001 Canada 20 1992 North America 1 1 2
201993001 Canada 20 1993 North America 1 1 1
201993002 Canada 20 1993 North America 1 2 62
201994001 Canada 20 1994 North America 1 1 1
201994002 Canada 20 1994 North America 1 2 1
201995001 Canada 20 1995 North America 1 1 1
201995002 Canada 20 1995 North America 1 2 1
201996001 Canada 20 1996 North America 1 1 1
201997001 Canada 20 1997 North America 1 1 13
201997002 Canada 20 1997 North America 1 2 16
I need to sum up all values for the same year to one value per year. So that I receive one value per year in every column. I want to iterate this through the whole data set for all years and countries. Any help is much appreciated. Thank you!

subsetting data to first occurrence in R

I'm trying to subset the data so it only preserves the first occurrence of a variable. I'm looking at a panel data that traces the career of workers, and I'm trying to subset the data so that it only shows until each person became Boss.
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
So I would want the data to look like:
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
2 1990 Jane Manager 0
2 1991 Jane Boss 1
This seems like basic censoring but for the sake of my analysis this is crucial..! Any help would be appreciated.
Here's a dplyr solution that uses two useful window functions lag() and cumall():
df <- read.table(header = TRUE, text = "
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
", stringsAsFactors = FALSE)
library(dplyr)
# Use mutate to see the values of the new variables
df %>%
group_by(id) %>%
mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))
# Use filter to see the results
df %>%
group_by(id) %>%
filter(cumall(lag(job, default = "") != "Boss"))
We use lag() to figure out what job each person had in the previous year, and then use cumall() to keep all rows up to the first instance of "Boss". If the data wasn't already sorted by year, you could use lag(job, order_by = year) to make sure lag() used the value of year, rather than the row order, to determine which was "last" year.
Base solution:
do.call(
rbind,
by(dat,dat$name,function(x) {
if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
})
)
# id year name job job2
#Bon.1 1 1990 Bon Manager 0
#Bon.2 1 1991 Bon Manager 0
#Bon.3 1 1992 Bon Manager 0
#Bon.4 1 1993 Bon Boss 1
#Jane.6 2 1990 Jane Manager 0
#Jane.7 2 1991 Jane Boss 1
An alternative base solution:
dat$keep <- with(dat,
ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2)
)
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )
# id year name job job2 keep
#1 1 1990 Bon Manager 0 0
#2 1 1991 Bon Manager 0 0
#3 1 1992 Bon Manager 0 0
#4 1 1993 Bon Boss 1 1
#6 2 1990 Jane Manager 0 0
#7 2 1991 Jane Boss 1 1
And a data.table solution:
dat <- as.data.table(dat)
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]
# name id year job job2
#1: Bon 1 1990 Manager 0
#2: Bon 1 1991 Manager 0
#3: Bon 1 1992 Manager 0
#4: Bon 1 1993 Boss 1
#5: Jane 2 1990 Manager 0
#6: Jane 2 1991 Boss 1
The library 'sqldf' could do the work.
library(sqldf)
miny <- sqldf("select id, min(year) as year from df where job='Boss' group by id")
sqldf("select df.* from df join miny on (df.id=miny.id and df.year<=miny.year)")
If your data is stored in a data frame called df:
library(plyr)
ddply(.data=df, .variables=c("name"), .fun=function(x) {
i <- which(x$job == "Boss")[1]
if (!is.na(i)) x[1:i, ] # omit lifelong managers
})
# id year name job job2
# 1 1 1990 Bon Manager 0
# 2 1 1991 Bon Manager 0
# 3 1 1992 Bon Manager 0
# 4 1 1993 Bon Boss 1
# 5 2 1990 Jane Manager 0
# 6 2 1991 Jane Boss 1

Resources