How to match datasets in R based on two non-unique variables? - r

I have two dataframes, one of failed firms, and one of non-failed firms.
They both comprise of rows of observations of firms, the variables in these rows include the industry of firm, the year where financial information was recorded, and the size of total assets of the firm and others.
I want to match each failed firm with one non-failed firm of the same industry and total asset size and year of financial information recorded.
I am happy to throw away observations with no match. If one failed firm matches multiple nonfailed firms, I am happy to just randomly choose one.
Currently, my code looks like this:
merge(cessdurc1[cessdurc1$afcyear=="2007",], cessdura[cessdura$afcyear=="2007",], by=c("ssic", "total_assets"), all.x=TRUE, all.y=FALSE)
Which does not work because the columns chosen need to be unique.
My data looks like this:
>head(alivefirms)
failed within year total_assets afcyear ssic
1 0 9e+07 2007 20
2 0 7e+06 2007 43
3 0 7e+05 2007 46
4 0 1e+07 2007 82
5 0 1e+08 2007 93
6 0 1e+06 2007 11
> head(failedfirms)
failed within year total_assets afcyear ssic
26 1 20000 2007 41
79 1 5000 2007 73
105 1 400 2007 30
127 1 4000 2007 18
133 1 2000 2007 70
154 1 10000 2007 41
I want the output to match failed firms to alive firms who have the same SSIC & Total Assets & Afcyear, so something that looks like this
> head(wantedoutput)
failed within year total_assets afcyear ssic
26 1 20000 2007 41
79 0 20000 2007 41
105 1 400 2007 30
127 0 400 2007 30
133 1 2000 2007 70
154 0 2000 2007 70

Related

Importing Data in R

I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>

Implementation of the Difference in Difference (DID) model in R with panel data

I am trying to implement the diff in diff model in R in order to analyze the effect of a regulation on households.
I have panel data, meaning that I have observations for different at different periods.
Lets say (for example) that I have below data:
Name Europe? 2000 2001 2002 2003 2004
A YES 56 84 95 32 15
B NO 63 45 9 25 14
C NO 47 72 123 54 95
D YES 28 64 874 14 358
E YES 45 68 48 32 674
If the regulation came into force in 2003 only in Europe, how can I implement this using R please?
I know that I have to create 1 dummy variables for the group control (european) and another one for the year when the regulation came into force but how does it works exactly?

creating unique sequence for October 15 to April 30th following year- R

Basically, I'm looking at snowpack data. I want to assign a unique value to each date (column "snowday") over the period October 15 to May 15th the following year (the winter season of course) ~215 days. then add a column "snowmonth" that corresponds to the sequential months of the seasonal data, as well as a "snow year" column that represents the year where each seasonal record starts.
There are some missing dates- however- but instead of finding those dates and inserting NA's into the rows, I've opted to skip that step and instead go the sequential root which can then be plotted with respect to the "snowmonth"
Basically, I just need to get the "snowday" sequence of about 1:215 (+1 for leap years down in a column, and the rest I can do myself. It looks like this
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 NA NA
12 27 1955 24 1955-12-27 361 NA NA
12 28 1955 24 1955-12-28 362 NA NA
12 29 1955 24 1955-12-29 363 NA NA
12 30 1955 26 1955-12-30 364 NA NA
12 31 1955 26 1955-12-31 365 NA NA
1 1 1956 25 1956-01-01 1 NA NA
1 2 1956 25 1956-01-02 2 NA NA
1 3 1956 26 1956-01-03 3 NA NA
man<-data.table()
man <-  read.delim('mansfieldstake.txt',header=TRUE, check.names=FALSE)
man[is.na(man)]<-0
man$date<-paste(man$yy, man$mm, man$dd,sep="-", collapse=NULL)
man$yearday<-NA #day of the year 1-365
colnames(man)<- c("month","day","year","depth", "date","yearday")
man$date<-as.Date(man$date)
man$yearday<-yday(man$date)
man$snowday<-NA
man$snowmonth<-NA
man[420:500,]
head(man)
output would look something like this:
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 73 3
12 27 1955 24 1955-12-27 361 74 3
12 28 1955 24 1955-12-28 362 75 3
12 29 1955 24 1955-12-29 363 76 3
12 30 1955 26 1955-12-30 364 77 3
12 31 1955 26 1955-12-31 365 78 3
1 1 1956 25 1956-01-01 1 79 4
1 2 1956 25 1956-01-02 2 80 4
1 3 1956 26 1956-01-03 3 81 4
I've thought about loops and all that- but it's inefficient... leap years kinda mess things up as well- this has become more challenging than i thought. good first project though!
just looking for a simple sequence here, dropping all non-snow months. thanks for anybody who's got input!
If I understand correctly that snowday should be the number of days since the beginning of the season, all you need to make this column using data.table is:
day_one <- as.Date("1955-10-01")
man[, snowday := -(date - day_one)]
If all you want is a sequence of unique values, then seq() is your best bet.
Then you can create the snowmonth using:
library(lubridate)
man[, snowmonth := floor(-time_length(interval(date, day_one), unit = "month"))

meta regression and bubble plot with metafor package in R

I am working on a meta-regression on the association of year and medication prevalence with 'metafor' package.
The model I used is 'rma.glmm' for mixed-effect model with logit transformed from 'metafor' package.
My R script is below:
dat<-escalc(xi=A, ni=Sample, measure="PLO")
print(dat)
model_A<-rma.glmm(xi=A, ni=Sample, measure="PLO", mods=~year)
print(model_A)
I did get significant results so I performed a bubble plot with this model. But I found there is no way to perform bubble plot straight away from 'ram.glmm' formula. I did something alternatively:
wi<-1/dat$vi
plot(year, transf.ilogit(dat$yi), cex=wi)
Apparently I got some 'crazy' results, my questions are:
1> How could I weight points in bubble plot by study sample size? the points in bubble plot should be proportionate to study weight. Here, I used 'wi<-dat$vi'. vi stands for sampling variance, which I got from 'escalc()'. But it doesn't seem right.
2> Is my model correct to investigate the association between year and medication prevalence? I tried 'rma' model I got totally different results.
3> Is there any alternative way to perform bubble plot? I also tried:
percentage<-A/Sample
plot(year, percentage)
The database is below:
study year Sample A
study 1 2007 414 364
study 2 2010 142 99
study 3 1999 15 0
study 4 2000 17 0
study 5 2001 20 0
study 6 2002 22 5
study 7 2003 21 6
study 8 2004 24 7
study 9 1999 203 82
study 10 2009 647 436
study 11 2009 200 169
study 12 2010 156 128
study 13 2009 10753 6374
study 14 2007 143 109
study 15 2001 247 36
study 16 2004 318 184
study 17 2012 611 565
study 18 2013 180 167
study 19 2006 344 337
study 20 2007 209 103
study 21 2013 470 354
study 22 2010 180 146
study 23 2005 522 302
study 24 2000 62 30
study 25 2001 79 39
study 26 2002 85 43
study 27 2011 548 307
study 28 2009 218 216
study 29 2006 2901 2332
study 30 2008 464 259
study 31 2010 650 393
study 32 2008 2514 704

Selecting unique non-repeating values

I have some panel data from 2004-2007 which I would like to select according to unique values. To be more precise im trying to find out entry and exits of individual stores throughout the period. Data sample:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
As well as how many stores have entered the market, which would imply:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
UPDATE:
I didn't include that it also should assume incumbent stores, such as:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
Since im, pretty new to R I've been struggling to do it right even on year-by-year basis. Any suggestions?
Using the data.table package, if your data.frame is called df:
dt = data.table(df)
exit = dt[,list(ExitYear = max(year)),by=store]
exit = exit[ExitYear != 2007] #Or whatever the "current year" is for this table
enter = dt[,list(EntryYear = min(year)),by=store]
enter = enter[EntryYear != 2003]
UPDATE
To get all columns instead of just the year and store, you can do:
exit = dt[,.SD[year == max(year)], by=store]
exit[year != 2007]
store year rev space market
1: 2 2004 35000 800 136
2: 6 2006 7000 345 191
Using only base R functions, this is pretty simple:
> subset(aggregate(df["year"],df["store"],max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(df["year"],df["store"],min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
or using formula syntax:
> subset(aggregate(year~store,df,max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(year~store,df,min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
Update Getting all the columns isn't possible for aggregate, so we can use base 'by' instead. By isn't as clever at reassembling the array:
Filter(function(x)x$year!=2007,by(df,df$store,function(s)s[s$year==max(s$year),]))
$`2`
store year rev space market
5 2 2004 35000 800 136
$`6`
store year rev space market
18 6 2006 7000 345 191
So we need to take that step - let's build a little wrapper:
by2=function(x,c,...){Reduce(rbind,by(x,x[c],simplify=FALSE,...))}
And now use that instead:
> subset(by2(df,"store",function(s)s[s$year==max(s$year),]),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
We can further clarify this by creating a function for getting a row which has the stat (min or max) for a particular column:
statmatch=function(column,stat)function(df){df[df[column]==stat(df[column]),]}
> subset(by2(df,"store",statmatch("year",max)),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
Dplyr
Using all of these base functions which don't really resemble each other starts to get fiddly after a while, so it's a great idea to learn and use the excellent (and performant) dplyr package:
> df %>% group_by(store) %>%
arrange(-year) %>% slice(1) %>%
filter(year != 2007) %>% ungroup
Source: local data frame [2 x 5]
store year rev space market
1 2 2004 35000 800 136
2 6 2006 7000 345 191
and
> df %>% group_by(store) %>%
arrange(+year) %>% slice(1) %>%
filter(year != 2004) %>% ungroup
Source: local data frame [3 x 5]
store year rev space market
1 4 2005 17500 320 136
2 5 2005 45000 580 191
3 7 2007 10000 500 191
NB The ungroup is not strictly necessary here, but puts the table back in a default state for further calculations.

Resources