I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>
I am trying to implement the diff in diff model in R in order to analyze the effect of a regulation on households.
I have panel data, meaning that I have observations for different at different periods.
Lets say (for example) that I have below data:
Name Europe? 2000 2001 2002 2003 2004
A YES 56 84 95 32 15
B NO 63 45 9 25 14
C NO 47 72 123 54 95
D YES 28 64 874 14 358
E YES 45 68 48 32 674
If the regulation came into force in 2003 only in Europe, how can I implement this using R please?
I know that I have to create 1 dummy variables for the group control (european) and another one for the year when the regulation came into force but how does it works exactly?
First few observations of dataframe. All are categorical with some having levels more than 100.
ac2.surcat ac2.typeonenum ac2.countrynum ac2.sumnewnum
1 Average survival rate 248 556 16
2 Poor survival rate 82 375 12
3 Poor survival rate 73 104 16
4 Below average survival rate 252 <NA> 6
5 Poor survival rate 252 200 11
6 Below average survival rate 252 83 19
7 Poor survival rate 252 200 12
8 Poor survival rate 210 111 5
9 Poor survival rate 252 178 19
10 Poor survival rate 252 178 18
11 Poor survival rate 230 200 5
I know that random forests limits only up to 52 levels. This is an already simplified data. Levels have been reduced from 4000s to 100s. Cannot simplify this further
Dependent variable is ac2$surcat (first one)
This is an air crash data. Last 3 columns are 'type of aircraft', 'country' and 'type of crash' respectively.(independent variables)
I have two dataframes, one of failed firms, and one of non-failed firms.
They both comprise of rows of observations of firms, the variables in these rows include the industry of firm, the year where financial information was recorded, and the size of total assets of the firm and others.
I want to match each failed firm with one non-failed firm of the same industry and total asset size and year of financial information recorded.
I am happy to throw away observations with no match. If one failed firm matches multiple nonfailed firms, I am happy to just randomly choose one.
Currently, my code looks like this:
merge(cessdurc1[cessdurc1$afcyear=="2007",], cessdura[cessdura$afcyear=="2007",], by=c("ssic", "total_assets"), all.x=TRUE, all.y=FALSE)
Which does not work because the columns chosen need to be unique.
My data looks like this:
>head(alivefirms)
failed within year total_assets afcyear ssic
1 0 9e+07 2007 20
2 0 7e+06 2007 43
3 0 7e+05 2007 46
4 0 1e+07 2007 82
5 0 1e+08 2007 93
6 0 1e+06 2007 11
> head(failedfirms)
failed within year total_assets afcyear ssic
26 1 20000 2007 41
79 1 5000 2007 73
105 1 400 2007 30
127 1 4000 2007 18
133 1 2000 2007 70
154 1 10000 2007 41
I want the output to match failed firms to alive firms who have the same SSIC & Total Assets & Afcyear, so something that looks like this
> head(wantedoutput)
failed within year total_assets afcyear ssic
26 1 20000 2007 41
79 0 20000 2007 41
105 1 400 2007 30
127 0 400 2007 30
133 1 2000 2007 70
154 0 2000 2007 70
I have a sample dataset with 45 rows and is given below.
itemid title release_date
16 573 Body Snatchers 1993
17 670 Body Snatchers 1993
41 1645 Butcher Boy, The 1998
42 1650 Butcher Boy, The 1998
1 218 Cape Fear 1991
18 673 Cape Fear 1962
27 1234 Chairman of the Board 1998
43 1654 Chairman of the Board 1998
2 246 Chasing Amy 1997
5 268 Chasing Amy 1997
11 309 Deceiver 1997
37 1606 Deceiver 1997
28 1256 Designated Mourner, The 1997
29 1257 Designated Mourner, The 1997
12 329 Desperate Measures 1998
13 348 Desperate Measures 1998
9 304 Fly Away Home 1996
15 500 Fly Away Home 1996
26 1175 Hugo Pool 1997
39 1617 Hugo Pool 1997
31 1395 Hurricane Streets 1998
38 1607 Hurricane Streets 1998
10 305 Ice Storm, The 1997
21 865 Ice Storm, The 1997
4 266 Kull the Conqueror 1997
19 680 Kull the Conqueror 1997
22 876 Money Talks 1997
24 881 Money Talks 1997
35 1477 Nightwatch 1997
40 1625 Nightwatch 1997
6 274 Sabrina 1995
14 486 Sabrina 1954
33 1442 Scarlet Letter, The 1995
36 1542 Scarlet Letter, The 1926
3 251 Shall We Dance? 1996
30 1286 Shall We Dance? 1937
32 1429 Sliding Doors 1998
45 1680 Sliding Doors 1998
20 711 Substance of Fire, The 1996
44 1658 Substance of Fire, The 1996
23 878 That Darn Cat! 1997
25 1003 That Darn Cat! 1997
34 1444 That Darn Cat! 1965
7 297 Ulee's Gold 1997
8 303 Ulee's Gold 1997
what I am trying to do is to convert the itemid based on the movie name and if the release date of the movie is same. for example, The movie 'Ulee's Gold' has two item id's 297 & 303. I am trying to find a way to automate the process of checking the release date of the movie and if its same, itemid[2] of that movie should be replaced with itemid[1]. For the time being I have done it manually by extracting the itemid's into two vectors x & y and then changing them using vectorization. I want to know if there is a better way of getting this task done because there are only 18 movies with multiple id's but the dataset has a few hundred. Finding and processing this manually will be very time consuming.
I am providing the code that I have used to get this task done.
x <- c(670,1650,1654,268,1606,1257,348,500,1617,1607,865,680,881,1625,1680,1658,1003,303)
y<- c(573,1645,1234,246,309,1256,329,304,1175,1395,305,266,876,1477,1429,711,878,297)
for(i in 1:18)
{
df$itemid[x[i]] <- y[i]
}
Is there a better way to get this done?
I think you can do it in dplyr straightforwardly:
Using your comment above, a brief example:
itemid <- c(878,1003,1444,297,303)
title <- c(rep("That Darn Cat!", 3), rep("Ulee's Gold", 2))
year <- c(1997,1997,1965,1997,1997)
temp <- data.frame(itemid,title,year)
temp
library(dplyr)
temp %>% group_by(title,year) %>% mutate(itemid1 = min(itemid))
(I changed 'release_date' to 'year' for some reason... but this basically groups the title/year together, searches for the minimum itemid and the mutate creates a new variable with this lowest 'itemid'.
which gives:
# itemid title year itemid1
#1 878 That Darn Cat! 1997 878
#2 1003 That Darn Cat! 1997 878
#3 1444 That Darn Cat! 1965 1444
#4 297 Ulee's Gold 1997 297
#5 303 Ulee's Gold 1997 297