It looks like the website is blocking direct access from Curl.
library(XML)
library(RCurl)
theurl <- "http://www.london2012.com/medals/medal-count/"
page <- getURL(theurl)
page # fail
[1] "<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don't have permission to access \"http://www.london2012.com/medals/medal-count/\" on this server.<P>\nReference #18.358a503f.1343590091.c056ae2\n</BODY>\n</HTML>\n"
Let's try to see if we can access it directly from the Table.
page <- readHTMLTable(theurl)
No luck there Error in htmlParse(doc) : error in creating parser for http://www.london2012.com/medals/medal-count/
How would you go about getting this table into R?
Update: in response to comments and toying, faking a user agent string worked to get the content. But readHTMLtable returns an error.
page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2")
It looks like this works:
rr <- readHTMLTable(page,header=FALSE)
rr2 <- setNames(rr[[1]],
c("rank","country","gold","silver","bronze","junk","total"))
rr3 <- subset(rr2,select=-junk)
## oops, numbers all got turned into factors ...
tmpf <- function(x) { as.numeric(as.character(x)) }
rr3[,-2] <- sapply(rr3[,-2],tmpf)
head(rr3)
## rank country gold silver bronze total
## 1 1 People's Republic of China 6 4 2 12
## 2 2 United States of America 3 5 3 11
## 3 3 Italy 2 3 2 7
## 4 4 Republic of Korea 2 1 2 5
## 5 5 France 2 1 1 4
## 6 6 Democratic People's Republic of Korea 2 0 1 3
with(rr3,dotchart(total,country))
Here is what I came up with using regular expressions. Very specific and definitely not better than using readHTMLTable used in the other answer. More to show that you can go quite far with textmining in R:
# file <- "~/Documents/R/medals.html"
# page <- readChar(file,file.info(file)$size)
library(RCurl)
theurl <- "http://www.london2012.com/medals/medal-count/"
page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2")
# Remove html tags:
page <- gsub("<(.|\n)*?>","",page)
# Remove newlines and tabs:
page <- gsub("\\n","",page)
# match table:
page <- regmatches(page,regexpr("(?<=Total).*(?=Detailed)",page,perl=TRUE))
# Extract country+medals+rank
codes <-regmatches(page,gregexpr("\\d+[^\\r]*\\d+",page,perl=TRUE))[[1]]
codes <- codes[seq(1,length(codes)-2,by=2)]
# Extract country and medals:
Names <- gsub("\\d","",codes)
Medals <- sapply(regmatches(codes,gregexpr("\\d",codes)),function(x)x[(length(x)-2):length(x)])
# Create data frame:
data.frame(
Country = Names,
Gold = as.numeric(Medals[1,]),
Silver = as.numeric(Medals[2,]),
Bronze = as.numeric(Medals[3,]))
And the output:
Country Gold Silver Bronze
1 People's Republic of China 6 4 2
2 United States of America 3 5 3
3 Italy 2 3 2
4 Republic of Korea 2 1 2
5 France 2 1 1
6 Democratic People's Republic of Korea 2 0 1
7 Kazakhstan 2 0 0
8 Australia 1 1 1
9 Brazil 1 1 1
10 Hungary 1 1 1
11 Netherlands 1 1 0
12 Russian Federation 1 0 3
13 Georgia 1 0 0
14 South Africa 1 0 0
15 Japan 0 2 3
16 Great Britain 0 1 1
17 Colombia 0 1 0
18 Cuba 0 1 0
19 Poland 0 1 0
20 Romania 0 1 0
21 Taipei (Chinese Taipei) 0 1 0
22 Azerbaijan 0 0 1
23 Belgium 0 0 1
24 Canada 0 0 1
25 Republic of Moldova 0 0 1
26 Norway 0 0 1
27 Serbia 0 0 1
28 Slovakia 0 0 1
29 Ukraine 0 0 1
30 Uzbekistan 0 0 1
Related
I apologize for what I imagine is a fairly simple question. Unfortunately while my searches on here have returned a number of results for making adjacency matrices from dyadic dataframes, I haven't come across anything for the opposite process - creating a dyadic dataframe from an adjacency matrix.
Here is a subset of the xls file I am working from in R:
ccode 2 20 31 40 41 42
year 2010 2010 2010 2010 2010 2010
abbrev USA CAN BHM CUB HAI DOM
2 2010 USA 0 1 1 1 1 1
20 2010 CAN 1 0 0 1 1 1
31 2010 BHM 1 1 0 1 1 0
40 2010 CUB 1 1 1 0 1 1
41 2010 HAI 1 1 1 1 0 1
42 2010 DOM 1 1 0 1 1 0
51 2010 JAM 1 1 0 1 0 0
I want it to look like this:
ccode ccode2 year Value
2 20 2010 1
2 31 2010 1
2 40 2010 1
...
20 31 2010 0
20 40 2010 1
20 41 2010 1
...
What R package(s)/code is necessary to perform such a transformation?
For those wishing to access the full data, it is the DIPCON 3.0 database and it can be found here: https://www.volgy.org/projects-and-data
path = "DIPCON_3.0.xlsx"# Put the correct path to your file
library(readxl)
sheets = excel_sheets(path)
my_read = function(x){
dat = read_excel(path,x)
c_names = 1:4# The column names-also same as the row names
col_names = do.call(paste,data.frame(t(dat[c_names,-c_names])))
row_names = do.call(paste,dat[-c_names,c_names])
dat1 = as.table(matrix(as.numeric(unlist(dat[-c_names,-c_names])),
nrow(dat)-4,dim=list(row_names,col_names)))
d = data.frame(dat1)
l = nrow(d)
proto = data.frame(ccode=numeric(l),Year=numeric(l),C1=character(l),C2=character(l))
m = do.call(cbind,lapply(d[2:1],function(x) strcapture("(\\d+) (\\d+) (\\w+) (\\w+)",x,proto)))
cbind(m,d[3])
}
my_read(sheets[1])
I have file 1 with attributes like (706 attributes)
Matchid TeamName Opp_TeamName TeamRank Opp_TeamRank Team_Top10RankingBatsman
1 New Zealand Bangladesh 1 10 2
2 New Zealand India 1 2 2
3 India England 2 5 1
4 Australia England 6 5 1
and file 2 with attributes (706 attributes)
id actual predicted error
3 79 206.828 127.828
1 90 182.522 92.522
2 101 193.486 92.486
4 89 174.889 85.889
I want to match "Matchid and id" of both files and add file2 attributes in file1 so that the final result is
Matchid TeamName Opp_TeamName TeamRank Opp_TeamRank Team_Top10RankingBatsman id actual predicted error
1 New Zealand Bangladesh 1 10 2 1 90 182.522 92.522
2 New Zealand India 1 2 2 2 101 193.486 92.486
3 India England 2 5 1 3 79 206.828 127.828
4 Australia England 6 5 1 4 89 174.889 85.889
so far I have tried tried simple merge function and it didn't work, how can I achieve my task?
merge(file1,file2,by.x="Matchid",by.y="id")
Maybe this way?
The dplyr way:
library(dplyr)
joined <- inner_join(file_1, file_2, by = c("Matchid" = "id"))
Dear all: I'm thinking of creating a "time to event" variable in R and need your expertice to get it done. Below you can see a small sample of my data. The time variable is in years and it starts at 0 and resets itself when Event = 1.
In the real data the observation period starts in 1989 but there are some countries (that had not ratified certain conventions before 1989) that come in later on, like the US in the example below. Whenever it starts, the first value for the "time to event" variable should be zero.
Thanks for all suggestions!
Country year Event **Time-to-event**
USA 2000 0 0
USA 2001 0 1
USA 2002 1 2
USA 2003 0 0
USA 2004 0 1
USA 2005 0 2
USA 2006 1 3
USA 2007 0 0
USA 2008 1 1
USA 2009 0 0
USA 2010 0 1
USA 2011 0 2
USA 2012 0 3
We can use ave
i1 <- with(df2, ave(Event, Country, FUN=
function(x) cumsum(c(TRUE, diff(x)<0))))
df2$Time_to_event <- with(df2, ave(i1, i1, Country, FUN= seq_along)-1)
df2$Time_to_event
#[1] 0 1 2 0 1 2 3 0 1 0 1 2 3
count_until(x) is always equal to rev(count_since(rev(x))).
one might use something like this:
count_since<-function(trigger)
{
i <- seq_along(trigger)
(i - cummax(i*trigger))*cummax(trigger)
}
count_until<-function(x)rev(count_since(rev(x)))
> count_until(1:10%%5==0)
[1] 4 3 2 1 0 4 3 2 1 0
I have a table called "merged" like:
Nationality CustomerID_count ClusterId
1 argentina 1 1
2 ARGENTINA 26 1
3 ARGENTINO 1 1
4 argentona 1 1
5 boliviana 14 2
6 paragauy 1 3
7 paraguay 1 3
8 PARAGUAY 1 3
I need to create a new Nationality column, searching the max value of Customer_ID_count within each cluster.
I did this other table with the following code:
merged1<-data.table(merged)
merged2<-merged1[, which.max(CustomerID), by = ClusterId]
And I got:
ClusterId V1
1: 1 2
2: 2 1
3: 3 1
After that I did a merge:
tot<-merge(x=merged, y=merged2, by= "ClusterId", all.x=TRUE)
And I got the following table:
ClusterId Nationality CustomerID V1
1 1 argentina 1 2
2 1 ARGENTINA 26 2
3 1 ARGENTINO 1 2
4 1 argentona 1 2
5 2 boliviana 14 1
6 3 paragauy 1 1
7 3 paraguay 1 1
8 3 PARAGUAY 1 1
But I didn't know how to finish. I tried this:
tot[,5]=tot[V1,5]
Because I want to have for each row the Nationality that is in the line shown in column V1. This didn't work.
How can I do the last part? and also is there a better way to solve this?
Thanks!
Note that you may have more that one CustomerID_count that matches the maximum value (e.g. all versions of "paraguay" have CustomerID_count == 1, which is the max for that cluster).
It's very easy using the plyr package:
library(plyr)
ddply(merged, .(ClusterId), mutate, Nationality2 = Nationality[CustomerID_count == max(CustomerID_count)])
This could be a good use-case for `dplyr:
library(dplyr)
merged <- merged %>%
group_by(ClusterId) %>%
mutate(newNat=Nationality[CustomerID_count == max(CustomerID_count)]) %>%
ungroup
print(merged)
## Source: local data frame [8 x 4]
##
## Nationality CustomerID_count ClusterId newNat
## 1 argentina 1 1 ARGENTINA
## 2 ARGENTINA 26 1 ARGENTINA
## 3 ARGENTINO 1 1 ARGENTINA
## 4 argentona 1 1 ARGENTINA
## 5 boliviana 14 2 boliviana
## 6 paragauy 1 3 paragauy
## 7 paraguay 1 3 paraguay
## 8 PARAGUAY 1 3 PARAGUAY
I am interested in learning how a specific factor such as foreign investments behaves 5 years before and after change, e.g. outbreak of civil war.
This is the structure of my data (the factor is not included here):
year country change time
2001 A 0 ? (-1)
2002 A 1 0
2003 A 0 ? (+1)
2004 A 0 ? (+2)
2002 B 0 ? (-2)
2003 B 0 ? (-1)
2004 B 1 0
...
I am seeking to replace the question marks by the respective values in brackets, e.g., "-1" for the year prior to change (t-1) and "+1" for the year following change (t+1). The presence of change is coded with 1.
How would you do this? I am grateful for any suggestions.
> dat <- read.table(text="year country change time
+ 2001 A 0 ?(-1)
+ 2002 A 1 0
+ 2003 A 0 ?(+1)
+ 2004 A 0 ?(+2)
+ 2002 B 0 ?(-2)
+ 2003 B 0 ?(-1)
+ 2004 B 1 0
+ ", header=TRUE)
> with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) )
$A
[1] -1 0 1 2
$B
[1] -2 -1 0
> dat$time <-unlist( with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0
>
Slightly less complex would be to use ave instead of unlist(tapply(...))
> dat$time <- with(dat, ave(change, country, FUN=function(x) seq(length(x))-which(x==1) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0