Using the Cshapes package in R, I want to create a list of matrices that measure for each year whether two countries are neigbors or not.
install.packages("cshapes")
Running the code for one year (here 1990) works fine:
wmat <- distmatrix(as.Date("1990-1-1"), type="mindist", tolerance=0.5, useGW=FALSE)
This gives a matrix with the following structure:
A B C D
1 A 0 0 210 0
2 B 0 0 637 305
3 C 210 637 0 73
4 D 0 305 73 0
In a next step, I set all combinations with 0 distance between two countries to 1, all other combinations to 0, and the diagonal to 0 again:
wmat[wmat>0]<-5
wmat[wmat==0]<-1
wmat[wmat==5]<-0
diag(wmat)<-0
This gives me following matrix:
A B C D
1 A 0 1 0 1
2 B 1 0 0 0
3 C 0 0 0 0
4 D 1 0 0 0
What I struggle to do is to automatically create matrices for all the years between 1960 and 2014, do the corrections for each year and store the results into a list of matrices where I can recall each matrix by the respective year.
Any inputs are highly welcome.
You could try
lst <- lapply(1960:2014, function(x) {
wmat <- distmatrix(as.Date(paste0(x, '-1-1')),
type="mindist", tolerance=0.5, useGW=FALSE)
wmat[wmat>0]<-5
wmat[wmat==0]<-1
wmat[wmat==5]<-0
diag(wmat)<-0
wmat
}
)
Related
I have this data set of batting data from the GameDay servers:
eliasID teamID gameID gameDate h hr bb so rbi ab runs t d lob sb cs sf hbp
1 430203 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 2 0 0 0 0 0 0 0 0
2 459714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 0 0 1 0 3 1 0 0 1 0 0 0 0
3 325392 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 1 0 0 0 0 0 0 0 0
4 429801 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 0 3 0 0 0 2 0 0 0 0
5 456714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 4 0 0 0 2 0 0 0 0
6 150449 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 1 4 0 0 0 2 0 0 0 0
ba ID gameDateFormat year Year
1 0.345 1 2010-04-01 2010 NA
2 0.250 2 2010-04-01 2010 NA
3 0.319 3 2010-04-01 2010 NA
4 0.327 4 2010-04-01 2010 NA
5 0.333 5 2010-04-01 2010 NA
6 0.217 6 2010-04-01 2010 NA
My issue is that I'd like to build a running total of at bats (ab) for each game, but total only those at bats from games with gameDate lower than the row's gameDate, and with games in the same gameYear.
I've look at the for loop and dplyr has been suggested, but these all want to sum all the ab column for one player, when I need an ongoing sum added to each game to show the player's ab total for the year so far at that game.
I'm attempting to build the equivalent of the kind of statistics you see on baseball-ref.com.
In English, I'm looking for:
For each EliasID, gameID in Batting:
sum(ab) for the EliasID where the gameDate < this row's gameDate and the gameYear = this row's gameYear
What do you think?
Welcome to R programming. Because you did not provide complete sample data (i.e. dput() rather than just a print(), this answer makes a couple assumptions:
your data.frame is called df. You can replace this name with the actual name.
your gameDate is an actual date vector, not just a string vector. If it is a string, change it to a date with df$gameDate <- as.Date(df$gameDate, format = "%m/%d/%Y")
It appears that what you want is a "cumulative sum with lag, by group." (I recommend that you make this your title to make it clear that this is what you want.) Let's look at both of those parts.
Cumulative sum, with lag
As suggested in this answer, an easy way to introduce a lag of 1 into cumsum() is to replace the vector x1, x2, ... xn with 0, x1, x2, ... xn-1. Thus:
cumsumLag1 <- function(x){
cumsum(c(0, head(x, n = -1))) # see ?cumsum and ?head, particularly the note on negative n
}
# test it out on first 5 counting numbers
cumsumLag1(1:5) # returns: 0 1 3 6 10
Your dataset should be in the right chronological order for the cumulative function. So you could do something with ?order like:
df <- df[order(df$gameDate)]
but we will use arrange() in dplyr (see below) to keep things simple.
By group
There are many ways to do sum (and similar functions) by group. Perhaps the simplest syntax is %>% group_by(thing) in dplyr. You want to group by year, and perhaps other variables (maybe teamId or playerId). One really unclear part of your question is what you're trying to group by, so please just focus on the concept here. The first challenge is that you don't have a year variable, and there are lots of ways to do this. Let's just do something like this:
df$gameYear <- as.POSIXlt(df$gameDate)$year + 1900 # see ?POSIXlt for more details
Putting it together
Using the chain operator %>%, we just sequence what we've already reviewed.
library(dplyr)
cumsumLag1 <- function(x) cumsum(c(0, head(x, n = -1)))
df %>%
mutate(gameYear = as.POSIXlt(gameDate)$year + 1900) %>%
arrange(gameDate) %>%
group_by(gameYear) %>%
mutate(priorAtBats = cumsumLag1(ab))
I have data where X-column is review and then columns of words that most reviews have. Is it possible to create a graph where nodes would be reviews and edges would be words?
X action age ago amazing american art author back bad beautiful beginning
1 1 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 1 3 0 0 0 2 1 0 0 0 0
5 0 0 1 0 1 0 0 2 0 1 0
Another idea is to claster the reviews in the graph according to the used words and their frequency.
Thank you very much. Any help is appreciated.
Here are three approaches to explore the relationships in your data:
par(mfrow=c(1,3))
# two mode network (reviews+words)
library(igraph)
set.seed(1)
g <- graph_from_data_frame(subset(reshape2::melt(df, 1), !!value, -value)[2:1])
V(g)$type <- bipartite.mapping(g)$type
plot(g, layout = layout_as_bipartite(g)[, 2:1], vertex.color = V(g)$type+1L)
# just the reviews:
library(reshape2)
lst <- with(subset(melt(df, 1), !!value)[2:1], split(X, variable))
lst <- lst[lengths(lst)>1]
lst <- lapply(lst, function(x) t(combn(x, m=2)))
g <- graph_from_edgelist(do.call(rbind, lst), dir = F)
E(g)$label <- rep(names(lst), sapply(lst, nrow))
plot(g)
# review clustering
df[-1] %>% dist(meth="bin") %>% hclust %>% plot
Output:
Data:
df <- read.table(header=T, text="
X action age ago amazing american art author back bad beautiful beginning
1 1 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 1 3 0 0 0 2 1 0 0 0 0
5 0 0 1 0 1 0 0 2 0 1 0")
PS: There may be a shortcut to no. 2 (reviews as nodes & words as edges) - feel free to add it.
With your example data one could create graph consiting of nodes (each review) and edges (two reviews are connected when they use the same word). Moreover, you could weight the edges according to how many words two reviews have in common and moreover you could use different shapes/colors of the edges to represent the different words.
There are several ways to create a graph with your data. First, to create a adjacency matrix, where each columns and rows would each represent a review. The adjecency matrix only counts whether there is a common word between two reviews or not. In case two reviews share a common word it takes the value 1, otherwise it is zero.
The adjency matrix would look similar to this, where the latters denote column and row labels:
Review A B C D
A 0 1 1 1
B 1 0 0 1
C 1 0 0 1
D 1 1 1 0
With the R command graph_from_adjency( ) in the igraph package you could then create a graph and use the plot functions.
Second you could also create a weight matrix, which counts how many words are shared between two review. Using the same command graph_from_adjency( , weighted=T) from the igraph package you could create from that matrix a graph .
You can find a good introduction to network analysis with the igraph package here: http://kateto.net/networks-r-igraph
Review A B C D
A 0 2 3 1
B 2 0 0 2
C 3 0 0 2
D 1 2 2 0
Third, you could specifiy the graph from an edge and nodes data frames.
The nodes data frame would contain a short id of each node and maybe the name and all other information you may want to include about the nodes :
id long_review_name
R1 A
R2 B
R3 C
R4 D
The edges data frame collects all the information about the connections between two reviews. First, and most important it would record all edges in the columns from and to . Further, it could contain the frequency as weight on the edges and type would denote, which word connection the two nodes share:
from to weight type
R1 R2 1 american
R1 R2 1 age
R1 R3 2 american
R1 R3 1 age
R1 R4 1 age
R2 R4 2 american
To turn the edges and the node data frame into a graph you would need to use the command graph_from_data_frame(d=links, vertices=nodes).
Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?
Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.
I try to create an adjacency matrix M from a list pList containing the indices that have to be equal to 1 in the matrix M.
For example, M is a 10x5 matrix
The variable pList contains 5 elements, each one is a vector of indices
Example :
s <- list("1210", c("254", "534"), "254", "534", "364")
M <- matrix(c(rep(0)),nrow = 5, ncol = length(unique(unlist(s))), dimnames=list(1:5,unique(unlist(s))))
Actually, my too simple solution is the brutal one with a for loop over rows of the matrix :
for (i in 1:nrow(M)){
M[i, as.character(s[[i]])] <- 1
}
So that the expected result is :
M
1210 254 534 364
1 1 0 0 0
2 0 1 1 0
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
The problem is that I have to manipulate matrices with several thousands of lines and it takes too much time.
I am not a "apply" expert but I wonder if there is a quicker solution
Thanks
Regards
We can convert the list to a matrix of row/column index, use that index to assign the elements in 'M' to 1.
M[as.matrix(stack(setNames(s, seq_along(s)))[,2:1])] <- 1
M
# 1210 254 534 364
#1 1 0 0 0
#2 0 1 1 0
#3 0 1 0 0
#4 0 0 1 0
#5 0 0 0 1
Or instead of using stack to convert to a data.frame, we can unlist the 's' to get the column index, cbind with row index created by replicating the sequence of list with length of each list element (using lengths) and assign the elements in 'M' to 1.
M[cbind(rep(seq_along(s), lengths(s)), unlist(s))] <- 1
Or yet another option would be to create a sparseMatrix
library(Matrix)
Un1 <- unlist(s)
sparseMatrix(i = rep(seq_along(s), lengths(s)),
j=as.integer(factor(Un1, levels = unique(Un1))),
x=1)
What I'm trying to do is to create a single cataract variable from three different datasets that asked about cataract. (Basically, a phone interview, a wave using a short questionnaire, and a wave using a longer questionnaire.) These datasets have been merged, such that there are missing values created for the values for participants in the wave they didn't participate in. I've coded each of the three separate cataract vars as 1=YES and 0=NO.
In the following code, I'm trying to say if you respond yes (1) to any of the three vars, then give a value of 1, then if you are a No (0) to any give a value of 0, otherwise "NA".
survey$cataract<-ifelse(survey$ew3_cat==1 | survey$lq3_catnum==1 | survey$sq3_cat==1,1,
ifelse(survey$ew3_cat==0 | survey$lq3_catnum==0 | survey$sq3_cat==0,0,NA))
As you can see from the following result, I get the 1's, but everything else is "NA", no zeros.
> table(survey$cataract,useNA="ifany")
1 <NA>
10303 63322
Now, if I change the order, say do all the zeros first, then I get the correct 0's, but no 1's.
survey$cataract<-ifelse(survey$ew3_cat==0 | survey$lq3_catnum==0 | survey$sq3_cat==0,0,
ifelse(survey$ew3_cat==1 | survey$lq3_catnum==1 | survey$sq3_cat==1,1,NA))
> table(survey$cataract,useNA="ifany")
0 <NA>
63315 10310
The correct count from the three separate vars should be:
10,303 = 1
63,315 = 0
7= NA
I also tried replicating this problem with made-up data as follows:
x <- c(rep(1,100),rep(0,200),rep(NA,400))
y <- c(rep(NA,300),rep(1,100),rep(0,100),rep(NA,200))
z <- c(rep(NA,500),rep(1,100),rep(0,100))
cat <- ifelse(x==1|y==1|z==1,1,
ifelse(x==0|y==0|z==0,0,NA))
> table(cat,useNA="ifany")
cat
1 <NA>
300 400
Same problem if I reverse the order:
cat <- ifelse(x==0|y==0|z==0,0,
ifelse(x==1|y==1|z==1,1,NA))
> table(cat,useNA="ifany")
cat
0 <NA>
400 300
Any suggestions about what logical thing I'm missing here?
This is a little hackish but should give you the right result:
tmp <- as.numeric(mapply(any, as.logical(x),as.logical(y),as.logical(z), na.rm=TRUE))
tmp[which(mapply(all, is.na(x), is.na(y), is.na(z)))] <- NA
Basically it looks for any values of 1, returning 1 for those values and 0 otherwise. Then it goes back and puts NA values back in wherever all of x, y, and z are NA.
> table(tmp)
tmp
0 1
400 300
Note: Your example data don't seem particularly good for testing this because you have cases that are NA-NA-NA:
> ftable(x,y,z, useNA='always')
z 0 1 NA
x y
0 0 0 0 0
1 0 0 0
NA 0 0 200
1 0 0 0 0
1 0 0 0
NA 0 0 100
NA 0 0 0 100
1 0 0 100
NA 100 100 0
So, here's a slightly modified version of your data that shows the above code works correctly:
x <- c(rep(1,100),rep(0,200),rep(NA,400))
y <- c(rep(NA,300),rep(1,100),rep(0,100),rep(NA,200))
z <- c(rep(NA,500),rep(1,100),rep(0,50),rep(NA,50))
The result for those data:
> ftable(x,y,z, useNA='always')
z 0 1 NA
x y
0 0 0 0 0
1 0 0 0
NA 0 0 200
1 0 0 0 0
1 0 0 0
NA 0 0 100
NA 0 0 0 100
1 0 0 100
NA 50 100 50
> table(tmp, useNA='always')
tmp
0 1 <NA>
350 300 50