Creating List of Observations from Matrix Values in R - r

I have a large matrix that is calculating the distance between two different zip codes (using rgeosphere package). I would like to run a function that finds all zip code pairings that are <=x distance away from each other and create a list of them. The data looks like this:
91423 92231 94321
90034 3 4.5 2.25
93201 3.75 2.5 1.5
94501 2 6 0.5
So if I ran the function to extract all zip code pairings that are <2 miles away I would end up with these zip codes:
94321
94321
93201
94501
The goal is basically to identify all adjacent zip codes in the US to a list of zip codes I have. If there is a better way to do this I am open to suggestions.

Perhaps something like the following. It will be slow, but it should work.
for(i in 1:nrow(data)){
for (j in 1:ncol(data)){
if(data[i,j]<distance){
if(exists(hold.zips)==FALSE){
hold.zips<-matrix(c(colnames(data)[i],colnames(data)[j]),ncol=2)
}else{
temp<-matrix(c(colnames(data)[i],colnames(data)[j]),ncol=2)
hold.zips<-rbind(hold.zips,temp)
}
}
}
}

This should work. Gives a nice list as output (calling your data x):
rn = rownames(x)
apply(x, 2, function(z) rn[z < 2])
# $`91423`
# character(0)
#
# $`92231`
# character(0)
#
# $`94321`
# [1] "93201" "94501"

Here is the Tidyverse solution:
library(dplyr)
library(tidyr)
# your data
dat <- matrix(c(3,3.75,2,4.5,2.5,6,2.25,1.5,0.5), nrow = 3, ncol = 3)
rownames(dat) <- c(90034, 93201, 94501)
colnames(dat) <- c(91423, 92231, 94321)
# tidyverse solution
r <- rownames(dat)
dat_tidy <- dat %>%
as_tibble() %>%
mutate(x = r) %>%
select(x, everything()) %>%
gather(key = y,
value = distance,
-x) %>%
filter(distance < 2)
print(dat_tidy)
# note if your matrix is a symetric matrix then
# to remove duplicates, filter would be:
# filter(x < y,
# distance < 2)

Related

Deriving cosine values for vector contrasts distributed over rows in a dataframe (rows to individual vectors)

I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))

error flattening (convert to data.frame) XML file in R using xlm2 and xlmtools

I am trying to convert this xml_file (and many other similar ones) to a data.frame in R. Desired outcome: a data.frame (or tibble, data.table, etc) with:
One row per Deputado (which is the main tag/level of xml_file, there are 4 of those)
All variables within each Deputado should be columns.
Neste categories with multiple values (such as comissao, cargoComissoes, etc) can be ignored.
In the code below, I tried to follow Example 2 in the readme of github/.../xmltools closely, but I got the error:
...
+ dplyr::mutate_all(empty_as_na)
Error: Argument 4 must be length 4, not 39
Any help fixing this (or different strategy with complete example) would be greatly appreciated.
The code (with reproducible error) is:
file <- "https://www.camara.leg.br/SitCamaraWS/Deputados.asmx/ObterDetalhesDeputado?ideCadastro=141428&numLegislatura="
doc <- file %>%
xml2::read_xml()
nodeset <- doc %>%
xml2::xml_children()
length(nodeset) # lots of nodes!
nodeset[1] %>% # lets look at ONE node's tree
xml_view_tree()
# lets assume that most nodes share the same structure
terminal_paths <- nodeset[1] %>%
xml_get_paths(only_terminal_parent = TRUE)
terminal_xpaths <- terminal_paths %>% ## collapse xpaths to unique only
unlist() %>%
unique()
# xml_to_df (XML package based)
## note that we use file, not doc, hence is_xml = FALSE
# df1 <- lapply(xpaths, xml_to_df, file = file, is_xml = FALSE, dig = FALSE) %>%
# dplyr::bind_cols()
# df1
# xml_dig_df (xml2 package based)
## faster!
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
if(class(x) == "character") ifelse(as.character(x)!="", x, NA) else x
}
terminal_nodesets <- lapply(terminal_xpaths, xml2::xml_find_all, x = doc) # use xml docs, not nodesets! I think this is because it searches the 'root'.
df2 <- terminal_nodesets %>%
purrr::map(xml_dig_df) %>%
purrr::map(dplyr::bind_rows) %>%
dplyr::bind_cols() %>%
dplyr::mutate_all(empty_as_na)
Here is an approach with XML package.
library(tidyverse)
library(XML)
df = xmlInternalTreeParse("./Data/ObterDetalhesDeputado.xml")
df_root = xmlRoot(df)
df_children = xmlChildren(df_root)
df_flattened = map_dfr(df_children, ~.x %>%
xmlToList() %>%
unlist %>%
stack %>%
mutate(ind = as.character(ind),
ind = make.unique(ind)) %>% # for duplicate identifiers
spread(ind, values))
Following Nodes are nested lists. So they will appear as duplicate columns with numbers affixed. You can remove them accordingly.
cargosComissoes 2
partidoAtual 3
gabinete 3
historicoLider 4
comissoes 11

Dataframe is too big for supercomputer

I am trying to create a matrix of donors and recipients, populated with the sum of donations produced in each couple keeping the eventual NAs.
It works well for small datasets (See toy example below) but when I switch to national datasets (3m entries) several problems emerge: besides being painstakingly slow, the creation of the fill df consume all the memory of the (super)computer and I get the error "Error: cannot allocate vector of size 1529.0 Gb"
How should I tackle the problem?
Thanks a lot!
library(dplyr)
library(tidyr)
libray(bigmemory)
candidate_id <- c("cand_1","cand_1","cand_1","cand_2","cand_3")
donor_id <- c("don_1","don_1","don_2","don_2","don_3")
donation <- c(1,2,3.5,4,10)
df = data.frame(candidate_id,donor_id,donation)
colnames(df) <- c("candidate_id","donor_id","donation")
fill <- df %>%
group_by(df$candidate_id,df$donor_id) %>%
summarise(tot_donation=sum(as.numeric(donation))) %>%
complete(df$candidate_id,df$donor_id)
fill <- unique(fill[ ,1:3])
colnames(fill) <- c("candidate_id","donor_id","tot_donation")
nrow = length(unique(df$candidate_id))
ncol = length(unique(df$donor_id))
row_names = unique(fill$candidate_id)
col_names = unique(fill$donor_id)
x <- big.matrix(nrow, ncol, init=NA,dimnames=list(row_names,col_names))
for (i in 1:nrow){
for (j in 1:ncol){
x[i,j] <- fill[which(fill$candidate_id == row_names[i] &
fill$donor_id == col_names[j]), 3]
}
}
I see you're using unique because your output has duplicated values.
Based on this question,
you should try the following in order to avoid duplication:
fill <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup %>%
complete(candidate_id, donor_id)
Can you then try to create your desired output?
I think unique can be very resource-heavy,
so try to avoid calling it.
The tidyr version of what Benjamin suggested should be:
spread(fill, donor_id, tot_donation)
EDIT: By the way, since you tagged the question with sparse-matrix,
you could indeed use sparsity to your advantage:
library(Matrix)
library(dplyr)
df <- data.frame(
candidate_id = c("cand_1","cand_1","cand_1","cand_2","cand_3"),
donor_id = c("don_1","don_1","don_2","don_2","don_3"),
donation = c(1,2,3.5,4,10)
)
summ <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup
num_candidates <- nlevels(df$candidate_id)
num_donors <- nlevels(df$donor_id)
smat <- Matrix(0, num_candidates, num_donors, sparse = TRUE, dimnames = list(
levels(df$candidate_id),
levels(df$donor_id)
))
indices <- summ %>%
select(candidate_id, donor_id) %>%
mutate_all(unclass) %>%
as.matrix
smat[indices] <- summ$tot_donation
smat
3 x 3 sparse Matrix of class "dgCMatrix"
don_1 don_2 don_3
cand_1 3 3.5 .
cand_2 . 4.0 .
cand_3 . . 10
You might try
library(reshape2)
dcast(fill, candidate_id ~ donor_id,
value.var = "tot_donation",
fun.aggregate = sum)
I don't know if it will avoid the memory issue, but it will likely be much faster than a double for loop.
I have to run to a meeting, but part of me wonders if there is a way to do this with outer.

How to auto look up for the start point of largest break in a given dataset in R

I am trying to find the starting point of the largest break of a given data. Here is my example:
data <- data.frame(month = c(1:12), countx = c(60,69,10,13,65,80,59,84,43,21,18,10))
select <- data[data$countx >= 50,] #take value greater than 50 into account
# find the break
wtym <- select$month
breaks <- c(0, which(diff(wtym) != 1), length(wtym))
allbreak <- sapply(seq(length(breaks) - 1 ),
function(i) wtym[(breaks[i] + 1):breaks[i+1]])
> allbreak
[[1]]
[1] 1 2
[[2]]
[1] 5 6 7 8
The question is: I need to find this for a large number of dataset (and the breaks are obviously varied), is there any way to auto pick up the start point of the largest break in a series (in this example, it is number 5 (gap no.2)? Any idea is highly appreciated. Thanks
Sounds like a run-length-encoding ?rle task where you are looking for runs of x < 50 and x >= 50. Here's a function:
bigbreak <- function(x, cutoff) {
r <- rle(x >= cutoff)
cumsum(r$l)[which(r$l == max(r$l[r$v]) & r$v)-1]+1
}
bigbreak(data$countx, 50)
#[1] 5
Now let's try it on 5 million records:
set.seed(1)
x <- sample(c(50,0), 5e6, replace=TRUE)
system.time({
bigbreak(x, 50)
})
# user system elapsed
# 0.41 0.00 0.41
Under half a second, not too bad.
A solution using dplyr and data.table.
# Create example data frame
data <- data.frame(month = c(1:12), countx = c(60,69,10,13,65,80,59,84,43,21,18,10))
# Load package
library(dplyr)
library(data.table)
# Process the data
data2 <- data %>%
mutate(Condition = countx >= 50) %>%
mutate(RunID = rleid(Condition)) %>%
filter(Condition) %>%
group_by(RunID) %>%
mutate(num = n()) %>%
ungroup() %>%
filter(num == max(num))
# Show the number of the first month
data2$month[1]
[1] 5

Calling recursive functions in R

Assuming I have a dataframe, df with this info
group wk source revenue
1 1 C 100
1 1 D 200
1 1 A 300
1 1 B 400
1 2 C 500
1 2 D 600
I'm trying to programatically filter's down to rows of unique combinations of group, wk and source, and then perform some operations on them, before combining them back into another dataframe. I want to write a function that can scale to any number of segments (and not just the example scenario here) and filter down rows. All I need to pass would be the column names by which I want to segment
eg.
seg <- c("group", "wk", "source")
One unique combination to filter rows in df would be
df %>% filter(group == 1 & wk == 1 & source == "A")
I wrote a recursive function (get_rows) to do so, but it doesn't seem to do what I want. Could anyone provide inputs on where I'm going wrong ?
library(dplyr)
filter_row <- function(df,x)
{
df %>% filter(group == x$group & wk == x$wk & source == x$source)
}
seg <- c("group", "wk", "source")
get_rows <- function(df,seg,pos = 1, l = list())
{
while(pos <= (length(seg) + 1))
{
if(pos <= length(seg))
for(j in 1:length(unique(df[,seg[pos]])))
{
k <- unique(df[,seg[pos]])
l[seg[pos]] <- k[j]
get_rows(df,seg,pos+1,l)
return()
}
if(pos > length(seg))
{
tmp <- df %>% filter_row(l)
<call some function on tmp>
return()
}
}
}
get_rows(df,seg)
EDIT: I understand there are prebuilt methods I can use to get what I need, but I'm curious about where I'm going wrong in the recursive function I wrote.
There might be a data.table/dplyr solution out there, but this one is pretty simple.
# Just paste together the values of the column you want to aggregate over.
# This creates a vector of factors
f <- function(data, v) {apply(data[,v,drop=F], 1, paste, collapse = ".")}
# Aggregate, tapply, ave, and a few more functions can do the same thing
by(data = df, # Your data here
INDICES = f(df, c("group", "wk", "source")), # Your data and columns here
FUN = identity, simplify = F) # Your function here
Can also use library(dplyr) and library(data.table)
df %>% data.table %>% group_by(group, wk, source) %>% do(yourfunctionhere, use . for x)

Resources