How to delete an NA (empty) vertex - r

I'm making an (undirected) social network plot with the igraph package, which I've practised with a subset of my data. For this the vertex have to be in two columns and that way all associations are shown. however sometimes a vertex (individual, I'm working with animals) is encountered alone, with no associations. so this animal will be in the left column and on the right column there is nothing, an empty cell.
However in the igraph package R thinks that "NA"/nothing is an animals ID so it makes a vertex out of it. In my subset I had solved this problem like this:
y <- data.frame(data$ID1, data$ID2)
ID1 and 2 are the codes from the pit-tag readers to recognize the individual animals. its basically their name.
graph.data.frame(y, directed=FALSE)
I'm calling this graph: net
net <- graph.data.frame(y, directed=FALSE)
net <- delete_vertices(net, "")
so in the code shown above the empty values, where nog animal ID was filled in, are deleted from the graph. I was thrilled I achieved this, but as I said it was in a subset of my data which I had already manually edited.
For the whole dataset, I had to wrangle the data. because the animals were observed in larger groups, so I had 8 columns with animal ID which were all associated together. this had to be reformed to two columns in which all possible permutations on a single location was covered (so for a group of 4 animals I needed the combination 1-2; 1-3; 1-4; 2-3; 2-4 and 3-4 and groups vary from 1 to 8 animals(vertexes)). I've done this with the tidyr and dplyr packages (and help). when there's no value in one of the columns (because it was an individual being by itself) R says:
Warning messages:
1: In graph.data.frame(y, directed = FALSE) :
In `d' `NA' elements were replaced with string "NA"
2: In `[<-.factor`(`*tmp*`, thisvar, value = "NA") :
invalid factor level, NA generated
So in my opinion it replaces the empty space with NA, which is also what it show when it tell R to show the new wrangled data. however the trick with remove vertices doesn't work anymore. it keeps saying "invalid vertex name". I've tried it with "", with "NA", NA, "<NA>" and everything logical I could think of but I cannot seem to solve this.
I'm hoping its an error which is easily solvable with a different " or , or something. Anyone have any ideas?

Related

r Terra issue with multicategorical raster. How to properly extract the categories and their values into layers without losing data?

I am working with rTerra and having an issue with the CONUS historical disturbance dataset from LANDFIRE found here:https://landfire.gov/version_download.php (HDist is the name). To summarize what I want to do, I want to take this dataset, crop and project to my extent, then take the values of the cells and separate them as layers. So I want a layer for severity, one for disturbance type, etc. The historical disturbance data has these things all in one attribute table. In terra, this attribute table is set up under categories and this is providing a lot of problems. I have not had issues with the crop nor reproject, it is getting into the values and separating the categories into layers. I have the following code
library(terra)
setwd("your pathway to historical disturbance tif here")
h1 <- terra::rast("LC16_HDst_200.tif") #read in the Hdist tif
h2 <- terra::project(h1, "EPSG:5070", method = "near") #project it using nearest neighbor
h3 <- crop(h2, ext([xmin,xmax,ymin,ymax]) #crop to the extent
h3
This then gives the output in the extent and projection I want but the main focus is the categories
categories : Count, HDIST_ID, DISTCODE_V, DIST_TYPE, TYPE_CONFI, SEVERITY, SEV_CONFID, HDIST_CAT, FDIST, R, G, B
So I learned that with these kinds of datasets, the values are stored under these categories.
if I plot with plot(h3)
I only get the first row of the count category. In order to switch that category I can use
activeCat(h3) <- 4
h3
and I would get
name : DIST_TYPE
min value : Clearcut
max value : Wildland Fire Use
The default active category was count, but now its DIST_TYPE, the fourth category, nothing too crazy. I try plotting
plot(h3)
I only get NoData plotted. None of the others. There is a function called catalyze() That claims to take your categories and converts them all into numerical layers
h4 <- catalyze(h3)
which gave me a thirteen layer dataset, which makes sense because there are 13 categories and it takes them and converts them into numeric layers. I tried plotting
plot(h4, 4) #plot h4 layer 4, which would correspond to DIST_TYPE category
it only plots a value of 8, and it looks to only show what is likely noData values. The map is mostly green, which is inline with the NoData from HDist.
Anytime I try directly accessing values, it crashes. When I look at the min and max values I get 8 and 8 for min and max for that 'name" names: DIST_TYPE min values: 8 max values: 8. Other categories show a similar pattern. So it appeared to just take the first row of values for each category and make that the entire layer.
In summary, it is clear that terra stores all of the values that would easily be seen in an attribute table if the dataset were brought into arcgis. However, whenever I try to plot it or work with it, even before any real manipulation, it only accesses the top row of that attribute table, and when I catalyze, it just seems to mess everything up even more. I know this is really easy to solve in arcgis pro, but I want to keep everything in r from a documentation coherency standpoint. Any terra whizzes know what to do about this? I figure it has to be something very easy, but I don't really know what else to try. Maybe it is some major issue too. I have the same issue with LANDFIRE evt data. I have not had this issue with simple rasters such as dem, canopy cover, etc. It is only with these rasters with multiple categories (or columns in an attribute table)
edit
this is the break image
That failed because the (ESRI) VAT IDs are not in the expected (for GDAL categories) 0..255 range. This has now been fixed and I get:
library(terra)
#terra version 1.4.6
r <- rast("LC16_HDst_200.tif")
activeCat(r) <- 4
r <- crop(r, ext(-93345, -57075, 1693125, 1716735))
plot(r)

Why does mutate() command create NAs?

I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data.
My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.
I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.
See my code attached:
any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE
#create variable that represents variance of rating, grouped by product type
data <- data %>%
group_by(asin) %>%
mutate(ProductVariance = var(overall))
any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289
I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.
Thank you in advance!
var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:
var(1)
# [1] NA
...
mutate(ProductVariance = coalesce(var(overall), 0))
Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.

Making a histogram

this sounds pretty basic but every time I try to make a histogram, my code is saying x needs to be numeric. I've been looking everywhere but can't find one relating to my problem. I have data with 240 obs with 5 variables.
Nipper length
Number of Whiskers
Crab Carapace
Sex
Estuary location
There is 3 locations and i'm trying to make a histogram with nipper length
I've tried making new factors and levels, with the 80 obs in each location but its not working
Crabs.data <-read.table(pipe("pbpaste"),header = FALSE)##Mac
names(Crabs.data)<-c("Crab Identification","Estuary Location","Sex","Crab Carapace","Length of Nipper","Number of Whiskers")
Crabs.data<-Crabs.data[,-1]
attach(Crabs.data)
hist(`Length of Nipper`~`Estuary Location`)
Error in hist.default(Length of Nipper ~ Estuary Location) :
'x' must be numeric
Instead of correct result
hist() doesn't seem to like taking more than one variable.
I think you'd have the best luck subsetting the data, that is, making a vector of nipper lengths for all crabs in a given estuary.
crabs.data<-read.table("whatever you're calling it")
names<-(as you have it)
Estuary1<-as.vector(unlist(subset(crabs.data, `Estuary Loc`=="Location", select = `Length of Nipper`)))
hist(Estuary1)
Repeat the last two lines for your other two estuaries. You may not need the unlist() command, depending on your table. I've tended to need it for Excel files, but I don't know what format your table is in (that would've been helpful).

Filtering grouped data in R

I was wondering if anyone can help with grouping the data below as I'm trying to use the subset function to filter out volumes below a certain threshold but given that the data represents groups of objects, this creates the problem of removing certain items that should be kept.
In Column F ( and I) you can see Blue, Red, and Yellow Objects. Each represent three separate colored probes on one DNA strand. Odd numbered or non-numbered Blue ,Red, and Yellow are paired with a homologous strand represented by an even numbered Blue, Red, and Yellow. Ie data in rows 2,3,and 4 are one "group" and pair with the "group" shown in rows 5,6,and 7. This then repeats, so 8,9,10 are a new group and that group pairs with the one in 11,12,13.
What I would like to do is subset the groups so that only those below a certain Distance to Midpoint (column M) are kept. The Midpoint here is the midpoint of the line that connects the blue of one group with the blue of its partner, so the subset should only apply to the Blue distance to midpoint, and that is where I'm having a problem. For instance if I ask to keep blue distances to midpoint that are less than 3, then the objects in row 3 and 4 should be kept because they are part of the group with the blue distance below 3. Right now though when I filter with the subset function I lose Red Selection and Yellow Selection. I'm confident there is a straighforward solution to this in R, but I'd also be open to some type of filtering in excel if anyone has suggestions via that route instead.
EDIT
I managed to work something out in Excel last night after posting the question. Solution isn't pretty but it works well enough. I just added a new column next to "distance to midpoint" that gives all the objects in one group the same distance so that when I filter the data I won't lose any objects that I shouldn't. If it helps anyone in the future, the formula I used in excel was:
=SQRT ( ((INDEX($B$2:$B$945,1+QUOTIENT(ROWS(B$2:B2)-1,3)*3))- (INDEX($O$2:$O$945,1+QUOTIENT(ROWS(O$2:O2)-1,3)*3)) ) ^2 +( (INDEX($C$2:$C$945,1+QUOTIENT(ROWS(C$2:C2)-1,3)*3))-(INDEX($P$2:$P$945,1+QUOTIENT(ROWS(P$2:P2)-1,3)*3)) ) ^2 +( (INDEX($D$2:$D$945,1+QUOTIENT(ROWS(D$2:D2)-1,3)*3))-(INDEX($Q$2:$Q$945,1+QUOTIENT(ROWS(Q$2:Q2)-1,3)*3)) ) ^2)
Would be easier with a reproducible example, but here's a (hacky) plyr solution:
filterframe<-function(df,threshold){
df$grouper<-rep(seq(from=1,to=6),nrow(df)/6)
dataout<-df%>%group_by(grouper)%>%summarise(keep=.[[1]]$distance_to_midpoint<threshold)
dataout[dataout$keep,]
}
filterframe(mydata)
A base R solution provided below. The idea is that once your data are in R, you (edit) keep! rows iff they meet 2 criteria. First, the Surpass column has to contain the word "blue" in it, which is done with the grepl function. Second, the distance must below a certain threshold (set arbitrarily by thresh.
fakeData=data.frame(Surpass=c('blue', 'red', 'green', 'blue'),
distance=c(1,2,5,3), num=c(90,10,9,4))
#thresh is your distance threshold
thresh = 2
fakeDataNoBlue = fakeData[which(grepl('blue', fakeData$Surpass)
& fakeData$distance < thresh),]
There's probably also a quick dplyr solution using filter, but I haven't fully explored the functionality there. Also, I may be a bit confused on if you also want to keep the other colors. If so, that's the same as saying you want to remove the blue ones exceeding a certain distance threshold, which you would just do a -which command, and turn the < operator into a > operator.

Importing edge list in igraph in R

I'm trying to import an edge list into igraph's graph object in R. Here's how I'm trying to do so:
graph <- read.graph(edgeListFile, directed=FALSE)
I've used this method before a million times, but it just won't work for this specific data set:
294834289 476607837
560992068 2352984973
560992068 575083378
229711468 204058748
2432968663 2172432571
2473095109 2601551818
...
R throws me this error:
Error in read.graph.edgelist(file, ...) :
At structure_generators.c:84 : Invalid (negative) vertex id, Invalid vertex id
The only difference I see between this dataset and the ones I previously used is that those were in sorted form, starting from 1:
1 1
1 2
2 4
...
Any clues?
It seems likely that it's trying to interpret the values as indexes rather than node names and it's probably storing them in a signed integer field that is too small and is probably overflowing into negative numbers. One potential work around is
library("igraph")
dd <- read.table("test.txt")
gg <- graph.data.frame(dd, directed=FALSE)
plot(gg)
It seems this method doesn't have the overflow problem (assuming that's what it was).

Resources