Unable to subset data in a shapefile

Unable to subset data in a shapefile - r

I would like to carry out a subsetting in my shapefile without specifying the name of the first column in the .dbf file.
To be more precise I would like to select all the rows with value 1 in the first column of the .dbf, but I don't want to specify the name of this column.
For example this script works because I specify the name of the column (as columnName)
library(rgdal) # readOGR
shapeIn <- readOGR(nomeFile)
shapeOut <- subset(shapeIn, columnName == 1)
instead it doesn't works
shapeOut <- (shapeIn[,1] == 1)
and I get an error message:
comparison (1) is possible only for atomic and list types shapeOut and shapeIn are ESRI vector files.
This is the header of my shapeIn
coordinates mask_1000_
1 (54000, 1218000) 0
2 (55000, 1218000) 0
3 (56000, 1218000) 0
Can you help me? Thank you

This
shapeOut <- (shapeIn[,1] == 1)
doesn't work beacuse SpatialPolygonsDataFrames contain other info other than the data. So "common" data.frame subsetting doesn't work in the same way. To have it work, you must make the "logical check" for subsetting on the #data slot: this should work (either using subset or "direct" indexing):
shapeOut <- subset(shapeIn, shapeIn#data[,1] == 1)
OR
shapeOut <- shapeIn[shapeIn#data[,1] == 1,]
(however, by recent experience, referencing to data by column number is seldom a good idea... ;-) )
ciao Giacomo !!!

Related

Character vectors for subset conditions

I'm trying to subset a dataframe with character conditions from a vector!
This works:
temp <- USA[USA$RegionName == "Virginia",]
Now for a loop I created a column-vector containing all States by name, so I could filter through them:
> states
[1] "virginia" "Alaska" "Alabama" (...)
But if I know try to consign the "RegionName" condition via the column-vector it does not work anymore:
temp <- USA[USA$RegionName == states[1],]
What I tried so far:
paste(states[1])
as.factor(states[1])
as.character(states[1])
For recreation of the relevant dataframe:
string <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv")
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
(In my vector Virginia is just on top for convenience!)

Based on the reproducible example, the first element of 'states' is empty
states[1]
[1] ""
We need to remove the blank elements
states <- states[nzchar(states)]
and then execute the code
dim(USA[USA$RegionName == states[1],])
[1] 569 51

I have not tried akruns option as the classic import function is very very very slow. When using a more modern approach I noticed, that the autoimport makes RegionName a logical and thus every values gets converted to NA. Therefore here is my approach to your problem:
# way faster to read in csv data and you need to set all columns to character as autoimport makes RegionName a logical returning all NA
string <- readr::read_csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", col_types = cols(.default = "c"))
USA <- string[string$CountryCode=="USA",]
USA <- USA[USA$Jurisdiction=="STATE_TOTAL",]
states <- unique(USA$RegionName)
temp <- USA[USA$RegionName == states[1],]
You will have to convert the columns acording to your need or specify exactly when importing which column should be of which data type

I must admit that this is a poorly asked question. Actually I wanted to subset the dataframe based on two conditions, a date condition and a state condition.
The date condition worked fine. The state condition did not work in combination or alone, so I assumed that the error was here.
In fact, I did a very bizarre transformation of the date from the original source. After I implemented another, much more reliable transformation, the state condition also worked fine, with the code in the question asked.
The error lay as apparently with a badly implemented date transformation! Sorry for the rookie mistake; my next question will be more sophisticated

Subsetting a Cell Data Set within Monocle

If anyone has any experience using the monocle package in R:
I am trying to subset my data based on a vector of sample names, but I cannot accomplish it.
I have tried:
x#phenoData$sampleNames <- example.cells
but I am getting this error:
replacement has 661 rows, data has 5809
The object I am trying to subset is a Cell Data Set (CDS) created from a Seurat object by the importCDS function.
I have also assigned a Cell Type to every sample that is called "CellType" which is part of the meta.data of the Seurat object and is listed under the varLabels slot of the phenoData after it is converted to a CDS.
I would like help subsetting based on either of these variables, thank you.

According to the monocle tutorial low quality cells were filtered with this code (HSSM is the monocle object):
valid_cells <- row.names(subset(pData(HSMM),
Cells.in.Well == 1 &
Control == FALSE &
Clump == FALSE &
Debris == FALSE &
Mapped.Fragments > 1000000))
HSMM <- HSMM[,valid_cells]
So for your example this should work:
x = x[,example.cells]
or (directly from Seurat):
x = x[,rownames(data.seurat#meta.data[data.seurat#meta.data$CellType == "interesting_cell",])]

This: x#phenoData$sampleNames <- example.cells is adding new data to the dataframe representing your sample treatments, instead of subsetting.
Try using x#phenoData$sampleNames %in% example.cells to retrieve a boolean vector (True, False) and filter using this:
x#phenoData[x#phenoData$sampleNames %in% example.cells,]
One small edit, this may mess up your CDS data structure, so be careful. It may be better to filter prior to generating the CDS or generate a new one from the old data.

how to fill up NA entries in the vector generated with assign command?

I would like to generate series of vectors like india_a, india_b, india_c. These vectors will have length 3. For example, 1st entry in india_a will be summation of 'total'when yrs=1 and crs=a.
for (i in crimes){ assign(paste("india_",i,sep=""),rep(NA,12))
for (j in 1:12){
india_i[j] <-sum(juvenile_crime$total[
juvenile_crime$year==years[j]&juvenile_crime$crime==crimes[i]]) } }
this is the message I get when I run above code
Error in india_i[j] <- sum(juvenile_crime$total[juvenile_crime$year == :
object 'india_i' not found
this example might help:
sts <- c(rep("s1",9),rep("s2",9))
yrs <- c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
crs <- c("a","b","c","a","b","c","a","b","c")
total <- c(1:18)
cri <- data.frame(sts,yrs,crs,total)
attach(cri)
yr <- levels(cri$yrs)
cr <- levels(cri$crs)
for (i in crs){
assign(paste("india_",i,sep=""),rep(NA,3))
for (j in 1:3){
india_i[j] <-sum(total[yrs==yr[j]&crs==cr[i]]
}
}

There is no object with the name "india_i". We don't have any information about what is in the "crimes" vector but if it's the numbers 1:12 then the objects have names like "india_1". You should learn to make named lists, rather than using separate objects.
After your edit, we can demonstrate this using a slightly modified version of your code ( and adding the missing close-parenthesis for the sum-call).
India_L <- list() # create an (empty) master list
for (i in crs){
assign(paste("india_",i,sep=""),rep(NA,3))
for (j in 1:3){
India_L[[paste("india_",j,sep="")]] <-sum(total[yrs==yr[j]&crs==cr[i]])
}
}
India_L # print to see the structure
#---- result
$india_1
[1] 0
$india_2
[1] 0
$india_3
[1] 0
The reason you got all zeroes was that there are no levels for the yrs column of the cri-object. It was "numeric" and only "factor"-classes have levels in R. A comment on your strategy. I wasn't really sure what goal you had set for yourself (besides) getting the assign function to succeed. The sum of those logical tests didn't seem particularly informative. Perhaps you meant to use the %in% operator. Using == with a vector will not generally be informative.
Using attach is generally very unwise. Notice the warning you got:
> attach(cri)
The following objects are masked _by_ .GlobalEnv:
crs, sts, total, yrs
The following objects are masked from cri (pos = 3):
crs, sts, total, yrs
So the objects named might be changed with an edit:
sts <- c(rep("s1.a",9),rep("s2.a",9))
What object do you think was altered? And if you then detach-ed the cri-dataframe, where do you think the edit would reside? One of the big problems with attaching objects is that the user gets confused about what is actually being changed.
It would be more clear to create the values with the dataframe in one pass and then work on components of the dataframe:
cri <- data.frame(
sts = c(rep("s1",9),rep("s2",9)),
yrs = c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
crs = c("a","b","c","a","b","c","a","b","c"),
total = c(1:18) )

If you want create 3 vectors, you may do something like this:
for (i in unique(crs)){
# Note: only one of each value
vect <- numeric(3)
# create a help vector
for (j in 1:3) {
vect[j] <-sum(total[yrs==yr[j] & crs==cr[i]])
}
assign(paste("india_",i,sep=""), vect)
# assign this vector to "india_i"
}
However, this program does not work. As yrs is numeric it will be included in the cri data frame as-is, and does not have any levels, and hence yrs==yr[j] is never true.
Another point: it is usually better to use lists instead of assignment of india_i. I would do
india <- vector("list", 3)
names(india) <- letters[1:3]
and the assignment later would be like
india[[i]] <- vect
And please!!! ensure your code runs (except the error you are struggling with.) Currently it does not even load as a parenthesis is missing from india_i[j] <-sum(total[yrs==yr[j]&crs==cr[i]].

R warning message - invalid factor level, NA generated

I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!

Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.

How to subset a list based on the length of its elements in R

In R I have a function (coordinates from the package sp ) which looks up 11 fields of data for each IP addresss you supply.
I have a list of IP's called ip.addresses:
> head(ip.addresses)
[1] "128.177.90.11" "71.179.12.143" "66.31.55.111" "98.204.243.187" "67.231.207.9" "67.61.248.12"
Note: Those or any other IP's can be used to reproduce this problem.
So I apply the function to that object with sapply:
ips.info <- sapply(ip.addresses, ip2coordinates)
and get a list called ips.info as my result. This is all good and fine, but I can't do much more with a list, so I need to convert it to a dataframe. The problem is that not all IP addresses are in the databases thus some list elements only have 1 field and I get this error:
> ips.df <- as.data.frame(ips.info)
Error in data.frame(`128.177.90.10` = list(ip.address = "128.177.90.10", :
arguments imply differing number of rows: 1, 0
My question is -- "How do I remove the elements with missing/incomplete data or otherwise convert this list into a data frame with 11 columns and 1 row per IP address?"
I have tried several things.
First, I tried to write a loop that removes elements with less than a length of 11
for (i in 1:length(ips.info)){
if (length(ips.info[i]) < 11){
ips.info[i] <- NULL}}
This leaves some records with no data and makes others say "NULL", but even those with "NULL" are not detected by is.null
Next, I tried the same thing with double square brackets and get
Error in ips.info[[i]] : subscript out of bounds
I also tried complete.cases() to see if it could potentially be useful
Error in complete.cases(ips.info) : not all arguments have the same length
Finally, I tried a variation of my for loop which was conditioned on length(ips.info[[i]] == 11 and wrote complete records to another object, but somehow it results in an exact copy of ips.info

Here's one way you can accomplish this using the built-in Filter function
#input data
library(RDSTK)
ip.addresses<-c("128.177.90.10","71.179.13.143","66.31.55.111","98.204.243.188",
"67.231.207.8","67.61.248.15")
ips.info <- sapply(ip.addresses, ip2coordinates)
#data.frame creation
lengthIs <- function(n) function(x) length(x)==n
do.call(rbind, Filter(lengthIs(11), ips.info))
or if you prefer not to use a helper function
do.call(rbind, Filter(function(x) length(x)==11, ips.info))

Alternative solution based on base package.
# find non-complete elements
ids.to.remove <- sapply(ips.info, function(i) length(i) < 11)
# remove found elements
ips.info <- ips.info[!ids.to.remove]
# create data.frame
df <- do.call(rbind, ips.info)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unable to subset data in a shapefile - r

Related

Character vectors for subset conditions

Subsetting a Cell Data Set within Monocle

how to fill up NA entries in the vector generated with assign command?

R warning message - invalid factor level, NA generated

How to subset a list based on the length of its elements in R

Categories

Resources