Replacing column values based on related column in R - r

I'm currently working on a dataset which has an address and a zip code column. I'm trying to deal with the invalid/missing data in zip code by finding a different record with same address, and then filling the corresponding zip code to the invalid zip code. What would be the best approach to go about doing this?

Step 1. Using the non-missing addresses and zip codes construct a dictionary
data frame of sorts. For example, in a data frame "df" with an "address"
column and a "zip_code" column, you could get this via:
library(dplyr)
zip_dictionary <- na.omit(select(df, address, zip_code))
zip_dictionary <- distinct(zip_dictionary)
This assumes there is only one unique value of "zip_code" for each "address"
in your data. If not, you need to figure out which value to use and filter or
recode it accordingly.
Step 2. Install the {elucidate} package from GitHub and use the translate()
function to fill in the missing zip codes using the extracted dictionary from
step 1:
remotes::install_github("bcgov/elucidate")
library(elucidate)
df <- df %>%
mutate(zip_code = if_else(is.na(zip_code),
translate(address,
old = zip_dictionary$address,
new = zip_dictionary$zip_code)
)
)
disclaimer: I am the author of the {elucidate} package

Related

Error message in Rstudio while uploading the data

Can anyone help me why I get this type of error every time I upload the data in R?
Any solution for that?
The data you're reading in probably has the columns poison, time and treat. For some reason these words are already taken in R's namespace.
When you attach the table, R tries to assign the names poison, time, and treat to refer to the respective columns. An example with the sleep dataset:
data1 <- as.data.frame(sleep)
names(sleep)
[1] "extra" "group" "ID"
attach(data1)
# Now ID is assigned to data1.ID in R's namespace
data2 <- as.data.frame(sleep)
attach(data2)
# The message you're getting
The following objects are masked from data1:
extra, group, ID
To avoid this problem, which can lead to unintended outcomes, make sure to call detach(data1) before attaching further datasets with non-unique column names.

How to retrieve data using the rentrez package by giving a list of query names instead of a single one?

So I'm trying to use the rentrez package to retrieve DNA sequence data from GenBank, giving as input a list of species.
What I've done is create a vector for the species I want to query, followed by creating a term where I specify the types of sequence data I want to retrieve, then creating a search that retrieves all the occurrences that match my query, and finally I create data where I retrieve the actual sequence data in fasta file.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data<-entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")
}
Basically what I'm trying to do is concatenate the results of the queries for each species into a single variable. I began using a for cycle but I see it makes no sense in this form because the data of each new species that is being queried is just replacing the previous one in data.
For some elements of species, there will be no data to retrieve and R shows this error:
Error: Vector of IDs to send to NCBI is empty, perhaps entrez_search or entrez_link found no hits?
In the cases where this error is shown and therefore there is no data for that particular species, I wanted the code to just keep going and ignore that.
My output would be a variable data which would include the sequence data retrived, from all the names in species.
library(rentrez)
species<-c("Ablennes hians","Centrophryne spinulosa","Doratonotus megalepis","Entomacrodus cadenati","Katsuwonus pelamis","Lutjanus fulgens","Pagellus erythrinus")
data <- list()
for (x in species){
term<-paste(x,"[Organism] AND (((COI[Gene] OR CO1[Gene] OR COXI[Gene] OR COX1[Gene]) AND (500[SLEN]:3000[SLEN])) OR complete genome[All Fields] OR mitochondrial genome[All Fields])",sep='',collapse = NULL)
search<-entrez_search(db="nuccore",term=term,retmax=99999)
data[x] <- tryCatch({entrez_fetch(db="nuccore",id=search$ids,rettype="fasta")},
error = function(e){NA})
}

A cell in a CSV is (wrongly) read as a character vector of length 2 in R

I have a data frame like this I read in from a .csv (or .xlsx, I've tried both), and one of the variables in the data frame is a vector of dates.
Generate the data with this
Name <- rep("Date", 15)
num <- seq(1:15)
Name <- paste(Name, num, sep = "_")
data1 <- data.frame(
Name,
Due.Date = seq(as.Date("2020/09/24", origin = "1900-01-01"),
as.Date("2020/10/08", origin = "1900-01-01"), "days")
)
When I reference one of the cells specifically, like this: str(project_dates$Due.Date[241]) it reads the date as normal.
However, the exact position of the important dates varies from project to project, so I wrote a command that identifies where the important dates are in the sheet, like this: str(project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"])
This code worked on a few projects, but on the current project it now returns a character vector of length 2. One of the values is the date, and the other value is NA. And to make matters worse, the location of the date and the NA is not fixed across dates--the date is the first value in some cells and the second in others (otherwise I would just reference, e.g., the first item in the vector).
What is going on here, but more importantly, how do I fix this?!
Clarification on the second command:
When I was originally reading from an Excel file, the command was project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"]$Due.Date because it was returning a 1x1 tibble, and I needed the value in the tibble.
When I switched to reading in data as a csv, I had to remove the $Due.Date because the command was now reading the value as an atomic vector, so the $ operator was no longer valid.
Help me, Oh Blessed 1's (with) Knowledge! You're my only hope!
Edited to include an image of the data like the one that generates the error
I feel sheepish.
I was able to remove the NAs with
data1<- data1[!is.na(data1$Due.Date), ].
I assumed that command would listwise delete the rows with any missing values, so if the cell contained the 2-length vector, then I would lose the whole row of data. Instead, it removed the NA from the cell, leaving only the date.
Thank you to everyone who commented and offered help!

Selecting and renaming columns in SpatialPointsDataFrame

I'm working with a feature class dataset extracted from a geodatabase, which I've filtered to my area of interest and intersected with a SpatialPointsDataFrame. In order to export it to a shapefile with WriteOGR I need to format the attribute names and I also want to only select specific columns to export in my final shapefile. I have been running into a lot of errors using standard select or base R subletting techniques. For some reason R doesn't seem to recognize the column names when I try to select. I've tried lots of different methods and can't figure out where I'm going wrong.
```bfcln%>%
+ select(STATEFP,DP2_HC03_V, DP2_HC03V.1)
Error in tolower(use) : object 'STATEFP' not found```
# create a spatial join between bf_pop and or_acs
#check CRS
```crsbf <- bf_pop#proj4string```
# change acs CRS to match bf_pop
```oracs_reprj <- spTransform(or_acs, crsbf)```
# join by spatial attributes
```bf_int <- raster::intersect(bf_pop, oracs_reprj)```
#truncate field names to 10 characters for ESRI formatting
```names(bf_int) <- strtrim(names(bf_int),10)```
#remove duplicates from attribute table
```bfcln <- bf_int[which(!duplicated(bf_int$id)), ]```
After failing with the select() method multiple times, I tried renaming columns.
# rename variables of interest
```bfcln1 <-bfcln%>%
select(DP2_HC03_V)%>%
rename(DP2_HC03_V=pcntunmar)%>%
select(DP2_HC03_V.1)%>%
rename(DP2_HC03_V.1=pcntirsh)
Error in tolower(use) : object 'DP2_HC03_V' not found```
To rename spatial files you'll need to install the package spdplyr.
Similarly to dplyr, you'd do:
df <- df %>%
rename(newName = oldName)

R spCbind error

I have successfully added information to shapefiles before (see my post on http://rusergroup.swansea.ac.uk/Healthmap.ashx?HL=map ).
However, I just tried to do it again with a slightly different shapefile (new local health boards for Wales) and the code fails at spCbind with a "row names not identical error"
o <- match(wales.lonlat$NEW_LABEL, wds$HB_CD)
wds.xtra <- wds[o,]
wales.ncchd <- spCbind(wales.lonlat, wds.xtra)
My rows did have different names before and that didn't cause any problems. I relabeled the column in wds.xtra to match "NEW_LABEL" and that doesn't help.
The labels and order of labels do match exactly between wales.lonlat and wds.xtra.
(I'm using Revolution R 5.0, which is built on R 2.13.2)
I use match to merge data to the sp data slot based on rownames (or any other common ID). This avoids the necessity of maptools for the spCbind function.
# Based on rownames
sdata#data=data.frame(sdata#data, new.df[match(rownames(sdata#data), rownames(new.df)),])
# Based on common ID
sdata#data=data.frame(sdata#data, new.df[match(sdata#data$ID, new.df$ID),])
# where; sdata is your sp object and new.df is a data.frame object that you want to merge to sdata.
I had the same error and could resolve it by deleting all other data, which were not actually to be added. I suppose, they confused spCbind because the matching wanted to match all row-elements, not only the one given. In my example, I used
xtra2 <- data.frame(xtra$ID_3, xtra$COMPANY)
to extract the relevant fields and fed them to spCbind afterwards
gadm <- spCbind(gadm, xtra2)

Resources