Use of substr() on DataFrame column in SparkR - r

I am using SparkR and want to use the substr() command to isolate the last character of a string that is contained in a column. I can get substr() to work if I set the StartPosition and EndPosition to a constant:
substr(sdfIris$Species, 8, 8)
But when I try to set these parameters using a value sourced from the DataFrame:
sdfIris <- createDataFrame(sqlContext, iris)
sdfIris$Len <- length(sdfIris$Species)
sdfIris$Last <- substr(sdfIris$Species, sdfIris$Len, sdfIris$Len)
Error in as.integer(start - 1) : cannot coerce type 'S4' to vector of type 'integer'
It seems that the result being returned from sdfIris$Len is perhaps a one-cell DataFrame, and the parameter needs an integer.
I have tried collect(sdfIris$Len), but:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"Column"’
This seems incongruous. substr() seems to see sdfIris$Len as a DataFrame, but collect() seems to see it as a Column.
I have already identified a work-around by using registerTempTable and using SparkSQL's substr to isolate the last character, but I was hoping to avoid the unnecessary steps of switching to SQL.
How can I use SparkR substr() on a DataFrame column with dynamic Start and Finish parameters?

It is not optimal but you can use expr:
df <- createDataFrame(
sqlContext,
data.frame(s=c("foo", "bar", "foobar"), from=c(1, 2, 0), to=c(2, 3, 5))
)
select(df, expr("substr(s, from, to)")) %>% head()
## substr(s,from,to)
## 1 fo
## 2 ar
## 3 fooba
or selectExpr:
selectExpr(df, "substr(s, from, to)") %>% head()
## substr(s,from,to)
## 1 fo
## 2 ar
## 3 fooba
as well as equivalent SQL query.

Related

ERROR in Biostrings while trying to MSA using ggmsa

I want to do a msa of the same peptide in 3 species (rat, zebrafish, and pupfish) and match it (found identical identities/disparities) with 2 synthetic peptides that I have (M35 and M871) but I'm getting the following error after building the vector:
Library (ggmsa)
galanin_table <- c("MACSKHLVLFLTILLSLAETPDSAPAHRGRGGWTLNSAGYLLGPVLHLSSKANQGRKTDSALEILDLWKAIDGLPYSRSPRMTKRSMGETFVKPRTGDLRIVDKNVPDEEATLNL", "Rat", "MHRCVGGVCVSLIVCAFLTETLGMVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDTPSARGREDLLGQYAIDSHRSLSDKHGLAGKREMPLDEDFKTGALRIADEDVVHTIIDFLSYLKLKEIGALDSLPSSLTSEEISQP", "Zebrafish", "MQRSFAVFCVSLIFCATLSETIGLVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDSPSARGRDELVNQYGIDGHRTLGDKAGLAGKRDMAQEDDVRTGPLRIGDEDIIHTVIDFLSYLKLKEMGALDSLPSPLTSDELANP", "Pupfish", "GWTLNSAGYLLGPPPGFSPFR","M35", "WTLNSAGYLLGPEHPPPALALA","M871")
galanin_matrix <- matrix(galanin_table, byrow=T, nrow=5)
galanin_table <- as.data.frame(galanin_matrix, stringsAsFactors = F)
colnames(galanin_table) <- c("Sequences", "Species")
galanin_table <- as.data.frame(galanin_table)
galanin_list <- as.list(galanin_table)
galanin_asvector <- as.vector(galanin_list)
galanin_asvector_ss <- Biostrings::AAStringSet(x= galanin_asvector)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'seqtype' for signature '"character"'
Probably I'm building the vector in the wrong way
You've certainly started out with an interesting approach for importing your sequences into R. ggmsa() expects either a system file identifying sequences in a recognized format like FASTA, or a XStringSet object of your sequences. I don't know if you've actually stored your sequences in a character string, or that was just an easy avenue for including them here in this example, but assuming that's what you've got this should get you started:
# load in decipher for the aligner
suppressMessages(library(DECIPHER))
# load in ggmsa
library(ggmsa)
# your sequences
galanin_table <- c("MACSKHLVLFLTILLSLAETPDSAPAHRGRGGWTLNSAGYLLGPVLHLSSKANQGRKTDSALEILDLWKAIDGLPYSRSPRMTKRSMGETFVKPRTGDLRIVDKNVPDEEATLNL", "Rat", "MHRCVGGVCVSLIVCAFLTETLGMVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDTPSARGREDLLGQYAIDSHRSLSDKHGLAGKREMPLDEDFKTGALRIADEDVVHTIIDFLSYLKLKEIGALDSLPSSLTSEEISQP", "Zebrafish", "MQRSFAVFCVSLIFCATLSETIGLVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDSPSARGRDELVNQYGIDGHRTLGDKAGLAGKRDMAQEDDVRTGPLRIGDEDIIHTVIDFLSYLKLKEMGALDSLPSPLTSDELANP", "Pupfish", "GWTLNSAGYLLGPPPGFSPFR","M35", "WTLNSAGYLLGPEHPPPALALA","M871")
# grab your sequnces, c(T,F) will recycle over the original vector to select
# a 1,3,5,7,etc pattern
# conversely c(F,T) can grab the names in the opposite pattern
seqs <- AAStringSet(galanin_table[c(T,F)])
names(seqs) <- galanin_table[c(F,T)]
# align your sequences
ali <- AlignSeqs(seqs)
# call ggmsa
ggmsa(msa = ali,
color = "Clustal",
font = "DroidSansMono",
char_width = 0.5,
seq_name = TRUE)
Good luck!

Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments

I am trying to do some Bioconductor exercises on R studio cloud. Running the first two codes (#1,#2) have been fine, but the last code(#3) gives the error message. Please can anyone help?
#1 Transcribe dna_seq into an RNAString object and print it
dna_seq <- subseq(unlist(zikaVirus), end = 21)
dna_seq
21-letter "DNAString" instance
seq: AGTTGTTGATCTGTGTGAGTC
#2 Transcribe dna_seq into an RNAString object and print it
rna_seq <- RNAString(dna_seq)
rna_seq
21-letter "RNAString" instance
seq: AGUUGUUGAUCUGUGUGAGUC
#3 Translate rna_seq into an AAString object and print it
aa_seq <- translate(rna_seq)
aa_seq
aa_seq <- translate(rna_seq)
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
aa_seq
Error: object 'aa_seq' not found
Thank you. I managed to solve the problem: I think there was a clash with the translate() function because it is used by both the seqinr and Biostring packages(I loaded both). I had to unload seqinr, because the exercises I was doing were based on the Biostring package.

R, getting an invalid argument to unary operator when using order function

I'm essentially doing the exact same thing 3 times, and when adding a new variable I get this error
Error in -emps$EV : invalid argument to unary operator
The code chunk causing this is
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
Works like a charm for the first list, but the identical code thereafter causes the error.
This specific line is causing the error
sort3<-emps[order(-emps$EV),]
How can I fix/workaround this?
Full Code
url <- getURL("https://raw.githubusercontent.com/M-ttM/Basketball/master/class.csv")
shots <- read.csv(text = url)
shots$make<-shots$points>0
shots2<-shots[which(!(shots$player=="Luc Richard Mbah a Moute")),]
fit1<-glm(make~factor(type)+factor(period), data=shots2,family="binomial")
summary(fit1)
shots2$makeodds<-fitted(fit1)
shots2$EV<-shots2$makeodds*ifelse(shots2$type=="3pt",3,2)
shots3<-shots2[which(shots2$y>7),]
locmakes<-data.frame(table(shots3[, c("x", "y")]))
s1k <- shots2[with(shots2, player %in% names(which(table(player)>=1000))), ]
pps<-aggregate(points~player,s1k,mean)
sort<-pps[order(-PPS$points),]
head(sort,10)
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
The error message seems to occur when trying to order columns including chr type data. A possible workaround is to use the reverse function rev() instead of the minus sign, like so:
column_a = c("a","a","b","b","c","c")
column_b = seq(6)
df = data.frame(column_a, column_b)
df$column_a = as.character(df$column_a)
df[with(df, order(-column_a, column_b)),]
> Error in -column_a : invalid argument to unary operator
df[with(df, order(rev(column_a), column_b)),]
column_a column_b
5 c 5
6 c 6
3 b 3
4 b 4
1 a 1
2 a 2
Let me know if it works in your case.
On this line, emps$EV doesn't exist.
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
You probably meant
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EM),]
head(sort3,10)

Selecting features from a feature set using mRMRe package

I am a new user of R and trying to use mRMRe R package (mRMR is one of the good and well known feature selection approaches) to obtain feature subset from a feature set. Please excuse if my question is simple as I really want to know how I can fix an error. Below is the detail.
Suppose, I have a csv file (gene.csv) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's a sample gene.csv file:
[G1.1.1.1] [G1.1.1.2] [G1.1.1.3] [G1.1.1.4] [G1.1.1.5] [G1.1.1.6] [Output]
11.688312 0.974026 4.87013 7.142857 3.571429 10.064935 -1
12.538226 1.223242 3.669725 6.116208 3.363914 9.174312 1
10.791367 0.719424 6.115108 6.47482 3.597122 10.791367 -1
13.533835 0.37594 6.766917 7.142857 2.631579 10.902256 1
9.737828 2.247191 5.992509 5.992509 2.996255 8.614232 -1
11.864407 0.564972 7.344633 4.519774 3.389831 7.909605 -1
11.931818 0 7.386364 5.113636 3.409091 6.818182 1
16.666667 0.333333 7.333333 4.333333 2 8.333333 -1
I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7,
feature_count = 2, solution_count = 1)
When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):
Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
However, data in each column of the csv file are real number.So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7,feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.
I will appreciate much if anyone can help me to obtain the best feature subset based on the gene.csv file using mRMRe R package.
I solved the problem by modifying my code as follows.
library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
df[[7]] <- as.numeric(df[[7]])
f_data <- mRMR.data(data = data.frame(df))
results <- mRMR.classic("mRMRe.Filter", data = f_data, target_indices = 7,
feature_count = 2)
solutions(results)
It worked fine. The output of the code gives the indices of the selected 2 features.
I think it has to do with your Output column which is probably of class integer. You can check that using class(df[[7]]).
To convert it to numeric as required by the warning, just type:
df[[7]] <- as.numeric(df[[7]])
That worked for me.
As for the other question, after reading the documentation, setting target_indices = 7 seems the right choice.

Nestled Loop not Working to gather data from NOAA

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Resources