I have a R script that contains a function, which I recieved in an answer for this question: R: For loop nested in for loop.
The script has been working fine on the first part of my data set, but I am now trying to use it on another part, which as far as I can tell, has the exact same format as the first, but for some reason I get an error when trying to use the script. I cannot figure out, what causes the error.
This is the script I am using:
require(data.table)
MappingTable_Calibrated = read.csv2(file.choose(), header=TRUE)
head(MappingTable_Calibrated)
#The data is sorted primarily after Scaffold number in ascending order, and secondarily after Cal_Startgen in ascending order.
MappingTable_Calibratedord = MappingTable_Calibrated[order(MappingTable_Calibrated$Scaffold, MappingTable_Calibrated$Cal_Startgen),]
head(MappingTable_Calibratedord)
dt <- data.table(MappingTable_Calibratedord, key = "Scaffold,Cal_Startgen")
head(dt)
# The following function creates pairs of loci for each scaffold.
# The function is a modified version of a function found retrieved from http://www.stackoverflow.com
fn = function(dtIn, id){
# Creates the object dtHead containing as many lines as in dtIn minus the last line)
dtHead = head(dtIn, n = nrow(dtIn) - 1)
# The names of dtHead are appended with _a. paste0 short for: paste(x, sep="")
setnames(dtHead, paste0(colnames(dtHead), "_a"))
# Creates the object dtTail containing as many lines as in dtIn minus the first line)
dtTail = tail(dtIn, n = nrow(dtIn) - 1)
# The names of dtTail are appended with _b.
setnames(dtTail, paste0(colnames(dtTail), "_b"))
# dtHead and dtTail are combined. Scaffold is defined as id. The blank column "Pairwise_Distance is added to the table.
cbind(dtHead, dtTail, Scaffold = id, Pairwise_Distance = 0)
}
#The function is run on the data. .SDcols defines the rows to be included in the output.
output = dt[, fn(.SD, Scaffold), by = Scaffold, .SDcols = c("Name", "Startpos", "Endpos", "Rev", "Startgen", "Endgen", "Cal_Startgen", "Cal_Endgen", "Length")]
output = as.data.frame(output[, with = FALSE])
But when trying to create "output" I get the following error:
Error in data.table(..., key = key(..1)) : Item 1 has no length. Provide at least one item (such as NA, NA_integer_etc) to be repeated to match the 2 rows in the longest column. Or, all columns can be 0 length, for insert()ing rows into.
dt looks like this:
Name Length Startpos Endpos Scaffold Startgen Endgen Rev Match Cal_Startgen Cal_Endgen
1: Locus_7173 144 0 144 34 101196 101340 1 1 101196 101340
2: Locus_133 110 0 110 34 223659 223776 1 1 223659 223776
3: Locus_2746 161 0 89 65 101415 101504 1 1 101415 101576
A full dput of "dt" can be found here: https://www.dropbox.com/sh/3j4i04s2rg6b63h/AADkWG3OcsutTiSsyTl8L2Vda?dl=0
Start with tracking the data which cause the error by:
function(dtIn, id){
dtHead = head(dtIn, n = nrow(dtIn) - 1)
setnames(dtHead, paste0(colnames(dtHead), "_a"))
dtTail = tail(dtIn, n = nrow(dtIn) - 1)
setnames(dtTail, paste0(colnames(dtTail), "_b"))
r <- tryCatch(cbind(dtHead, dtTail, Scaffold = id, Pairwise_Distance = 0), error = function(e) NULL)
if(is.null(r)) browser()
r
}
Then you can see you are trying to cbind elements of different nrow/length:
Browse[1]> dtHead
Empty data.table (0 rows) of 9 cols: Name_a,Startpos_a,Endpos_a,Rev_a,Startgen_a,Endgen_a...
Browse[1]> dtTail
Empty data.table (0 rows) of 9 cols: Name_b,Startpos_b,Endpos_b,Rev_b,Startgen_b,Endgen_b...
Browse[1]> id
[1] 76
Browse[1]> 0
[1] 0
Which is not allowed.
I recommend to put an if(nrow( or something similar and then add columns id = integer(), Pairwise_Distance = numeric() for nrow = 0 cases.
Related
dt <- data.frame(name = "John",children = I(list(c(1,2,3))))
name children
1 John 1, 2, 3
After trying this
dt[nrow(dt) + 1,] = c("Amos", I(list(c(3,4,5))))
I get the error
Warning message:
In [<-.data.frame(*tmp*, nrow(dt) + 1, , value = list("Amos", :
replacement element 2 has 3 rows to replace 1 rows
I think your issue is the c()
A one-row data frame would be a list, not an atomic vector.
Try:
dt[nrow(dt) + 1,] = list("Amos", I(list(c(3,4,5))))
or:
dt <- rbind(dt, list("Amos", I(list(c(3,4,5)))))
Do note though, as per r2evans note below and the link to the R inferno, this is a way of doing things that gets very slow and memory-hungry with large data frames as they are duplicated for each new line. There are better ways to approach the issue, my answer is just intended to address why your original code was not working.
I am creating a function that takes a list of user-specified words and then labels them as a number depending on the order of the number in the list. The user can specify different list lengths.
For example:
myNotableWords<-c("No_IM","IM","LGD","HGD","T1a")
aa<-c("No_IM","IM","No_IM","HGD","T1a","HGD","T1a","IM","LGD")
aa<-data.frame(aa,stringsAsFactors=FALSE)
Intended Output
new<-(1,2,1,4,5,4,5,2,3)
Is there a way of maybe getting the index of the original list and then looking up where the each element of the target list is in that index and replacing it with the index number?
Why not just use the factor functionality of R?
A "factor data type" stores an integer that references a "level" (= character string) via the index number:
myNotableWords<-c("No_IM","IM","LGD","HGD","T1a")
aa<-c("No_IM","IM","No_IM","HGD","T1a","HGD","T1a","IM","LGD")
aa <- as.integer(factor(aa, myNotableWords, ordered = TRUE))
aa
# [1] 1 2 1 4 5 4 5 2 3
new <- c()
for (item in aa) {
new <- c(new, which(myNotableWords == item))
}
print(new)
#[1] 1 2 1 4 5 4 5 2 3
You can do this using data.frame; the syntax shouldn't change. I prefer using data.table though.
library(data.table)
myWords <- c("No_IM","IM","LGD","HGD","T1a")
myIndex <- data.table(keywords = myWords, word_index = seq(1, length(myWords)))
The third line simply adds an index to the vector myWords.
aa <- data.table(keywords = c("No_IM","IM","No_IM","HGD","T1a",
"HGD","T1a","IM","LGD"))
aa <- merge(aa, myIndex, by = "keywords", all.x = TRUE)
And now you have a table that shows the keyword and its unique number.
I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
Im working on stack overflow data dump .csv file and I need to to find :
The top 8 most frequent tags in the dataset.
To do this, I see the set of tags associated with each row in the data1.PostTypeId column.The frequency of a tag is equal to the number of questions that have that tag.(it means the frequency of a tag is equal to the number of rows that has that tag )
Note1 : The file is too large it has over 1 million of rows
Note2 : Im beginner in R, so I need the simplest way. My attempt is to use table function but what I got was list of tags and I couldn't figure out the top ones
This is a sample of the table Im using is below :
Let say for example that "java" had the highest frequency (because it appeared in the most among all the rows)
then the tag "python-3.x" is the second highest frequency (because appeared the most among all the rows)
so basically I need to go over the the second column in the table and what are the top 8 that were there
etc ...
Using base R with (optional) magrittr pipes for readability:
library(magrittr)
# Make a vector of all the tags present in data
tags_sep <- tags %>%
strsplit("><") %>%
unlist()
# Clean out the remaining < and >
tags_sep <- gsub("<|>", "", tags_sep)
# Frequency table sorted
tags_table <- tags_sep %>%
table() %>%
sort(decreasing = TRUE)
# Print the top 10 tags
tags_table[1:10]
java android amazon-ec2 amazon-web-services android-mediaplayer
4 2 1 1 1
antlr antlr4 apache-kafka appium asp.net
1 1 1 1 1
Data
tags <- c(
"<java><android><selenium><appium>",
"<java><javafx><javafx-2>",
"<apache-kafka>",
"<java><spring><eclipse><gradle><spring-boot>",
"<c><stm32><led>",
"<asp.net>",
"<python-3.x><python-2.x>",
"<http><server><Iocalhost><ngrok>",
"<java><android><audio><android-mediaplayer>",
"<antlr><antlr4>",
"<ios><firebase><swift3><push-notification>",
"<amazon-web-services><amazon-ec2><terraform>",
"<xamarin.forms>",
"<gnuplot>",
"<rx-java><rx-android><rx-binding>",
"<vim><vim-plugin><syntastic>",
"<plot><quantile>",
"<node.js><express-handlebars>",
"<php><html>"
)
If I understood correctly, this should solve your problem
library(stringr)
library(data.table)
# some dummy data
dat = data.table(id = 1:3, tags = c("<java><android><selenium>",
"<java><javafx>",
"<apache><android>"))
tags = apply(str_split(dat$tags, pattern = "><", simplify = T),
2, function(x) str_replace(x, "<|>", "")) # separate one tag in each column
foo = cbind(dat[, .(id)], tags) # add the separated tags to the data
foo[foo==""] = NA # substitute empty strings with NA
foo = melt.data.table(foo, id.vars = "id") # transform to long format
foo = foo[, .N, by = value] # calculate frequency
foo[, .SD[N %in% head(N, n = 1)]] # change the value of "n" to the number you want
value N
1: java 2
2: android 2
3: NA 2
I'm kind of new to data.tables and I have a table containing DNA genomic coordinates like this:
chrom pause strand coverage
1: 1 3025794 + 1
2: 1 3102057 + 2
3: 1 3102058 + 2
4: 1 3102078 + 1
5: 1 3108840 - 1
6: 1 3133041 + 1
I wrote a custom function that I want to apply to each row of my circa 2 million rows table, it uses GenomicFeatures' mapToTranscripts to retrieve two related values in form of a string and a new coordinate. I want to add them to my table in two new columns, like this:
chrom pause strand coverage transcriptID CDS
1: 1 3025794 + 1 ENSMUST00000116652 196
2: 1 3102057 + 2 ENSMUST00000116652 35
3: 1 3102058 + 2 ENSMUST00000156816 888
4: 1 3102078 + 1 ENSMUST00000156816 883
5: 1 3108840 - 1 ENSMUST00000156816 882
6: 1 3133041 + 1 ENSMUST00000156816 880
The function is the following:
get_feature <- function(dt){
coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand)
hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE)
tx_id <- tx_names[as.character(seqnames(hit))]
cds_coordinate <- sapply(ranges(hit), '[[', 1)
if(length(tx_id) == 0 || length(cds_coordinate) == 0) {
out <- list('NaN', 0)
} else {
out <- list(tx_id, cds_coordinate)
}
return(out)
}
Then, I do:
counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]
And I get this error, indicating that the function is returning two lists of shorter length than the original table, instead of one new element per row:
Warning messages:
1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... :
Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... :
Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).
I assumed that using the .I operator would apply the function on a row basis and return one value per row. I also made sure the function was not returning empty values using the if statement.
Then I tried this mock version of the function:
get_feature <- function(dt) {
return('I should be returned once for each row')
}
And called it like this:
new.table <- counts[, get_feature(.SD), by = .I]
It makes a 1 row data table, instead of one the original length. So I concluded that my function, or maybe the way I'm calling it, is collapsing the elements of the resulting vector somehow. What am I doing wrong?
Update (with solution): As #StatLearner pointed out, it is explained in this answer that, as explained in ?data.table, .I is only intended for use in j (as in DT[i,j,by=]). Therefore, by=.I is equivalent to by=NULL and the proper syntax is by=1:nrow(dt) in order to group by row number and apply the function row-wise.
Unfortunately, for my particular case, this is utterly inefficient and I calculated an execution time of 20 seconds for 100 rows. For my 36 million row dataset that takes 3 months to complete.
In my case, I had to give up and use the mapToTranscripts function on the entire table like this, which takes a couple of seconds and was obviously the intended use.
get_features <- function(dt){
coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
tx_hit <- as.character(seqnames(hits)) # get transcript number
tx_id <- tx_names[tx_hit] # get transcript name from translation table
return(data.table('transcriptID'= tx_id,
'CDS_coordinate' = start(hits))
}
density <- counts[, get_features(.SD)]
Then map back to the genome using mapFromTranscripts from GenomicFeatures package so I could use a data.tables join to retrieve information from the original table, which was the intended purpose of what I was trying to do.
The way I do it when I need to apply a function for each row in a data.table is grouping it by row number:
counts[, get_feature(.SD), by = 1:nrow(counts)]
As explained in this answer,.I is not intended for using in by since it should return the sequence of row indices that is produced by grouping. The reason why by = .I doesn't throw an error is that data.table creates object .I equals NULL in data.table namespace, hence by = .I is equivalent to by = NULL.
Note that using by=1:nrow(dt) groups by row number and allows your function to access only a single row from data.table:
require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
pause = sample((3 * 10^6):(3.2 * 10^6), size = 100),
strand = sample(c('-','+'), size = 100, replace = TRUE),
coverage = sample.int(3, size = 100, replace = TRUE))
get_feature <- function(dt){
coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
rowNum <- nrow(coordinate)
return(list(text = 'Number of rows in dt', rowNum = rowNum))
}
counts[, get_feature(.SD), by = 1:nrow(counts)]
will produce a data.table with the same number of rows as in counts, but coordinate will contain just a single row from counts
nrow text rowNum
1: 1 Number of rows in dt 1
2: 2 Number of rows in dt 1
3: 3 Number of rows in dt 1
4: 4 Number of rows in dt 1
5: 5 Number of rows in dt 1
while by = NULL will supply the entire data.table to the function:
counts[, get_feature(.SD), by = NULL]
text rowNum
1: Number of rows in dt 100
which is the intended way for by to work.