I have three dataframes created from different ngram counts (Uni, Bi , Tri) each data frame contains the separated ngram, frequency counts (n) and have added probability using smoothing.
I have written three functions to look through the tables and return the highest probable word based on an input string. And have binded them
##Prediction Model
trigramwords <- function(FirstWord, SecondWord, n = 5 , allow.cartesian =TRUE) {
probword <- trigramtable[.(FirstWord, SecondWord), allow.cartesian = TRUE][order(-Prob)]
if(any(is.na(probword)))
return(bigramwords(SecondWord, n))
if(nrow(probword) > n)
return(probword[1:n, ThirdWord])
count <-nrow(probword)
bgramwords <- bigramtable(SecondWord, n)[1:(n - count)]
return(c(probword[, ThirdWord], bgramwords))
}
bigramwords <- function(FirstWord, n = 5 , allow.cartesian = TRUE){
probword <- bigramtable[FirstWord][order(-Prob)]
if(any(is.na(probword)))
return(Unigramword(n))
if (nrow(probword) > n)
return(probword[1:n, SecondWord])
count <- nrow(probword)
word1 <- Unigramword(n)[1:(n - count)]
return(c(probword[, SecondWord], word1))
}
##Back off Model
Unigramword <- function(n = 5, allow.cartesian = TRUE){
return(sample(UnigramTable[, FirstWord], size = n))
}
## Bind Functions
predictword <- function(str) {
require(quanteda)
tokens <- tokens(x = char_tolower(str))
tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
words <- trigramwords(tokens[1], tokens[2], 5)
chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")
print(words[1])
}
However I receive the following warning message and the output is always the same word. If I use only the bigramwords function it works fine, but when adding the trigram function I get the warning message. I believe it because 1:n is not defined correctly.
Warning message:
In 1:n : numerical expression has 5718534 elements: only the first used
Related
I have a dataframe that has two columns, x and y (both populated with numbers). I am trying to look at a moving window within the data, and I've done it like this (source):
# Extract just x and y from the original data frame
df <- dat_fin %>% select(x, y)
# Moving window creation
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
This lapply creates a list of tibbles that are each 10 (x, y) pairs. At this point, I am trying to compute a single quantity using each of the sets of 10 pairs; my current (not working) code looks like this:
library(shotGroups)
for (f in 1:length(windfs)) {
tsceps[f] = getCEP(windfs[f], accuracy = TRUE)
}
When I run this, I get the error:
Error in getCEP.default(windfs, accuracy = TRUE) : xy must be numeric
My goal is that the variable that I've called tsceps should be a 1 x length(windfs) data frame, each value in which comes from the getCEP calculation for each of the windowed subsets.
I've tried various things with unnest and unlist, all of which were unsuccessful.
What am I missing?
Working code:
df <- dat_fin %>% select(x, y)
nr <- nrow(df)
windowSize <- 10
windfs <- lapply(seq_len(nr - windowSize + 1), function(i) df[i:(i + windowSize - 1), ])
tsceps <- vector(mode = "numeric", length = length(windfs))
library(shotGroups)
for (j in 1:length(windfs)) {
tsceps[j] <- getCEP(windfs[[j]], type = "CorrNormal", CEPlevel = 0.50, accuracy = TRUE)
}
ults <- unlist(tsceps)
ults_cep <- vector(mode = "numeric", length = length(ults))
for (k in 1:length(ults)) {
ults_cep[k] <- ults[[k]]
}
To get this working with multiple type arguments to getCEP, just use additional code blocks for each type required.
I've been trying to randomly subsample my seurat object.
I'm interested in subsampling based on 2 columns: condition and cell type. I have 5 conditions and 5 cell types. Main goal is to have 1000 cells for each cell type in each condition.
I've tried this so far:
First thing is subsetting my seurat object:
my.list <- list(hipo.c1.neurons = hipo %>%
subset(., condition %in% "c1" & group %in% "Neurons"),
hipo.c1.oligo = hipo %>%
subset(., condition %in% "c1" & group %in% "Oligod")...etc...)
And then subsample it using sample function:
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
})
And I get this error since there are some objects with less than 1000 cells: error in evaluating the argument 'j' in selecting a method for function '[': cannot take a sample larger than the population when 'replace = FALSE'
Then I've tried with this function:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error = function(e)NULL))
}
But then it gives me 0 in those objects that have less than 1000 cells. What would be the way to skip those objects that have less than 1000 cells and leave it like they are (not sample those ones)?
Is there a simpler way to do this, so I don't have to subset all of my objects separately?
I can't say for certain without seeing your data, but could you just add an if statement in the function? It looks like you're sampling column-wise, so check the number of columns. Just return x if the number of columns is less than the number you'd like to sample.
set.seed(0)
my.list.sampled <- lapply(X = my.list, FUN = function(x) {
if(ncol(x) > 1000){
x <- x[,sample(ncol(x), 1000, replace = FALSE)]
} else {
x
}
})
You could make it more flexible if you want to sample something other than 1000.
set.seed(0)
my.list.sampled <- lapply(X = my.list, B = 1000, FUN = function(x, B) {
if(ncol(x) > B){
x <- x[,sample(ncol(x), B, replace = FALSE)]
} else {
x
}
})
I have a data set from a sample without replacement look like this:
The picture shows the frequency of each species, and there are 50 data.c[[k]] like this.
Now I'm trying the Jackknife resampling(without replacement) to estimate coverage, codes below:
data.c <- sapply(1:50, function(k)table(data[,k])) #freq
mdata <- sapply(1:50, function(k)sum(data.c[[k]]==1))
True_c <- 1- sum(np*(exp(lchoose(N-data.c[[k]], i))/exp(lchoose(N,i))))
##True_c function shows error message##
my result shows "Error in N - data.c : non-numeric argument to binary operator"
I want to do True_c with N(population size) minus species' frequncies and do the 'lchoose' function, how can I do or adjust my codes?
My entire codes show below:
### without replacement
for (seed in c(99,100)){
set.seed(seed)
for (s in c(100,1000)){
sdata <- rlnorm(s,0,1)
p <- sdata/sum(sdata)
gn <- p*s*10
gn <- round(gn)
M <-replace(gn, gn==0,1) #or M=gn[gn==0]=1
N <- sum(M); N
np <- M/N #new prob
pop_index = rep(1:s, time=M)
for (i in c(100,500,1000,5000,N))
{
data=replicate(50, sample(pop_index, i,
replace = FALSE, prob = NULL))
data.c=sapply(1:50, function(k)table(data[,k])) #freq
mdata=sapply(1:50, function(k)sum(data.c[[k]]==1)) #each group, total freq=1
True_c <- 1- sum(np*(exp(lchoose(N-data.c, i))/exp(lchoose(N,i))))
c.hat <- (1-(1-(i/N))*(mdata/i)) #geo
bias=mean(c.hat)-True_c
var=var(c.hat)
cat("sample_size",i,"\n",
"True_C=",True_c,"\n",
"bias =",bias,"\n",
"variance=",var,"\n","\n")
}
}
}
I am trying to write a code that would automatically calculate Wilcoxon test p-value for several comparisons.
Data used: 2 data sets with the same information representing two groups of participants completed the same 5 tasks which means that the each table contains 5 columns (tasks) and X rows with tasks scores.
data_17_18_G2 # first data set (in data.table format)
data_18_20_G2 # second data set (in data.table format)
Both data sets have identical names of column which are to be used in the W-test the next way:
wilcox.test(Group1Task1, Group2Task1, paired = F)
wilcox.test(Group1Task2, Group2Task2, paired = F)
and so on.
The inputs (e.g., Grou1Task1) are two vectors of task scores (the first one will be from data_17_18_G2 and the other one from data_18_20_G2
Desired output: a data table with a column of p-values
The problem I faced is that no matter how I manipulated the val1 and val2 empty objects, in the second and the third lines the right size "as.numeric(unlist(data_17_18_G2[, ..i]))" gives a correct output (a numeric vector) but it's left size "val1[i]" always returns only one value from the vector. That gave me the idea that the main problem appeared on the step of creating an empty vector, however, I wasn't able to solve it.
Empty objects:
result <- data.table(matrix(ncol=2))
val1 <- as.numeric() # here I also tried functions "numeric" and "vector"
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
For loop
for (i in 1:5) {
val1[i] <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2[i] <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
Output:
Error in `[<-.data.table`(`*tmp*`, i, 2, value = NULL) :
When deleting columns, i should not be provided
1: В val1[i] <- as.numeric(unlist(data_17_18_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
2: В val2[i] <- as.numeric(unlist(data_18_20_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
3: В res[i] <- wilcox.test(val1[i], val2[i], paired = F) :
number of items to replace is not a multiple of replacement length
Alternative:
I changed the second and the third lines
for (i in 1:5) {
val1[i] <- as.numeric(data_17_18_G2[ , ..i])
val2[i] <- as.numeric(data_18_20_G2[ , ..i])
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
And got this
Error in as.numeric(data_17_18_G2[, ..i]) :
(list) object cannot be coerced to type 'double'
which means that the function wilcox.test cannot interpret this type of input.
How can I improve the code so that I get a data table of p-values?
There would appear to be some bugs in the code. I have rewritten the code using the cars dataset as a example.
## use the cars dataset as a example (change with appropriate data)
data(cars)
data_17_18_G2 <- as.data.table(cars)
data_18_20_G2 <- data_17_18_G2[,2:1]
## Fixed code
result <- data.table(matrix(as.numeric(), nrow=ncol(data_17_18_G2), ncol=2))
val1 <- as.numeric()
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
for (i in 1:ncol(data_17_18_G2)) {
val1 <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2 <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[[i]] <- wilcox.test(val1, val2, paired = F)
result[i, 1] <- as.numeric(i)
result[i, 2] <- as.numeric(res[[i]]$p.value)
}
Hope this gives you the output you are after.
I'm trying the same code as in https://thiloshon.wordpress.com/2018/03/11/build-your-own-word-sentence-prediction-application-part-02/ to do word-level prediction. The input textual data is also in the mentioned link and I use en_US.news.txt file as my only input file.
library(quanteda)
library(data.table)
#read the .txt file
df=readLines('en_US.news.txt')
#take a sample of the df
sampleHolderNews <- sample(length(df), length(df) * 0.1)
US_News_Sample <- df[sampleHolderNews]
#build the corpus of the data
corp <- corpus(US_News_Sample)
#Preprocessing
master_Tokens <- tokens(x = tolower(corp),remove_punct =
TRUE,remove_numbers = TRUE,remove_hyphens = TRUE,remove_symbols = TRUE)
stemed_words <- tokens_wordstem(master_Tokens, language = "english")
#tokenization#
bi_gram <- tokens_ngrams(stemed_words, n = 2)
tri_gram <- tokens_ngrams(stemed_words, n = 3)
uni_DFM <- dfm(stemed_words)
bi_DFM <- dfm(bi_gram)
tri_DFM <- dfm(tri_gram)
uni_DFM <- dfm_trim(uni_DFM, 3)
bi_DFM <- dfm_trim(bi_DFM, 3)
tri_DFM <- dfm_trim(tri_DFM, 3)
sums_U <- colSums(uni_DFM)
sums_B <- colSums(bi_DFM)
sums_T <- colSums(tri_DFM)
# Create data tables with individual words as columns
uni_words <- data.table(word_1 = names(sums_U), count = sums_U)
bi_words <- data.table(
word_1 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 2),
count = sums_B)
tri_words <- data.table(
word_1 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 2),
word_3 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 3),
count = sums_T)
#indexing#
setkey(uni_words, word_1)
setkey(bi_words, word_1, word_2)
setkey(tri_words, word_1, word_2, word_3)
######## Finding Bi-Gram Probability #################
discount_value <- 0.75
# Finding number of bi-gram words
numOfBiGrams <- nrow(bi_words[.(word_1, word_2)])
# Dividing number of times word 2 occurs as second part of bigram, by total number of bigrams.
# Finding probability for a word given the number of times it was second word of a bigram
ckn <- bi_words[, .(Prob = ((.N) / numOfBiGrams)), by = word_2]
setkey(ckn, word_2)
# Assigning the probabilities as second word of bigram, to unigrams
uni_words[, Prob := ckn[word_1, Prob]]
uni_words <- uni_words[!is.na(uni_words$Prob)]
# Finding number of times word 1 occurred as word 1 of bi-grams
n1wi <- bi_words[, .(N = .N), by = word_1]
setkey(n1wi, word_1)
# Assigning total times word 1 occured to bigram cn1
bi_words[, Cn1 := uni_words[word_1, count]]
# Kneser Kney Algorithm
bi_words[, Prob := ((count - discount_value) / Cn1 + discount_value / Cn1 *
n1wi[word_1, N] * uni_words[word_2, Prob])]
######## End of Finding Bi-Gram Probability #################
######## Finding Tri-Gram Probability #################
# Finding count of word1-word2 combination in bigram
tri_words[, Cn2 := bi_words[.(word_1, word_2), .N]]
n1w12 <- tri_words[, .N, by = .(word_1, word_2)]
setkey(n1w12, word_1, word_2)
# Kneser Kney Algorithm
tri_words[, Prob := ((count - discount_value) / Cn2 + discount_value / Cn2 *
n1w12[.(word_1, word_2), .N] * bi_words[.(word_1, word_2), Prob])]
Here I get the following error for Kneser algorithm for trigrams:
Error in `[.data.table`(tri_words, , `:=`(Prob, ((count - discount_value)/Cn2 + :
Supplied 13867 items to be assigned to 3932 items of column 'Prob'. If you wish to 'recycle'
the RHS please use rep() to make this intent clear to readers of your code.
In addition: Warning messages:
1: In discount_value/Cn2 * n1w12[list(word_1, word_2), .N] * bi_words[list(word_1, :
longer object length is not a multiple of shorter object length
2: In (count - discount_value)/Cn2 + discount_value/Cn2 * n1w12[list(word_1, :
longer object length is not a multiple of shorter object length
I could find some similar questions related to data table error but I can't understand how should I solve this error in the code.
The problem is in your attempt to multiply the quantities in the last line. This expression:
(count - discount_value) / Cn2 + discount_value / Cn2
is length 20, like tri_words. But the next expression
n1w12[.(word_1, word_2), .N]
is length 19. Then the last part,
bi_words[.(word_1, word_2), Prob])
is length 155 (and contains a lot of NAs).
The error messages are saying that the shorter item cannot be recycled into the longer item because the longer item's length is not a multiple of the length of the shorter item. To fix this, you need to implement this algorithm more carefully.