How to calculate similarities between test and train documents - vector

I'm trying to calculate similarities between test and train documents and label them. Here is the code but it doesn't work. I'd also appreciate if somebody could explain the idea.
def calculate_similarities(self, vecTestDoc, vectorsOfTrainDocs):
list_of_similarities = []
for vector in vectorsOfTrainDocs:
label = vectorsOfTrainDocs.key()
list_of_similarities += [(self.calculate_similarities(vector, vecTestDoc), label)]
return list_of_similarities
Here's the error:
File "..\classification.py", line 98, in calculate_similarities
label = vectorsOfTrainDocs.key()
AttributeError: 'list' object has no attribute 'key'
Edit: I've defined two more functions and have been working on a different solution. Here are they:
def cosine_similarity(self, weightedA, weightedB):
dotAB = dot(weightedA, weightedB)
normA = math.sqrt(dot(weightedA, weightedA))
normB = math.sqrt(dot(weightedB, weightedB))
return dotAB / (normA * normB)
def fit(self, doc_collection):
self.doc_collection = doc_collection
self.vectorsOfDoc_collection = [(doc, self.doc_collection.tfidf(doc.token_counts))
for doc in self.doc_collection.docid_to_doc.values()]
I believe something like this would work but there are still error messages... What should I change?
return [self.doc_collection.cosine_similarity(vecTestDoc) in vectorsOfTrainDocs]

Related

Using the try and try catch function on .WAV files

Trying to cut a bunch of audio (.WAV) files into smaller samples in R. For this example, I'm using a loop to cut out 1 minute samples at 140 minutes.
For some files, the recording ends before 140 minutes due to an error in the recording device. When this occurs, an error appears -- and the loop stops. I'm trying to make it so the loop continues by using the try or tryCatch function however keep getting errors.
The code is as follows:
for(i in 1:length(AR_CD288)){
CUT_AR288_5 <- try({readWave(AR_CD288[i], from = 140, to = 141, units = "minutes")})
FILE.OUT_AR288_5<- sub("\\.wav$", "_140.wav", AR_CD288)
OUT.PATH_AR288_5 <- file.path("New files", basename(FILE.OUT_AR288_5))
writeWave(CUT_AR288_5, extensible=FALSE, filename = OUT.PATH_AR288_5[i])
}
I get the following two errors from the code:
Error in readBin(con, int, n = N, size = bytes, signed = (bytes != 1), :
invalid 'n' argument
Error in writeWave(CUT_AR288_5, extensible = FALSE, filename = OUT.PATH_AR288_5[i]) :
'object' needs to be of class 'Wave' or 'WaveMC
The loop still saves some samples into the "New files" directory, however, once the loop reaches a file <140 minutes, the loop stops.
I am very stuck! Any help would be greatly appreciated.
Cheers.
When I use try, I always do one (or both) of:
check the return value to see if it inherits "try-error", indicating that the command failed; or
add try(., silent = TRUE), indicating that I don't care if it succeeded (but this implies that I will not use its return value, either).
Try this:
for (i in seq_along(AR_CD288)) {
CUT_AR288_5 <- try({
readWave(AR_CD288[i], from = 140, to = 141, units = "minutes")
}, silent = TRUE)
if (!inherits(CUT_AR288_5, "try_error")) {
FILE.OUT_AR288_5 <- sub("\\.wav$", "_140.wav", AR_CD288)
OUT.PATH_AR288_5 <- file.path("New files", basename(FILE.OUT_AR288_5))
writeWave(CUT_AR288_5, extensible = FALSE, filename = OUT.PATH_AR288_5[i])
}
}
Three notes:
I changed 1:length(.) to seq_along(.); the latter is more resilient in an automated use when it is feasible that the vector might be length 0. For example, if AR_CD288 can ever be length 2, intuitively we expect 1:length(AR_CD288) to return nothing so that the for loop will not run; unfortunately, it resolves to 1:0 which returns a vector of length 2, which will often fail (based on whatever code is operating in the loop). The use of seq_along(.) will always return a vector of length 0 with an empty input, which is what we need. (Alternatively and equivalent, seq_len(length(AR_CD288)), though that's really what seq_along is intended to do.)
If you do not add silent=TRUE (or explicitly add silent=FALSE), then you will get an error message indicating that the command failed. Unfortunately, the error message may not indicate which i failed, so you may be left in the dark as far as fixing or removing the errant file. You may prefer to add an else to the if (inherits(.,"try-error")) clause so that you can provide a clearer error, such as
if (inherits(CUT_AR288_5, "try_error")) {
warning("'readWave' failed on ", sQuote(AR_CD288[i]), call. = FALSE)
} else {
FILE.OUT_AR288_5 <- sub("\\.wav$", "_140.wav", AR_CD288)
# ...
}
(noting that I put the "things worked" code in the else clause here ... I find it odd to do if (!...) {} else {}, seems like a double-negation :-).
The choice to wrap one function or the whole block depends on your needs: I tend to prefer to know exactly where things fail, so the will-possibly-fail functions are often individually wrapped with try so that I can react (or log/message) accordingly. If you don't need that resolution of error-detection, then you can certainly wrap the whole code-block in a sense:
for (i in seq_along(AR_CD288)) {
ret <- try({
CUT_AR288_5 <- readWave(AR_CD288[i], from = 140, to = 141, units = "minutes")
FILE.OUT_AR288_5 <- sub("\\.wav$", "_140.wav", AR_CD288)
OUT.PATH_AR288_5 <- file.path("New files", basename(FILE.OUT_AR288_5))
writeWave(CUT_AR288_5, extensible = FALSE, filename = OUT.PATH_AR288_5[i])
}, silent = TRUE)
if (inherits(ret, "try-error")) {
# do or log something
}
}

How to implement self and __init__() in julia

I would like to know what is the correct approach to implement self and __inti__() in Julia?
Example
class rectangle:
def __init__(self, length, breadth, height):
self.length = length
self.breadth = breadth
self.height = height
def get_area(self):
return self.length * self.breadth
r = rectangle(160, 20, 1000)
print("area is", r.get_area())
I have tried this in Julia, but it does neither fits the operation expectation nor the results.
struct rectangle
length
breadth
height
end
function __init__(rectangle)
rectangle.length = length
rectangle.breadth = breadth
rectangle.height = height
end
function get_area(rectangle)
return rectangle.length*rectangle.breadth
end
data_obj = __init__()
r = get_area(data_obj)
end
Please do suggest an appropriate approach to achieve the python example in Julia.
Thanks in advance!!
A bold move to just literally translate from Python. It doesn't work that way, obviously.
However, the following should be enough:
struct Rectangle{T}
length::T
breadth::T
height::T
end
area(rectangle) = rectangle.length * rectangle.breadth
r = Rectangle(160, 20, 1000)
println(area(r))
(The type parameter is not something you asked for, but recommended.)
Now, if you need to do something more than simply assign the fields, you can write an outer constructor:
function Rectangle(l, b, h)
...
return Rectangle(l, b, h)
end
But there's no need for this unless some actual logic is required.

Gradient flow stopped on a combined model

I meet with a problem that the gradient cannot backpropagate on a combined network. I checked lots of answers but cannot find a relevant solution to this problem. I would appreciate it so much if we can solve this.
I wanted to calculate the gradient for input data in this code:
for i, (input, target, impath) in tqdm(enumerate(data_loader)):
# print(‘input.shape:’, input.shape)
input = Variable(input.cuda(), requires_grad=True)
output = model(input)
loss = criterion(output, target.cuda())
loss = Variable(loss, requires_grad=True)
loss.backward()
print(‘input:’, input.grad.data)
but I got errror:
print(‘input:’, input.grad.data)
AttributeError: ‘NoneType’ object has no attribute ‘data’
and my model is a combined model that I loaded the parameters from two pretrained models.
I checked the requires_grad state-dict of model weights, it is true, however, the gradient of the model weights is None.
Is it because I load the state-dict that caused the gradient block?
How can I deal with this problem?
The model structure is attached below:
class resnet_model(nn.Module):
def __init__(self, opt):
super(resnet_model, self).__init__()
resnet = models.resnet101()
num_ftrs = resnet.fc.in_features
resnet.fc = nn.Linear(num_ftrs, 1000)
if opt.resnet_path != None:
state_dict = torch.load(opt.resnet_path)
resnet.load_state_dict(state_dict)
print("resnet load state dict from {}".format(opt.resnet_path))
self.model1 = torch.nn.Sequential()
for chd in resnet.named_children():
if chd[0] != 'fc':
self.model1.add_module(chd[0], chd[1])
self.model2 = torch.nn.Sequential()
self.classifier = LINEAR_LOGSOFTMAX(input_dim=2048, nclass=200)
if opt.pretrained != None:
self.classifier_state_dict = torch.load('../checkpoint/{}_cls.pth'.format(opt.pretrained))
print("classifier load state dict from ../checkpoint/{}_cls.pth".format(opt.pretrained))
self.classifier.load_state_dict(self.classifier_state_dict)
for chd in self.classifier.named_children():
self.model2.add_module(chd[0], chd[1])
def forward(self, x):
x = self.model1(x)
x = x.view(-1, 2048)
x = self.model2(x)
return x
The problem is solved with this comment:
Why do you have this line: loss = Variable(loss, requires_grad=True) ?
Variable should not be used anymore.
So the line above should be deleted and to mark a Tensor for which you want gradients, you can use:
input = input.cuda().requires_grad_().

how do I loop through nextToken

I just kind of figured out how to use the nextToken in boto3. The API call I am making I should expect about 300 entries. I only get 100. I know I need to loop through the next token but I am struggling on how to do that. I am new to the python army.
def myservers():
response = client.get_servers(maxResults=100,)
additional = client.get_servers(nextToken=response['nextToken'])
this little snipit will give me the first 50 plus the first 'nextToken' for a total of 100 items. Clearly I need to iterate over and over to get the rest. I am expecting 300 plus items.
I would just do a simple while loop:
response = client.get_servers()
results = response["serverList"]
while "NextToken" in response:
response = client.get_servers(NextToken=response["NextToken"])
results.extend(response["serverList"])
I used the suggestion here:
https://github.com/boto/botocore/issues/959#issuecomment-429116381
You have to keep calling client.get_servers() passing in the NextToken.
Derived from the answer #Luke mentioned, here is a simple method to call any method using the nextToken
def get_all_boto_records(method, args_dict, response_to_list):
all_records = []
extraArgs = {**args_dict}
while True:
resp = method(**extraArgs)
all_records += response_to_list(resp)
if "nextToken" in resp:
extraArgs["nextToken"] = resp["nextToken"]
else:
break
return all_records
Here is how to use it:
all_log_records = get_all_boto_records(
client.filter_log_events,
{
"logGroupName": "",
"filterPattern": "",
},
lambda x: x["events"],
)
Another great example with another method to get the log groups
all_log_groups = get_all_boto_records(
client.describe_log_groups,
{"logGroupNamePrefix": f"/aws/lambda/{q}"},
lambda x: x["logGroups"],
)

How to use S4 object programming in R

What's wrong with my R script? I'm trying to use a vector of user-defined objects (here a vector of "Page" objects) within another user-defined object (here a "Book" object)
setClass("Page",
slots = c(PageNo = "numeric", #scalar
Contents = "character") #vector of strings
)
setClass("Book",
slots = c(Pages = "vector", # Something wrong here? vector of pages ? "Page" or vector" or "list"
Title = "character") #vector of strings
)
setGeneric(name="AddPage", def=function(aBook, pageNo){standardGeneric("AddPage")})
setMethod(f="AddPage", signature="Book",
definition=function(aBook, pageNo)
{
page1 = new("Page")
page1#PageNo = pageNo
aBook#Pages = c(aBook#Pages, page1) # Something wrong here?
}
)
book1 = new("Book")
book1#Title = "Sample Book"
book1
book1#Pages
AddPage(book1, 1)
AddPage(book1, 2)
book1#Pages
Remember that R does not use reference semantics, so AddPage(book1, 1) creates a copy of book1, and updates that. In the method you don't return the updated object, and book1 remains unchanged.
Update the method so that it returns the modified object
setMethod(f="AddPage", signature="Book",
definition=function(aBook, pageNo)
{
page1 = new("Page")
page1#PageNo = pageNo
aBook#Pages = c(aBook#Pages, page1) # Something wrong here?
aBook
}
)
and assign the return value to the old variable
book1 = AddPage(book1, 1)
But this is a very inefficient approach -- the line aBook#Pages = c(aBook#Pages, page1) makes a copy of all existing pages (on the right-hand side, to create a longer vector; this will scale with the square of the number of Pages added to the book) and then copies the entire Book (for the assignment). In addition, creating individual objects is expensive and does not exploit R's 'vectorization'. A first step is to think of the object 'Page' as instead 'Pages', where the object models the columns rather than rows of a data frame. 'Book' then doesn't have vector of Page objects, but a single Pages object. This also implies a different approach to creating your 'book'.

Resources