In my Jupyter notebook I have the following Markdown code block:
````python
[[0 1 0 0 0 0]
[0 1 1 0 1 1] - [0 0 1 0 0 0]
[0 0 0 0 1 0]
[0 0 0 0 0 1]]
````
which is rather unexpectedly rendering as on my local machine as well as on Nbviewer. In other words the whitespace in front of the first vector [[0 1 0 0 0 0] seems to be getting ignored. I have used single spaces to create this whitespace, not tabs.
Any idea on how to resolve this problem? Thank you.
Related
I want to create a random Matrix with values of zeros and ones. With this presumption, there will be more zeros instead of ones! So I guess there should be something like a weighted Bernoulli distribution to choose between 0 or 1 each time (and more probability for choosing 0). I prefer not to limit it to just nxn matrices! I can try in an utterly not standard way like this:
julia> let mat = Matrix{Int64}(undef, 3, 5)
zero_or_one(shift) = rand()+shift>0.5 ? 0 : 1
foreach(x->mat[x]=zero_or_one(0.3), eachindex(mat))
end
julia> mat
3×5 Matrix{Int64}:
1 1 1 0 1
0 1 1 1 1
0 1 1 0 1
Note that this doesn't do the job. Because as you can see, I get more ones instead of zeros in the result.
Is there any more optimal or at least creative way? Or any module to do it?
Update:
It seems the result of this code will never change whether I change the value of shift or not 😑.
using SparseArrays?
julia> sprand(Bool, 1_000_000,1_000_000, 1e-9)
1000000×1000000 SparseMatrixCSC{Bool, Int64} with 969 stored entries:
⠀⠀⠁⠀⠢⠀⠂⠆⡄⠀⠀⠀⡈⠀⠐⠀⠁⠐⠂⠀⠀⢀⠤⠀⠀⠀⠄⠐⢀⠘⠈⠀⢂⠐⠀⠀⠆⠀⠠⠀⠀⠀⠀⢀⠈⠁⠀⠑⠀⢀⠐⠀
⠄⠀⡀⠀⠒⠠⠨⢀⣀⠀⠀⠀⠐⠤⠈⠀⠀⠀⠀⠀⠁⠁⠄⠐⠑⠅⢄⠠⠐⠀⠀⠀⠁⢀⠋⠂⠂⠂⠀⠀⠀⠀⠀⠀⠈⠄⠀⠀⠀⠄⠈⠀
⠠⠄⠀⢀⠀⢁⠐⠀⠁⠂⢂⠂⠀⠀⠠⠀⠀⠀⠁⠀⠈⠀⠀⠂⠀⠀⢀⠂⠀⠈⠀⠀⠀⠠⠀⠂⠄⠀⠄⠀⢀⠀⠀⠉⠀⠠⠤⠀⠒⡐⠀⠂
⢀⠂⠁⠀⠐⠀⠀⠀⠄⠀⢀⡘⠁⠂⠁⠀⠂⢀⠂⠅⡀⠀⠈⠡⠈⠉⢀⠩⠉⠄⡀⠀⠀⠐⠀⡀⡄⠈⠀⢀⠀⠂⠌⠀⠀⠂⠀⠀⠁⠀⠀⠀
⠂⠠⠀⡀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠐⠀⠀⠂⡁⠁⠉⠈⠀⠀⠁⠀⠄⢀⠤⠀⠀⠁⡂⠠⠀⠄⠀⠀⡀⠀⢀⠥⠀⢉⠀⠀⠄⠁⠀
⠈⠀⠀⠀⠀⠅⠀⠀⠈⠀⠄⡄⠀⠀⠀⠠⠄⠈⠠⠀⠀⠐⠂⠀⠀⠀⠀⠆⠠⠀⠀⠀⠐⠀⠐⠀⠀⠀⠀⠀⠀⡀⠌⢠⠀⠀⠂⠐⠈⠀⠀⠐
⠀⡀⠁⠈⡀⠀⢀⠁⠈⠠⡈⠁⢄⠈⠀⠀⠀⢁⠐⣀⠂⠄⠄⢀⠠⠀⠐⠀⠡⠠⠄⠈⢄⢈⠂⠈⠆⠀⠁⠀⠀⠀⠃⠀⠀⠠⠀⠐⠀⠐⠘⡀
⠀⠂⠁⠰⠁⠀⠀⠀⠀⠄⠀⣐⠀⡄⠤⡀⠀⠄⠀⠐⠀⠉⠁⠀⠈⢀⣐⠠⠀⠀⠂⠀⠀⠀⠠⠂⠐⠁⠀⠀⠀⠐⡈⠀⠐⡀⠠⡁⡀⠠⡀⠈
⠈⠀⠀⠠⠀⠀⠀⠁⠠⠐⠀⠐⠄⡀⠠⠀⠀⠀⠐⡀⠀⠀⠄⠀⠀⢒⠈⠊⠀⢢⡠⠀⠀⠀⡈⠀⠀⠀⠀⠈⠉⠃⠀⡀⡉⠀⢁⠔⠀⠀⠂⠀
⠀⠀⠀⡐⠠⢀⠀⡐⠀⠈⢀⠀⠀⠀⠐⠪⠀⠂⡄⠐⠀⢀⠀⠈⠀⠀⠰⠀⠀⠀⠈⠀⠀⠠⠀⠀⠐⠀⠀⠠⠀⠀⡀⠄⠈⢂⠂⠌⠀⠀⠐⠀
⢀⠜⢈⠀⠤⠂⢄⠀⠘⠀⠀⠀⠈⠀⠀⢀⠄⠀⠠⠀⠠⠀⠀⠀⠁⠐⠁⠀⠀⠈⠁⠀⠀⢀⠀⢄⠀⠄⠀⠀⠀⠀⠀⡀⢄⠀⠅⠀⠀⠀⠀⠀
⠠⠦⠀⡐⠈⠐⠀⡄⠀⠄⠀⠀⠀⠀⡐⠀⠀⠌⠀⠨⠀⠀⠩⢀⠁⠀⠈⠐⠐⠀⠀⠀⠀⡐⠈⠀⠁⠘⠀⢀⠀⠀⠈⠀⠈⠀⠀⠐⠀⠐⠀⠈
⠀⠀⠀⢄⠤⠀⡀⠀⠀⠬⠀⠀⠂⡡⠀⠌⠠⠠⠀⠀⢀⢔⠀⠀⠀⠀⢀⠄⠀⡈⠀⠀⠈⠄⡀⠐⠀⠠⠀⠀⠠⠂⠠⠑⠀⠀⡄⢀⠁⠀⠀⢁
⠀⡀⠀⠀⠄⠀⠀⡀⠀⠀⠀⠄⠀⠂⠀⠁⠀⠀⠁⡠⠀⠀⠡⠀⠂⠂⠄⠀⣀⠄⠊⢀⠁⠀⠄⠀⠀⢀⠀⠄⠀⠁⡀⠈⠁⠀⠀⠀⢂⠀⠈⠂
⠀⠀⠀⢀⠀⠀⠀⠀⠀⠀⠠⡠⢐⠀⠀⠁⠀⠂⠀⠐⠀⠒⠈⡀⡂⢀⠀⠀⠀⠡⠌⠀⠀⢀⠄⠀⢐⠀⠀⢀⠠⠀⠀⠂⠀⠀⠈⠄⠠⡠⠀⡀
⢀⠲⠀⠀⠈⠀⠀⠂⠀⠀⠀⠀⠀⣀⠨⠁⢀⠀⠀⠀⠀⠀⠰⠀⠀⢠⠀⠁⠀⢀⢀⢀⠀⡡⠀⠈⠁⠀⠁⠠⠀⡀⠀⡀⠀⠐⠀⠐⠁⡀⠂⠈
⢀⠄⠀⠀⠀⠀⠡⠀⠀⠀⠀⠀⠀⠀⢀⠀⣂⠀⠀⠀⠂⠀⠀⠀⠀⠀⠁⠀⢀⠐⠀⠀⠐⠋⠀⠀⠀⢢⠠⠀⠂⠐⠄⢈⠠⠤⠀⡀⠀⠀⠀⠀
⠀⠠⠀⠄⢀⠄⠀⠑⠀⠀⠀⠄⠀⡠⠁⡀⢔⠠⢐⠀⢀⠀⠢⠀⠀⠈⠐⠀⠀⠀⠄⠂⠀⠀⠀⠀⠀⠀⡄⠀⡈⠀⠀⠀⡀⠀⠊⡀⠀⢠⠀⠀
⠀⠀⠒⠀⡀⢐⠄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠁⠄⠀⠀⠀⠀⠀⡄⢀⡀⠀⠀⠀⠀⢀⢀⢀⠁⠁⠀⠁⠔⠀⠀⠀⠂⠀⠒⠀⢀⢈⢀⠀⠀
⠈⠀⠀⡂⠀⠁⢐⡀⠀⠀⠂⠀⠀⡂⠄⠊⠀⠀⠄⢀⠈⠈⠁⠀⠀⠈⠒⠀⠠⠑⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠄⠆⢄⠀⠀⠂⠂⠀⡀⠀⠀
⠀⠠⠄⠀⠀⠠⡀⠠⠀⠠⠀⠐⠀⠀⡌⠨⢀⠀⠀⠁⠀⠂⠀⡀⠄⠴⠀⢠⠄⠄⠄⡀⠀⠀⠂⢠⠀⠀⠀⠜⠐⠀⠀⠁⢠⠀⠄⠐⠁⠂⠀⠁
⠈⠀⠀⠀⠈⠐⠂⠈⠆⢈⠐⡀⠈⢀⠀⠐⠀⠰⠂⠀⠀⠀⠀⠀⠀⠠⠀⡂⠨⠀⠈⡀⠁⠀⠤⠈⠐⠂⠀⠀⡀⠀⠀⠀⠀⠢⠀⠠⠀⠀⠁⠀
⠠⠈⠀⠈⠠⡀⠀⠠⠀⠠⠀⠀⠐⢄⠜⠀⠈⠀⠄⡁⠀⠠⠀⠀⠀⠁⠀⠡⡀⠈⠐⠀⠂⠀⠀⠀⠀⠐⠐⡈⢀⡠⡂⠀⠀⠐⠀⠄⠀⠀⠀⠁
⢀⠁⠀⢠⠂⢁⠄⡅⠀⠠⠀⠄⠀⠠⠀⠈⡀⠈⠂⠀⠨⠈⢀⠀⢀⡈⠀⠈⡈⠂⠀⠈⠀⠀⡀⠀⠀⠀⠀⠀⠀⠊⠄⠠⠀⠀⠄⠊⠀⠈⠄⠀
⠂⠀⠀⠀⠌⠁⢀⠀⠐⠀⠀⠈⠀⠀⠁⠀⢀⠁⠪⠠⠀⠀⢐⠀⠀⠄⠀⠂⢀⡀⢐⠁⠀⣀⠒⠀⢀⢀⠀⠠⢂⠀⠀⠠⠀⠄⠐⠄⠁⠀⠀⠀
⠐⠀⠠⠀⠀⡀⠀⠄⠄⠐⠀⠁⠀⠀⠀⠀⠄⠄⠀⠀⢀⠂⠀⠰⠀⠀⠊⠀⢀⠀⠤⠀⠀⠀⠉⠀⠀⢀⠀⠁⠁⠀⠈⠁⠀⡠⡀⠐⠐⠀⠀⠀
I'd choose the Bernoulli distribution for this. Specify a success rate p, which takes value 1 with probability p and 0 with probability 1-p.
using Distributions
mat = rand(Bernoulli(0.1), 3, 4)
3×4 Matrix{Bool}:
1 0 0 0
0 0 0 0
0 0 0 0
As for your code, you chose rand()+shift>0.5 ? 0 : 1, that means if you write zero_or_one(0.3) it will give ones with probability 0.2 and zeros with probability 0.8, etc.
If you are OK with a BitMatrix:
julia> onesandzeros(shape...; threshold=0.5) = rand(shape...) .< threshold
onesandzeros (generic function with 1 method)
julia> onesandzeros(5, 8; threshold=0.2)
5×8 BitMatrix:
0 0 0 0 0 0 1 1
0 0 0 0 1 1 1 0
0 1 1 0 0 0 0 0
0 0 0 0 1 0 1 0
0 0 0 0 0 0 0 0
This amounts to sampling from a Binomial distribution.
If 0 and 1 should be equally probable, the default Binomial coefficient p = 0.5 encodes this:
julia> using Distributions
julia> rand(Binomial(), 3, 5)
3×5 Matrix{Int64}:
1 1 1 1 1
1 0 1 0 0
0 0 0 1 1
The number of 1s in the matrix is proportional to the parameter p, so if the matrix should on average contain ~10% 1 and ~90% 0, this is the same as sampling from Binomial(1, 0.1):
julia> rand(Binomial(1, 0.1), 3, 5)
3×5 Matrix{Int64}:
0 0 1 0 0
0 0 0 0 0
0 0 0 1 1
See also: Distributions.Binomial
Although based on some comments, I was somewhat convinced that the result of my code was reasonable, each time I investigated its simple procedure, I couldn't withdraw from focusing on it. I found the snag in my code. I forgot that the let blocks create a new hard scope area. So if I try returning the mat it would show a different and expected result for each run:
julia> let mat = Matrix{Int64}(undef, 3, 5)
zero_or_one(shift) = rand()+shift>0.5 ? 0 : 1
foreach(x->mat[x]=zero_or_one(0.3), eachindex(mat))
return mat
end
3×5 Matrix{Int64}:
0 0 0 0 1
0 0 0 0 0
1 0 0 1 0
Then, for making it available in the global scope, using a begin block will make the job get done:
julia> begin mat = Matrix{Int64}(undef, 3, 5)
zero_or_one(shift) = rand()+shift>0.5 ? 0 : 1
foreach(x->mat[x]=zero_or_one(0.3), eachindex(mat))
end
julia> mat
3×5 Matrix{Int64}:
0 1 0 0 1
0 0 0 0 0
1 0 0 1 0
(Note that the above results aren't precisely the same.)
When dealing with recursive equations in mathematics, it is common to write equations that hold over some range k = 1,...,d with the implicit convention that if d < 1 then the set of equations is considered to be empty. When programming in R I would like to be able to write for loops in the same way as a mathematical statement (e.g., a recursive equation) so that it interprets a range with upper bound lower than the lower bound as being empty. This would ensure that the syntax of the algorithm mimics the syntax of the mathematical statement on which it is based.
Unfortunately, R does not interpret the for loop in this way, and so this commonly leads to errors when you program your loops in a way that mimics the underlying mathematics. For example, consider a simple function where we create a vector of zeros with length n and then change the first d values to ones using a loop over the elements in the range k = 1,...,d. If we input d < 1 into this function we would like the function to recognise that the loop is intended to be empty, so that we would get a vector of all zeros. However, using a standard for loop we get the following:
#Define a function using a recursive pattern
MY_FUNC <- function(n,d) {
OBJECT <- rep(0, n);
for (k in 1:d) { OBJECT[k] <- 1 }
OBJECT }
#Generate some values of the function
MY_FUNC(10,4);
[1] 1 1 1 1 0 0 0 0 0 0
MY_FUNC(10,1);
[1] 1 0 0 0 0 0 0 0 0 0
MY_FUNC(10,0);
[1] 1 0 0 0 0 0 0 0 0 0
#Not what we wanted
MY_FUNC(10,-2);
[1] 1 1 1 1 1 1 1 1 1 1
#Not what we wanted
My Question: Is there any function in R that performed loops like a for loop, but interprets the loop as empty if the upper bound is lower than the lower bound? If there is no existing function, is there a way to program R to read loops this way?
Please note: I am not seeking answers that simply re-write this example function in a way that removes the loop. I am aware that this can be done in this specific case, but my goal is to get the loop working more generally. This example is shown only to give a clear view of the phenomenon I am dealing with.
There is imho no generic for-loop doing what you like but you could easily make it by adding
if(d > 0) break
as the first statement at the beginning of the loop.
EDIT
If you don't want to return an error when negative input is given you can use pmax with seq_len
MY_FUNC <- function(n,d) {
OBJECT <- rep(0, n);
for (k in seq_len(pmax(0, d))) { OBJECT[k] <- 1 }
OBJECT
}
MY_FUNC(10, 4)
#[1] 1 1 1 1 0 0 0 0 0 0
MY_FUNC(10, 1)
#[1] 1 0 0 0 0 0 0 0 0 0
MY_FUNC(10, 0)
#[1] 0 0 0 0 0 0 0 0 0 0
MY_FUNC(10, -2)
#[1] 0 0 0 0 0 0 0 0 0 0
Previous Answer
Prefer seq_len over 1:d and it takes care of this situation
MY_FUNC <- function(n,d) {
OBJECT <- rep(0, n);
for (k in seq_len(d)) { OBJECT[k] <- 1 }
OBJECT
}
MY_FUNC(10, 4)
#[1] 1 1 1 1 0 0 0 0 0 0
MY_FUNC(10, 1)
#[1] 1 0 0 0 0 0 0 0 0 0
MY_FUNC(10, 0)
#[1] 0 0 0 0 0 0 0 0 0 0
MY_FUNC(10, -2)
Error in seq_len(d) : argument must be coercible to non-negative integer
The function can be vectorized
MY_FUNC <- function(n,d) {
rep(c(1, 0), c(d, n -d))
}
MY_FUNC(10, 4)
#[1] 1 1 1 1 0 0 0 0 0 0
MY_FUNC(10, 1)
#[1] 1 0 0 0 0 0 0 0 0 0
MY_FUNC(10, 0)
#[1] 0 0 0 0 0 0 0 0 0 0
MY_FUNC(10, -2)
Error in rep(c(1, 0), c(d, n - d)) : invalid 'times' argument
I have this document term matrix from package R{tm} which i have coerced to as.matrix. MWE here:
> inspect(dtm[1:ncorpus, intersect(colnames(dtm), thai_list)])
<<DocumentTermMatrix (documents: 15, terms: 4)>>
Non-/sparse entries: 17/43
Sparsity : 72%
Maximal term length: 12
Weighting : term frequency (tf)
Terms
Docs toyota_suv gmotors_suv ford_suv nissan_suv
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 2 0 0
5 0 4 0 0
6 1 1 0 0
7 1 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 0 0
I need to subset this as.matrix(dtm), such that I get only documents (rows) which refer to toyota_suv but no other vehicle. I get a subset for one term (toyota_suv) using dmat<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota_suv")]) which works well. How do I set up a query: documents where toyota_suv is non-zero but values of non-toyota_suv columns are zero? I could have specified column-wise as ==0 but this matrix is dynamically generated. In some markets, there may be four cars, in some markets there may be ten. I cannot specify colnames beforehand. How do I (dynamically) club all the non-toyota_suv columns to be zero, like all_others==0?
Any help will be much appreciated.
You can accomplish this by getting the index position for toyota_suv, and then subsetting dtm to match that for non-zero, and all other columns using negative indexing on the same index variable to ensure they are all zero.
Here I modified your dtm slightly so that the two cases where toyota_sub are non-zero meet the criteria you are looking for (since none in your example met them):
dtm <- read.table(textConnection("
toyota_suv gmotors_suv ford_suv nissan_suv
0 1 0 0
0 1 0 0
0 1 0 0
0 2 0 0
0 4 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 1 0 0
0 1 0 0"), header = TRUE)
Then it works:
# get the index of the toyota_suv column
index_toyota_suv <- which(colnames(dtm) == "toyota_suv")
# select only cases where toyota_suv is non-zero and others are zero
dtm[dtm[, "toyota_suv"] > 0 & !rowSums(dtm[, -index_toyota_suv]), ]
## toyota_suv gmotors_suv ford_suv nissan_suv
## 6 1 0 0 0
## 7 1 0 0 0
Note: This is not really a text analysis question at all, but rather one for how to subset matrix objects.
It would be helpful if you provided the exact code you are running and a sample data set to work with so that we can replicate your work and provide a working example.
Given that, if I understand your question correctly you are looking for a way to label all non-toyota columns to be zero. You could try:
df[colnames(df) != "toyota"] <- 0
I have an object currency I would like to select one column and the rows equal to 1 with the variable Pair.
>currency
EURUSD EURUSDi USDJPY USDJPYi GBPUSD GBPUSDi AUDUSD AUDUSDi XAUUSD XAUUSDi zeroes
2000-07-16 0 0 0 0 0 1 0 0 0 0 0
2000-07-23 0 0 0 0 0 1 0 0 0 0 0
2000-07-30 0 0 0 0 0 1 0 0 0 0 0
2000-08-06 0 0 0 0 0 0 0 0 0 1 0
2000-08-13 0 1 0 0 0 0 0 0 0 0 0
From the console I can do it with subset like this :
> subset(currency$GBPUSDi, GBPUSDi == 1)
GBPUSDi
2000-07-16 1
2000-07-23 1
2000-07-30 1
2000-08-06 1
2000-08-13 1
2000-08-20 1
But as soon as it is passed in a script with variable Pair it fails. I've searched for hours in the documentation and I'm having a headache trying to figure out what is wrong.
Please find the different command I've try :
subset (currency$Pair, Pair == 1)
subset (currency, Pair = 1, select = Pair)
weights$Cur[currency$Pair = 1]
The one that works is currency[,c(Pair)] but it only select column, how can I complete with row selection of Pair = 1 ?
currency[,c(Pair)][Pair = 1] and subset (currency[,c(Pair)], Pair = 1) with = or == doesn't work.
currency$Pair[currency$Pair == 1] should work ($Pair select column Pair and [currency$Pair == 1] select values equal to 1). It looks like it don't work in your case, because currency don't contain variable Pair.
If currency is not a dataframe but matrix, you can try
currency[currency[, c("Pair")] == 1, c("Pair")]
I have a series of data in the format (true/false). eg it looks like it can be generated from rbinom(n, 1, .1). I want a column that represents the # of rows since the last true. So the resulting data will look like
true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1
What is an efficient way to go from true/false to gap (in practice I'll this will be done on a large dataset with many different ids)
DF <- read.table(text="true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1", header=TRUE)
DF$gap2 <- sequence(rle(DF$true.false)$lengths) * #create a sequence for each run length
(1 - DF$true.false) * #multiply with 0 for all 1s
(cumsum(DF$true.false) != 0L) #multiply with zero for the leading zeros
# true.false gap gap2
#1 0 0 0
#2 0 0 0
#3 1 0 0
#4 0 1 1
#5 0 2 2
#6 1 0 0
#7 1 0 0
#8 0 1 1
The cumsum part might not be the most efficient for large vectors. Something like
if (DF$true.false[1] == 0) DF$gap2[seq_len(rle(DF$true.false)$lengths[1])] <- 0
might be an alternative (and of course the rle result could be stored temporarly to avoid calculating it twice).
Ok, let me put this in answer
1) No brainer method
data['gap'] = 0
for (i in 2:nrow(data)){
if data[i,'true/false'] == 0{
data[i,'gap'] = data[i-1,'gap'] + 1
}
}
2) No if check
data['gap'] = 0
for (i in 2:nrow(data)){
data[i,'gap'] = (data[i-1,'gap'] + 1) * (-(data[i,'gap'] - 1))
}
Really don't know which is faster, as both contain the same amount of reads from data, but (1) have an if statement, and I don't know how fast is it (compared to a single multiplication)