Updating running list if event happens - count

sample Data:
clear
* Input data
input student CITATION EXPELLED hadCITATION hadEXPELLED
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
2 0 0 0 0
2 0 0 0 0
2 1 0 1 0
2 1 0 1 0
2 0 0 1 0
3 1 0 1 0
3 0 1 1 1
3 1 1 1 1
3 1 0 1 1
3 1 0 1 1
4 . . . .
4 . 0 . 0
4 0 0 0 0
4 0 1 0 1
4 1 0 1 0
I want to create these hadCITATION and hadEXPELLED variable columns that update based on the responses of CITATION and EXPELLED.

This may help. I can't see that this makes sense without a time or sequence variable. My guess is that once you've had a CITATION or EXPULSION, then that's your history. The rules may be more complicated, but I can't see that you're explaining them. I can't see the rationale for your example for student 4.
clear
input student CITATION EXPELLED hadCITATION hadEXPELLED
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
2 0 0 0 0
2 0 0 0 0
2 1 0 1 0
2 1 0 1 0
2 0 0 1 0
3 1 0 1 0
3 0 1 1 1
3 1 1 1 1
3 1 0 1 1
3 1 0 1 1
4 . . . .
4 . 0 . 0
4 0 0 0 0
4 0 1 0 1
4 1 0 1 0
end
gen long time = _n
bysort student (time) : gen want1 = sum(CITATION) > 0
by student: gen want2 = sum(EXPELLED) > 0
list student CIT EXP hadCIT hadEXP want?, sepby(student)
+---------------------------------------------------------------------+
| student CITATION EXPELLED hadCIT~N hadEXP~D want1 want2 |
|---------------------------------------------------------------------|
1. | 1 0 0 0 0 0 0 |
2. | 1 0 0 0 0 0 0 |
3. | 1 0 0 0 0 0 0 |
4. | 1 0 0 0 0 0 0 |
5. | 1 0 0 0 0 0 0 |
|---------------------------------------------------------------------|
6. | 2 0 0 0 0 0 0 |
7. | 2 0 0 0 0 0 0 |
8. | 2 1 0 1 0 1 0 |
9. | 2 1 0 1 0 1 0 |
10. | 2 0 0 1 0 1 0 |
|---------------------------------------------------------------------|
11. | 3 1 0 1 0 1 0 |
12. | 3 0 1 1 1 1 1 |
13. | 3 1 1 1 1 1 1 |
14. | 3 1 0 1 1 1 1 |
15. | 3 1 0 1 1 1 1 |
|---------------------------------------------------------------------|
16. | 4 . . . . 0 0 |
17. | 4 . 0 . 0 0 0 |
18. | 4 0 0 0 0 0 0 |
19. | 4 0 1 0 1 0 1 |
20. | 4 1 0 1 0 1 1 |
+---------------------------------------------------------------------+

Related

comparison.wordcloud error of strwidth(words[i], cex = size[i], ...) : invalid 'cex' value (SOLVED)

I have a termdocumentmatrix tdm1 and I put it in through this formula:
comparison.cloud(tdm1, random.order=FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
title.size=1, max.words=50, scale=c(4, 0.5),rot.per=0.4)
However, I got an error which is "Error in strwidth(words[i], cex = size[i], ...) : invalid 'cex' value"
Not too sure what cex value am I missing.
The tdm1 is as follows:
Docs
Terms anger anticipation disgust fear joy sadness surprise trust
bag 1 0 0 0 0 1 0 0
choices 1 1 0 0 1 2 1 1
limited 1 0 0 0 0 1 0 0
plastic 1 0 0 0 0 1 0 0
provided 1 0 0 0 0 1 0 0
abit 0 1 0 0 1 1 1 1
ai 0 2 0 0 2 1 2 2
always 0 1 0 1 0 0 0 1
amazed 0 1 0 0 1 1 1 1
amount 0 1 0 0 1 0 1 1
app 0 2 0 1 2 1 2 2
area 0 1 0 0 1 0 1 1
areas 0 1 0 0 0 0 0 0
around 0 1 0 0 1 0 1 1
atmosphere 0 1 0 0 1 0 1 1
attended 0 1 0 0 1 1 1 1
back 0 1 0 0 1 1 1 0
basah 0 1 0 1 1 0 1 1
bought 0 1 0 0 0 0 0 0
brands 0 1 0 0 1 0 1 1
bras 0 1 0 1 1 0 1 1
breeze 0 1 0 0 1 1 1 1
buy 0 1 0 1 1 1 1 1
can 0 2 0 0 1 1 1 0
cant 0 1 0 0 0 0 0 0
cashiers 0 1 0 0 1 0 1 1
cbd 0 1 0 0 0 0 0 0
charged 0 1 0 0 1 1 1 1
choose 0 2 0 0 1 0 0 0
chopstick 0 1 0 0 0 0 0 0
classes 0 1 0 0 1 0 0 0
come 0 2 0 0 2 2 2 1
concept 0 4 0 0 3 0 3 3
confused 0 1 0 0 1 1 1 1
contains 0 1 0 0 1 1 1 1
convenient 0 8 0 0 5 1 4 4
cool 0 4 0 0 4 1 4 4
correct 0 1 0 0 1 1 1 1
cream 0 1 0 0 1 0 1 1
cup 0 1 0 0 0 0 0 0
curious 0 1 0 0 0 0 0 0
current 0 1 0 0 1 0 1 1
customer 0 1 0 0 0 0 0 0
cutlery 0 1 0 0 0 0 0 0
doesnt 0 1 0 0 1 0 1 1
dont 0 1 0 0 1 0 1 1
don’t 0 1 0 1 1 1 1 1
download 0 1 0 0 1 1 1 1
drinks 0 1 0 0 1 0 0 0
easy 0 3 0 0 2 0 2 2
eat 0 1 0 0 1 1 1 1
electronic 0 1 0 0 1 1 1 1
eleven 0 1 0 0 1 0 1 1
entering 0 1 0 0 1 1 1 1
ereciept 0 1 0 0 1 1 1 1
especially 0 1 0 0 1 0 1 1
even 0 2 0 2 1 1 1 1
exit 0 1 0 0 1 1 1 1
experience 0 3 0 1 2 1 2 2
explained 0 1 0 0 1 1 1 1
feel 0 1 0 0 1 0 1 1
first 0 1 0 0 1 1 1 1
found 0 1 0 0 1 0 1 1
free 0 1 0 0 1 0 1 1
friends 0 1 0 0 1 1 1 1
fussfree 0 1 0 0 1 0 1 1
gantry 0 1 0 0 1 0 1 1
get 0 2 0 1 2 1 1 1
go 0 3 0 1 3 1 3 3
good 0 2 0 0 2 0 2 2
goods 0 1 0 1 1 1 1 1
goto 0 1 0 0 0 0 0 0
great 0 4 0 0 3 0 3 3
greatly 0 1 0 0 1 0 1 1
hasslefree 0 1 0 0 0 0 0 1
history 0 1 0 0 1 1 1 1
hope 0 1 0 0 1 0 1 1
hour 0 1 0 0 1 0 1 1
hours 0 1 0 0 1 0 1 1
ice 0 1 0 0 1 0 1 1
im 0 1 0 0 1 1 1 0
inside 0 1 0 0 1 1 1 1
items 0 3 0 0 3 2 3 2
jiffy 0 1 0 0 1 1 1 1
just 0 4 0 0 3 2 3 2
large 0 1 0 0 1 0 1 1
leave 0 1 0 0 1 1 1 0
less 0 1 0 0 0 0 0 1
link 0 2 0 0 2 1 2 2
linked 0 1 0 0 1 0 1 1
lots 0 1 0 0 0 0 0 1
love 0 3 0 0 3 1 3 2
lovely 0 1 0 0 1 1 1 1
makes 0 1 0 0 1 0 1 1
making 0 1 0 0 0 0 0 1
many 0 1 0 0 0 0 0 0
method 0 1 0 0 1 0 1 1
methods 0 1 0 0 1 1 1 1
minute 0 2 0 0 1 1 1 2
mrt 0 1 0 1 1 0 1 1
muchneeded 0 1 0 0 1 0 1 1
near 0 1 0 0 1 0 1 1
nearby 0 1 0 0 1 0 1 1
new 0 1 0 0 1 0 1 1
newly 0 1 0 0 0 0 0 0
nice 0 1 0 0 0 0 0 0
noodles 0 1 0 0 0 0 0 0
number 0 1 0 0 1 1 1 1
offers 0 1 0 0 1 0 1 1
often 0 2 0 0 2 2 2 1
opened 0 1 0 0 0 0 0 0
operate 0 1 0 0 1 0 1 1
order 0 1 0 0 1 0 1 1
outlets 0 1 0 0 1 1 1 0
patiently 0 1 0 0 1 1 1 1
pay 0 1 0 0 1 0 1 1
payment 0 3 0 0 3 1 3 3
people 0 1 0 0 0 0 0 0
perfect 0 1 0 0 1 1 1 1
pick 0 4 0 0 3 0 3 3
picking 0 1 0 0 1 1 1 1
prepare 0 1 0 0 0 0 0 0
prices 0 1 0 0 1 0 1 1
product 0 2 0 0 2 2 2 2
products 0 7 0 0 5 1 5 4
promotion 0 1 0 1 1 0 1 1
promotions 0 1 0 0 0 0 0 1
quench 0 1 0 0 1 1 1 1
queue 0 2 0 0 2 1 2 2
quite 0 1 0 0 1 1 1 1
range 0 2 0 0 1 1 1 0
ready 0 1 0 0 1 1 1 1
reasonable 0 1 0 0 1 0 1 1
recieved 0 1 0 0 1 1 1 1
recommend 0 1 0 0 0 0 0 1
reduced 0 1 0 0 1 0 1 1
rush 0 1 0 0 1 1 1 0
salut 0 1 0 0 0 0 0 1
sandwiches 0 1 0 0 1 1 1 1
scan 0 1 0 0 1 0 1 1
see 0 3 0 0 2 0 2 2
seems 0 1 0 0 0 0 0 0
setup 0 1 0 0 0 0 0 1
shop 0 3 0 0 2 1 2 2
shopping 0 4 0 1 4 2 4 4
show 0 1 0 0 1 1 1 1
small 0 1 0 0 0 0 0 0
smu 0 2 0 0 2 0 2 2
snacks 0 3 0 0 3 1 1 1
spent 0 1 0 0 1 0 1 1
staff 0 3 0 1 3 3 3 3
stared 0 1 0 1 1 1 1 1
stop 0 1 0 0 1 1 1 1
store 0 14 0 0 10 5 9 9
stores 0 1 0 0 1 0 1 1
students 0 1 0 0 1 0 0 0
stuff 0 1 0 0 0 0 0 0
super 0 2 0 0 2 0 2 2
sure 0 1 0 0 1 1 1 1
sweets 0 1 0 0 1 0 0 0
take 0 1 0 0 1 1 1 0
technology 0 4 0 0 4 2 4 4
thankfully 0 1 0 0 1 1 1 1
theres 0 1 0 0 0 0 0 0
thirst 0 1 0 0 1 1 1 1
thought 0 1 0 0 0 0 0 0
time 0 2 0 0 2 1 2 2
took 0 1 0 0 0 0 0 1
truly 0 1 0 0 0 0 0 1
unmanned 0 1 0 0 1 1 1 1
use 0 1 0 0 1 0 1 1
used 0 1 0 0 1 0 1 1
useful 0 1 0 0 1 0 0 0
users 0 1 0 0 1 0 1 1
variety 0 4 0 0 4 0 3 3
wait 0 2 0 0 1 0 1 1
waited 0 1 0 0 1 1 1 1
walk 0 2 0 0 1 0 1 1
want 0 1 0 0 0 0 0 0
wanted 0 1 0 0 1 0 1 1
wasnt 0 1 0 0 1 1 1 1
whatever 0 1 0 0 1 0 1 1
wide 0 4 0 0 3 1 2 1
won’t 0 1 0 1 1 1 1 1
worry 0 1 0 1 1 1 1 1
avoid 0 0 0 1 0 0 0 0
away 0 0 0 1 0 0 0 0
better 0 0 0 1 0 0 0 0
choice 0 0 0 1 0 0 0 0
customers 0 0 0 1 0 0 0 0
deceptive 0 0 0 1 0 0 0 0
expired 0 0 0 1 0 0 0 0
listed 0 0 0 1 0 0 0 0
make 0 0 0 1 0 0 0 0
marketing 0 0 0 1 0 0 0 0
minutes 0 0 0 1 0 0 0 0
much 0 0 0 1 0 0 0 0
purchases 0 0 0 1 0 0 0 0
qr 0 0 0 1 0 0 0 0
resulting 0 0 0 1 0 0 0 0
scanner 0 0 0 1 0 0 0 0
screen 0 0 0 1 0 0 0 0
showing 0 0 0 1 0 0 0 0
still 0 0 0 1 0 0 0 0
takes 0 0 0 1 0 0 0 0
tries 0 0 0 1 0 0 0 0
trusted 0 0 0 1 0 0 0 0
trusting 0 0 0 1 0 0 0 0
works 0 0 0 1 0 0 0 0
Hence, not too sure what is the issue about since there is no NA!
Hope you can help. Thank you!

R equivalent of Stata `tabulate , generate( )` command

I want to mimic the behavior of Stata's tabulate , generate() command in R. It is illustrated below; the command's functionality is twofold. First, in my example, it produces a one-way table of frequency counts. Second, it generated dummy variables for each of the values contained on the variable (var1) using the prefix (stubname) declared in option ,generate() to name the generated dummy variables (d_1 - d_7). My question is regarding the second functionality. R-base solutions are preferred, but packaged dependent are also welcome.
[Edit]: My final goal is to generate a data.frame() that emulates the last data set printed on the screen.
clear all
input var1
0
1
2
2
2
2
42
42
777
888
999999
end
tabulate var1 ,gen(d_)
/* var1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 1 9.09 9.09
1 | 1 9.09 18.18
2 | 4 36.36 54.55
42 | 2 18.18 72.73
777 | 1 9.09 81.82
888 | 1 9.09 90.91
999999 | 1 9.09 100.00
------------+-----------------------------------
Total | 11 100.00 */
list, sep(11)
/* +--------------------------------------------------+
| var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 |
|--------------------------------------------------|
1. | 0 1 0 0 0 0 0 0 |
2. | 1 0 1 0 0 0 0 0 |
3. | 2 0 0 1 0 0 0 0 |
4. | 2 0 0 1 0 0 0 0 |
5. | 2 0 0 1 0 0 0 0 |
6. | 2 0 0 1 0 0 0 0 |
7. | 42 0 0 0 1 0 0 0 |
8. | 42 0 0 0 1 0 0 0 |
9. | 777 0 0 0 0 1 0 0 |
10. | 888 0 0 0 0 0 1 0 |
11. | 999999 0 0 0 0 0 0 1 |
+--------------------------------------------------+ */
set.seed(123)
df = data.frame(var1 = factor(sample(10, 20, TRUE)))
df = data.frame(df, model.matrix(~0+var1, df)) # 0 here is to suppress the intercept. The smallest value will be the base group--and hence will be dropped.
names(df)[-1] = paste0('d_', 1:(ncol(df)-1))
df
var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 d_8 d_9
1 3 0 1 0 0 0 0 0 0 0
2 3 0 1 0 0 0 0 0 0 0
3 10 0 0 0 0 0 0 0 0 1
4 2 1 0 0 0 0 0 0 0 0
5 6 0 0 0 0 1 0 0 0 0
6 5 0 0 0 1 0 0 0 0 0
7 4 0 0 1 0 0 0 0 0 0
8 6 0 0 0 0 1 0 0 0 0
9 9 0 0 0 0 0 0 0 1 0
10 10 0 0 0 0 0 0 0 0 1
11 5 0 0 0 1 0 0 0 0 0
12 3 0 1 0 0 0 0 0 0 0
13 9 0 0 0 0 0 0 0 1 0
14 9 0 0 0 0 0 0 0 1 0
15 9 0 0 0 0 0 0 0 1 0
16 3 0 1 0 0 0 0 0 0 0
17 8 0 0 0 0 0 0 1 0 0
18 10 0 0 0 0 0 0 0 0 1
19 7 0 0 0 0 0 1 0 0 0
20 10 0 0 0 0 0 0 0 0 1
I guess you are assuming each value in var_1 is unique so that you get dummy variables rather than counts in the d_ fields.
You could try something like this:
var1 <- 1:5
dummy_matrix <- vapply(var1, function(x) as.numeric(var1 == x), rep(1, 5)) # create a matrix of dummy vars
colnames(dummy_matrix) <- paste0("d_", var1) # name the columns
cbind(var1, dummy_matrix) # bind to var1
Output:
var1 d_1 d_2 d_3 d_4 d_5
1 1 1 0 0 0 0
2 2 0 1 0 0 0
3 3 0 0 1 0 0
4 4 0 0 0 1 0
5 5 0 0 0 0 1

Empty nodes when creating a SOM in R

I am trying to create a SOM map based on records with different discrete classifications (tags) like the example below
Record Tag1 Tag2 Tag3 Tag4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
The problem I am facing is that there are many empty nodes in the SOM. From what I have read, each node should have 5-10 records but still, this is not working.
Could it be that all observations are very different from one another?

Retrieve values in each cluster in R

I have successfully run the DBSCAN algorithm (here is the stripped down command):
results <- dbscan(data,MinPts=15, eps=0.01)
and plotted my clusters:
plot(results, data)
results$cluster returns a list with numeric values. The value at each index reflects the cluster to which the original data in that index belongs:
[1] 0 1 2 1 0 0 2 1 0 0 0 1 2 0 2 0 2 0 0 1 2 0 2 2 0 1 2 0 1 0 1 0 2 0 0 0 1 1 0 1 2 0 0 0 1 0 0 1 1 0 1
[52] 0 2 2 0 0 1 2 2 0 2 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 2 2 2 2 2 0 0 0 0 0 2 1 2 1 0 2 0 0 1 1 1 0 0 1
[103] 2 1 1 0 1 0 1 1 0 0 0 0 1 2 0 0 1 1 1 1 0 0 0 1 0 0 2 2 1 1 0 1 2 1 0 0 1 0 1 2 0 0 2 0 0 2 2 2 2 0 1
However, how can I retrieve the values of the original data that is in each cluster? For example, how can I get all the values from the original data that are in cluster #2?
Okay, this should do the trick for, e.g., cluster #2:
data[results$cluster==2,]

Calculating all possible combinations within a range in R

I'm trying to generate all combinations of four variables, where each variable is an integral between 0 and 10. Is there an easy way to do this in R?
X | Y | Z | W
-------------
0 | 0 | 0 | 0
1 | 0 | 0 | 0
1 | 1 | 0 | 0
1 | 1 | 1 | 0
. . . .
. . . .
. . . .
10|10 |10 |10
If W, X, Y and Z exist
expand.grid(W = W, X = X, Y = Y, Z = Z)
W X Y Z
1 0 0 0 0
2 1 0 0 0
3 2 0 0 0
4 3 0 0 0
5 4 0 0 0
6 5 0 0 0
7 6 0 0 0
8 7 0 0 0
9 8 0 0 0
10 9 0 0 0
11 10 0 0 0
12 0 1 0 0
13 1 1 0 0
14 2 1 0 0
15 3 1 0 0
...
All combinations can be done with table. Converting to a data frame yields to what you're looking for.
> as.data.frame(table(W=0:10, X=0:10, Y=0:10, Z=0:10))[, c('W','X','Y','Z')]
W X Y Z
1 0 0 0 0
2 1 0 0 0
3 2 0 0 0
4 3 0 0 0
5 4 0 0 0
6 5 0 0 0
7 6 0 0 0
8 7 0 0 0
9 8 0 0 0
10 9 0 0 0
11 10 0 0 0
12 0 1 0 0
13 1 1 0 0
...

Resources