I am quite new to SPSS and I need to count the number of certain errors made in a test (Stroop Test). There are three kinds of variables:
theCongruencies - can be 'I' or 'C' for incongruent or congruent
theWordkeys - code for a key that indicates the first letter of a word
thePressedKeys - code for the key pressed by the user
Each type exists 80 times called e.g. theCongruencies_1 to the theCongruencies_80.
I want to count how many times there is the same value in theWordKeys_x and thePressedKeys_x when theCongruencies_x has the value 'I'.
Example: theCongruencies_42 = 'I' theWordKeys_42 = 88 thePressedKeys_42 = 88
So I need to do something like this in my SPSS Code:
COMPUTE InhibErrs = COUNT(
IF(
theCongruencies_1 to theCongruencies_80 EQ 'I'
AND theWordkeys_1 to theWordkeys_80 EQ thePressedKeys_1 to thePressedKeys_80)).
execute.
Thanks a lot
Deego
Try this:
compute countVar=0.
do repeat theCongruencies=theCongruencies_1 to theCongruencies_80
/theWordkeys=theWordkeys_1 to theWordkeys_80
/thePressedKeys=thePressedKeys_1 to thePressedKeys_80.
compute countVar=sum(countVar, (theCongruencies="I" and theWordkeys=thePressedKeys)).
end repeat.
exe.
Related
I am currently having an issue. Basically, I have 2 similar functions in terms of concept but the results do not align. These are the codes I learned from Bioinformatics I on Coursera.
The first code is simply creating a dictionary of occurrences of each k-mer pattern from a text (which is a long stretch of nucleotides). In this case, k is 5.
def FrequencyMap(text,k):
freq ={}
for i in range (0, len(text)-k+1):
freq[text[i:i+k]]=0
for j in range (0, len(text)-k+1):
if text[j:j+k] == text[i:i+k]:
freq[text[i:i+k]] +=1
return freq, max(freq)
The text and the result dictionary are kinda long, but the main point is when I call max(freq), it returns the key 'TTTTC', which has a value of 1.
Meanwhile, I wrote another code that is simply based on the previous code to generate the 5-mer patterns that have the max values (number of occurrences in the text).
def FrequentWords(text, k):
a = FrequencyMap(text, k)
m = max(a.values())
words = []
for i in a:
if a[i]==m:
words.append(i)
return words,m
And this code returns 'ACCTA', which has the value of 99, meaning it appears 99 times in the text. This makes total sense.
I used the same text and k (k=5) for both codes. I ran the codes on Jupyter Notebook. Why does the first one not return 'ACCTA'?
Thank you so much,
Here is the text, if anyone wants to try:
"ACCATCCCTAGGGCATACCTAAGTCTACCTAAAAGGCTACCTAATACCATACCTAATTACCTAACTACCTAAAATAAGTCTACCTAATACCTAATACCTAAAGTTACCTAACGTACCTAATACCTAATACCTAACCACTACCTAATCCGATTTACCTAACAACCGATCGAGTACCTAATCGATACCTAAATAACGGACAATATACCTAATTACCTAATACCTAATACCTAAGTGTACCTAAGACGTCTACCTAATTGTACCTAACTACCTAATTACCTAAGATTAATACCTAATACCTAATTTACCTAATACCTAACGTGGACTACCTAATACCTAACTTTTCCCCTACCTAATACCTAACTGTACCTAAATACCTAATACCTAAGCTACCTAAAGAACAACATTGTACGTGCGCCGTACCTAAATACCTAACAACTACCTAACTGATACCTAATAGTGATTACCTAACGCTTCTACCTAACTACCTAAGTACCTAACGCTACCTAACTACCTAATGTCCACAAAATACCTAATACCTAATAGCTACCTAATTGTGTACCTAAGTACCTAACCTACCTAATAATACCTAAAAATACCTAAGTACCTAACGTACCTAAATTTTACCTAATCTACCTAACGTACCTAATACCTAATTATACCTAATTACCTAATGGTTACCTAAGTTACCTAATATGCCACTACCTAACCTTACCTAAGACCTACCTAATAGGTACCTAACTGGGTACCTAAGGCAGTTTACCTAATTCAGGGCTACCTAATGTACCTAATACCTAAGTACCTAATACCTAATCCCATACCTAATATTTACCTAAGGGCACCGGTACCTAATACCTAATACCTAATACCTAAACCTTCGTACCTAAATACCTAATCTACCTAATGTACCTAAGGTACCTAATACCTAAGTCACTACCTAATACCTAATACCTAATGGGAGGAGCTTACCTAAGGTTACCTAATTACCTAAATACCTAATCGTTACCTAA"
Why does the first one not return 'ACCTA'?
Because max(freq) returns the maximum key of the dictionary. In this case the keys are strings (the k-mers), and strings are compared alphabetically. Hence the maximum one is the last string when the are sorted alphabetically.
If you want the first function to return the k-mer that occurs most often, you should change max(freq) to max(freq.items(), key=lambda key_value_pair: key_value_pair[1])[0]. Here, you are sorting the (kmer, count) pairs (that's the key_value_pair parameter of the lambda expression) based on the frequency and then selecting the kmer.
I have a simple data set with training sessions for some athletes. Let's say I want to visualize how many training sessions are done as an average of the number of athletes, either in total or divided by the clubs that exist. I hope the data set is somewhat self-describing.
To norm the number of activities by the number of athletes I use two measures:
TotalSessions = COUNTA(Tab_Sessions[Session key])
AvgAthlete = AVERAGEX(VALUES(Tab_Sessions[Athlete]),[TotalSessions])
I give AvgAthlete as the desired value in both visuals shown below. If I make a filter on the clubs the values are as expected, but with no filter applied I get some strange values
What I guess happens is that since Athlete B doesn't do any strength, Athlete B is not included in the norming factor for strength. Is there a DAX function that can solve this?
If I didn't have the training sessions as a hierarchy (Type-Intensity), it would be pretty straightforward to do some kind of workaround with a calculated column, but it won't work with hierarchical categories. The expected results calculated in excel are shown below:
Data set as csv:
Session key;Club;Athlete;Type;Intensity
001;Fast runners;A;Cardio;High
002;Fast runners;A;Strength;Low
003;Fast runners;B;Cardio;Low
004;Fast runners;B;Cardio;High
005;Fast runners;B;Cardio;High
006;Brutal boxers;C;Cardio;High
007;Brutal boxers;C;Strength;High
If you specifically want to aggregate this across whatever choice you have made in your Club selection, then you simply write out a simple measure that does that:
AvgAthlete =
VAR _athletes =
CALCULATE (
DISTINCTCOUNT ( 'Table'[Athlete] ) ,
ALLEXCEPT ( 'Table' , 'Table'[Club] )
)
RETURN
DIVIDE (
[Sessions] ,
_athletes
)
Here we use a distinct count of values in the Athlete column, with all filters removed apart from on the Club column. This is, as far as I interpret your question, the denominator you are after.
Divide the total number of sessions on this number of athletes. Here is the result:
I have to include participants into a dataframe(or existing data frame) if they have higher score in invalid conditions relative to valid conditions. But I have two times of (T1-T3) data.
I have tried this one: data_new <- subset(data_raw, T1_invalid > T1_valid & T3_invalid > T3_valid)
However, it did not work because, for instance, some participants may have higher invalid score in just one time (T1), not in the second time (T3), or vice versa.
For example, a person can have higher invalid in one of the times, let's say T1_invalid > T1_valid. This should be included to the new data frame, it is okay. But, T3_invalid - T3_valid should be excluded because the invalid score is not higher than the valid score. But when you use AND operator, it excludes the person because, they have to have higher invalid scores in both T1 and T3. So, we over exclude in that case.
When you use OR operator it is the same. For example, a person has a higher score in T1_invalid > T1_valid, but not in the T3_invalid - T3_valid. Then, since one of the conditions is okay, it includes the person, but this person failed at T3. So, we should exclude T3_invalid - valid scores.
So basically, I was looking for something can check them separately. Then, I decided to make it null one by one like this:
data_raw[data_raw$T1_invalid < data_raw$T1_valid, c("T1_invalid", "T1_valid")] <- NA
data_raw[data_raw$T3_invalid < data_raw$T3_valid, c("T3_invalid", "T3_valid")] <- NA
However, it did not let me do this because I use the variables two times, for the condition part (>) and for make it null.
Does anyone have any idea? By the way they have to be in the same data frame for using in the model.
Here I provide a normal data.table solution. You can have a try.
library(data.table)
setDT(data_raw)
data_raw[, T1_invalid := ifelse(T1_invalid < T1_valid,NA,T1_invalid)]
data_raw[, T1_valid := ifelse(T1_invalid < T1_valid,NA,T1_valid)]
data_raw[, T3_invalid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]
data_raw[, T3_valid := ifelse(T3_invalid < T3_valid,NA,T3_valid)]
This is Teradata specific question. In RANDOM function, I want the lower bound to be taken directly from one of the columns. e.g. I want a random value between age of the subscriber and till date. SO I want to put RANDOM(int_tenure, 0). I am receiving below error:
"Syntax error, expected something like an integer or a decimal number or a floating point number or '+' or '-' between '(' and the word 'int_tenure'"
the RANDOM only can take literals (no field/column names) and first parameter and to be lower/equal than second one. So in first step it's not possible. But you can work around: Generate a random factor [0;1] and apply this factor to the interval.
select 10 as lower_bound
,20 as upper_bound
-- ,random(lower_bound, upper_bound) -- will not work
,random(0, 1000)/1000.0000 as RND_Factor -- a random factor between 0 and 1
,(upper_bound-lower_bound)*RND_Factor+lower_bound;
I'm trying to find out the best (best as in performance) to having a data frame of the form getting a new column called "Season" with each of the four seasons of the year:
MON DAY YEAR
1 1 1 2010
2 1 1 2010
3 1 1 2010
4 1 1 2010
5 1 1 2010
6 1 1 2010
One straightforward to do this is create a loop conditioned on the MON and DAY column and assign the value one by one but I think there is a better way to do this. I've seen on other posts suggestions for ifelse or := or apply but most of the problem stated is just binary or the value can be assigned based on a given single function f based on the parameters.
In my situation I believe a vector containing the four stations labels and somehow the conditions would suffice but I don't see how to put everything together. My situation resembles more of a switch case.
Using modulo arithmetic and the fact that arithmetic operators coerce logical-values to 0/1 will be far more efficient if the number of rows is large:
d$SEASON <- with(d, c( "Winter","Spring", "Summer", "Autumn")[
1+(( (DAY>=21) + MON-1) %/% 3)%%4 ] )
The first added "1" shifts the range of the %%4 operationon all the results inside the parentheses from 0:3 to 1:4. The second subtracted "1" shifts the (inner) 1:12 range back to 0:11 and the (DAY >= 21) advances the boundary months forward one.
I'll start by giving a simple answer then I'll delve into the details.
I quick way to do this would be to check the values of MON and DAY and output the correct season. This is trivial :
f=function(m,d){
if(m==12 && d>=21) i=3
else if(m>9 || (m==9 && d>=21)) i=2
else if(m>6 || (m==6 && d>=21)) i=1
else if(m>3 || (m==3 && d>=21)) i=0
else i=3
}
This f function, given a day and a month, will return an integer corresponding to the season (it doesn't matter much if it's an integer or a string ; integer only allows to save a bit of memory but it's a technicality).
Now you want to apply it to your data.frame. No need to use a loop for this ; we'll use mapply. d will be our simulated data.frame. We'll factor the output to have nice season names.
d=data.frame(MON=rep(1:12,each=30),DAY=rep(1:30,12),YEAR=2012))
d$SEA=factor(
mapply(f,d$MON,d$DAY),
levels=0:3,
labels=c("Spring","Summer","Autumn","Winter")
)
There you have it !
I realize seasons don't always change a 21st. If you need fine tuning, you should define a 3-dimension array as a global variable to store the accurate days. Given a season and a year, you could access the corresponding day and replace the "21"s in the f function with the right calls (you would obviously add a third argument for the year).
About the things you mentionned in your question :
ifelse is the "functionnal" way to make a conditionnal test. On atomic variables it's only slightly better than the conditionnal statements but it is vectorized, meaning that if the argument is a vector, it will loop itself on its elements. I'm not familiar with it but it's the way to got for an optimized solution
mapply is derived from sapply of the "apply family" and allows to call a function with several arguments on vector (see ?mapply)
I don't think := is a standard operator in R, which brings me to my next point :
data.table ! It's a package that provides a new structure that extends data.frame for fast computing and typing (among other things). := is an operator in that package and allows to define new columns. In our case you could write d[,SEA:=mapply(f,MON,DAY)] if d is a data.table.
If you really care about performance, I can't insist enough on using data.table as it is a major improvement if you have a lot of data. I don't know if it would really impact time computing with the solution I proposed though.