SPSS: Compute variable based on COUNT of different values - count

I've simplified this for an example of what I'm trying to do. I have 3 different variables with values ranging from 1 to 5:
I'd need to compute a new variable that counts the appearance of each of the values across all cases and all 3 variables. The goal is to be able to show how many times each value appears (eg. 5 appears 3 times) and I've been told that can be achieved with the Compute variable method, but I haven't been able to figure out how.
Example .sav file can be found here, if it helps. Thanks in advance for answers!

You can use a loop to create 5 new variables, each containing the number of occurrences of a specific value in each row. Then to get the complete count you just sum these variables:
do repeat vr=occ1 to occ5/vl=1 to 5.
compute vr=sum(Mention1=vl, Mention2=vl, Mention3=vl).
end repeat.
exe.
Now you have the 5 new variables (eg. var occ1 has the count of the occurrences of the number 1 in each row). There are a few ways to get the complete total.
To simply get it in the output window:
descriptives occ1 to occ5/statistics=sum.
But in your question you mentioned adding the counts to the actual dataset. This can be done using the aggregate command:
aggregate /out=* mode=addvariables /break= /TotOcc1 to TotOcc5=sum(occ1 to occ5).

I wasn't able to access your .sav file from this computer, but this is the solution I came up with for your example data:
*Dataset1 should equal your original dataset name.
DATASET ACTIVATE Dataset1.
*This creates a table with the counts of each occurance of 1 through 5.
FREQUENCIES VARS MENTION1 MENTION2 MENTION3.
EXECUTE.
*This translates the frequencies table to a dataset called 'freqtbl'.
DATASET DECLARE freqtbl.
OMS Select Tables
/IF SUBTYPES =['Frequencies']
/DESTINATION format = sav outfile=freqtbl.
freq vars= MENTION1 MENTION2 MENTION3.
OMSEND.
*This will make sure you are referencing the new dataset you created: 'freqtbl'.
*The ALTER TYPE statement turns VAR2, which is the occurances of 1 through 5, back into a numeric variable.
DATASET ACTIVATE freqtbl.
ALTER TYPE VAR2 (F1.0).
*This is a simple if/then statement across all of your values.
IF VAR2 = 1 FREQ_1 = SUM(FREQUENCY).
IF VAR2 = 2 FREQ_2 = SUM(FREQUENCY).
IF VAR2 = 3 FREQ_3 = SUM(FREQUENCY).
IF VAR2 = 4 FREQ_4 = SUM(FREQUENCY).
IF VAR2 = 5 FREQ_5 = SUM(FREQUENCY).
EXECUTE.
*This cleans up your dataset and computes a dummy variable to aggregate counts.
DELETE VARIABLES COMMAND_ TO CumulativePercent.
COMPUTE DUMMY = 1.
EXECUTE.
*This aggregates your frequency columns to count all occurances of 1 through 5.
AGGREGATE
/OUTFILE = *
MODE = ADDVARIABLES
/BREAK DUMMY
/COUNT_1 = SUM(FREQ_1)
/COUNT_2 = SUM(FREQ_2)
/COUNT_3 = SUM(FREQ_3)
/COUNT_4 = SUM(FREQ_4)
/COUNT_5 = SUM(FREQ_5).
EXECUTE.

Related

For every unit increase in one column value , another column entries increase

I have a simulation dataset with 500 replicates - each replicate contains 300 ids. When rep = 1, id ranges from 1-300; when rep = 2, id again ranges from 1-300 and so on.
I want to get the following: when rep = 1: id 1-300; when rep = 2: id 301-600 and so on. This can be easily done using an if-else statement if number of replicates is relatively small - like the following code does the job for four replicates:
d1 <- mutate(d1, ID = ifelse(rep==1, id,
ifelse(rep==2, id+300,
ifelse(rep==3, id+600, id+900))))
But how should I address this when I have 500 replicates?
So essentially my question is: how should I code - for every unit increase in replicate column, the id column will increase by 300? I have attached the result for 4 replicates (the result of the above code).
Here is a snapshot of the data:
replicate 1
replicate 4
I would use rep to identify the id, like:
d1 <- mutate(d1, ID = id + 300*(rep-1))

How to compare sum of several columns to a single column and returns an error if there is a difference?

I am creating some pre-load files that need to be cleaned, by ensuring the sum of the 2 columns are equal to the total sum column. The data entry was done manually by RA's and therefore the data is prone to error. My problem is ascertaining that the data is clean and if there is an error, the easiest way to identify the columns that don't add up by returning the ID number.
This is my data
df1 <- data.frame(
id = c(1,2,3,4,5,6,7),
male = c(2,4,2,6,3,4,5),
female = c(3,6,4,9,2,4,1),
Total = c(5,10,7,15,6,8,7)
)
The code am looking for is suppossed to compare if male+female=Total in each row, and ONLY returns an error where there is disagreement. In my data above, i would expect an error like like sum of male and female in 3 rows with ID 3,5 and 7, are not equal to the total.
You could also do something more fancy like this one liner:
df1$id[apply(df1[c('male','female')], 1, sum) != df1$Total]
which will give you just the ids (Aziz's answer works great too)
You can use:
mismatch_rows = which(df1$male + df1$female != df1$Total)
To get the indices of the rows that don't match. If you want the actual values, you can simply use:
df1[mismatch_rows,]

R function that creates indicator variable values unique between several columns

I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.

Dealing with duplicated data, reassign a new value

It seems that when we have duplicated data, most of the time we want to remove the duplicated data.
Lets say, we do not want to exclude it, but instead assign it with a new variable.
Taking the following data as a example
b <- c(1:100,1:99,1:104,1:105,1:105)
So we see that between the values for 1-99 are repeated 5 times, the number 100 repeated 4 times, the number 101 repeated 4 times etc.....
How can one search through b (ideally in sequential order), find a repeated/duplicate number and then assign it a new value
Try this if you're interested in assigning one (universal) new value
b <- c(1:100,1:99,1:104,1:105,1:105)
b[duplicated(b)] = 888 # new value
The duplicated command helps you spot the positions of all values that are duplicates in b.

R scaling multiple columns while maintaining one specific column

Alright, I've been asked to be more specific and provide code when I ask questions. So here we go!
I need to calculate z-scores for 6 columns in a data set ("Grade2"), which are my 9th-14th columns. Column 1 is a number ID. Columns 2-8 are demographics. Ultimately, I need these scaled columns to be appended onto my existing dataframe. My approach was this: create dataframe of scaled scores, rename columns in new dataframe, merge onto old.
Grade2z = as.data.frame(scale(Grade2[ ,9:14]))#create new dataframe that ONLY has the scaled CBM and aR scores
colnames(Grade2z) = c("Fallz", "Winterz", "Springz", "aRFallz", "aRWinterz", "aRSpringz")
Grade2 = merge(Grade2, Grade2z)
This caused an issue. Since I had no ID to merge by, it created 40,000 some observations. So I went back and tried this:
Grade2z = as.data.frame(scale(Grade2[ ,c(1,9:14)]))#create new dataframe that ONLY has the scaled CBM and aR scores
colnames(Grade2z) = c("Fallz", "Winterz", "Springz", "aRFallz", "aRWinterz", "aRSpringz")
Grade2 = merge(Grade2, Grade2z)
This didn't work either as column 1 is a numeric vector and therefore was also scaled. Is there a simple solution I'm missing? Without manipulating the original dataset (Grade2), how can I create a new one that includes columns 1 and 9:14 without scaling column 1?
Edit: The actual data don't really matter in this case. They are just strings of values. This random dataframe should work.
Grade2 = data.frame(replicate(14, sample(0:50, 1000, rep=FALSE)))
colnames(Grade2)[1] = "ID"
Grade2z = as.data.frame(cbind(Grade2[,1,drop=F], scale(Grade2[ ,9:14])))
colnames(Grade2z) = c("ID", "Fallz", "Winterz", "Springz", "aRFallz", "aRWinterz", "aRSpringz")
Grade2 = merge(Grade2, Grade2z)

Resources