Alright, I've been asked to be more specific and provide code when I ask questions. So here we go!
I need to calculate z-scores for 6 columns in a data set ("Grade2"), which are my 9th-14th columns. Column 1 is a number ID. Columns 2-8 are demographics. Ultimately, I need these scaled columns to be appended onto my existing dataframe. My approach was this: create dataframe of scaled scores, rename columns in new dataframe, merge onto old.
Grade2z = as.data.frame(scale(Grade2[ ,9:14]))#create new dataframe that ONLY has the scaled CBM and aR scores
colnames(Grade2z) = c("Fallz", "Winterz", "Springz", "aRFallz", "aRWinterz", "aRSpringz")
Grade2 = merge(Grade2, Grade2z)
This caused an issue. Since I had no ID to merge by, it created 40,000 some observations. So I went back and tried this:
Grade2z = as.data.frame(scale(Grade2[ ,c(1,9:14)]))#create new dataframe that ONLY has the scaled CBM and aR scores
colnames(Grade2z) = c("Fallz", "Winterz", "Springz", "aRFallz", "aRWinterz", "aRSpringz")
Grade2 = merge(Grade2, Grade2z)
This didn't work either as column 1 is a numeric vector and therefore was also scaled. Is there a simple solution I'm missing? Without manipulating the original dataset (Grade2), how can I create a new one that includes columns 1 and 9:14 without scaling column 1?
Edit: The actual data don't really matter in this case. They are just strings of values. This random dataframe should work.
Grade2 = data.frame(replicate(14, sample(0:50, 1000, rep=FALSE)))
colnames(Grade2)[1] = "ID"
Grade2z = as.data.frame(cbind(Grade2[,1,drop=F], scale(Grade2[ ,9:14])))
colnames(Grade2z) = c("ID", "Fallz", "Winterz", "Springz", "aRFallz", "aRWinterz", "aRSpringz")
Grade2 = merge(Grade2, Grade2z)
Related
I have a data frame that looks like this:
df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))
I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.
I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.
I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.
Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.
hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))
In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).
Join the hierarchy to the original data frame.
require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))
Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.
setDT(matched)[, freq := .N, by = c("Set", "Value")]
Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.
multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)
Now, multiply frequency by the numeric scores.
multiplied$mult <- multiplied$Nums * multiplied$freq
Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.
check <- multiplied[with(multiplied, order(Set, -mult)), ]
final <- distinct(check, Set, .keep_all = TRUE)
This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).
If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).
I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)
I am creating some pre-load files that need to be cleaned, by ensuring the sum of the 2 columns are equal to the total sum column. The data entry was done manually by RA's and therefore the data is prone to error. My problem is ascertaining that the data is clean and if there is an error, the easiest way to identify the columns that don't add up by returning the ID number.
This is my data
df1 <- data.frame(
id = c(1,2,3,4,5,6,7),
male = c(2,4,2,6,3,4,5),
female = c(3,6,4,9,2,4,1),
Total = c(5,10,7,15,6,8,7)
)
The code am looking for is suppossed to compare if male+female=Total in each row, and ONLY returns an error where there is disagreement. In my data above, i would expect an error like like sum of male and female in 3 rows with ID 3,5 and 7, are not equal to the total.
You could also do something more fancy like this one liner:
df1$id[apply(df1[c('male','female')], 1, sum) != df1$Total]
which will give you just the ids (Aziz's answer works great too)
You can use:
mismatch_rows = which(df1$male + df1$female != df1$Total)
To get the indices of the rows that don't match. If you want the actual values, you can simply use:
df1[mismatch_rows,]
I've simplified this for an example of what I'm trying to do. I have 3 different variables with values ranging from 1 to 5:
I'd need to compute a new variable that counts the appearance of each of the values across all cases and all 3 variables. The goal is to be able to show how many times each value appears (eg. 5 appears 3 times) and I've been told that can be achieved with the Compute variable method, but I haven't been able to figure out how.
Example .sav file can be found here, if it helps. Thanks in advance for answers!
You can use a loop to create 5 new variables, each containing the number of occurrences of a specific value in each row. Then to get the complete count you just sum these variables:
do repeat vr=occ1 to occ5/vl=1 to 5.
compute vr=sum(Mention1=vl, Mention2=vl, Mention3=vl).
end repeat.
exe.
Now you have the 5 new variables (eg. var occ1 has the count of the occurrences of the number 1 in each row). There are a few ways to get the complete total.
To simply get it in the output window:
descriptives occ1 to occ5/statistics=sum.
But in your question you mentioned adding the counts to the actual dataset. This can be done using the aggregate command:
aggregate /out=* mode=addvariables /break= /TotOcc1 to TotOcc5=sum(occ1 to occ5).
I wasn't able to access your .sav file from this computer, but this is the solution I came up with for your example data:
*Dataset1 should equal your original dataset name.
DATASET ACTIVATE Dataset1.
*This creates a table with the counts of each occurance of 1 through 5.
FREQUENCIES VARS MENTION1 MENTION2 MENTION3.
EXECUTE.
*This translates the frequencies table to a dataset called 'freqtbl'.
DATASET DECLARE freqtbl.
OMS Select Tables
/IF SUBTYPES =['Frequencies']
/DESTINATION format = sav outfile=freqtbl.
freq vars= MENTION1 MENTION2 MENTION3.
OMSEND.
*This will make sure you are referencing the new dataset you created: 'freqtbl'.
*The ALTER TYPE statement turns VAR2, which is the occurances of 1 through 5, back into a numeric variable.
DATASET ACTIVATE freqtbl.
ALTER TYPE VAR2 (F1.0).
*This is a simple if/then statement across all of your values.
IF VAR2 = 1 FREQ_1 = SUM(FREQUENCY).
IF VAR2 = 2 FREQ_2 = SUM(FREQUENCY).
IF VAR2 = 3 FREQ_3 = SUM(FREQUENCY).
IF VAR2 = 4 FREQ_4 = SUM(FREQUENCY).
IF VAR2 = 5 FREQ_5 = SUM(FREQUENCY).
EXECUTE.
*This cleans up your dataset and computes a dummy variable to aggregate counts.
DELETE VARIABLES COMMAND_ TO CumulativePercent.
COMPUTE DUMMY = 1.
EXECUTE.
*This aggregates your frequency columns to count all occurances of 1 through 5.
AGGREGATE
/OUTFILE = *
MODE = ADDVARIABLES
/BREAK DUMMY
/COUNT_1 = SUM(FREQ_1)
/COUNT_2 = SUM(FREQ_2)
/COUNT_3 = SUM(FREQ_3)
/COUNT_4 = SUM(FREQ_4)
/COUNT_5 = SUM(FREQ_5).
EXECUTE.
It seems that when we have duplicated data, most of the time we want to remove the duplicated data.
Lets say, we do not want to exclude it, but instead assign it with a new variable.
Taking the following data as a example
b <- c(1:100,1:99,1:104,1:105,1:105)
So we see that between the values for 1-99 are repeated 5 times, the number 100 repeated 4 times, the number 101 repeated 4 times etc.....
How can one search through b (ideally in sequential order), find a repeated/duplicate number and then assign it a new value
Try this if you're interested in assigning one (universal) new value
b <- c(1:100,1:99,1:104,1:105,1:105)
b[duplicated(b)] = 888 # new value
The duplicated command helps you spot the positions of all values that are duplicates in b.