Get rid of 0s before running chi-square test - r

My variable RaceCat looks like this after coding table(data$RaceCat)
I want to run a chi-square test but I know I need to get rid of the races American Indian and Middle Eastern since those have 0s. American Indian is coded as 5 and Middle Eastern is 4. My thought process is to do data%>% filter(RaceCat!=5)%>% filter(RaceCat!=4)
However, when I try to do that, R says length of 'dimnames' [2] not equal to array extent.
Here is another thing I tried to do:
hivneg<-droplevels(hivneg)
a<- table(hivneg$RaceCat, hivneg$PrEP2)
chisq.test(a, correct=F)
However, the table() and chisq.test() won't run. It says
Warning message in mapply(FUN = f, ..., SIMPLIFY = FALSE):
“longer argument not a multiple of length of shorter”

From the image showed, it seems that the column 'RaceCat' is factor class with some unused levels. An option is to update the data with droplevels which removes those unused levels in any of the factor columns
data <- droplevels(data)
NOTE: table just returns the frequency count of each of the unique values/ or if it is a factor, the count of each of the levels of the factor.

Related

R: Seperating several observations of a variable and building a matrix

I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)

Principal Components Analysis Error X must be numeric but my column contains positive and negative number and yes or no

I have a csv file which consists of 10 k rows of data. I am trying to perform PCA prcomp on the dataset. The data consists of 4 columns which are Money, Pay, Balance and Enough.
I tried to perform PCA on the dataframe below and it mention the error,
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I understand that the data need to be numeric but I only see numeric data only except the enough column. Even though I work on the 1st to 3rd column, it still did not work and mention the same error.
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I tried to remove the - sign under the Balance column and when I tried to perform PCA, it work. Isnt -12345 a numeric?
I understand that non-numeric must not be stored in the data as PCA will not work. Is it possible to work on the data if it consist of Yes or No.
Money Pay Balance Enough
24 25 -1 No
30 26 4 Yes
40 45 -5 No
31 20 11 Yes
My R code
results <- prcomp(dataf, scale = TRUE)

T test using column variable from 2 different data frames in R

I am attempting to conduct a t test in R to try and determine whether there is a statistically significant difference in salary between US and foreign born workers in the Western US. I have 2 different data frames for the two groups based on nativity, and want to compare the column variable I have on salary titled "adj_SALARY". For simplicity, say that there are 3 observations in the US_Born_west frame, and 5 in the Immigrant_West data frame.
US_born_West$adj_SALARY=30000, 25000,22000
Immigrant_West$adj_SALARY=14000,20000,12000,16000,15000
#Here is what I attempted to run:
t.test(US_born_West$adj_SALARY~Immigrant_West$adj_SALARY, alternative="greater",conf.level = .95)
However I received this error message: "Error in model.frame.default(formula = US_born_West$adj_SALARY ~ Immigrant_West$adj_SALARY) :
variable lengths differ (found for 'Immigrant_West$adj_SALARY')"
Any ideas on how I can fix this? Thank you!
US_born_West$adj_SALARY and Immigrant_West$adj_SALARY are of unequal length. Using formula interface of t.test gives an error about that. We can pass them as individual vectors instead.
t.test(US_born_West$adj_SALARY, Immigrant_West$adj_SALARY,
alternative="greater",conf.level = .95)

targeting a single value of a dichotomous variable in R

I am trying to use the command assocstats() in order to receive Cramer's V for 2 Variables. This is not a a problem as long as I target the entirety of both variables:
assocstats(table(democrat, sex))
Problems arise when I try to target only 1 specific value of the dichotomous variable sex, which consists of 1 and 2.
I thought that dplyr might be of help with the filter command, but
assocstats(table(democrat, filter(sex==1))
does not yield any results.
Does anybody know how I can target only 1 value of the variable sex in this case?
Many thanks
Suppose if I am using the Arthritis data from library(vcd), we need to filter the rows that matches the 'Male' (or 1 in your dataset), select the columns of interest ('Treatment', and 'Sex'), get the frequency with table and use assocstats.
library(vcd)
assocstats(table(Arthritis[Arthritis$Sex=='Male', c('Treatment', 'Sex')]))
Assuming that the OP have two vectors i.e. 'democrat' and 'sex'
i1 <- sex ==1
assocstats(table(democrat[i1], sex[i1]))

R:More than 52 levels in a predicting factor, truncated for printout

Hi I'm a beginner in R programming language. I wrote one code for regression tree using rpart package. In my data some of my independent variables have more than 100 levels. After running the rpart function
I'm getting following warning message "More than 52 levels in a predicting factor, truncated for printout" & my tree is showing in very weird way. Say for example my tree is splitting by location which has around 70 distinct levels, but when the label is displaying in tree then it is showing "ZZZZZZZZZZZZZZZZ..........." where I don't have any location called "ZZZZZZZZ"
Please help me.
Thanks in advance.
Many of the functions in R have limits on the number of levels a factor-type variable can have (ie randomForest limits the number of levels of a factor to 32).
One way that I've seen it dealt with especially in data mining competitions is to:
1) Determine maximum number of levels allowed for a given function (call this X).
2) Use table() to determine the number of occurrences of each level of the factor and rank them from greatest to least.
3) For the top X - 1 levels of the factor leave them as is.
4) For the levels < X change them all to one factor to identify them as low-occurrence levels.
Here's an example that's a bit long but hopefully helps:
# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels. If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000"))
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))
Finally, you may want to consider using truncated variables in rpart as the tree display gets very busy when there are a large number of variables or they have long names.

Resources