Mean/standard deviation plot of survey items with missing data - r

I'm an R beginner attempting to do what I figured (erroneously) would be a beginner-type task: produce a simple plot of means/standard deviations for multiple survey questions (vectors), grouped by a second variable (say, group).
So I am reading variables (say, q1-q10) into R from Stata and have even managed to melt the data following this suggestion.
What I would like is essentially the graph presented in the solution:
However, my data contain missing values (NA), and the NUMBER of missing values varies by question. So when I try to use ggplot to plot the 'melted' data, I get an error saying the vector lengths do not match.

Well, suppose that your variables q1-q10 are separated, then you should merge them into a data frame df:
df <- data.frame(q1, q2, ...,q10)
And then you can clean it such that you only have complete cases, i.e. only observations without NA:
df <- df[complete.cases(df),]
Afterwards, you should not have problems with ggplot.

Related

R calculate averages by group with uneven categorical data

I want to calculate averages for categorical data. My data is in a long format, and I do not understand why I am not succeeding.
Here is an example (imagine it as individual participants, indicated by id, picking different options, in this example m_ex):
id <- (1,1,1,1,1,2,2,2,3,3,3)
m_ex <- ("a","b","c","b","a","b","c","b","a","a","c")
df <- data.frame(id , m_ex)
print (df)
I want to calculate averages for m_ex. That is, the average times specific m_ex are picked. I am trying to achieve this with dplyr. But I do not quite understand how to proceed with the id's having different lengths. What would I have to divide by then? And is it a problem that I do not have equal lengths of ids?
I really appreciate any help you can provide.
I have tried using dplyr and grouping by id and summarizing the results without much success. I would, in particular, like to understand what I do not understand right now.
I get something like this, but how do I get the averages?
[1]: https://i.stack.imgur.com/7nxze.jpg
[![example picture][1]][1]

R Beginner: how to combine two variables and merge them into a new dataframe

If I have the mean and standard error from the mean (SE) for a particular set of numbers, how would I go about combining the two values into one dataframe? For example, I have the variable mean_boeing (for the average) and stde_boeing (for the error from the mean) and I want to combine these two into one dataframe. Ultimately, I will be doing this for several other variables, combining them all into one big dataframe so that I can graph them in ggplot.
Thanks
We can use data.frame to create a data.frame
df1 <- data.frame(mean_boeing, stde_boeing)

How can I see multiple variable's outlier in one boxplot using R?

I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:
boxplot(train$rate)
Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?
If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.
Thanks in advance
I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.
train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
train2[outindex[[i]],i] <- NA # Remove the outliers
}
This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues package has a function that takes a different approach to identifying outliers and doesn't require a plot.
This uses the getOutliers function from the extremevalues package
outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
outRight[[i]]<-getOutliers(train2[,i])$iRight
outLeft[[i]]<-getOutliers(train2[,i])$iLeft
train2[outRight[[i]],i] <- NA
train2[outLeft[[i]],i] <- NA
}
The function boxplot returns a value. If you see the Value section of its help page you'll see that it's a list with named components, one of which is out. That's the one you seem to be looking for.
bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)] # to remove the outliers
I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.

How can I make a histogram for more than one column of a data frame?

I have a dataframe df, now I want to make a histogram using ggplot2 function I want to merge the data of two columns 1 and 2
+geom_histogram
So I tried:
v<-c(df$column1,df$column2)
myplot = ggplot(v)
myplot+geom_histogram()
I get an error:
ggplot2 doesn't know how to deal with data of class numeric
Is there another way to merge columns?
My only problem is that I have yearly data and I just want to compare it without considering years. Phrased differently pour it all together.
v<-c(df$columm1,df$columm2)
library(ggplot2)
ggplot()+aes(v)+geom_histogram(binwidth = (0.01))+xlim(c(-0.1,0.1))+labs(x="Jahresuberschuss",y="count")`

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Resources