I'm looking to generate means of ratings as a new variable/column in a data frame. Currently every method I've tried either generates columns that show the mean of the entire dataset (for the chosen items) or don't generate means at all. Using the rowMeans function doesn't work as I'm not looking for a mean of every value in a row, just a mean that reflects the chosen values in a given row. So for example, I'm looking for the mean of 10 ratings:
fun <- mean(T1.1,T2.1,T3.1,T4.1,T5.1,T6.1,T7.1,T8.1,T9.1,T10.1, trim = 0, na.rm = TRUE)
I want a different mean printed for every row because each row represents a different set of observations (a different subject, in my case). The issues I'm looking to correct with this are twofold: 1) it generates only one mean, the mean of all values for each of the 10 variables, and 2) this vector is not a part of the dataframe. I tried to generate a new column in the dataframe by using "exp$fun" but that just creates a column whose every value (for every row) is the grand mean. Could anyone advise as to how to program this sort of row-based mean? I'm sure it's simple enough but I haven't been able to figure it out through Googling or trawling StackOverflow.
Thanks!
It's hard to figure out an answer without a reproducible example but have you tried subsetting your dataset to only include the 10 columns from which you'd like to derive your means and then using an apply statement? Something along the lines of apply(df, 1, mean) where the first argument refers to your dataframe, the second argument specifies whether to perform a function by rows (1) or columns (2), and the third argument specifies the function you wish to apply?
Related
This question already has answers here:
How to find the mean of a column in R [duplicate]
(2 answers)
Closed 2 years ago.
I want to find out mean of each column in my dataset, which contains null / blank values.
I've attached screenshots of actual and sample data for reference.
I don't see the data?
Usually, you just have to calculate mean() of extracted column from a data frame if a column is numerical. And you become immediate data frame with importing an excel file in rstudio.
It's is easier to work with the data frame if you name your columns.
dataframe_name <- c(column1, column2, column3)
Then, you can easily extract the mean of a column.
mean(dataframe_name$column1)
library(tidyverse)
df %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
This calculates the mean for every numeric column in your dataframe.
Best thing to do, is to add a helper column, using this formula:
=IfError(B1;0)
(Obviously you might need to use another cell reference)
This formula replaces all error values by zero, you can use this column as an input for calculating your averages.
There are two simple ways that can solve your problem.
Set up a separate table next to the one you use and take the mean of the respective cells:
=IfError(Cell;0)
=MEAN(StartCell:EndCell)
Replace the null values with 0 by using the replace all function and take the mean value afterwards.
Note: Both approaches will take into consideration the zeros when calculating the mean. If you want to avoid this, replace the null values with "nothing". Hope that helps.
I am working with a data-frame in R. I have the following function which removes all rows of a data-frame df where, for a specified column index/attribute, the value at that row is outside mean (of column) plus or minus n*stdev (of column).
remove_outliers <- function(df,attr,n){
outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
return(outliersgone)
}
There are two parts to my question.
(1) My data-frame df also has a column 'Group', which specifies a class label. I would like to be able to remove outliers according to mean and standard deviation within their group within the column, i.e. organised by factor (within the column). So you would remove from the data-frame a row labelled with group A if, in the specified column/attribute, the value at that row is outside mean (of group A rows in that column) plus/minus n*stdev (of group A rows in that column). And the same for groups B, C, D, E, F, etc.
How can I do this? (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group) followed by mutate but I'm not sure what to pass to mutate, given my function remove_outliers seems to require the whole data-frame to be passed into it (so it can return the whole data-frame with rows only removed based on the chosen attribute attr).
I am open to hearing suggestions for changing the function remove_outliers as well, as long as they also return the whole data-frame as explained. I'd prefer solutions that avoid loops if possible (unless inevitable and no more efficient method presents itself in base R / dplyr).
(2) Is there a straightforward way I could combine outlier considerations across multiple columns? e.g. remove from the dataframe df those rows which are outliers wrt at least $N$ attributes out of a specified vector of attributes/column indices (length≥N). or a more complex condition like, remove from the dataframe df those rows which are outliers wrt Attribute 1 and at least 2 of Attributes 2,4,6,8.
(Ideally the definition of outlier would again be within-group within column, as specified in question 1 above, but a solution working in terms of just within column without considering the groups would also be useful for me.)
Ok - part 1 (and trying to avoid loops wherever possible):
Here's some test data:
test_data=data.frame(
group=c(rep("a",100),rep("b",100)),
value=rnorm(200)
)
We'll find the groups:
groups=levels(test_data[,1]) # or unique(test_data[,1]) if it isn't a factor
And we'll calculate the outlier limits (here I'm specifying only 1 sd) - sorry for the loop, but it's only over the groups, not the data:
outlier_sds=1
outlier_limits=sapply(groups,function(g) {
m=mean(test_data[test_data[,1]==g,2])
s=sd(test_data[test_data[,1]==g,2])
return(c(m-outlier_sds*s,m+outlier_sds*s))
})
So we can define the limits for each row of test_data:
test_data_limits=outlier_limits[,test_data[,1]]
And use this to determine the outliers:
outliers=test_data[,2]<test_data_limits[1,] | test_data[,2]>test_data_limits[2,]
(or, combining those last steps):
outliers=test_data[,2]<outlier_limits[1,test_data[,1]] | test_data[,2]>outlier_limits[2,test_data[,1]]
Finally:
test_data_without_outliers=test_data[!outliers,]
EDIT: now part 2 (apply part 1 with a loop over all the columns in the data):
Some test data with more than one column of values:
test_data2=data.frame(
group=c(rep("a",100),rep("b",100)),
value1=rnorm(200),
value2=2*rnorm(200),
value3=3*rnorm(200)
)
Combine all the steps of part 1 into a new function find_outliers that returns a logical vector indicating whether any value is an outlier for its respective column & group:
find_outliers = function(values,n_sds,groups) {
group_names=levels(groups)
outlier_limits=sapply(group_names,function(g) {
m=mean(values[groups==g])
s=sd(values[groups==g])
return(c(m-n_sds*s,m+n_sds*s))
})
return(values < outlier_limits[1,groups] | values > outlier_limits[2,groups])
}
And then apply this function to each of the data columns:
test_groups=test_data2[,1]
test_data_outliers=apply(test_data2[,-1],2,function(d) find_outliers(values=d,n_sds=1,groups=test_groups))
The rowSums of test_data_outliers indicate how many times each row is considered an 'outlier' in the various columns, with respect to its own group:
rowSums(test_data_outliers)
I am trying to learn the recommended ways to create a new column for a data table when the column of interest is a list (or vector), and the selection is done relative to another column, and there may be a preliminary selection done as part of a chain.
Consider these data named (tmp). We want to find the minimum value of sacStartT greater than stimTime (in the real data one or the other of these could be empty and no minimum exist).
tmp = data.table("pid" = c(14,14,9,9),"trialNumber" = c(25,26,25,26),"stimTime" = c(100,200,1,2),"sacStartT" = list(c(98,99,101,102), c(201,202), c(5), c(-2,-3,3)))
This works:
tmp[,"mintime" := as.integer(min(unlist(sacStartT)[unlist(sacStartT)>stimTime])),by=seq_len(nrow(tmp))]
But if I wanted to first subselect the data I don't know how to get that row number for the row-by-row analysis, e.g.
tmp[pid == 9][,"mintime" := as.integer(min(unlist(sacStartT)[unlist(sacStartT)>stimTime])),by=seq_len(nrow(.N))]
fails because .N refers to the number of rows in tmp, and not the subset in the chain.
In summary, the question is the composition of:
Recommendations for doing this row by row analysis?
How to find the right number for the by argument in a chain?
Recommendations for dealing with data.table elements that contain lists? Do you just have to manually unlist them all?
I have two datasets H and G. They have a column named 'diff' that as the name suggests, holds difference between two columns within each dataset. I used lapply to calculate the percentage for each dataset (I have more datasets than H and G, so would like to calculate the percentage of the two columns in each dataset), but for some reason lapply gives me the output however doesn't create "perc" column in the datasets that pass through it. What am I doing wrong here?
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
G<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[c(2,3,7,9),9]<-NA
G[c(1,5,7,8),9]<-NA
H$diff<-H$X10-H$X9
G$diff<-G$X10-G$X9
dsay<-list(H,G)
lapply(dsay,function(x)x$perc<-round((x$diff/x$X10)*100,1))
Extension of this question:
once I have the percent differences as columns using:
H<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
G<-data.frame(replicate(10,sample(0:20,10,rep=TRUE)))
H[c(2,3,7,9),9]<-NA
G[c(1,5,7,8),9]<-NA
H$diff<-H$X10-H$X9
G$diff<-G$X10-G$X9
H$perc<-round((H$diff/H$X10)*100,1)
G$perc<-round((G$diff/G$X10)*100,1)
I generated a plot using:
xyplot(X8+X9+X10~X1,H,type=c('p','l','g'),
col = c('yellow', 'green', 'blue','red'),
ylab='Count',layout=c(3, 1),
xlab=paste("H",'difference',min(pmin(H$perc, na.rm = TRUE),na.rm=TRUE),
'% change count'))
Never mind the plot it will generate, but what I'm trying to get to is that I also display the value of corresponding difference from the "diff" column alongwith the lowest difference (which is what the min function is doing). I've tried using "match" in vain. Could someone help please?
If we need the changes to reflect in the dataframe objects as well, list2env or assign can be used. But, I would do all the computations within the list itself.
list2env(lapply(mget(c('H','G')), function(x)
{x$perc<-round((x$diff/x$X10)*100,1);x}), envir=.GlobalEnv)
I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric