How to get mean of all elements in data.frame in R - r

I have a data frame that contains 150 numerical values that I want the mean of. Which column or row they're on is not relevant at all, that's just how the data was given.
I have found a solution to do this, but it's so shamefully disgusting that I'd prefer to use a better method. I've literally just added up the mean of each column and divided by the number of columns...
This is still a 1-liner so it's not that bad, but there must be better ways to do this.
A thousand thanks in advance!

Best solution for me was to unlist it with the unlist() function. Thanks to #H 1 !
Then you can simply use the mean() function.

Related

Filtering with dplyr not working as expected

how are you?
I have the next problem, that is very weird because the task it is very simple.
I want to filter one of my factor variables in R, but the outcome is an empty dataframe.
So my data frame is called "data_2022", if i execute this code:
sum(data_2022$CANALDEVENTA=="WEB")
The result is 2704800 that is the number of times that this filter is TRUE.
a= data_2022 %>% filter(CANALDEVENTA=="WEB")
This returns an empty data frame.
I know i am not an expert in R, but i have done the last thing a million times and i never had this error before.
Do you have a clue about whats the problem with this?
Sorry i did not make a reproducible example.
Already thank you.
you could use subset function:
a<-subset(data_2022, CANALDEVENTA=="WEB")
using tidyverse, make sure you are using the function from dplyr::filter. filter is looking for a logical expression but probably you apply it to a data.frame. Try this code too:
my_names<-c("WEB")
a<-dplyr::filter(data_2022, CANALDEVENTA %in% my_names)
Hope it works.

Selecting rows in a long dataframe based on a short list

I'm sure this should be easier to do than the way I know how to do it.
I'd like to apply fields from a short dataframe back into a long one based on matching a common factor.
Example short dataframe, list of valid cases:
$ptid (factor) values 1,2,3,4,5...20
$valid 1/0 (to represent true/false; variable through ptid)
long dataframe has 15k rows, each level of $ptid will have several thousand rows
I want to apply $valid onto those rows when the it is 1/true from the list above
The way I know how to do it is to loop through each row of long dataframe, but this is horribly inelegant and also slow.
I have a niggling feeling there is a much better way with dply or similar and I'd really like to learn how.
Worked this out based on the comments, thank you Colonel.
combination_dataset <- Merge(short_dataframe, long_dataframe) worked (very quickly).
Thanks to those who commented.

Using ifelse statement to condense variables

New to R, taking a very accelerated class with very minimal instruction. So I apologize in advance if this is a rookie question.
The assignment I have is to take a specific column that has 21 levels from a dataframe, and condense them into 4 levels, using an if, or ifelse statement. I've tried what feels like hundreds of combinations, but this is the code that seemed most promising:
> b2$LANDFORM=ifelse(b2$LANDFORM=="af","af_type",
ifelse(b2$LANDFORM=="aflb","af_type",
ifelse(b2$LANDFORM=="afub","af_type",
ifelse(b2$LANDFORD=="afwb","af_type",
ifelse(b2$LANDFORM=="afws","af_type",
ifelse(b2$LANDFORM=="bfr","bf_type",
ifelse(b2$LANDFORM=="bfrlb","bf_type",
ifelse(b2$LANDFORM=="bfrwb","bf_type",
ifelse(b2$LANDFORM=="bfrwbws","bf_type",
ifelse(b2$LANDFORM=="bfrws","bf_type",
ifelse(b2$LANDFORM=="lb","lb_type",
ifelse(bs$LANDFORM=="lbaf","lb_type",
ifelse(b2$LANDFORM=="lbub","lb_type",
ifelse(b2$LANDFORM=="lbwb","lb_type","ws_type"))))))))))))))
LANDFORM is a factor, but I tried changing it to a character too, and the code still didn't work.
"ws_type" is the catch all for the remaining variables.
the code runs without errors, but when I check it, all I get is:
> unique(b2$LANDFORM)
[1] NA "af_type"
Am I even on the right path? Any suggestions? Should I bite the bullet and make a new column with substr()? Thanks in advance.
If your new levels are just the first two letters of the old ones followed by _type you can easily achieve what you want through:
#prototype of your column
mycol<-factor(sample(c("aflb","afub","afwb","afws","bfrlb","bfrwb","bfrws","lb","lbwb","lbws","wslb","wsub"), replace=TRUE, size=100))
as.factor(paste(sep="",substr(mycol,1,2),"_type"))
After a great deal of experimenting, I consulted a co-worker, and he was able to simplify a huge amount of this. Basically, I should have made a new column composed of the first two letters of the variables in LANDFORM, and then sample from that new column and replace values in LANDFORM, in order to make the ifelse() statement much shorter. The code is:
> b2$index=as.factor(substring(b2$LANDFORM,1,2))
b2$LANDFORM=ifelse(b2$index=="af","af_type",
ifelse(b2$index=="bf","bf_type",
ifelse(b2$index=="lb","lb_type",
ifelse(b2$index=="wb","wb_type",
ifelse(b2$index=="ws","ws_type","ub_type")))))
b2$LANDFORM=as.factor(b2$LANDFORM)
Thanks to everyone who gave me some guidance!

R: Average value of vector

I'm playing around with R and trying to get the average of a column. Just mean(V1) doesn't work.
Could anybody give me an advice?
Thank you!
A: mean(D$V1) ... column names are not first class objects. They are part of a data.frame with a name that needs to be used.
Q: es, thanks. So $ is like the dot-operator? – user1170330 4 mins ago
A: Perhaps (depending on which language is being compared.) it is possible to construct list objects with cascaded calls to $<- and then obj$V1$subV1 to extract. Review ?Extract very carefully.
if you just want mean(V1), you can attach dataframe with attach(D), but probably I don't want to recommend it as later you may have other dataframes with same variable names and a mess withattach and detach commands. So, mean(D$V1) is the best way.

Calculate e.g. a mean in a list with multi-column data.frames

I have a list of several data.frames. Each data.frame has several columns.
By using
mean(mylist$first_dataframe$a
I can get the mean for a in this one data.frame.
However I do not know how to calculate over all the data.frames stored in my list or how for specific data.frames.
I could use a loop but I was told that
apply() and its variations are better
I tried using several solutions I found via search but somehow it just doesn't work.
I assume I need to use
unlist()
Could you provide an example of how to calculate e.g. a mean for a data structure like mine.
A list with several data.frames containing several columns.
Update:
I'm sorry for the confusion. I wanted the grand mean for a specific column in all dataframes.
Thanks to Thomas for providing a working solution for calculating a grand mean for a specific column in all dataframes and to psychometriko for providing a useful solution for calculating means over all columns in all dataframes (& even for the case when not numeric data is involved).
Thanks!
Is this what you are looking for?
set.seed(42)
mylist <- list(a=data.frame(foo=rnorm(10),
bar=rnorm(10)),
b=data.frame(foo=rnorm(10),
bar=rnorm(10)),
c=data.frame(foo=rnorm(10),
bar=rnorm(10)))
sapply(do.call("rbind",mylist),mean)
foo bar
0.1163340 -0.1696556
Note: do.call("rbind",mylist) returns something similar to what you referred to above with the unlist function, and then sapply, as referred to by Roland in his answer, just calls the function mean on each component (column) of the data.frame that results from the above do.call function.
Edit: In response to the question of how to deal with non-numeric data.frame components, the below solution admittedly isn't very elegant and I'm sure better ones exist, but here's the first thing I was able to think of:
set.seed(42)
mylist <- list(a=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
b=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)),
c=data.frame(rand=rnorm(10),
lets=sample(LETTERS,10,replace=TRUE)))
sapply(do.call("rbind",mylist),function(x) {
if (is.numeric(x)) mean(x)
})
$rand
[1] -0.02470602
$lets
NULL
This basically just creates a custom function that first tests whether each component is numeric and, if it is, returns the mean. If it isn't, it skips it.
The whole do.call('rbind', List) thing can be quite slow and prone to mishaps. If there is only one column you need the mean for, the best way is:
mean(sapply(mylist, function(X) X$rand))
It's about 10x faster the the do.call method.

Resources