Getting a Geometric Mean/SD based on column value in dplyr

Getting a Geometric Mean/SD based on column value in dplyr - r

I am wondering if it is possible to get the geometric mean of a set of values based upon the value of another column using dplyr, or if there is a better way.
I have something like this as a data.frame
Days.Stay | Svc
5 | Med
6 | Surg
... | ...
I'd like to get a column and call it Geo.Mean.Days.Stay or something like that, where the value is derived as the geometric mean of Days.Stay grouped by Svc, so each Svc will have its own unique geometric mean - and I would like to extend this to the geometric standard deviation. So a data.frame result like so:
Days.Stay | Svc | Geo.Mean.Days.Stay | Geo.SD.Days.Stay
5 | Med | 6.78 | 2.7
6 | Surg| 5.4 | 2.1
Is dplyr a good package for this or should I use an alternate method?

This should work:
library("dplyr")
dd %>% group_by(svc) %>%
summarise(Geo.Mean.Days.Stay=exp(mean(log(Days.Stay))),
Geo.SD.Days.Stay=exp(sd(log(Days.Stay))))
If you were going to use the geometric mean and SD on a regular basis it would be a good idea to define some helper functions (gmean <- function(x) exp(mean(log(x)))) to improve readability ...

Related

ifelse Statement in R Not Working - Incorrect Column Result

I have a df:
Year | Stage | Home.Team.Name | Home.Team.Goals | Away.Team.Name | Away.Team.Goals
1998 | Group A| Brazil..................| 2............................ | Scotland............... | 1
and so on.
What I'm trying to do is create a new column based off the result of each game. So the winners name appears in a new column. The code I currently have is:
RecentWorldCups$Game.Winner <- ifelse(RecentWorldCups$Home.Team.Goals>RecentWorldCups$Away.Team.Goals, RecentWorldCups$Home.Team.Name,
ifelse(RecentWorldCups$Away.Team.Goals>RecentWorldCups$Home.Team.Goals, RecentWorldCups$Away.Team.Name,
"Draw"))
The result of this is that it gives me a number (perhaps a factor number?) instead of the name of the team.
Anyone able to help?
Cheers

You need to extract the character level value from your factor columns. Try this:
df <- RecentWorldCups # for readability of your code
df$Game.Winner <- ifelse(df$Home.Team.Goals > df$Away.Team.Goals,
levels(df$Home.Team.Name)[df$Home.Team.Name],
ifelse(df$Away.Team.Goals > df$Home.Team.Goals,
levels(df$Away.Team.Name)[df$Away.Team.Name],
"Draw")
)
If you find it cumbersome to do these factor conversions, then one workaround would be to create your data frame with all strings set to not be factors, e.g. something like this:
RecentWorldCups <- data.frame(Home.Team.Goals=c(...), ..., stringsAsFactors=FALSE)

In R, how to efficiently de-dupe a data.frame while processing the duplicates?

(The title of the question is terrible, I'm sorry. I was having a hard time finding a pithy way to express it.)
I have a "tall" data.frame that I have compiled. It looks like this:
id | rating
-----------
3 | 5.5
4 | 6
4 | 7
5 | 3
5 | 5
6 | 7.5
7 | 9
...
I want to turn that into this:
id | avg rating
-----------
3 | 5.5
4 | 6.5
5 | 4
6 | 7.5
7 | 9
...
I don't just want to remove duplicates. I want to take the rows that have the same duplicate id, remove the duplicates, but update the rating field to be the average.
I'm not sure how to go about this. I'm not even sure whether I should be modifying the original data frame or instead creating a new one with the modified data.
(Note: I think a good answer would be a bit agnostic to the specifics of the operation. Like, if I wanted to do something similar but instead have the resulting rating column be a sum or a count, hopefully your answer would apply to those situations as well.)

You also have the option to use SQL language is you are familiar with it.
You will require the sqldf package library(sqldf)
sqldf("
select id, avg(rating) `avg_rating`
from your_data
group by id
")

A version using dplyr and including a sum example.
library(dplyr)
df %>%
group_by(id) %>%
summarize(avg_rating = mean(rating),
sum_rating = sum(rating))

How to perform conditional average in R or Excel

I have a big dataset something like this below:
Image | Length | Angel
--------------------------------
DSC_001 | 233.22 |2.00
--------------------------------
DSC_001 | 24.897 |1.2
--------------------------------
DSC_001 | 28.55 |2.87
--------------------------------
DSC_002 | 23.76 |3.71
--------------------------------
DSC_002 | 34.21 |3.21
---------------------------------
I want to do average of Length and Angles for each set (DSC_001 is one set, DSC_002 is another and so on).
I can do it manually in Excel but taking huge time when it around 4000 data point.
I like to know how I do it in R or in Excel in much smarter way?

In R, we can use dplyr
library(dplyr)
df1 %>%
group_by(image) %>%
summarise_each(funs(mean))
Or with data.table
library(data.table)
setDT(df1)[, lapply(.SD, mean) , by = image]
Or using aggregate from base R
aggregate(.~image, df1, FUN = mean)

In Excel:
Make a new list with the unique values in the Image column as decribed here.
Add the column names above your new list (not mandatory but important for clear presentation of the data).
Use AVERAGEIF() to compute a conditioned average with the formula: =AVERAGEIF(A2:A10,E3,B2:B10) assuming A2:A10 is the column Image, B2:B10 is The column of the values to calculate their mean, and E3 is the cell where the Image to calculate its' mean is stored.
Here is a screenshot to clarify this:
Hope it helps ;)

Friedman test unreplicated complete block design error

I'm having trouble running a Friedman test over my data.
I'm trying to run a Friedman test using this command:
friedman.test(mean ~ isi | expId, data=monoSum)
On the following database (https://www.dropbox.com/s/2ox0y1b4gwld0ai/monoSum.csv):
> monoSum
expId isi N mean
1 m80B1 1 10 100.000000
2 m80B1 2 10 73.999819
3 m80B1 3 10 45.219362
4 m80B1 4 10 116.566174
. . . . .
18 m80L2 2 10 82.945491
19 m80L2 3 10 57.675480
20 m80L2 4 10 207.169277
. . . . . .
25 m80M2 1 10 100.000000
26 m80M2 2 10 49.752687
27 m80M2 3 10 19.042592
28 m80M2 4 10 150.411035
It gives me back the error:
Error in friedman.test.default(c(100, 73.9998193095267, 45.2193621626293, :
not an unreplicated complete block design
I figure it gives the error because, when monoSum$isi==1 the value of mean is always 100. Is this correct?
However, monoSum$isi==1 is alway 100 because it is the control group on which all the other monoSum$isi groups are normalized. I can not assume a normal distribution, so I cannot run a rmANOVA…
Is there a way to run a friedman test on this data or am I missing a very essential point here?
Many thanks in advance!

I don't get an error if I run your dataset:
Friedman rank sum test
data: mean and isi and expId
Friedman chi-squared = 17.9143, df = 3, p-value = 0.0004581
However, you have to make sure that expId and isi are coded as factors. Run these commands:
monoSum$expID$<-factor(monoSum$expID)
monoSum$isi$<-factor(monoSum$isi)
Then run the test again. This has worked for me with a similar problem.

I know this is pretty old but for future generations (see also: me when I forget and google this again):
You can determine what the missing values are in your dataframe by running table(groups, blocks) or in the case of this question table(monoSum$isi, monoSum$expID). This will return a table of 0s and 1s. This missing records are in the the cells with 0s.
I ran into this problem after trying to remove the blocks that had incomplete results; taking a subset of the data did not remove the blocks for some reason.

Just thought I would mention I found this post because I was getting a similar error message. The above suggestions did not solve it. Strangely, I had to sort my dataframe so that block by block the groups appeared in order (i.e. I could not have the following:
Block 1 A
Block 1 B
Block 2 B
Block 2 A
It has to appear as A, B, A, B)

I ran into the same cryptic error message in R, though in my case it was resolved when I applied the 'as.matrix' function to what was originally a dataframe for the CSV file I imported in using the read.csv() function.
I also had a missing data point in my original data set, and I found that when my data was transformed into a matrix for the friedman.test() call, the entire row containing the missing data point was omitted automatically.

Using the function as.matrix() to transform my dataframe is the magic that got the function to run for me.

I had this exact error too with my dataset.
It turns out that the function friedman.test() accepts data frames (fx those created by data.frame() ) but not tibbles (those created by dplyr and other modern tools). The solution for me was to convert my dataset to a dataframe first.
D_fri <- D_all %>% dplyr::select(FrustrationEpisode, Condition, Participant)
D_fri <- as.data.frame(D_fri)
str(D_fri) # confirm the object should now be a 'data.frame'
friedman.test(FrustrationEpisode ~ Condition | Participant, D_fri)

I ran into this problem too. Fixed mine by removing the NAs.
# My data (called layers) looks like:
| resp.no | av.l.all | av.baseem | av.base |
| 1 | 1.5 | 1.3 | 2.3 |
| 2 | 1.4 | 3.2 | 1.4 |
| 3 | 2.5 | 2.8 | 2.9 |
...
| 1088 | 3.6 | 1.1 | 3.3 |
# Remove NAs
layers1 <- na.omit(layers)
# Re-organise data so the scores are stacked, and a column added with the original column name as a factor
layers2 <- layers1 %>%
gather(key = "layertype", value = "score", av.l.all, av.baseem, av.base) %>%
convert_as_factor(resp.no, layertype)
# Data now looks like this
| resp.no | layertype | score |
| 1 | av.l.all | 1.5 |
| 1 | av.baseem | 1.3 |
| 1 | av.base | 2.3 |
| 2 | av.l.all | 1.4 |
...
| 1088 | av.base | 3.3 |
# Then do Friedman test
friedman.test(score ~ layertype | resp.no, data = layers2)

Just want to share what my problem was. My ID factor did not have correct levels after doing pivot_longer(). Because of this, the same error was given. I made sure the correct level and it worked by the following:as.factor(as.character(df$ID))

Reviving an old thread with new information. I ran into a similar problem after removing NAs. My group and block were factors before the NA removal. However, after removing NAs, the factors retained the levels before the removal even though some levels were no longer in the data!
Running the friedman.test() with the as.matrix() trick (e.g., friedman.test(a ~ b | c, as.matrix(df))) was fine but running frdAllPairsExactTest() or friedman_effsize() would throw the not an unreplicated complete block design error. I ended up re-factoring the group and block (i.e., dropping the levels that were no longer in the data, df$block <- factor(df$block)) to make things work. After the re-factor, I did not need the as.matrix() trick, either.

Is it possible to combine separate boxplot summaries into one and create the combined graph?

I am working with rather large datasets (appx. 4 mio rows per month with 25 numberic attributes and 4 factor attributes). I would like to create a graph that contains per month (for the last 36 months) a boxplot for each numeric attribute per product (one of the 4 factor attributes).
So as an example for product A:
-
_ | -
_|_ | _|_
| | | | |
| | _|_ | |
| | | | |---|
| | |---| | |
|---| | | | |
|_ _| | | |_ _|
| |_ _| |
| | |
- | -
-
--------------------------------------------------------------
jan '10 feb '10 mar '10 ................... feb '13
But since these are quite large datasets I will be working with I would like some advice to get started on how to approach. My idea (but I am not sure if this is possible) is to
a) extract the data per month per product
b) create a boxplot for that specific month (so let's say jan'10 for product A)
c) store the boxplot summary data somewhere
d) repeat a-c for all months until feb '13
e) combine all the stored boxplot summary data into one
f) plot the combined boxplot g) repeat a-f for all other products
So my main question is: is it possible to combine separate boxlot summaries into one and create the combined graph as sketched above from this?
Any help would be appreciated,
Thank you

Here's a long-hand example that you can probably cook something up around:
Read in the individual datasets - you might want to overwrite the same data or wrap this step in a function given the large data you are using.
dset1 <- 1:10
dset2 <- 10:20
dset3 <- 20:30
Store some boxplot info, notice the plot=FALSE
result1 <- boxplot(dset1,plot=FALSE,names="month1")
result2 <- boxplot(dset2,plot=FALSE,names="month2")
result3 <- boxplot(dset3,plot=FALSE,names="month3")
Group up the data and plot with bxp
mylist <- list(result1, result2, result3)
groupbxp <- do.call(mapply, c(cbind, mylist))
bxp(groupbxp)
Result:

You will not be able to predict with absolute precision what the values of the "fivenum" values will be for combined assembly of values. Think about the situation with two groups for which you have the 75th percentiles in each group and the counts of observations in each group. Suppose the percentiles are unequal. You cannot just take the weighted mean of the percentiles to get the 75th percentile of the aggregated values. The see the help page for ?boxplot.stats. I would think, however, that you might come very close by using the median values of the fivenum collections. This might be a place to start your examinations.
mo.mtx <- tapply(dat$values, dat$month, function( mo.dat) c( fivenum(mo.dat), length(mo.dat) )
matplot( mo.mtx[, 1:5] , type="l" )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Getting a Geometric Mean/SD based on column value in dplyr - r

Related

ifelse Statement in R Not Working - Incorrect Column Result

In R, how to efficiently de-dupe a data.frame while processing the duplicates?

How to perform conditional average in R or Excel

Friedman test unreplicated complete block design error

Is it possible to combine separate boxplot summaries into one and create the combined graph?

Categories

Resources