I have a big dataset something like this below:
Image | Length | Angel
--------------------------------
DSC_001 | 233.22 |2.00
--------------------------------
DSC_001 | 24.897 |1.2
--------------------------------
DSC_001 | 28.55 |2.87
--------------------------------
DSC_002 | 23.76 |3.71
--------------------------------
DSC_002 | 34.21 |3.21
---------------------------------
I want to do average of Length and Angles for each set (DSC_001 is one set, DSC_002 is another and so on).
I can do it manually in Excel but taking huge time when it around 4000 data point.
I like to know how I do it in R or in Excel in much smarter way?
In R, we can use dplyr
library(dplyr)
df1 %>%
group_by(image) %>%
summarise_each(funs(mean))
Or with data.table
library(data.table)
setDT(df1)[, lapply(.SD, mean) , by = image]
Or using aggregate from base R
aggregate(.~image, df1, FUN = mean)
In Excel:
Make a new list with the unique values in the Image column as decribed here.
Add the column names above your new list (not mandatory but important for clear presentation of the data).
Use AVERAGEIF() to compute a conditioned average with the formula: =AVERAGEIF(A2:A10,E3,B2:B10) assuming A2:A10 is the column Image, B2:B10 is The column of the values to calculate their mean, and E3 is the cell where the Image to calculate its' mean is stored.
Here is a screenshot to clarify this:
Hope it helps ;)
Related
I have two tables: One with participants and one with an encoding of scores based on birth dates.
The score table looks like this:
score_table
Key | Value
--------------------
01/01/1900 | 15
01/01/1940 | 25
01/01/1950 | 30
All participants with birth dates between 01/01/1900 and 01/01/1940 should get a score of 15.
Participants born between 01/01/1940 and 01/01/1950 should get a score of 25, etc.
My participants' table looks like this:
participant_table
BirthDate | Gender
-----------------------
05/05/1930 | M
02/07/1954 | V
01/11/1941 | U
I would like to add a score to get the output table:
BirthDate | Gender | Score
------------------------------------
05/05/1930 | M | 15
02/07/1954 | V | 30
01/11/1941 | U | 25
I built several solutions for similar problems when the exact values are in the score table (using dplyr::left_join or base::match) or for numbers which can be rounded to another value. Here, the intervals are irregular and the dates arbitrary.
I know I can build a solution by iterating through the score table, using this method:
as.Date("05/05/1930", format="%d/%m/%Y) < as.Date("01/01/1900", format="%d/%m/%Y)
Which returns a Boolean and thus allows me to walk through the scores until I find a date which is bigger and then use the last score. However, there must be a better way to do this.
Maybe I can create some sort of bins from the dataframe, as such:
Bin 1 | Bin 2 | Bin 3
Date 1 : Date 2 | Date 2 : Date 3 | Date 3 : inf
But I don't yet see how. Does anyone see an efficient way to create such bins from a dataframe, so that I can efficiently retrieve scores from this table?
MRE:
Score table:
structure(list(key=c("1/1/1900", "2/1/2013", "2/1/2014","2/1/2015", "4/1/2016", "4/1/2017"), value=c(65,65,67,67,67,68)), row.names=1:6, class="data.frame")
Participant File:
structure(list(birthDate=c("10/10/1968", "6/5/2015","10/10/2017"), Gender=c("M", "U", "F")), row.names=1:3, class="data.frame")
Goal File:
structure(list(birthDate=c("10/10/1968", "6/5/2015","10/10/2017"), Gender=c("M", "U", "F"), Score = c(65,67,68)), row.names=1:3, class="data.frame")
Here is an approach using lag() along with sqldf:
score_table$Key2 <- as.Date(lead(score_table$Key), format="%d/%m/%Y")
score_table$Key <- as.Date(score_table$Key, format="%d/%m/%Y")
names(score_table) <- c("Date1", "Value", "Date2")
participant_table$BirthDate <- as.Date(participant_table$BirthDate, format="%d/%m/%Y")
sql <- "SELECT p.BirthDate, p.Gender, s.Value AS Score
FROM participant_table p
INNER JOIN score_table s
ON (p.BirthDate >= s.Date1 OR s.Date1 IS NULL) AND
(p.BirthDate < s.Date2 OR s.Date2 IS NULL)"
participant_table <- sqldf(sql)
The logic here is to join the participant to the score table using a range of matching dates in the latter. For the edge cases of the first and last rows of the score table, we allow a missing date in either column to represent any date whatsoever. For example, in the last row of the score table, the only requirement for a match is that a date be greater than the lower portion of the range.
I actually do not have R running locally at the moment, but here is a demo link to SQLite showing that the SQL logic works correctly:
Demo
I have found a very simple solution using only arithmetic.
In order to retrieve a score, I check how many numbers are superseded by the input date:
rownum <- sum(as.Date(input_date, format="%d/%m/%Y") >
as.Date(score_table$Key, format="%d/%m/%Y"))
Then, the corresponding key can be found using:
score <- score_table[["Value"]][rownum]
Thus, the spacing of the dates becomes irrelevant and it works quite fast.
I thought I'd share my solution in case it might be of use.
Thanks everyone for the effort and responses!
I have a df:
Year | Stage | Home.Team.Name | Home.Team.Goals | Away.Team.Name | Away.Team.Goals
1998 | Group A| Brazil..................| 2............................ | Scotland............... | 1
and so on.
What I'm trying to do is create a new column based off the result of each game. So the winners name appears in a new column. The code I currently have is:
RecentWorldCups$Game.Winner <- ifelse(RecentWorldCups$Home.Team.Goals>RecentWorldCups$Away.Team.Goals, RecentWorldCups$Home.Team.Name,
ifelse(RecentWorldCups$Away.Team.Goals>RecentWorldCups$Home.Team.Goals, RecentWorldCups$Away.Team.Name,
"Draw"))
The result of this is that it gives me a number (perhaps a factor number?) instead of the name of the team.
Anyone able to help?
Cheers
You need to extract the character level value from your factor columns. Try this:
df <- RecentWorldCups # for readability of your code
df$Game.Winner <- ifelse(df$Home.Team.Goals > df$Away.Team.Goals,
levels(df$Home.Team.Name)[df$Home.Team.Name],
ifelse(df$Away.Team.Goals > df$Home.Team.Goals,
levels(df$Away.Team.Name)[df$Away.Team.Name],
"Draw")
)
If you find it cumbersome to do these factor conversions, then one workaround would be to create your data frame with all strings set to not be factors, e.g. something like this:
RecentWorldCups <- data.frame(Home.Team.Goals=c(...), ..., stringsAsFactors=FALSE)
I'm struggling a bit with conditional subsetting of information to average the subset
I have 2 datasets:
type<-c("flesh","wholefish","wholefish","wholefishdelip")
group<-c("two","four",'five','five')
N<-c(10.2,11.1,10.7,11.3)
prey <- cbind(type,group,N)
sample<-c('plasma','wholeblood','redbloodcell')
group1<-c('four','four','two')
group2<-c('','five','four')
group3<-c('','','five')
avgN<-c("","","")
penguin<-cbind(sample,group1,group2,group3,avgN)
I want to output to look like this
sample | group1 | group2 | group3 | avgNwf
plasma | four | | | 11.1 #made up by (11.1/1)
wholeblood | four | five | | 10.9 #(11.1+10.7)/2
redbloodcell | two | four | five | 10.9 #(11.1+10.7)/2
I want to calculate a value for penguin$avgN according to conditions per row. I want to calculate the average prey$N if prey$Type == "wholefish" & prey$group matches penguin$group1, penguin$group2 and penguin$group3. Not all penguin groups have entries so I was running into a problem with excel where I couldn't make it ignore the #N/A. (And excel doesn't have a function for conditional standard deviations)
IE for the first row in the penguin dataframe, I want to average N (of the prey df) for all wholefish in groups four and five.
I have tried the following with fewer conditions just to see if I am on track but to no avail:
avgN <-mean(ifelse(prey$group==penguin$group1,prey$N, "nope"))
avgN <-mean(prey$N[prey$group==penguin$group1,])
The following is not what I want to achieve:
avgN = summaryBy(N ~group+type, data=prey, FUN=c(mean, sd), na.rm=T)
as it brings back a summary version of information instead of an individual result for each entry with its own conditions.
avgN <-mean(prey$N)
as it lacks the conditions for each individual sample.
In excel I would use cell references to work with conditions unique to a row.
So here is an answer for anyone struggling with something similar
for(i in 1:3) {
#condition 1 prey$type=="wholefish"
a<- which(prey[,1]=="wholefish")
#condition 2 prey$group==penguin$group1
b<- which(prey[,2]==penguin[i,2])
c<-match(a,b)
d<-which(c>0)
ad<-a[d]
#condition 3 prey$group==penguin$group2
bb<- which(prey[,2]==penguin[i,3])
cc<-match(a,bb)
dd<-which(cc>0)
add<-a[dd]
#condition 4 prey$group==penguin$group4
bbb<- which(prey[,2]==penguin[i,4])
ccc<-match(a,bbb)
ddd<-which(ccc>0)
addd<-a[ddd]
#some objects returned interger(0) which meant the mean couldn't be calculated
#so I removed those
if (identical(add,integer(0))==TRUE) {relrows<-c(ad)
} else {relrows<-c(ad,add)}
if (identical(addd,integer(0))==TRUE) {relrows2<-c(relrows)
} else {relrows2<-c(relrows,addd)}
#turns out prey and penguin were matrices
#to ensure that only the values of prey$N are used
#I made a new object with just a string a numbers
as.numeric(prey[,3])->prey3
#then I could do the calculations I wanted
penguin[i,5]<-mean(prey3[relrows2])
penguin[i,6]<-sd(prey3[relrows2])
}
Thank you Z.Lin for your help
I am wondering if it is possible to get the geometric mean of a set of values based upon the value of another column using dplyr, or if there is a better way.
I have something like this as a data.frame
Days.Stay | Svc
5 | Med
6 | Surg
... | ...
I'd like to get a column and call it Geo.Mean.Days.Stay or something like that, where the value is derived as the geometric mean of Days.Stay grouped by Svc, so each Svc will have its own unique geometric mean - and I would like to extend this to the geometric standard deviation. So a data.frame result like so:
Days.Stay | Svc | Geo.Mean.Days.Stay | Geo.SD.Days.Stay
5 | Med | 6.78 | 2.7
6 | Surg| 5.4 | 2.1
Is dplyr a good package for this or should I use an alternate method?
This should work:
library("dplyr")
dd %>% group_by(svc) %>%
summarise(Geo.Mean.Days.Stay=exp(mean(log(Days.Stay))),
Geo.SD.Days.Stay=exp(sd(log(Days.Stay))))
If you were going to use the geometric mean and SD on a regular basis it would be a good idea to define some helper functions (gmean <- function(x) exp(mean(log(x)))) to improve readability ...
I want to merge two different csv files into one depending on one column. Both csv datasets have one column with the same description (name). Now I want to copy the content of two columns (POINT_X and POINT_Y) from Table B to Table A depending on the NAME column.
Every row of Table A with the name "TestTestTest" should have the corresponding values of Table B with the name "TestTestTest.
TableA
FID | NAME| job | school | superma | traffic | fun | shopping |
TableB
FID | NAME| pop | POINT_X | POINT_Y | POINT_Z
I've already tried to use the merge function.
newdata = merge(TableA, TableB, all="TRUE")
write.csv(newdata, file = "merge.csv")
This works somehow, but it writes a strange new .csv with many columns, which I don't want. I just want to add only the columns "POINT_X", "POINT_Y" to TableA depending on the column "NAME"
Thanks!
You could still use merge, but pass tableB limited to the columns NAME, POINT_X, and POINT_Y:
newdata = merge(TableA, TableB[,c("NAME", "POINT_X", "POINT_Y")], all=TRUE)
write.csv(newdata, file = "merge.csv")
merge is still the best way and you can add by parameter
newdata = merge(TableA, TableB, by="NAME",all="TRUE")
However, it can also be achieved as
TableA$POINT_X<-TableB[match(TableB$NAME, TableA$NAME),"POINT_X"]
TableA$POINT_Y<-TableB[match(TableB$NAME, TableA$NAME),"POINT_Y"]