Using dplyr::summarize() function for multi-step arithmetical process? - r

So I've got some golf data that I'm messing with in R:
player rd hole shot distToPin distShot
E. Els 1 1 1 525 367.6
E. Els 1 1 2 157.4 130.8
E. Els 1 1 3 27.5 27.4
E. Els 1 1 4 1.2 1.2
E. Els 1 2 1 222 216.6
E. Els 1 2 2 6.8 6.6
E. Els 1 2 3 0.3 0.3
E. Els 2 1 1 378 244.4
E. Els 2 1 2 135.9 141.6
E. Els 2 1 3 6.7 6.9
E. Els 2 1 4 0.1 0.1
I'm trying to make an "efficiency" computation. Basically, I want to compute the following formula (which I made up, if you can't tell) by round:
E = hole yardage / (sum(distance of all shots) - hole yardage)
And ultimately, I want my results to look like this:
rd efficiency
E.Els 1 205.25
2 25.2
That efficiency column is the averaged result of the efficiency for each hole over the entire round. The issue that I'm having is I can't quite figure out how to do such a complex calculation using dplyr::summarize():
efficiency <- df %>%
group_by(player, rd) %>%
summarize(efficiency = (sum(distShot) - distToPin))
But the problem with that particular script is that it returns the error:
Error: expecting a single value
I think my problem is that were it to run, it wouldn't be able to tell WHICH distToPin to subtract, and the one I want is obviously the first distToPin of each hole, or the accurate hole length (unfortunately, I don't have a column of just "hole yardage." I want to pull that first distToPin of each hole out and use it within my summarize() arithmetic. Is this even possible?
I'm guessing that there is a way to do these types of complex, multi-step calculations within the summarize function, But maybe there's not! Any ideas or advice?

You seem to be missing some steps. Here is a deliberately labored version to show that, using dplyr. It assumes that your data frame is named golfdf:
golfdf %>%
group_by(player, round, hole) %>%
summarise(hole.length = first(distToPin), shots.length = sum(distShot)) %>%
group_by(player, round) %>%
summarise(efficiency = sum(hole.length) / (sum(shots.length) - sum(hole.length)))

Related

How to setup two dynamic conditions in SUMIFS like problem in R?

I already tried my best but am still pretty much a newbie to R.
Based on like 500mb of input data that currently looks like this:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days
1 2818 5829821 335511.0 1
2 20168 5829746 335265.2 3
3 25428 5830640 331534.6 0
4 27886 5832156 332003.1 3
5 28658 5830888 329727.2 3
6 28871 5829980 332071.3 7
I need to calculate the conditional sum of reviews_last30days - the conditions being a specific and changing area range for each respective record, i.e. R should sum only those reviews for which the calc.latitude and calc.longitude do not deviate more than +/-500 from the longitude and latitude values in each row.
EXAMPLE:
ROW 1 has a calc.latitude 5829821 and a calc.longitude 335511.0, so R should take the sum of all reviews_last30days for which the following ranges apply: calc.latitude 5829321‬ to 5830321‬ (value of Row 1 latitude +/-500)
calc.longitude 335011.0 to 336011.0 (value of Row 1 longitude +/-500)
So my intended output would look somewhat like this in column 5:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days reviewsper1000
1 2818 5829821 335511.0 1 4
2 20168 5829746 335265.2 3 4
3 25428 5830640 331534.6 0 10
4 27886 5832156 332003.1 3 3
5 28658 5830888 331727.2 3 10
6 28871 5829980 332071.3 7 10
Hope I calculated correctly in my head, but you get the idea..
Until now I particularly struggle with the fact that my sum conditions are dynamic and "newly assigned" since the latitude and longitude conditions have to be adjusted for each record.
My current code looks like this but it obviously doesn't work that way:
review1000 <- function(TOTALLISTINGS = NULL){
# tibble to return
to_return <- TOTALLISTINGS %>%
group_by(listing_id) %>%
summarise(
reviews1000 = sum(reviews_last30days[(calc.latitude>=(calc.latitude-500) | calc.latitude<=(calc.latitude+500))]))
return(to_return)
}
REVIEWPERAREA <- review1000(TOTALLISTINGS)
I know I also would have to add something for longitude in the code above
Does anyone have an idea how to fix this?
Any help or hints highly appreciated & thanks in advance! :)
See whether the below code will help.
TOTALLISTINGS$reviews1000 <- sapply(1:nrow(TOTALLISTINGS), function(r) {
currentLATI <- TOTALLISTINGS$calc.latitude[r]
currentLONG <- TOTALLISTINGS$calc.longitude[r]
sum(TOTALLISTINGS$reviews_last30days[between(TOTALLISTINGS$calc.latitude,currentLATI - 500, currentLATI + 500) & between(TOTALLISTINGS$calc.longitude,currentLONG - 500, currentLONG + 500)])
})

R aggregate by a variable then find out proportion of a each column

Sorry, I've tried my best but I didn't find the answer. As beginner, I'm not sure that I'm able to put the question clearly. Thanks in advance.
So I have a dataframe with data about consumption with 24000 rows.
In this dataframe, there is a series of variable about the number of objects bought within the last two months :
NumberOfCoat, NumberOfShirt, NumberOfPants, NumberOfShoes...
And there is a variable "profession" registered by number.
So now the data looks looks like this
profession NumberOfCoat NumberOfShirt NumberOfShoes
individu1 1 1 1 1
individu2 3 2 4 1
individu3 2 2 0 0
individu4 6 0 3 2
individu5 5 0 2 3
individu6 7 1 0 5
individu7 4 3 1 2
I would like to know the structure of consumption by profession and get something like this :
ProportionOfCoat ProportionOfShirt ProportionOfShoe...
profession1 0.3 0.5 0.1
profession2 0.1 0.2 0.4
profession3 0.2 0.6 0.1
profession4 0.1 0.1 0.2
I don't know if it is clear, but finally I want to be able to say :
10% of clothing products that doctors bought are Tshirts whereas 20% of what teachers bought are T-shirts.
And finally, I'd like to draw a stacked barplot where each stack is scaled to sum to 100%.
I suppose that we can you dplyr ?
Thank you very much !!
temp <- aggregate( . ~ profession, data=zzz, FUN=sum)
cbind(temp[1],temp[-1]/rowSums(temp[-1]))
or also using prop.table
As other people noted, it is always better to post a reproducible example, I´ll try to post one with my solution, which is longer than the ones already posted but, for the same reason, maybe clearer.
First you should create an example dataframe:
set.seed(10) # I set a seed cause I´ll use the sample() function
n <- 1:100 # vector from 1 to 100 to obtain the number of products bought
p <- 1:8 # vector for obtaining id of professions
profession <- sample(p,50, replace = TRUE)
NumberOfCoat <- sample(n,50, replace = TRUE)
NumberOfShirt <- sample(n,50, replace = TRUE)
NumberOfShoes <- sample(n,50, replace = TRUE)
df <- as.data.frame(cbind(profession, NumberOfCoat,
NumberOfShirt, NumberOfShoes))
Once you got the dataframe, you can explain what you have tried so far, or a possible solution. Here I used dplyr.
df <- df %>% group_by(profession) %>% summarize(coats = sum(NumberOfCoat),
shirts = sum(NumberOfShirt),
shoes = sum(NumberOfShoes)) %>%
mutate(tot_prod = coats + shirts + shoes,
ProportionOfCoat = coats/tot_prod,
ProportionOfShirt = shirts/tot_prod,
ProportionofShoes = shoes/tot_prod) %>%
select(profession, ProportionOfCoat, ProportionOfShirt,
ProportionofShoes)
dfcorresponds to the second dataframe you show, where you have the proportion of each product bought by each profession. In my example looks like this:
profession ProportionOfCoat ProportionOfShirt ProportionofShoes
<int> <dbl> <dbl> <dbl>
1 1 0.3910483 0.2343934 0.3745583
2 2 0.4069641 0.3525571 0.2404788
3 3 0.3330804 0.3968134 0.2701062
4 4 0.2740657 0.3952435 0.3306908
5 5 0.2573991 0.3784753 0.3641256
6 6 0.2293814 0.3543814 0.4162371
7 7 0.2245841 0.3955638 0.3798521
8 8 0.2861635 0.3490566 0.3647799
If you want to produce a stacked barplot, you have to reshape your data to a long format in order to be able to use ggplot2. As #alistaire noted, you can do it with the gather function from the tidyr package.
df <- df %>% gather(product, proportion, -profession)
And finally you can plot it with ggplot2.
ggplot(df, aes(x=profession, y=proportion, fill=product)) +
geom_bar(stat="identity")

R/Plotly: Error in list2env(data) : first argument must be a named list

I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}

Blank information after merge

R version 3.2.1
I'm following this article on how to create Heat Maps
and downloaded the required files (just need help with the logic)
This works
setwd("D:/GIS/london")
library(maptools)
library(ggplot2)
library(gpclib)
sport <- readShapeLines("london_sport.shp")
When I run
names(sport)
I get normal output, i.e.
[1] "ons_label" "name" "Partic_Per" "Pop_2001"
And when I print sport
print(sport)
I get this (showing first 6 lines)
geometry ons_label name Partic_Per Pop_2001
0 MULTILINESTRING((541177.7 173555.7 ...)) 00AF Bromley 21.7 295535
1 MULTILINESTRING((522957.6 178071.3 ...)) 00BD Richmond upon Thames 26.6 172330
2 MULTILINESTRING((505114.9 184625.1 ...)) 00AS Hillingdon 21.5 243006
3 MULTILINESTRING((552108.3 194151.8 ...)) 00AR Havering 17.9 224262
4 MULTILINESTRING((519370.8 163657.4 ...)) 00AX Kingston upon Thames 24.4 147271
5 MULTILINESTRING((525554.3 166815.8 ...)) 00BF Sutton 19.3 179767
6 MULTILINESTRING((513062.8 178187.2 ...)) 00AT Hounslow 16.9 212352
So far everything looks normal. I understand the next few lines, to create plot points
p <- ggplot(sport#data, aes(Partic_Per, Pop_2001))
p + geom_point(aes(color="Partic_Per", size = "Pop_2001")) + geom_text(size=2, aes(label=name))
gpclibPermit()
Then we make the shape file (sport) into a data frame
sport_geom <- fortify(sport, region="ons_label")
And when I execute head(sport_geom) I get
long lat order piece group id
1 541177.7 173555.7 1 1 0.1 0
2 541872.2 173305.8 2 1 0.1 0
3 543441.5 171429.9 3 1 0.1 0
4 544361.6 172379.2 4 1 0.1 0
5 546662.4 170451.9 5 1 0.1 0
6 548187.1 170582.3 6 1 0.1 0
Then the command to merge data, where the problem arises
sport_geom <- merge (sport_geom, sport#data, by.x = "id", by.y = "ons_label")
When I go according to the document and print by., the only values that pop up are by.data.frame and by.default
And when I execute head(sport_geom), the data is blank!!!
[1] id long lat order piece group
name Partic_Per Pop_2001 <0 rows> (or 0-length row.names)
What am I missing and how to troubleshoot this?
Perhaps this is an error in the document (this is the simplest tutorial I can find on Heat Maps), because I was troubleshooting another error for the past hour.
Please help!!!!!! I know it's long, but it's the best I could do to explain the issue and possibly reproduce it.
If you need links to download the stuff, I'll get it for you.

Transferring categorical means to a new table

I'm fairly new to R, but I've tackled much larger challenges than my current problem, which makes it particularly frustrating. I searched the forums and found some related topics, but none would do the trick for this situation.
I've got a dataset with 184 observations of 14 variables:
> head(diving)
tagID ddmmyy Hour.GMT. Hour.Local. X0 X3 X10 X20 X50 X100 X150 X200 X300 X400
1 122097 250912 0 9 0.0 0.0 0.3 12.0 15.3 59.6 12.8 0.0 0 0
2 122097 260912 0 9 0.0 2.4 6.9 5.5 13.7 66.5 5.0 0.0 0 0
3 122097 260912 6 15 0.0 1.9 3.6 4.1 12.7 39.3 34.6 3.8 0 0
4 122097 260912 12 21 0.0 0.2 5.5 8.0 18.1 61.4 6.7 0.0 0 0
5 122097 280912 6 15 2.4 9.3 6.0 3.4 7.6 21.1 50.3 0.0 0 0
6 122097 290912 18 3 0.0 0.2 1.6 6.4 41.4 50.4 0.0 0.0 0 0
This is tagging data, with each date having one or more 6-hour time bins (not a continuous dataset due to transmission interruptions). In each 6-hour bin, the depths to which the animal dived are broken down, by %, into 10 bins. So X0 = % of time spent between 0-3m, X3= % of time spent between 3-10m, and so on.
What I want to do for starters is take the mean % time spent in each depth bin and plot it. To start, I did the following:
avg0<-mean(diving$X0)
avg3<-mean(diving$X3)
avg10<-mean(diving$X10)
avg20<-mean(diving$X20)
avg50<-mean(diving$X50)
avg100<-mean(diving$X100)
avg150<-mean(diving$X150)
avg200<-mean(diving$X200)
avg300<-mean(diving$X300)
avg400<-mean(diving$X400)
At this point, I wasn't sure how to then plot the resulting means, so I made them a list:
divingmeans<-list(avg0, avg3, avg10, avg20, avg50, avg100, avg150, avg200, avg300, avg400)
boxplot(divingmeans) sort of works, providing 1:10 on the X axis and the % 0-30 on the y axis. However, I would prefer a histogram, as well as the x-axis providing categorical bin names (e.g. avg3 or X3), rather than just a rank 1:10.
hist() and plot() provide the following:
> plot(divingmeans)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' is a list, but does not have components 'x' and 'y'
> hist(divingmeans)
Error in hist.default(divingmeans) : 'x' must be numeric
I've also tried:
> df<-as.data.frame(divingmeans)
> df
X3.33097826086957 X3.29945652173913 X8.85760869565217 X17.6461956521739 X30.2614130434783
1 3.330978 3.299457 8.857609 17.6462 30.26141
X29.3565217391304 X6.44510869565217 X0.664130434782609 X0.135869565217391 X0.0016304347826087
1 29.35652 6.445109 0.6641304 0.1358696 0.001630435
and
> df <- data.frame(matrix(unlist(divingmeans), nrow=10, byrow=T))
> df
matrix.unlist.divingmeans...nrow...10..byrow...T.
1 3.330978261
2 3.299456522
3 8.857608696
4 17.646195652
5 30.261413043
6 29.356521739
7 6.445108696
8 0.664130435
9 0.135869565
10 0.001630435
neither of which provide the sort of table I'm looking for.
I know there must be a really basic solution for converting this into an appropriate table, but I can't figure it out for the life of me. I'd like to be able to make a basic histogram showing the % of time spent in each diving bin, on average. It seems the best format for the data to be in for this purpose would be a table with two columns: col1=bin (category; e.g. avg50), and col2=% (numeric; mean % time spent in that category).
You'll also notice that the data is broken up into different timing bins; eventually I'd like to be able to separate out the data by time of day, to see if, for example, the average diving depths shift between day/night, and so on. I figure that once I have this initial bit of code worked out, I can then do the same by time-of-day by selecting, for example X0[which(Hour.GMT.=="6")]. Tips on this would also be very welcome.
I think you will find it far easier to deal with the data in long format.
You can reshape using reshape. I will use data.table to show how to easily calculate the means by group.
library(data.table)
DT <- data.table(diving)
DTlong <- reshape(DT, varying = list(5:14), direction = 'long',
times = c(0,3,10,20,50,100,150,200,300,400),
v.names = 'time.spent', timevar = 'hours')
timeByHours <- DTlong[,list(mean.time = mean(time.spent)),by=hours]
# you can then plot the two column data.table
plot(timeByHours, type = 'l')
You can now analyse by any combination of date / hour / time at depth
How would you like to plot them?
# grab the means of each column
diving.means <- colMeans(diving[, -(1:5)])
# plot it
plot(diving.means)
# boxplot
boxplot(diving.means)
If youd like to grab the lower bound to intervals from the column names, siply strip away the X
lowerIntervalBound <- gsub("X", "", names(diving)[-(1:5)])
# you can convert these to numeric and plot against them
lowInts <- as.numeric(lowerIntervalBound)
plot(x=lowInts, y=diving.means)
# ... or taking log
plot(x=log(lowInts), y=diving.means)
# ... or as factors (similar to basic plot)
plot(x=factor(lowInts), y=diving.means)
instead of putting the diving means in a list, try putting them in a vector (using c).
If you want to combine it into a data.frame:
data.frame(lowInts, diving.means)
# or adding a row id if needed.
data.frame(rowid=seq(along=diving.means), lowInts, diving.means)

Resources