Questions about how to divide and find averages of a dataset - r

Let's say I have a dataset where I have a list of names and their ages
Tom 65
Sam 40
Sue 88
Kay 4
Jon 25
Lia 85
Ian 39
Joe 10
Bea 17
Jan 43
Jen 17
Ike 24
Jay 35
Cam 77
Jin 12
Ron 1
Ray 45
Leo 29
Ken 98
Mel 56
Amy 49
Joy 67
Ivy 3
Noe 14
Max 31
Jax 61
Lee 19
Ace 28
Ben 5
Guy 74
I'm trying to divide the dataset into ten equal bins by descending order (Ex. the first bin will have Ken, Sue, and Lia and the last bin will have Ben, Ivy, and Ron) and I want to find the average age for each bin (So the average age for the first bin would be 90.33). I was able to do this on MS excel quite easily but I'm not exactly sure how to do this efficiently on R. Any suggestions?

We can use cut to create a group and then summarise by taking the mean
library(dplyr)
df1 %>%
group_by(grp = cut(v2, breaks = 10)) %>%
summarise(v1 = list(v1), v2 = mean(v2))

Related

Match strings by distance between non-equal length ones

Say we have the following datasets:
Dataset A:
name age
Sally 22
Peter 35
Joe 57
Samantha 33
Kyle 30
Kieran 41
Molly 28
Dataset B:
name company
Samanta A
Peter B
Joey C
Samantha A
My aim is to match both datasets while ordering the subsequent one's values by distance and keeping only the relevant matches. In other words, the output should look as follows below:
name_a name_b age company distance
Peter Peter 35 B 0.00
Samantha Samantha 33 A 0.00
Samantha Samanta 33 A 0.04166667
Joe Joey 57 C 0.08333333
In this example I'm calculating the distance using method = "jw" in stringdist, but I'm happy with any other method that might work. Until now I've been doing attempts with packages such as stringr or stringdist.
You can use stringdist_inner_join to join the two dataframes and use levenshteinSim to get the similarity between the two names.
library(fuzzyjoin)
library(dplyr)
stringdist_inner_join(A, B, by = 'name') %>%
mutate(distance = 1 - RecordLinkage::levenshteinSim(name.x, name.y)) %>%
arrange(distance)
# name.x age name.y company distance
#1 Peter 35 Peter B 0.000
#2 Samantha 33 Samantha A 0.000
#3 Samantha 33 Samanta A 0.125
#4 Joe 57 Joey C 0.250

R Extract names from text

I'm trying to extract a list of rugby players names from a string. The string contains all of the information from a table, containing the headers (team names) as well as the name of the player in each position for each team. It also has the player ranking but I don't care about that.
Important - a lot of player rankings are missing. I found a solution to this however doesn't handle missing rankings (for example below Rabah Slimani is the first player not to have a ranking recorded).
Note, the 1-15 numbers indicate positions, and there's always two names following each position (home player and away player).
Here's the sample string:
" Team Sheets # FRA France RPI IRE Ireland RPI 1 Jefferson Poirot 72 Cian Healy 82 2 Guilhem Guirado 78 Rory Best 85 3 Rabah Slimani Tadhg Furlong 85 4 Arthur Iturria 82 Iain Henderson 84 5 Sebastien Vahaamahina 84 James Ryan 92 6 Wenceslas Lauret 82 Peter O'Mahony 93 7 Yacouba Camara 70 Josh van der Flier 64 8 Kevin Gourdon CJ Stander 91 9 Maxime Machenaud Conor Murray 87 10 Matthieu Jalibert Johnny Sexton 90 11 Virimi Vakatawa Jacob Stockdale 89 12 Henry Chavancy Bundee Aki 83 13 RĂ©mi Lamerat Robbie Henshaw 78 14 Teddy Thomas Keith Earls 89 15 Geoffrey Palis Rob Kearney 80 Substitutes # FRA France RPI IRE Ireland RPI 16 Adrien Pelissie Sean Cronin 84 17 Dany Priso 70 Jack McGrath 70 18 Cedate Gomes Sa 71 John Ryan 86 19 Paul Gabrillagues 77 Devin Toner 90 20 Marco Tauleigne Dan Leavy 80 21 Antoine Dupont 92 Luke McGrath 22 Anthony Belleau 65 Joey Carbery 86 23 Benjamin Fall Fergus McFadden "
Note - it comes from here: https://www.rugbypass.com/live/six-nations/france-vs-ireland-at-stade-de-france-on-03022018/2018/info/
So basically what I want is just the list of names with the team names as the headers e.g.
France Ireland
Jefferson Poirot Cian Healy
Guilhem Guirado Rory Best
... ...
Any help would be much appreciated!
I tried this on an advanced notepad editor and tried to find occurrences of 2 consecutive numbers and replaced those with a new line. the ReGex is
\d+\s+\d+
Once you are done replacing, you will be left with 2 names in each line separated by a number. Then use the below ReGex to replace that number with a single tab
\s+\d+\s+
Hope that helps

selected rows from the column as per the values in the rows [duplicate]

This question already has answers here:
What is the difference between `%in%` and `==`?
(3 answers)
Subset dataframe by multiple logical conditions of rows to remove
(8 answers)
Closed 4 years ago.
i have a df as below: where there are 2 columns, student names and marks.
Stud_name Marks
Jon 25
john 20
ajay 50
ram 27
jay 61
jess 46
troy 23
mike 42
steve 45
glenn 43
i want few name and their marks.
expected output:
Stud_name Marks
john 20
ajay 50
jess 46
troy 23
ram 27
glenn 43
please help.
i tried:
pd <- filter(df,Stud_name == "john" , "ajay" , "jess")
Error in filter_impl(.df, quo) :
Evaluation error: operations are possible only for numeric, logical or
complex types.
You can try this, if you can think to use a base solution:
# your data
dats <- read.table(text='Stud_name Marks
Jon 25
john 20
ajay 50
ram 27
jay 61
jess 46
troy 23
mike 42
steve 45
glenn 43',sep='', header=T)
# vector with choosen names
names <- c("john","ajay","jess")
dats[which(dats$Stud_name %in% names),]
or (thanks #markus):
dats[(dats$Stud_name %in% names),]
Stud_name Marks
2 john 20
3 ajay 50
6 jess 46

Manipulating R Data Frames

I've currently got two separate data frames, excerpts as per below:
mydata
Player TG% Pts Team Opp Yr Rd Grnd
John 56 42 A 1 2015 1 Grnd1
James 94 64 B 2 2015 1 Grnd2
Jerry 85 78 C 3 2015 1 Grnd3
Daniel 97 51 D 4 2015 1 Grnd4
John 89 61 A 1 2015 1 Grnd2
James 65 26 B 4 2015 1 Grnd3
Jerry 73 34 C 3 2015 1 Grnd2
Daniel 73 40 D 2 2015 1 Grnd2
John 89 26 A 1 2015 1 Grnd3
James 92 42 B 3 2015 1 Grnd1
Jerry 89 25 C 2 2015 1 Grnd2
Daniel 80 41 D 4 2015 1 Grnd2
John 73 82 A 3 2015 1 Grnd3
James 73 41 B 4 2015 1 Grnd3
Jerry 89 76 C 2 2015 1 Grnd1
Daniel 91 77 D 1 2015 1 Grnd2
round
Team Opp Grnd
A 1 Grnd1
B 3 Grnd4
C 4 Grnd2
D 2 Grnd3
What I want to be able to do is manipulate this so that I generate a second data frame as per below
Player Gms Avg.Pts Avg.Last3 Avg.v.Opp Avg.#.Grnd
John
James
Jerry
Daniel
I know how to do this in Excel, however I'm struggling in R
Gms - total number of games for each individual player (excel would be countif)
Avg.Pts - this is the average of Pts for each Player name (excel would be averageif)
Avg.Last3 - this is the average of Pts for each Player in their last 3 games, note that the data frame is in order with most recent games at the end of the data frame.
Avg.v.Opp - this is the average of Pts for each player against the next opponent as defined in data frame round. For example John plays for team A and his next opponent is Opp 1. (excel would be averageifs)
Avg.#.Grnd - this is the average of Pts for each player at the next ground as defined in data fram round. For example John plays for team A and his next game is held at Grnd1. (excel would be averageifs)
I've tried using dplyr and a number of other options but haven't seemed to successfully put together something that works at this stage. Note that mydata data frame runs to over 10,000+ rows.
I think this will work. If you share your sample data with dput(), I'll be happy to copy/paste it and check (and debug if necessary).
First I'll do the easy ones, the ones that don't depend on round:
library(dplyr)
group_by(mydata, Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)))
I wanted to do that one separately to emphasize how clean dplyr can be for simple cases. All the "ifs" in your Excel commands are taken care of by the single group_by at the beginning. n() is the count, and mean() is the average. tail() is a handy base function that returns the end of a data frame or vector.
To add in the round data, we'll want to join the data frames together based on the Team column. We still we'll want to be able to tell the other columns apart whether they're from mydata or round, so I'll rename the round columns:
round = rename(round, next_opp = Opp, next_grnd = Grnd)
Then we'll start with the join and proceed as before. This time we do need some ifs at the end, which I'll do with a simple subset inside the mean calls:
left_join(mydata, round) %>%
# convert ground columns to character as discussed in comments
mutate(next_grnd = as.character(next_grnd),
Grnd = as.character(Grnd)) %>%
group_by(Player) %>%
summarize(Gms = n(),
Avg.Pts = mean(Pts),
Avg.Last3 = mean(tail(Pts, 3)),
Avg.v.Opp = mean(Pts[Opp == next_opp]),
Avg.at.Grnd = mean(Pts[Grnd == next_grnd]))

How to get multiple means scores from a variable in R

here is my example data file.
Animal Scores
1 Dogs 10
2 Dogs 11
3 Dogs 12
4 Dogs 13
5 Dogs 14
6 Dogs 15
7 Dogs 16
8 Dogs 17
9 Dogs 18
10 Dogs 19
11 Cats 20
12 Cats 21
13 Cats 22
14 Cats 23
15 Cats 24
16 Cats 25
17 Cats 26
18 Cats 27
19 Cats 28
20 Cats 29
21 Birds 30
22 Birds 31
23 Birds 32
24 Birds 33
25 Birds 34
26 Birds 35
27 Birds 36
28 Birds 37
29 Birds 38
30 Birds 39
I have only just begun to learn R and am a complete beginner in the coding world hence have been doing this the very very long way.
e.g.
>####Separate each animal out
>dogs <- animaldata$Animal == "Dogs"
>cats <- animaldata$Animal == "Cats"
>birds <- animaldata$Animal == "Birds"
>####Get the means for each animal scores
>dogsmean <- mean(animaldata$Scores[dogs])
>catssmean <- mean(animaldata$Scores[cats])
>birdsmean <- mean(animaldata$Scores[birds])
>
>####Group all means and plot
>Finalmeans <- c(I manually type the numbers of all found means here)
>plot(finalmeans, type="o")
I would like an efficient way to get the mean scores for the dogs cats and birds and then plot the mean for each animal on a graph.
P.S. This is my first post! :) I am guessing I have broken most forum posting rules in the process. I'm still figuring it out. All feedback is welcome! :)
Your post is fine, don't worry. One possibility to obtain the result you described consists in using the aggregate() function:
> aggregate( . ~ Animal, animaldata, mean)
# Animal Scores
#1 Birds 34.5
#2 Cats 24.5
#3 Dogs 14.5
If your data in a data frame you can use:
sapply(levels(animaldata[,1]),function(x) mean(animaldata[animaldata[,1]==x,2]))
Birds Cats Dogs
34.5 24.5 14.5
I would use ddply from plyr package...
library(plyr)
ddply(animaldata, .(Animal), summarise, mean(Scores))
The aggregate function might come in handy for your problem:
animaldata <- aggregate(animaldata[, 2:2], list(animaldata$Animal), mean)
> animaldata
Group.1 x
1 Birds 34.5
2 Cats 24.5
3 Dogs 14.5
Here is code to a simple plot of this animal data:
par(las=1)
plot(animaldata$Group.1, animaldata$x, xlab="Animal", ylab="Mean Score")
title("Mean Animal Scores")
I would suggest to do it data.table way. It is currently the fastest package for this kind of things.
library(data.table)
animaldata=data.table(animaldata)
means=animaldata[,.(mean=mean(Scores)),by=Animal]
plot(means$mean,type="o")

Resources