I have data set like below:
> head(worldcup)
Team Position Time Shots Passes Tackles Saves
Abdoun Algeria Midfielder 16 0 6 0 0
Abe Japan Midfielder 351 0 101 14 0
Abidal France Defender 180 0 91 6 0
Abou Diaby France Midfielder 270 1 111 5 0
Aboubakar Cameroon Forward 46 2 16 0 0
Abreu Uruguay Forward 72 0 15 0 0
Then there is a code count mean of certain variables:
wc_3 <- worldcup %>%
select(Time, Passes, Tackles, Saves) %>%
summarize(Time = mean(Time),
Passes = mean(Passes),
Tackles = mean(Tackles),
Saves = mean(Saves))
and the output is:
> wc_3
Time Passes Tackles Saves
1 208.8639 84.52101 4.191597 0.6672269
Then I need to perform an output like below:
var mean
Time 208.8638655
Passes 84.5210084
Tackles 4.1915966
Saves 0.6672269
I tried to do like this:
wc_3 <- worldcup %>%
select(Time, Passes, Tackles, Saves) %>%
summarize(Time = mean(Time),
Passes = mean(Passes),
Tackles = mean(Tackles),
Saves = mean(Saves)) %>%
gather(var, mean, Time:Saves, factor_key=TRUE)
The output is same. My question: is there anyway to perform the same output with the different way?
This is my a course but my submission was rejected. I do not know why but I had ask the about this.
Please advise
One option will be to gather first, group by 'Var' and summarise to get the mean of 'Val'
library(dplyr)
library(tidyr)
worldcup %>%
gather(Var, Val, Time:Saves) %>%
filter(Var!= "Shots") %>%
group_by(Var) %>%
summarise(Mean = mean(Val))
Another option is to transpose your output wc_3, as follows:
result <- as.data.frame(t(w_c))
Set the name of your "mean" variable:
names(result)[1] <- "mean"
The names of the columns from wc_3 have become rownames in 'result', so we need to get these as values of the column "var":
result$var <- rownames(result)
Set the rownames in our 'result' table as NULL:
rownames(result) <- NULL
Interchange the order of columns:
result <- result[,c(2,1)]
Related
I am trying to organize and mutate my data in R.
Essentially I am trying to graph the average of B, for data ranges in A
Original Data Set
A B
<dbl> <dbl>
1 200 28
2 1053 67.3
3 17000. 30
4 7565. 12
5 14525 56
6 3411 30
What I am trying to transform my data into
Ranges Average
0 - 999.99 23%
1000 - 1999.99 45%
2000 - 2999.99 32%
3000 - 3999.99 50%
This is what I have so far for this function
A1 <- read_excel("file")
DataRange <- data.frame( A= A1$C,
B= A1$R)
# Function 1
ranges1 <- DataRange %>% mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999))
The Output of range1 is
232 699.00 23.00000 (699,700]
233 445.00 33.00000 (445,446]
234 3112.00 28.00000 (3112,3113]
235 1235.00 98.00000 (1235,1236]
This is a breakdown from the function I am working with
# Function 2
ranges1 <- DataRange %>% mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999)
%>% group_by(new_range)
%>% dplyr::summarize(mean_1 = mean(B))
%>% as.data.frame())
The output of range1 is:
Error in `mutate()`:
! Problem while computing `new_range = ... %>% as.data.frame()`.
Caused by error in `UseMethod()`:
! no applicable method for 'group_by' applied to an object of class "factor"
Run `rlang::last_error()` to see where the error occurred.
As you can tell I am jumping the gun on the first problem, but the later function is where I am trying to take this expression.
I am really confused about how to fix the first function, any suggestions?
This is a syntax error. You need to have the %>% pipes at the ends of lines, not the start of lines. When your line ends after the mutate() R thinks that command is complete. Then the next line starts with %>% and the data didn't actually get piped through.
Change it to this:
ranges1 <- DataRange %>%
mutate(new_range=cut(A, breaks = seq(min(A),max(A)), by = 999)) %>%
group_by(new_range) %>%
dplyr::summarize(mean_1 = mean(B)) %>%
as.data.frame())
I'm not able to slice according to the code specified. See a reproducible example below:
library(alr4)
library(tidyverse)
modelUN <- lm(fertility ~ ppgdp, data = UN11)
I want to label the two highest and lowest residuals.
library(broom)
UN11 <- UN11 %>% mutate(Residuals = augment(modelUN) %>% pull(.resid))
UN11 %>% arrange(Residuals) %>% slice_head(n = 2)
This does not give me the lowest residuals. I tried saving the dataset (thinking that its fetching from the original df) but the result is the same. How should I go ahead?
The slice_head or slice_tail returns the head and tail rows based on the n given. If it is to get both ends, we can use the slice with the index (1:2 - head, and (n()-1):n() for tail
library(dplyr)
UN11 %>%
dplyr::arrange(Residuals) %>%
dplyr::slice(c(1:2, (n()-1):n()))
Or make use of row_number with head/tail
UN11 %>%
dplyr::arrange(Residuals) %>%
dplyr::slice(c(head(row_number(), 2), tail(row_number(), 2)))
# region group fertility ppgdp lifeExpF pctUrban Residuals
#1 Europe other 1.134 4477.7 78.40 49 -1.900575
#2 Europe other 1.450 1625.8 73.48 48 -1.675868
#3 Africa africa 6.300 1237.8 50.04 36 3.161712
#4 Africa africa 6.925 357.7 55.77 17 3.758539
and using head
UN11 %>%
arrange(Residuals) %>%
head(2)
# region group fertility ppgdp lifeExpF pctUrban Residuals
#1 Europe other 1.134 4477.7 78.40 49 -1.900575
#2 Europe other 1.450 1625.8 73.48 48 -1.675868
Or another option is slice_min/slice_max and bind them together with bind_rows (but it is less efficient and less direct than the index option in slice)
UN11 %>%
slice_min(Residuals, n = 2) %>%
bind_rows(UN11 %>%
slice_max(Residuals, n = 2))
My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)
I would like to find the monthly usage of all the aircrafts(based on tailnum)
lets say this is required for some kind of maintenance activity that needs to be done after x number of trips.
As of now i am doing it like below;
library(nycflights13)
N14228 <- filter(flights,tailnum=="N14228")
by_month <- group_by(N14228 ,month)
usage <- summarise(by_month,freq = n())
freq_by_months<- arrange(usage, desc(freq))
This has to be done for all aircrafts and for that the above approach wont work as there are 4044 distinct tailnums
I went through the dplyr vignette and found an example that comes very close to this but it is aimed at finding overall delays as shown below
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Apart from this i tried using aggregate and apply but couldnt get the desired results.
Check out the data.table package.
library(data.table)
flt <- data.table(flights)
flt[, .N, by = c("tailnum", "month")]
tailnum month N
1: N14228 1 15
2: N24211 1 14
3: N619AA 1 1
4: N804JB 1 29
5: N668DN 1 4
---
37984: N225WN 9 1
37985: N528AS 9 1
37986: N3KRAA 9 1
37987: N841MH 9 1
37988: N924FJ 9 1
Here, the .N means "count occurrence of".
Not sure if this is exactly what you're looking for, but regardless, for these kinds of counts, it's hard to beat data.table for execution speed and syntactical simplicity.
i choose the hflights-dataset as an example.
I try to create a variable/column that contains the "TailNum" from the planes, but only for the planes that are under the 10% with the longest airtime.
install.packages("hflights")
library("hflights")
flights <-tbl_df(hflights)
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new_var=TailNum)
EDIT: The resulting dataframe has only 22208 obs instead of 227496. Is there a way to keep the original dataframe, but add a new variable with the TeilNum for the planes with top10-percent airtime?
You don't need the flights in mutate() after the pipe.
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new = TailNum)
Also, new is a function, so best avoid that as a variable name. See ?new.
As an illustration:
flights <-tbl_df(hflights)
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>%
+ mutate(new_var = TailNum, new = TailNum) %>%
+ select(AirTime, TailNum, new_var)
Source: local data frame [22,208 x 3]
AirTime TailNum new_var
1 255 N614AS N614AS
2 257 N627AS N627AS
3 260 N627AS N627AS
4 268 N618AS N618AS
5 273 N607AS N607AS
6 278 N624AS N624AS
7 274 N611AS N611AS
8 269 N607AS N607AS
9 253 N609AS N609AS
10 315 N626AS N626AS
.. ... ... ...
To retain all observations, lose the filter(). My normal approach is to use ifelse() instead. Others may be able to suggest a better solution.
f2 <- flights %>% mutate(cumdist = cume_dist(desc(AirTime)),
new_var = ifelse(cumdist < 0.1, TailNum, NA)) %>%
select(AirTime, TailNum, cumdist, new_var)
table(is.na(f2$new_var))
FALSE TRUE
22208 205288