How to correctly slice after arrange? (R) - r

I'm not able to slice according to the code specified. See a reproducible example below:
library(alr4)
library(tidyverse)
modelUN <- lm(fertility ~ ppgdp, data = UN11)
I want to label the two highest and lowest residuals.
library(broom)
UN11 <- UN11 %>% mutate(Residuals = augment(modelUN) %>% pull(.resid))
UN11 %>% arrange(Residuals) %>% slice_head(n = 2)
This does not give me the lowest residuals. I tried saving the dataset (thinking that its fetching from the original df) but the result is the same. How should I go ahead?

The slice_head or slice_tail returns the head and tail rows based on the n given. If it is to get both ends, we can use the slice with the index (1:2 - head, and (n()-1):n() for tail
library(dplyr)
UN11 %>%
dplyr::arrange(Residuals) %>%
dplyr::slice(c(1:2, (n()-1):n()))
Or make use of row_number with head/tail
UN11 %>%
dplyr::arrange(Residuals) %>%
dplyr::slice(c(head(row_number(), 2), tail(row_number(), 2)))
# region group fertility ppgdp lifeExpF pctUrban Residuals
#1 Europe other 1.134 4477.7 78.40 49 -1.900575
#2 Europe other 1.450 1625.8 73.48 48 -1.675868
#3 Africa africa 6.300 1237.8 50.04 36 3.161712
#4 Africa africa 6.925 357.7 55.77 17 3.758539
and using head
UN11 %>%
arrange(Residuals) %>%
head(2)
# region group fertility ppgdp lifeExpF pctUrban Residuals
#1 Europe other 1.134 4477.7 78.40 49 -1.900575
#2 Europe other 1.450 1625.8 73.48 48 -1.675868
Or another option is slice_min/slice_max and bind them together with bind_rows (but it is less efficient and less direct than the index option in slice)
UN11 %>%
slice_min(Residuals, n = 2) %>%
bind_rows(UN11 %>%
slice_max(Residuals, n = 2))

Related

How to group and average a dataset by order then plot a broken line plot?

I have a data frame that contains 5000 examinee's ability estimation with their test score, and they are both continuous variables. Since there are too many examinees, it would be messy to plot out all their scores, so I wish to draw a 'broken line plot' or 'conditional mean plot', that average the test scores of several examines that have similar ability levels at a time, and plot their average score against their average ability. Like the plot below.
I already managed to do this with the codes below.
df<-cbind(rnorm(100,set.seed(123)),sample(100,set.seed(123)),) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
seq<-round(seq(from=1, to=nrow(df), length.out=10),0) #divide the data equally to nine groups (which is also gonna be the 9 points that appear in my plot)
b<-data.frame()
for (i in 1:9) {
b[i,1]<-mean(df[seq[i]:seq[i+1],1]) #calculate the mean of the ability by group
b[i,2]<-mean(df[seq[i]:seq[i+1],2]) # calculate the mean of test score by group
}.
I got the mean of the ability and test score using this for loop, and it looks like this
and finally, do the plot
plot(b$V1,b$V2, type='b',
xlab="ability",
ylab="score",
main="Conditional score")
These codes meet my goal, but I can't help thinking if there's a simpler way to do this. Drawing a broken line plot by averaging the data that is sorted from low to high seems to be a normal task.
I wonder if there is any function or trick for this. All ideas are welcome! :)
Here is a solution to create the data to be plotted using dplyr:
set.seed(123)
df<-cbind(rnorm(100,1),sample(100,50)) %>%
as.data.frame() %>%
setNames(c("ability","score")) #simulate the dataset
df<-df[order(df$ability),] #sort the data from low to high according to the ability varaible
df$id <- seq(1, nrow(df))
df %>% mutate(bin = ntile(id, 10)) %>%
group_by(bin) %>%
dplyr::summarize(meanAbility = mean(ability, na.rm=T),
meanScore = mean(score, na.rm=T)) %>%
as.data.frame()
bin meanAbility meanScore
1 1 -0.81312770 41.6
2 2 -0.09354171 52.3
3 3 0.29089892 54.4
4 4 0.68490709 45.8
5 5 0.93078744 59.8
6 6 1.17380069 34.0
7 7 1.42942368 41.3
8 8 1.64965315 40.1
9 9 1.95290596 35.6
10 10 2.50277510 52.9
I would approach the whole thing a bit differently (note also that your code has several errors and won't run the way you were showing.
The exmaple below will lead to different numbers than yours (due to the random generation of numbers and your non-working code).
library(tidyverse)
df <- data.frame(ability = rnorm(100),
score = sample(100)) %>%
arrange(ability) %>%
mutate(seq = ntile(n = 9)) %>%
group_by(seq) %>%
summarize(mean_ability = mean(ability),
mean_score = mean(score))
which gives:
# A tibble: 9 x 3
seq mean_ability mean_score
<int> <dbl> <dbl>
1 1 -1.390807 45.25
2 2 -0.7241746 56.18182
3 3 -0.4315872 49
4 4 -0.2223723 48.81818
5 5 0.06313174 56.36364
6 6 0.3391321 42
7 7 0.6118022 53.27273
8 8 1.021438 50.54545
9 9 1.681746 53.54545

Paired bar chart with conditional labeling based on multiple factors

I am trying to create a graphical output like the picture below for the following sample of data but the code I have included gives an error:
Error in mutate_impl(.data, dots) : Evaluation error: Column n must be length 43 (the number of rows) or one, not 42.
My goal is to plot all providers from the same location on the same chart and then only include one name on the axis so that each provider can see how they compare to others in their area without revealing the identity of the other providers. I have tried specifying that n= 43 (the length of the full dataset) but have not had any success. Additionally, I would like to do a paired bar chart to show how each provider compares the their previous months' rates.
Provider Month Payment Location
Andrew 2 32.62 OH
Dillard 2 40 OH
Henry 2 32.28 OH
Lewis 2 47.79 IL
Marcus 2 73.04 IL
Matthews 2 45.22 NY
Paul 2 65.73 NY
Reed 2 27.67 NY
Andrew 1 33.23 OH
Dillard 1 36.63 OH
Henry 1 42.68 OH
Lewis 1 71.45 IL
Marcus 1 39.51 IL
Matthews 1 59.11 NY
Paul 1 27.67 NY
Reed 1 28.78 NY
library(tidyverse)
library(purrr)
df <- 1:nrow(PaymentsFeb) %>%
purrr::map( ~PaymentsFeb) %>%
set_names(PaymentsFeb$Provider) %>%
bind_rows(.id = "ID") %>%
nest(-ID) %>%
mutate(Location=map2(data,ID, ~.x %>% filter(Provider == .y) %>% select(Location))) %>%
mutate(data=
map2(data, ID, ~.x %>%
mutate(n=paste0("#", sample(seq_len(n()), size = n())),
Provider=ifelse(Provider == .y, as.character(Provider), n),
Provider=factor(Provider, levels = c(.y, sample(n, n())))))) %>%
mutate(plots=map2(data,Location, ~ggplot(data=.x,aes(x = Provider, y = scores, fill = scores))+
geom_col() +geom_text(aes(label=Per.Visit.Bill.Rate), vjust=-.3)+
ggtitle("test scores by Location- February 2018", subtitle = .y$Location)
))

dplyr calculations involving two columns of a data frame (R)

I'm pretty new to R and couldn't find a clear answer my question after extensively searching the web. I'm trying to get dplyr functions to do the following task:
I have the following data.frame as tibble: Columns starting with X. indicates different samples and rows indicate how much a specific gene is expressed.
head(immgen_dat)
# A tibble: 6 x 212
ProbeSetID GeneName Description X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
<int> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10344620 " Gm1056~ " predicted gene 1~ 15.6 15.3 17.2 16.1 18.1
2 10344622 " Gm1056~ " predicted gene 1~ 240. 255. 224. 312. 272.
3 10344624 " Lypla1" " lysophospholipas~ 421. 474. 349. 478. 459.
4 10344633 " Tcea1" " transcription el~ 802. 950. 864. 968. 1056.
5 10344637 " Atp6v1~ " ATPase H+ transp~ 199. 262. 167. 267. 255.
6 10344653 " Oprk1" " opioid receptor ~ 14.8 12.8 18.0 13.2 15.3
# ... with 204 more variables: X.proB_FrBC_FL. <dbl>,
I added a mean expression variable at the end for each gene by using the following code (the range of variables are the first and the last sample):
immgen_avg <- immgen_dat %>%
rowwise() %>%
mutate(Average = mean(X.proB_CLP_BM.:X.MLP_FL.))
Here, I have a quick question: The returned mean value I get from this code doesn't match the average I calculated elsewhere (in Excel). I don't think there are any missing values.
What I'd like to do is the following: For each gene, I'd like to compare the sample values with the average value and calculate a log2-fold difference (log2 difference of gene expression in a sample compared to the average expression value across all the samples). I'd like to store this dataframe with the name of immgen_log2 and do some subsequent analyses. In this new data frame, I'd like to keep the gene names because I'm thinking to merge this with another data table to compare log2 change between different experiments.
What is the best way of doing this? I appreciate your answers.
I will explain what is happening in a short while, but one way to solve for the row means of your intended variables is:
immgen_dat %>%
mutate(Average = apply(.[, 4:8], 1, mean)) %>%
select(Average)
# Average
# 1 16.46
# 2 260.60
# 3 436.20
# 4 928.00
# 5 230.00
# 6 14.82
To see what is happening with your code, we can use the do function as follows:
df2 <- immgen_dat %>%
rowwise() %>%
do(Average = .$X.proB_CLP_BM.:.$X.proB_FrBC_BM.)
df2$Average[1]
# [[1]]
# [1] 15.6 16.6 17.6
You will see that : generates a sequence from 15.6 in steps of 1. You can see this explained in more detail by typing help(":"). So in
immgen_dat %>%
rowwise() %>%
mutate(Average = mean(X.proB_CLP_BM.:X.proB_FrBC_BM.))
you are computing the means of the values of these sequences.
Edit
The logarithm of the ratios is of course the differences of the logarithms (provided the denominator is nonzero). So you are trying to find the differences between the log2's of each of the other numerical variables from the log2 of the Average, you can do something like.
immgen_log2 <- immgen_dat
immgen_log2[,4:9] <- log(immgen_dat[,4:9])
immgen_log2[,4:8] <- sapply(immgen_log2[,4:8], func)
I'm not entirely sure whether I get it right what you need to do, but whenever using dplyr or tidyverse in general (also ggplot2), long representation of your data works best. I assume that you want to calculate the mean of all variables starting with X. for each ProbeSetID. Then, for each X.-column and ProbeSetID, calculate ratio and take log2, i.e. log2(X.bla/mean):
df <- read.table(text = 'ProbeSetID X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
10344620 15.6 15.3 17.2 16.1 18.1
10344622 240. 255. 224. 312. 272.
10344624 421. 474. 349. 478. 459.
10344633 802. 950. 864. 968. 1056.
10344637 199. 262. 167. 267. 255.
10344653 14.8 12.8 18.0 13.2 15.3', header = T)
library(dplyr)
library(tidyr)
result <-
df %>%
# transform to long:
gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>%
# group by IDs, ie make rowwise calculations if it was still wide, but faster:
group_by(ProbeSetID) %>%
# calculate group-mean on the fly and calculate log-ratio directly:
mutate(log2_ratio = log2(value / mean(value)))
# transform back to wide, if needed:
result %>%
# remove initial values to have only 1 value variable:
select(-value) %>%
# go back to wide:
spread(key = key, value = log2_ratio)
# or, if you want to keep all values:
df %>%
# transform to long:
gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>%
# group by IDs, ie make rowwise calculations if it was still wide, but faster:
group_by(ProbeSetID) %>%
# calculate the mean of each observation:
mutate(mean_value = mean(value)) %>%
# go back to wide:
spread(key, value) %>%
# now do the transformation to each variable that begins with X.:
mutate_at(.vars = vars(matches("^X\\.")),
.funs = funs(log2_ratio = log2(./mean_value)))

Another way to do pivot table in R

I have data set like below:
> head(worldcup)
Team Position Time Shots Passes Tackles Saves
Abdoun Algeria Midfielder 16 0 6 0 0
Abe Japan Midfielder 351 0 101 14 0
Abidal France Defender 180 0 91 6 0
Abou Diaby France Midfielder 270 1 111 5 0
Aboubakar Cameroon Forward 46 2 16 0 0
Abreu Uruguay Forward 72 0 15 0 0
Then there is a code count mean of certain variables:
wc_3 <- worldcup %>%
select(Time, Passes, Tackles, Saves) %>%
summarize(Time = mean(Time),
Passes = mean(Passes),
Tackles = mean(Tackles),
Saves = mean(Saves))
and the output is:
> wc_3
Time Passes Tackles Saves
1 208.8639 84.52101 4.191597 0.6672269
Then I need to perform an output like below:
var mean
Time 208.8638655
Passes 84.5210084
Tackles 4.1915966
Saves 0.6672269
I tried to do like this:
wc_3 <- worldcup %>%
select(Time, Passes, Tackles, Saves) %>%
summarize(Time = mean(Time),
Passes = mean(Passes),
Tackles = mean(Tackles),
Saves = mean(Saves)) %>%
gather(var, mean, Time:Saves, factor_key=TRUE)
The output is same. My question: is there anyway to perform the same output with the different way?
This is my a course but my submission was rejected. I do not know why but I had ask the about this.
Please advise
One option will be to gather first, group by 'Var' and summarise to get the mean of 'Val'
library(dplyr)
library(tidyr)
worldcup %>%
gather(Var, Val, Time:Saves) %>%
filter(Var!= "Shots") %>%
group_by(Var) %>%
summarise(Mean = mean(Val))
Another option is to transpose your output wc_3, as follows:
result <- as.data.frame(t(w_c))
Set the name of your "mean" variable:
names(result)[1] <- "mean"
The names of the columns from wc_3 have become rownames in 'result', so we need to get these as values of the column "var":
result$var <- rownames(result)
Set the rownames in our 'result' table as NULL:
rownames(result) <- NULL
Interchange the order of columns:
result <- result[,c(2,1)]

Use dplyr´s filter and mutate to generate a new variable

i choose the hflights-dataset as an example.
I try to create a variable/column that contains the "TailNum" from the planes, but only for the planes that are under the 10% with the longest airtime.
install.packages("hflights")
library("hflights")
flights <-tbl_df(hflights)
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new_var=TailNum)
EDIT: The resulting dataframe has only 22208 obs instead of 227496. Is there a way to keep the original dataframe, but add a new variable with the TeilNum for the planes with top10-percent airtime?
You don't need the flights in mutate() after the pipe.
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new = TailNum)
Also, new is a function, so best avoid that as a variable name. See ?new.
As an illustration:
flights <-tbl_df(hflights)
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>%
+ mutate(new_var = TailNum, new = TailNum) %>%
+ select(AirTime, TailNum, new_var)
Source: local data frame [22,208 x 3]
AirTime TailNum new_var
1 255 N614AS N614AS
2 257 N627AS N627AS
3 260 N627AS N627AS
4 268 N618AS N618AS
5 273 N607AS N607AS
6 278 N624AS N624AS
7 274 N611AS N611AS
8 269 N607AS N607AS
9 253 N609AS N609AS
10 315 N626AS N626AS
.. ... ... ...
To retain all observations, lose the filter(). My normal approach is to use ifelse() instead. Others may be able to suggest a better solution.
f2 <- flights %>% mutate(cumdist = cume_dist(desc(AirTime)),
new_var = ifelse(cumdist < 0.1, TailNum, NA)) %>%
select(AirTime, TailNum, cumdist, new_var)
table(is.na(f2$new_var))
FALSE TRUE
22208 205288

Resources