Making quick calculations on subsets with R - r

and thanks to all in advance.
I have the following data:
set.seed(123)
data <- data.frame (name=LETTERS[sample(1:26, 500, replace=T)],present=sample(0:1,500,replace = T))
And I want to quickly calculate the percentage of present observations (1's) for each letter. I can do it manually, but I believe there is an easier way to do this:
library(dplyr)
A <- filter(data, name=="A" & present==1)
A2 <- filter(data, name=="A")
data$Percentage[data$name=="A"] <- nrow(A)/nrow(A2)
And so on until I arrive to "Z".
Can I make this task automatically without having to change the values of the "name" colum manually?
Best regards,

We can use prop.table with table to get the proportion
prop.table(table(data), 1)[,2]
To add it as a column, we can expand it by matching with the 'names'
data$Percentage <- prop.table(table(data), 1)[,2][as.character(data$name)]
Or as #Lars Lau Raket suggested, we don't need to convert to character
prop.table(table(data), 1)[,2][data$name]
If we need to create a column
library(dplyr)
data %>%
group_by(name) %>%
mutate(Percentage = mean(present==1))

Related

How to use apply to change the elements of one dataframe based in the columns of another?

I have a data frame where two columns mark the beginning and end of regions I need to manipulate in another data frame. Instead of applying a for I decided to create a logical vector with the rows I'm interested
df <- data.frame(b=c(7,25,32,44),e=c(11,27,39,48),n=c('a','b','c','d'))
logint <- rep(F,50)
log_vec <- apply(df[,c('b','e')],1, function(x){logint[x['b']:x['e']] <- T;return(logint)})
However, the result a matrix with one column for each row of df. I know I can solve this with
log_vec <- Reduce(`|`,as.data.frame(log_vec))
but if the number of rows in df is too large, there is not enough memory to allocate the matrix resulting from apply.
Do you have a better solution?
Thanks!
We can use mapply/Map to create a sequence between b and e values and turn them to TRUE.
logint <- rep(FALSE,50)
logint[unlist(Map(`:`, df$b, df$e))] <- TRUE
We can also do this with map2
library(dplyr)
library(purrr)
df %>%
transmute(new = map2(b, e, `:`)) %>%
pull(new) %>%
flatten_int %>%
replace(logint, ., TRUE)

How do I aggregate certain columns from data frame by a Unique ID?

I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.

Why numbers not mapped to each row?

So I am trying to find the number of occurrences of each name in another dataset. The code I am trying to run is:
Data$Count <- grep(Data$Name,OtherDataSet$LeadName) %>% length()
The issue is when I run this, the number for the first name gets mapped to each spot in that column. Why is this happening?
library(tidyverse)
Data <- data_frame(Name=c("Dog","Cat","Bird"))
OtherDataSet <- data_frame(LeadName=c("Frog","Cat","Catfish","BirdOfPrey","Bird","Bird"))
Data <- Data %>% mutate(Count=map(.x = Name,~str_detect(.,pattern = OtherDataSet$LeadName)) %>% map_int(~sum(.)))

How to analyse a data set both grouped by and ungrouped in one analysis using dplyr

This is my first stackoverflow question.
I'm trying to use dplyr to process and output a summary of data grouped by a categorical variable (inj_length_cat3) in my dataset. Actually, I generate this variable (from inj_length) on the fly using mutate(). I also want to output the same summary of the data without grouping. The only way I figured out how to do that is to do the analysis twice over, once with, once without grouping, and then combine the outputs. Ugh.
I'm sure there is a more elegant solution than this and it bugs me. I wonder if anyone would be able to help.
Thanks!
library(dplyr)
df<-data.frame(year=sample(c(2005,2006),20,replace=T),inj_length=sample(1:10,20,replace=T),hiv_status=sample(0:1,20,replace=T))
tmp <- df %>%
mutate(inj_length_cat3 = cut(inj_length, breaks=c(0,3,100), labels = c('<3 years','>3 years')))%>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years'))
tmp_all <- df %>%
group_by(year)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
)
tmp_all$inj_length_cat3=as.factor('All')
tmp<-merge(tmp_all,tmp,all=T)
I'm not sure you consider this more elegant, but you can get a solution to work if you first create a dataframe that has all your data twice: once so that you can get the subgroups and once to get the overall summary:
df1 <- rbind(df,df)
df1$inj_length_cat3 <- cut(df$inj_length, breaks=c(0,3,100,Inf),
labels = c('<3 years','>3 years','All'))
df1$inj_length_cat3[-(1:nrow(df))] <- "All"
Now you just need to run your first analysis without mutate():
tmp <- df1 %>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years','All'))

How could I reduce a dataframe in R with aggregate (or similar) to only retain the 100 highest values for each group?

I have a dataframe as such:
probe.id gene.name variance databse
A_23_P100002 FAM174B 0.93285966 Database1
A_23_P100013 AP3S2 0.48936044 Database1
...
A_23_P100020 RBPMS2 0.77441359 Database2
A_23_P100072 AVEN 0.36194383 Database2
...
I am interested in reducing this dataframe so that only the 100 genes with the highest variances per database remain. It seems that aggregate could do the job, but I don't have an idea of how to write the function that I would pass to aggregate. I would greatly appreciate any help.
Thank you!
There are a lot of ways to skin this cat so you'll get a variety of answers. In base R this one should work pretty well.
o <- ave(dat$variance, dat$database, FUN = order, decreasing = TRUE)
dat100 <- dat[o <= 100,]
try this:
library(dplyr)
myData %>% group_by(database) %>% arrange(desc(variance)) %>% slice(1:100)
try data.table
# assume DF is your data frame
setDT(DF)[order(-variance), .SD[1:100], by = database]
# setDT(DF) is to convert DF to data table which could be reverted back to a data frame using setDF(DF)

Resources