Need help organizing and summarizing column data into R Markdown - r

sorry if this has is a easy question but I have a problem
I have a .csv file imported into RStudio. Picture linked below is an example of how it looks like. I want to create individual data frames for each type (BMW, Mercedes, Honda) and then create summary statistics for each subsetted data frame.
example
I am pretty lost that I cant even really figure out a correct title to this question. Any help would be appreciated.

creating single data.frames for each type can be done with the split function, you can then calculate summary statistics for each data.frame by using lapply on the list of data frames.
split_dfs <- split(your_data, your_data$type)
summary_stats <- lapply(split_dfs, function(x){
data.frame(
mean_price = mean(x$price)
)
})
A more modern version would be, not to create single data.frames but to use a grouped data.frame. Use group_by and summarise from the dplyr package.
require(tidyverse)
your_data %>%
group_by(type) %>%
summarise(
mean_price = mean(price)
)
Another library, that makes the computation easier and most of all faster for large datasets with many groups is the data.table library, the computation would look something like this.
require(data.table)
your_dt <- as.data.table(your_data)
summary_stats <- your_dt[, .(mean_price=mean(price)), by="type"]

Related

Transposing a CSV in R so each row contains one data point

I am trying to manipulate a CSV in R to match a very specific formatting need. Pretty confident I can nest a few loops to write to a file, but I'm hoping there's an easier way in R.
Before:
1,2,400,410,420,430,450,490,75700,75701,77035,77035,77035,77035
*,Facility Name,1234 Test Street,Michigan,49503,123-456-7891,,MPI_ID_TYPE,1,Sober Living Community,Clothing,Diapers,Food Pantry
After:
1,*
2,Facility Name
400,123 Test Street
410,TestCity
420,TestState
430,12345
450,123-456-7891
75700,MPI_ID_TYPE
75701,1
77035,Sober Living Community
77035,Clothing
77035,Diapers
77035,Food Pantry
So far, I have ingested data and manipulated it to achieve the Before chunk.
you can use t() to transpose data.
df <- mtcars[,1:2]
df2 <- t(df)
df3 <- t(df2)
To further address your question I am keeping one row in the mtcars dataset. Then I use pivot_longer from tidyr to transpose and keep the original headers as a column in the data.
library(tidyr)
df <- mtcars[1,]
df2 <- pivot_longer(df, cols = everything())

Applying dplyr's tally over large amount of columns to create codebook

I have a dataframe ov 100+ variables and I would like to create a codebook to see the frequencies of each variable (and ideally output this to excel). Right now, I'm using the following code:
freq_fun <- function(var){
var <- enquo(var)
frequencies <- raw %>% group_by(group, !!var) %>% tally()
return(frequencies)
}
I added in the return in the hopes that looping by column names would at least show me the output but this was unsuccessful.
At this point, my plan is to do the following:
for(i in colnames(rawxl[,9:107])){
assign(paste0(i,"freq"), freq_queue(!!i))
}
output each dataframe to a csv and then copy and paste into one excel doc. This is undesirable for obvious reasons, but I can't see a clear way around it. What is a better way to do this?

How to analyse a data set both grouped by and ungrouped in one analysis using dplyr

This is my first stackoverflow question.
I'm trying to use dplyr to process and output a summary of data grouped by a categorical variable (inj_length_cat3) in my dataset. Actually, I generate this variable (from inj_length) on the fly using mutate(). I also want to output the same summary of the data without grouping. The only way I figured out how to do that is to do the analysis twice over, once with, once without grouping, and then combine the outputs. Ugh.
I'm sure there is a more elegant solution than this and it bugs me. I wonder if anyone would be able to help.
Thanks!
library(dplyr)
df<-data.frame(year=sample(c(2005,2006),20,replace=T),inj_length=sample(1:10,20,replace=T),hiv_status=sample(0:1,20,replace=T))
tmp <- df %>%
mutate(inj_length_cat3 = cut(inj_length, breaks=c(0,3,100), labels = c('<3 years','>3 years')))%>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years'))
tmp_all <- df %>%
group_by(year)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
)
tmp_all$inj_length_cat3=as.factor('All')
tmp<-merge(tmp_all,tmp,all=T)
I'm not sure you consider this more elegant, but you can get a solution to work if you first create a dataframe that has all your data twice: once so that you can get the subgroups and once to get the overall summary:
df1 <- rbind(df,df)
df1$inj_length_cat3 <- cut(df$inj_length, breaks=c(0,3,100,Inf),
labels = c('<3 years','>3 years','All'))
df1$inj_length_cat3[-(1:nrow(df))] <- "All"
Now you just need to run your first analysis without mutate():
tmp <- df1 %>%
group_by(year,inj_length_cat3)%>%
summarise(
r=sum(hiv_status,na.rm=T),
n=length(hiv_status),
p=prop.test(r,n)$estimate,
cilow=prop.test(r,n)$conf.int[1],
cihigh=prop.test(r,n)$conf.int[2]
) %>%
filter(inj_length_cat3%in%c('<3 years','>3 years','All'))

How could I reduce a dataframe in R with aggregate (or similar) to only retain the 100 highest values for each group?

I have a dataframe as such:
probe.id gene.name variance databse
A_23_P100002 FAM174B 0.93285966 Database1
A_23_P100013 AP3S2 0.48936044 Database1
...
A_23_P100020 RBPMS2 0.77441359 Database2
A_23_P100072 AVEN 0.36194383 Database2
...
I am interested in reducing this dataframe so that only the 100 genes with the highest variances per database remain. It seems that aggregate could do the job, but I don't have an idea of how to write the function that I would pass to aggregate. I would greatly appreciate any help.
Thank you!
There are a lot of ways to skin this cat so you'll get a variety of answers. In base R this one should work pretty well.
o <- ave(dat$variance, dat$database, FUN = order, decreasing = TRUE)
dat100 <- dat[o <= 100,]
try this:
library(dplyr)
myData %>% group_by(database) %>% arrange(desc(variance)) %>% slice(1:100)
try data.table
# assume DF is your data frame
setDT(DF)[order(-variance), .SD[1:100], by = database]
# setDT(DF) is to convert DF to data table which could be reverted back to a data frame using setDF(DF)

Split up a dataframe by number of rows

I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with.
I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.
There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes.
How might I do this?
Make your own grouping variable.
d <- split(my_data_frame,rep(1:400,each=1000))
You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.
edited for brevity, after Hadley's comments.
If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do
chunk <- 1000
n <- nrow(my_data_frame)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)
You could also use
r <- ggplot2::cut_width(1:n,chunk,boundary=0)
For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like
(my_data_frame
%>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
%>% group_by(index)
%>% [mutate, summarise, do()] ...
)
There are also many answers here
I had a similar question and used this:
library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)
from left to right:
you assign your result to split
you start with df as your input dataframe
then you group your data by dividing the row_number by n (number of groups) using modular division.
then you just pass that group through the group_map function which returns a list.
So in the end your split is a list with in each element a group of your dataset.
On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).
You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions

Resources