Applying dplyr's tally over large amount of columns to create codebook - r

I have a dataframe ov 100+ variables and I would like to create a codebook to see the frequencies of each variable (and ideally output this to excel). Right now, I'm using the following code:
freq_fun <- function(var){
var <- enquo(var)
frequencies <- raw %>% group_by(group, !!var) %>% tally()
return(frequencies)
}
I added in the return in the hopes that looping by column names would at least show me the output but this was unsuccessful.
At this point, my plan is to do the following:
for(i in colnames(rawxl[,9:107])){
assign(paste0(i,"freq"), freq_queue(!!i))
}
output each dataframe to a csv and then copy and paste into one excel doc. This is undesirable for obvious reasons, but I can't see a clear way around it. What is a better way to do this?

Related

Create a dataframe looping a function's results

So this is a simplification of my problem.
I have a dataframe like this:
df <- data.frame(name=c("lucas","julio","jack","juan"),number=c(1,15,100,22))
And I have a function that creates new values for every name, like this:
var_number <- function(x) {
example <- df %>%
filter(name %in% unique(df$name)[x]) %>%
select(-name) %>%
mutate(value1=number/2^5, value2=number^5)
(example)
}
var_number(1)
0.03125 1
Now I have two new values for every name and I would like to create a loop to save each result in a new dataframe.
I know how to solve this particular problem, but I need a general solution that allows me to save the results of all functions into a dataframe.
I'm looking for an automatic way to do something like this:
result<- bind_rows(var_number(1),var_number(2),var_number(3),var_number(4))
Since I would have to apply var_number around 1000 times and the lenght would change with every test i do.
There is anyway I can do something like this? I was thinking about doing it with "for", but I'm not really sure about how to do it, I have just started with R and I am a total newbie.
This answers my problem:
library(tidyverse) # contains purrr library
#an arbitrary function that always outputs a dataframe
# with a consistent number of columns, in this case 3
myfunc <- function(x){
data.frame(a=x*2,
b=x^2,
c=log2(x))
}
# iterate over 1:10 as inputs to myfunc, and
# combine the results rowwise into a df
purrr::map_dfr(1:10,
~myfunc(.))
Why do you want to apply var_number function for each name, create a new dataframe for each and then combine all of them together?
Do it only once in the same dataframe.
library(dplyr)
df1 <- df %>%
mutate(value1=number/2^5,value2=number^5) %>%
select(-name)
If you want to do it only for specific names, you can filter them first before applying the above.

Need help organizing and summarizing column data into R Markdown

sorry if this has is a easy question but I have a problem
I have a .csv file imported into RStudio. Picture linked below is an example of how it looks like. I want to create individual data frames for each type (BMW, Mercedes, Honda) and then create summary statistics for each subsetted data frame.
example
I am pretty lost that I cant even really figure out a correct title to this question. Any help would be appreciated.
creating single data.frames for each type can be done with the split function, you can then calculate summary statistics for each data.frame by using lapply on the list of data frames.
split_dfs <- split(your_data, your_data$type)
summary_stats <- lapply(split_dfs, function(x){
data.frame(
mean_price = mean(x$price)
)
})
A more modern version would be, not to create single data.frames but to use a grouped data.frame. Use group_by and summarise from the dplyr package.
require(tidyverse)
your_data %>%
group_by(type) %>%
summarise(
mean_price = mean(price)
)
Another library, that makes the computation easier and most of all faster for large datasets with many groups is the data.table library, the computation would look something like this.
require(data.table)
your_dt <- as.data.table(your_data)
summary_stats <- your_dt[, .(mean_price=mean(price)), by="type"]

Need Help Incorporating Tidyr's Spread into a Function that Outputs a List of Dataframes with Grouped Counts

library(tidyverse)
Using the sample data at the bottom, I want to find counts of the Gender and FP variables, then spread these variables using tidyr::spread(). I'm attempting to do this by creating a list of dataframes, one for the Gender counts, and one for FP counts. The reason I'm doing this is to eventually cbind both dataframes. However, I'm having trouble incorporating the tidyr::spread into my function.
The function below creates a list of two dataframes with counts for Gender and FP, but the counts are not "spread."
group_by_quo=quos(Gender,FP)
DF2<-map(group_by_quo,~DF%>%
group_by(Code,!!.x)%>%
summarise(n=n()))
If I add tidyr::spread, it doesn't work. I'm not sure how to incorporate this since each dataframe in the list has a different variable.
group_by_quo=quos(Gender,FP)
DF2<-map(group_by_quo,~DF%>%
group_by(Code,!!.x)%>%
summarise(n=n()))%>%
spread(!!.x,n)
Any help would be appreciated!
Sample Code:
Subject<-c("Subject1","Subject2","Subject1","Subject3","Subject3","Subject4","Subject2","Subject1","Subject2","Subject4","Subject3","Subject4")
Code<-c("AAA","BBB","AAA","CCC","CCC","DDD","BBB","AAA","BBB","DDD","CCC","DDD")
Code2<-c("AAA2","BBB2","AAA2","CCC2","CCC2","DDD2","BBB2","AAA2","BBB2","DDD2","CCC2","DDD2")
Gender<-c("Male","Male","Female","Male","Female","Female","Female","Male","Male","Male","Male","Male")
FP<-c("F","P","P","P","F","F","F","F","F","F","F","F")
DF<-data_frame(Subject,Code,Code2,Gender,FP)
I think you misplaced the closing parenthesis. This code works for me:
library(tidyverse)
Subject<-c("Subject1","Subject2","Subject1","Subject3","Subject3","Subject4","Subject2","Subject1","Subject2","Subject4","Subject3","Subject4")
Code<-c("AAA","BBB","AAA","CCC","CCC","DDD","BBB","AAA","BBB","DDD","CCC","DDD")
Code2<-c("AAA2","BBB2","AAA2","CCC2","CCC2","DDD2","BBB2","AAA2","BBB2","DDD2","CCC2","DDD2")
Gender<-c("Male","Male","Female","Male","Female","Female","Female","Male","Male","Male","Male","Male")
FP<-c("F","P","P","P","F","F","F","F","F","F","F","F")
DF<-data_frame(Subject,Code,Code2,Gender,FP)
group_by_quo <- quos(Gender, FP)
DF2 <- map(group_by_quo,
~DF %>%
group_by(Code,!!.x) %>%
summarise(n=n()) %>%
spread(!!.x,n))
This last part is a bit more concise using count:
DF2 <- map(group_by_quo,
~DF %>%
count(Code,!!.x) %>%
spread(!!.x,n))
And by using count the unnecessary grouping information is removed as well.

Expand each row in R dataframe with multiple rows

I need a dataframe containing the names of some files matching a pattern mapped to each line in those files. My problem is, that I am unable to generate multiple rows for each row, the dataframe should grow in columns and rows, expanded per row. What I need is basically a left outer join, but I am struggling with the syntax.
library(dplyr)
app.lsts <- data.frame(
file=list.files(path='.', pattern='app.lst', recursive=TRUE)
) %>%
mutate(command=paste0('cat ', file)) %>%
mutate(packages=system(command, intern=TRUE))
The last mutate does not work because packages is a list of lines. How do I "unwrap" these?
First, some working (but not very good code):
require(tidyverse)
out_df <-
list.files(path='.', pattern='*.foo', recursive=TRUE) %>%
map(~readLines(file(.x))) %>%
setNames(fnames) %>%
t %>%
as.data.frame %>%
gather(file, lines) %>%
unnest()
out_df
This is a tidyverse-style command to generate the data that I think you want. Since I don't have your input files, I made up these sample files:
contents of f1.foo
line_1_f1
line_2_f1
contents of f2.foo
line_1_f2
line_2_f2
line_3_f2
Changes relative to your approach:
Avoid the use of the built-in function file() as a column name. I used fname instead.
Don't use system to read the files, there is built-in R functions to do that. Use of system() needlessly makes porting your code to other operating systems far more unlikely to succeed.
Build the data frame after all the data is read into R, not before. Because of the way non-standard evaluation with dplyr works, it's hard to use readLines(...) inside of a mutate() where the file connection to be read varies.
Use purrr::map() to generate a list of lists of file content lines from a list of filenames. This is a tidyverse way of writing a for-loop.
Set the names of the list elements with setNames().
Munge this list into a data.frame using t() and as.data.frame()
Tidy the data with gather() to collapse the data frame that has one column per file into a data frame with one file per row.
Expand the list using unnest().
I don't think this approach is very pretty, but it works. Another approach that avoids the ugly steps 5 and 6 is a for loop.
fnames <- list.files(path='.', pattern='*.foo', recursive=TRUE)
out_df <- data.frame(fname = c(), lines=c())
for(fname in fnames){
fcontents <- readLines(file(fname)) %>% as.character
this_df <- data.frame(fname = fname, lines = fcontents)
out_df <- bind_rows(out_df, this_df)
}
The output in either case is
fname lines
1 f1.foo line_1_f1
2 f1.foo line_2_f1
3 f2.foo line_1_f2
4 f2.foo line_2_f2
5 f2.foo line_3_f2

Filter data with R from csv file

I have a csv file of Facebook data with around 190,000 rows. The column names are the following:
comment_id, status_id, parent_id, comment_message, comment_author, comment_published, comment_likes, Positive, Negative, Sentiment
I want to find out which comment_author who has the most comments (# of comment_message) and a Sentiment > 0.
Does anybody know how to apply this filter using R?
If df is your data frame you can use dplyr package as follow:
df %>% group_by(comment_author,sentiment) %>%
dplyr::summarize(total_number_comment=sum(comment_message)) %>%
as.data.frame() %>%
arrange(desc(total_number_comment)) %>%
filter(sentiment>0)
I didn't understand what you really want to do with the sentiment variable (you need to provide an example for instance), but the grouping part is done

Resources