Expand each row in R dataframe with multiple rows - r

I need a dataframe containing the names of some files matching a pattern mapped to each line in those files. My problem is, that I am unable to generate multiple rows for each row, the dataframe should grow in columns and rows, expanded per row. What I need is basically a left outer join, but I am struggling with the syntax.
library(dplyr)
app.lsts <- data.frame(
file=list.files(path='.', pattern='app.lst', recursive=TRUE)
) %>%
mutate(command=paste0('cat ', file)) %>%
mutate(packages=system(command, intern=TRUE))
The last mutate does not work because packages is a list of lines. How do I "unwrap" these?

First, some working (but not very good code):
require(tidyverse)
out_df <-
list.files(path='.', pattern='*.foo', recursive=TRUE) %>%
map(~readLines(file(.x))) %>%
setNames(fnames) %>%
t %>%
as.data.frame %>%
gather(file, lines) %>%
unnest()
out_df
This is a tidyverse-style command to generate the data that I think you want. Since I don't have your input files, I made up these sample files:
contents of f1.foo
line_1_f1
line_2_f1
contents of f2.foo
line_1_f2
line_2_f2
line_3_f2
Changes relative to your approach:
Avoid the use of the built-in function file() as a column name. I used fname instead.
Don't use system to read the files, there is built-in R functions to do that. Use of system() needlessly makes porting your code to other operating systems far more unlikely to succeed.
Build the data frame after all the data is read into R, not before. Because of the way non-standard evaluation with dplyr works, it's hard to use readLines(...) inside of a mutate() where the file connection to be read varies.
Use purrr::map() to generate a list of lists of file content lines from a list of filenames. This is a tidyverse way of writing a for-loop.
Set the names of the list elements with setNames().
Munge this list into a data.frame using t() and as.data.frame()
Tidy the data with gather() to collapse the data frame that has one column per file into a data frame with one file per row.
Expand the list using unnest().
I don't think this approach is very pretty, but it works. Another approach that avoids the ugly steps 5 and 6 is a for loop.
fnames <- list.files(path='.', pattern='*.foo', recursive=TRUE)
out_df <- data.frame(fname = c(), lines=c())
for(fname in fnames){
fcontents <- readLines(file(fname)) %>% as.character
this_df <- data.frame(fname = fname, lines = fcontents)
out_df <- bind_rows(out_df, this_df)
}
The output in either case is
fname lines
1 f1.foo line_1_f1
2 f1.foo line_2_f1
3 f2.foo line_1_f2
4 f2.foo line_2_f2
5 f2.foo line_3_f2

Related

In R list ,how to set sub list names

How to set list names ,here is the code as below.
Currently,split_data include two sub list [[1]] and [[2]], how set names separately for them?
I want set name 'A' for [[1]],'B' for [[2]], so can retrieve data use split_data['A']...
Anyone can help on this, thanks ?
for instance ma <- list(a=c('a1','a2'),b=c('b1','b2')) can use ma["a"] for sub list
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
Others have shown you in the comments how to get what you want using split() instead of group_split(). That seems like the easiest solution.
However, if you're stuck with the existing code, here's an alternative that keeps your current code, and adds the names.
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
names(split_data) <- test_data %>% group_by(category) %>% group_keys() %>% apply(1, paste, collapse = ".")
The idea is to use group_by to split in the same way group_split does, then extract the keys as a tibble. This will have one row per group, but will have the different variables in separate columns, so I put them together by pasting the columns with a dot as separator. The last expression in the pipe is equivalent to apply(keys, 1, f)
where f is function(row) paste(row, collapse = "."). It applies f to each row of the tibble, producing a single name.
This should work even if the split happens on multiple variables, and produces names similar to those produced by split().

Batch Convert Columns from chr to num with either read_excel or dplyr

I have a database saved in excel, and when I bring it into R there are many columns that should be numeric, but they get listed as characters. I know that in read_excel I can specify each column format using the col_types = "numeric", but I have > 500 columns, so this gets a bit tedious.
Any suggestions on how to do this either when importing with read_excel, or after with dplyr or something similar?
I can do this 1 by 1 using a function that I wrote but it still requires writing out each column name
convert_column <- function(data, col_name) {
new_col_name <- paste0(col_name)
data %>% mutate(!!new_col_name := as.numeric(!!sym(col_name)))
}
convert_column("gFat_OVX") %>%
convert_column("gLean_OVX")%>%
convert_column("pFat_OVX") %>%
convert_column("pLean_OVX")
I would ideally like to say "if a column contains the text "Fat" or "Lean" in the header, then convert to numeric", but I'm open to suggestions.
select(df, contains("Fat" | "Lean"))
I'm not sure how to make an example that allows people to test this out, given that we're starting with an excel sheet here.
dplyr::mutate and across may be a solution after reading in the data.
Something like this, where df1 is your data frame from read_excel:
library(dplyr)
df1 <- df1 %>%
mutate(across(contains(c("Fat", "Lean")), ~as.numeric(.x)))

How many columns can be selected in a data frame in R?

I want to select 3117 columns out of a data frame,
I tried to select them by column names:
dataframe %>%
select(
'AAACCTGAGCACGCCT-1',
'AAACCTGAGCGCTTAT-1',
'AAACCTGAGCGTTGCC-1',
......,
'TTGGAACCACGGACAA-1'
)
or
firstpickupnames <- ('AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-1',......,'TTGGAACCACGGACAA-1')
Both ways the R console just replied
'AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-
1',......,'TTGGAACCACGGACAA-1'
+ )
+
What does this mean? Is there a limitation of columns that I can select in R?
Without a reproducible example, it's difficult to know what exactly you're looking for, but dplyr::select() has several options for selecting columns, and dplyr::everything() might be what you're looking for:
library(dplyr)
# this reorders the column names, but keeps everything without having to name the columns specifically:
mtcars %>%
select(carb, gear, everything())
# from a list of column names:
keep_columns <- c('cyl','disp','hp')
mtcars %>%
select(one_of(keep_columns))
# specific names, and a range of names:
mtcars %>%
select(hp, qsec:gear)
#You could also use `contains()`, `starts_with()`, `ends_with()`, or `matches()`. Note that calling all of the following at once will give you no results:
mtcars %>%
select(contains('t')) %>%
select(starts_with('a')) %>%
select(ends_with('b')) %>%
select(matches('^m.+g$'))
The way that the console replies (with the + indicating that it is waiting for the rest of the expression) strongly suggests that you are encountering a limitation in the capacity for the console to process long commands (which you are attempting to assemble via pasting from the clipboard) rather than an inherent limit in the number of columns which can be selected. The only place I could find in the documentation to this limitation is here where it says "Command lines entered at the console are limited to about 4095 bytes."
In the comments you said that the column names that you wanted to select were in a csv file. You didn't say much about the structure of the csv file, but say that you have a csv file that contains a single list of column names. As an example, I created a file named "colnames.csv" which has a single line:
Sepal.Width, Petal.Length
Note that there is no need to manually place quote marks around the column names in the text file. Then in the R console I typed:
iris %>% select(one_of(as.character(read.csv("colnames.csv",header = FALSE, strip.white = TRUE,stringsAsFactors = FALSE))))
which worked as expected. Even though this example only used 2 columns, there is no reason that it should fail with 3000+, since the number of columns per se wasn't the problem with what you were doing.
If the structure of the csv file is different from the example then you would need to adjust the call to read.csv and perhaps the way that you convert it to a character vector, but you should be able to tweak this approach to your situation.

Applying dplyr's tally over large amount of columns to create codebook

I have a dataframe ov 100+ variables and I would like to create a codebook to see the frequencies of each variable (and ideally output this to excel). Right now, I'm using the following code:
freq_fun <- function(var){
var <- enquo(var)
frequencies <- raw %>% group_by(group, !!var) %>% tally()
return(frequencies)
}
I added in the return in the hopes that looping by column names would at least show me the output but this was unsuccessful.
At this point, my plan is to do the following:
for(i in colnames(rawxl[,9:107])){
assign(paste0(i,"freq"), freq_queue(!!i))
}
output each dataframe to a csv and then copy and paste into one excel doc. This is undesirable for obvious reasons, but I can't see a clear way around it. What is a better way to do this?

How to use purrr with dplyr to filter list elements and export lists into Excel

I'm fairly new to working with lists in R and have a quick question that also involes using purrr. Below are too small sample data frames as an example.
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals <- c("Cat","Cat","Dog","Rat","Bird")
Living <- c("House","Condo","Condo","Apartment","House")
Data1 <- data.frame(Client1,Animals,Living)
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals2 <- c("Cat","Dog","Dog","Rat","Cat")
Living2 <- c("House","Apartment","Apartment","Family","Apartment")
Data2 <- data.frame(Client1,Animals2,Living2)
Bonus if you can include how to rename list elements at once instead of using the two lines below:
names(Data1)[1:3] <- c("Client","Animals","Living")
names(Data2)[1:3] <- c("Client","Animals","Living")
So next if I want to filter each data frame by Animals and then export each into an Excel spreadsheet by using the two lines of code below:
Data1 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data1.csv")
Data2 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data2.csv")
However, to be more efficient I can join both data frames into a list and use purrr to filter each at the same time.
DataList <- list(Data1,Data2)
DataList %>% map(~filter(.,Animals=="Cat"))
For the above code, I will use multiple ~filter lines for each animal, so not sure if there's a more efficient way that will avoid writing many different lines of code while still using purrr and dplyr?
Also, how do I use write.csv with purrr. I can either export the list into one spreadsheet, but I'm not sure how to break up the list so that it exports properly. Also, I can export each list element into separate spreadsheets. It would be great to see a solution for both of these situations.
If I understand your question correctly, you want to write a separate file for each of the Animals of both the data frames:
DataList <- list(Data1, Data2)
library(purrr)
a <- DataList %>% map(., function(x) {
colnames(x) <- c("Client","Animals","Living")
x
}) %>% map(., function(x) {
split(x, x$Animals)
}) %>% flatten(.)
names(a) <- paste0("Data", (1:length(a)))
lapply(1:length(a), function(x) write.csv(a[[x]],
file = paste0(names(a[x]), ".csv"),
row.names = FALSE))
We first dump both the data frames in DataList, then rename the columns for both the data frames with the first map, then split both the data frames by Animals, and finally flatten the nested list.
I wish I could do this without breaking the chain, but I couldn't find another way.
From here, we first rename the elements of the list, then use lapply to loop over all the elements in the list and apply write.csv on each of them.
You mentioned Excel - you can just as easily replace write.csv with any of the functions for writing excel files from R
Here is one option, involving binding the two datasets together before re-splitting.
library(purrr)
library(dplyr)
DataList %>%
map(~setNames(.x, c("Client","Animals","Living"))) %>%
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id") %>%
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
The first map line shows how to rename the columns of all the datasets in a list at once via setNames.
DataList %>%
map(~setNames(.x, c("Client","Animals","Living")))
I then set the names of the datasets in the list via setNames. While stacking the datasets together into a single data.frame via dplyr's bind_rows, these names are added as a new column, id.
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id")
The last step is to split the combined data.frame by id and Animal before writing each split into a separate csv file. Information is pulled out of the dataset for naming the individual files by dataset and animal (this was the reason to name the elements of DataList). I removed the id variable via select prior to writing the files, as it may be extraneous to your needs.
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
This can be all be done without putting these into a single data.frame, but I had trouble with naming the files at the end.

Resources