Need advice on using R to clean up data - r

I have multiple same format csv files that I need to combine but before that
Header is not the first row but 4th row. Should I remove first 3 row by skip? Or should I reassign the header?
I need to add in a column which is the ID of the file (same as file name) before I combine.
Then I need to extract only 4 columns from a total of 7.
Sum up numbers under a category.
Combine all csv files into one.
This is what I have so far where I do Step 1, 3, 4 then only 2 to add in a column then 5, not sure if I should add in the ID column first or not?
files = list.files(pattern = "*.csv", full.names = TRUE)
library("tidyverse")
library("dplyr")
data = data.frame()
for (file in files){
temp <- read.csv(file, skip=3, header = TRUE)
colnames(temp) <- c("Volume", "Unit", "Category", "Surpass Object", "Time", "ID")
temp <- temp [, c("Volume", "Category", "Surpass Object")]
temp <- subset(temp, Category =="Surface")
mutate(id = file)
aggregate(temp$Volume, by=list(Category=temp$Category), FUN=sum)
}
And I got an error:
Error in is.data.frame(.data) :
argument ".data" is missing, with no default
The code is fine if I didn't put in the mutate line so I think the main problem comes from there but any advice will be appreciated.
I am quite new to R and really appreciate all the comments that I can get here.
Thanks in advance!

Since you appear to be trying to use dplyr, I'll stick with that theme.
library(dplyr)
library(purrr)
files = list.files(pattern = "*.csv", full.names = TRUE)
results <- map_dfr(setNames(nm = files), ~ read.csv(.x, skip=3, header=TRUE), .id = "filename") %>%
select(filename, Category, Volume, Surpass) %>% # no idea why you want Surpass
group_by(filename, Category) %>%
summarize(Volume = sum(Volume)) # Surpass is discarded here
Walk-through:
purrr::map_dfr iterates our function (read.csv(...)) over each of the inputs (each file in files) and row-concatenates it. Since we named the files with themselves (setNames(nm=files) is akin to names(files) <- files), we can use id="filename" which adds a "filename" column that reflects from which file each row was taken.
select(...) whatever four columns you said you needed. Frankly, since you're aggregating, we really only need c("filename", "Category", "Volume"), anything else and you likely have missed something in your explanation.
group_by(..) will allow us to get one row for each filename, each Category, where Volume is a sum (calculated in the next step, summarize).

You can use read.csv(), but if there are many files, I suggest using the fread() from the data.table package. It is significantly faster. I used fread() here, but it will still work if you switch it out for read.csv(). fread() is more advanced, as well. You will find that even things like skip can sometimes be left out, and it will still be read correctly.
library(tidyverse)
library(data.table)
add_filename <- function(flnm){
fread(flnm, skip = 3) %>% # read file
mutate(id = basename(flnm)) # creates new col id w/ basename of the file
}
# single data frame all CSVs; id in first col
df <- list.files(pattern = "*.csv", full.names = TRUE) %>%
map_df(~add_filename) %>%
select(id, Volume, Category, `Surpass Object`)
I get the impression that you wanted to aggregate but keep the consolidated data frame, as well. If that's the case, you'll keep the aggregation separate from building the data frame.
df %>% # not assigned to a new object, so only shown in console
filter(Category == "Surface") %>% # filter for the category desired
{sum(.$Volume)} # sum the remaining values for volume
If you are not aware, the period in that last call is the data carried forward, so in this case, the filtered data. The simplest way (perhaps not the best way) to explain the {} is that sum() is not designed to handle data frames - therefore isn't inherently friendly with dplyr piping.
If you wanted the sum of volume for every category instead of only "Surface" that you had coded in your question, then you would use this instead:
df %>%
group_by(Category) %>%
summarise(sum(Volume))
Notice I used the British spelling of summarize here. The function summarize() is in a lot of packages. I have just found it easier to use the British spelling for this function whenever I want to make sure it's the dplyr function that I've called. (tidyverse accepts the American and British spelling for nearly all functions, I think.)

Related

How to check in R if the name of the list element contains "this text" in it and pass to the next element in a for loop?

I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)

How to loop through text files, find the average of each, and store it in a dataframe in R?

I have 500 text files, each with a different header and different columns of numbers, in a directory that I would like to do the following with:
Calculate the mean for each file
Store that mean into a new row in a dataframe
Repeat for all files
To read the file names I use:
#read file names
Filenames <- list.files(path = "path/",
pattern = "*.txt",
full.names = T)
To calculate the mean and store into a directory I am trying variations of (and sapply):
df <- imap_dfr(Filenames, ~ vroom(.x) %>%
summarise_if(is.numeric, mean))
However, the headers are being broken down into several columns, each receiving their own mean. I would like to either delete the first row header or ignore it before calculating the total file mean.
Help appreciated.
Select the numeric columns, unlist them to a vector and calculate mean.
library(dplyr)
library(purrr)
library(vroom)
map_dbl(Filenames, ~ vroom(.x) %>%
select(where(is.numeric)) %>%
unlist %>% mean(na.rm = TRUE)) -> mean_values
mean_values

Only strings can be converted to symbols within a function in R

I have a function that is intended to operate on data obtained from a variety of sources with many manual entry fields. Since I don't know what to expect for the layout or naming convention used in these files, I want it to 'scan' a data frame for columns with the character string 'fix', 'name', or 'agent', and mutate the column to a new column with name 'Firm', then proceed to do string cleaning on the entries of that column, then finally, remove the original column. I have gotten it to work with SOME of the CSVs that I have already, but now have run into this error: ONLY STRINGS CAN BE CONVERTED TO SYMBOLS. I have checked into this thread ERROR: Only strings can be converted to symbols but to no avail.
Here is the function at the moment:
clean_firm_names2 <- function(df){
df <- df %>%
mutate(Firm := !!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T)) %>%
str_replace_all(pattern = "(\\W)+"," ") %>%
...str manipulations...
str_squish()) %>%
dplyr::select(-(!!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T))))
return(df)
}
I have tried using as.character() around the grep() function but that did not solve the problem. I have looked at the CSV that the function is meant to operate on and all of the column names are character strings. I read in the CSV using vroom(), as with my other CSVs, and that works fine, all of the column names appear. I can perform other dplyr functions on the df, suggesting to me that the df is behaving normally otherwise. I have run out of ideas as to why the function is choking up only on SOME of my CSVs but works as intended on others. Has anyone run into similar issues or got any clues as to what might be causing this error? This is the first time I've used SO-- I'm sorry if this question isn't very clear. I'll try and edit as needed.
Thanks!
Note that grep() returns indices of the matches (integers), not the matches themselves (strings). Integer indices can be passed directly to dplyr::rename, so perhaps the following may work better?
i <- grep(pattern = '(AGENT)|(NAME)|(FIX)', x = colnames(df), ignore.case = T, value = T)
df <- df %>%
rename(Firm = i) %>%
mutate(Firm = ...str manipulations... )
(There is an implicit assumption here that your grep() returns a single index. Additional code may be required to handle multiple matches.)

combining ldply to combine multiple csv files AND add column with file names via mutate/basename

I'm trying to apply the code here which uses ldply to combine multiple csv files into one dataframe
I'm trying to figure out what the appropriate tidyverse syntax is to add a column that lists the name of the file from which the data comes from.
Here's what I have
test <- ldply( .data = list.files(pattern="*.csv"),
.fun = read.csv,
header = TRUE) %>%
mutate(filename=gsub(".csv","",basename(x)))
I get
"Error in basename(x) : object 'x' not found message".
My understanding is that basename(path), but when I set the path as the folder which contains the file, the filename column that ends up getting added just has the folder name.
Any help is much appreciated!
You could use purrr::map_dfr
purrr::map_dfr(list.files(pattern="*.csv", full.names = TRUE),
~read.csv(.x) %>% mutate(file = sub(".csv$", "", basename(.x))))
We can use imap
library(purrr)
library(dplyr)
library(stringr)
library(readr)
files <- list.files(pattern="*.csv", full.names = TRUE)
fileSub <- str_remove(basename(files), "\\.csv$")
imap_dfr(setNames(files, fileSub), ~ read_csv(.x) %>%
mutate(file = .y))
I don't know if this helps anyone, I stumbled across this solution which is very simple.
Context: the .id column created by ldply lists the names of each item in your input vector. So, to combine multiple csv files and create a new column with the file name, you can do:
# get csv files in current working directory as a character vector
file_names <- list.files(pattern="*.csv") #for the example above it is .data=list.files(pattern="*.csv")
# Name these items (in this case equal to the items themselves, but can be subbed out for sample.Ids)
names(file_names) <- paste(file_names) # or for the example above names(.data) <- paste(.data)
# then use ldply to do the hard work
combined_csv <- ldply(file_names, read.csv)
# Names are stored under .id
combined_csv$.id

How to use purrr with dplyr to filter list elements and export lists into Excel

I'm fairly new to working with lists in R and have a quick question that also involes using purrr. Below are too small sample data frames as an example.
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals <- c("Cat","Cat","Dog","Rat","Bird")
Living <- c("House","Condo","Condo","Apartment","House")
Data1 <- data.frame(Client1,Animals,Living)
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals2 <- c("Cat","Dog","Dog","Rat","Cat")
Living2 <- c("House","Apartment","Apartment","Family","Apartment")
Data2 <- data.frame(Client1,Animals2,Living2)
Bonus if you can include how to rename list elements at once instead of using the two lines below:
names(Data1)[1:3] <- c("Client","Animals","Living")
names(Data2)[1:3] <- c("Client","Animals","Living")
So next if I want to filter each data frame by Animals and then export each into an Excel spreadsheet by using the two lines of code below:
Data1 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data1.csv")
Data2 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data2.csv")
However, to be more efficient I can join both data frames into a list and use purrr to filter each at the same time.
DataList <- list(Data1,Data2)
DataList %>% map(~filter(.,Animals=="Cat"))
For the above code, I will use multiple ~filter lines for each animal, so not sure if there's a more efficient way that will avoid writing many different lines of code while still using purrr and dplyr?
Also, how do I use write.csv with purrr. I can either export the list into one spreadsheet, but I'm not sure how to break up the list so that it exports properly. Also, I can export each list element into separate spreadsheets. It would be great to see a solution for both of these situations.
If I understand your question correctly, you want to write a separate file for each of the Animals of both the data frames:
DataList <- list(Data1, Data2)
library(purrr)
a <- DataList %>% map(., function(x) {
colnames(x) <- c("Client","Animals","Living")
x
}) %>% map(., function(x) {
split(x, x$Animals)
}) %>% flatten(.)
names(a) <- paste0("Data", (1:length(a)))
lapply(1:length(a), function(x) write.csv(a[[x]],
file = paste0(names(a[x]), ".csv"),
row.names = FALSE))
We first dump both the data frames in DataList, then rename the columns for both the data frames with the first map, then split both the data frames by Animals, and finally flatten the nested list.
I wish I could do this without breaking the chain, but I couldn't find another way.
From here, we first rename the elements of the list, then use lapply to loop over all the elements in the list and apply write.csv on each of them.
You mentioned Excel - you can just as easily replace write.csv with any of the functions for writing excel files from R
Here is one option, involving binding the two datasets together before re-splitting.
library(purrr)
library(dplyr)
DataList %>%
map(~setNames(.x, c("Client","Animals","Living"))) %>%
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id") %>%
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
The first map line shows how to rename the columns of all the datasets in a list at once via setNames.
DataList %>%
map(~setNames(.x, c("Client","Animals","Living")))
I then set the names of the datasets in the list via setNames. While stacking the datasets together into a single data.frame via dplyr's bind_rows, these names are added as a new column, id.
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id")
The last step is to split the combined data.frame by id and Animal before writing each split into a separate csv file. Information is pulled out of the dataset for naming the individual files by dataset and animal (this was the reason to name the elements of DataList). I removed the id variable via select prior to writing the files, as it may be extraneous to your needs.
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
This can be all be done without putting these into a single data.frame, but I had trouble with naming the files at the end.

Resources