I have a function that is intended to operate on data obtained from a variety of sources with many manual entry fields. Since I don't know what to expect for the layout or naming convention used in these files, I want it to 'scan' a data frame for columns with the character string 'fix', 'name', or 'agent', and mutate the column to a new column with name 'Firm', then proceed to do string cleaning on the entries of that column, then finally, remove the original column. I have gotten it to work with SOME of the CSVs that I have already, but now have run into this error: ONLY STRINGS CAN BE CONVERTED TO SYMBOLS. I have checked into this thread ERROR: Only strings can be converted to symbols but to no avail.
Here is the function at the moment:
clean_firm_names2 <- function(df){
df <- df %>%
mutate(Firm := !!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T)) %>%
str_replace_all(pattern = "(\\W)+"," ") %>%
...str manipulations...
str_squish()) %>%
dplyr::select(-(!!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T))))
return(df)
}
I have tried using as.character() around the grep() function but that did not solve the problem. I have looked at the CSV that the function is meant to operate on and all of the column names are character strings. I read in the CSV using vroom(), as with my other CSVs, and that works fine, all of the column names appear. I can perform other dplyr functions on the df, suggesting to me that the df is behaving normally otherwise. I have run out of ideas as to why the function is choking up only on SOME of my CSVs but works as intended on others. Has anyone run into similar issues or got any clues as to what might be causing this error? This is the first time I've used SO-- I'm sorry if this question isn't very clear. I'll try and edit as needed.
Thanks!
Note that grep() returns indices of the matches (integers), not the matches themselves (strings). Integer indices can be passed directly to dplyr::rename, so perhaps the following may work better?
i <- grep(pattern = '(AGENT)|(NAME)|(FIX)', x = colnames(df), ignore.case = T, value = T)
df <- df %>%
rename(Firm = i) %>%
mutate(Firm = ...str manipulations... )
(There is an implicit assumption here that your grep() returns a single index. Additional code may be required to handle multiple matches.)
Related
How to set list names ,here is the code as below.
Currently,split_data include two sub list [[1]] and [[2]], how set names separately for them?
I want set name 'A' for [[1]],'B' for [[2]], so can retrieve data use split_data['A']...
Anyone can help on this, thanks ?
for instance ma <- list(a=c('a1','a2'),b=c('b1','b2')) can use ma["a"] for sub list
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
Others have shown you in the comments how to get what you want using split() instead of group_split(). That seems like the easiest solution.
However, if you're stuck with the existing code, here's an alternative that keeps your current code, and adds the names.
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
names(split_data) <- test_data %>% group_by(category) %>% group_keys() %>% apply(1, paste, collapse = ".")
The idea is to use group_by to split in the same way group_split does, then extract the keys as a tibble. This will have one row per group, but will have the different variables in separate columns, so I put them together by pasting the columns with a dot as separator. The last expression in the pipe is equivalent to apply(keys, 1, f)
where f is function(row) paste(row, collapse = "."). It applies f to each row of the tibble, producing a single name.
This should work even if the split happens on multiple variables, and produces names similar to those produced by split().
I have multiple same format csv files that I need to combine but before that
Header is not the first row but 4th row. Should I remove first 3 row by skip? Or should I reassign the header?
I need to add in a column which is the ID of the file (same as file name) before I combine.
Then I need to extract only 4 columns from a total of 7.
Sum up numbers under a category.
Combine all csv files into one.
This is what I have so far where I do Step 1, 3, 4 then only 2 to add in a column then 5, not sure if I should add in the ID column first or not?
files = list.files(pattern = "*.csv", full.names = TRUE)
library("tidyverse")
library("dplyr")
data = data.frame()
for (file in files){
temp <- read.csv(file, skip=3, header = TRUE)
colnames(temp) <- c("Volume", "Unit", "Category", "Surpass Object", "Time", "ID")
temp <- temp [, c("Volume", "Category", "Surpass Object")]
temp <- subset(temp, Category =="Surface")
mutate(id = file)
aggregate(temp$Volume, by=list(Category=temp$Category), FUN=sum)
}
And I got an error:
Error in is.data.frame(.data) :
argument ".data" is missing, with no default
The code is fine if I didn't put in the mutate line so I think the main problem comes from there but any advice will be appreciated.
I am quite new to R and really appreciate all the comments that I can get here.
Thanks in advance!
Since you appear to be trying to use dplyr, I'll stick with that theme.
library(dplyr)
library(purrr)
files = list.files(pattern = "*.csv", full.names = TRUE)
results <- map_dfr(setNames(nm = files), ~ read.csv(.x, skip=3, header=TRUE), .id = "filename") %>%
select(filename, Category, Volume, Surpass) %>% # no idea why you want Surpass
group_by(filename, Category) %>%
summarize(Volume = sum(Volume)) # Surpass is discarded here
Walk-through:
purrr::map_dfr iterates our function (read.csv(...)) over each of the inputs (each file in files) and row-concatenates it. Since we named the files with themselves (setNames(nm=files) is akin to names(files) <- files), we can use id="filename" which adds a "filename" column that reflects from which file each row was taken.
select(...) whatever four columns you said you needed. Frankly, since you're aggregating, we really only need c("filename", "Category", "Volume"), anything else and you likely have missed something in your explanation.
group_by(..) will allow us to get one row for each filename, each Category, where Volume is a sum (calculated in the next step, summarize).
You can use read.csv(), but if there are many files, I suggest using the fread() from the data.table package. It is significantly faster. I used fread() here, but it will still work if you switch it out for read.csv(). fread() is more advanced, as well. You will find that even things like skip can sometimes be left out, and it will still be read correctly.
library(tidyverse)
library(data.table)
add_filename <- function(flnm){
fread(flnm, skip = 3) %>% # read file
mutate(id = basename(flnm)) # creates new col id w/ basename of the file
}
# single data frame all CSVs; id in first col
df <- list.files(pattern = "*.csv", full.names = TRUE) %>%
map_df(~add_filename) %>%
select(id, Volume, Category, `Surpass Object`)
I get the impression that you wanted to aggregate but keep the consolidated data frame, as well. If that's the case, you'll keep the aggregation separate from building the data frame.
df %>% # not assigned to a new object, so only shown in console
filter(Category == "Surface") %>% # filter for the category desired
{sum(.$Volume)} # sum the remaining values for volume
If you are not aware, the period in that last call is the data carried forward, so in this case, the filtered data. The simplest way (perhaps not the best way) to explain the {} is that sum() is not designed to handle data frames - therefore isn't inherently friendly with dplyr piping.
If you wanted the sum of volume for every category instead of only "Surface" that you had coded in your question, then you would use this instead:
df %>%
group_by(Category) %>%
summarise(sum(Volume))
Notice I used the British spelling of summarize here. The function summarize() is in a lot of packages. I have just found it easier to use the British spelling for this function whenever I want to make sure it's the dplyr function that I've called. (tidyverse accepts the American and British spelling for nearly all functions, I think.)
I tried to read a subset of columns from a 'table' using spark_read_parquet,
temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"),
path="/my/path/to/the/parquet/folder")
But I got the error:
Error: java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (54): .....
Is my syntax right? I tried googling for a (real) code example using the columns argument but couldn't find any.
(And my apologies in advance... I don't really know how to give you a reproducible example involving a spark and cloud.)
TL;DR This is not how columns work. When applied like this there are used to rename the columns, hence its lengths, should be equal to the length of the input.
The way to use it is (please note memory = FALSE, it is crucial for this to work correctly):
spark_read_parquet(
sc, name = "mytable", path = "/tmp/foo",
memory = FALSE
) %>% select(Col1, Col2)
optionally followed by
... %>%
sdf_persist()
If you have a character vector, you can use rlang:
library(rlang)
cols <- c("Col1", "Col2")
spark_read_parquet(sc, name="mytable", path="/tmp/foo", memory=FALSE) %>%
select(!!! lapply(cols, parse_quosure))
I see a lot of questions on that but nothing works in my particular case.
I am building function which pivots certain columns long to wide. It spreads factors so they can be transformed to integers (flags). Inside I use function spread() that takes name of the column to be spread. I tried all the combinations I can find and nothing works. This function is going to be reusable and pivot various columns in one pass. So passing column names is essential.
This is one of many not working tricks I tried: key = dataFrame[, columnName] inside of spread(). (last line of the function body)
Here is the function code:
pivotColumn <- function(dataFrame, columnName) {
as.data.frame(dataFrame %>%
group_by_( .dots = names(dataFrame)[1:ncol(dataFrame)] ) %>%
tally %>% dplyr::rename(temporary = n) %>%
spread( key = dataFrame[, columnName], value = "temporary", fill = ""))
}
If someone wants to use this function, a dummy column needs to be added with unique values otherwise some observations will be removed as duplicates. After operation the column can be removed.(I do it at the beginning of cleaning and remove at the end).
I'm doing a simple operation using dplyr in R and got 'expecting single value' error
test <- data.frame(a=rep("item",3),b=c("step1","step2","step3"))
test%>%group_by(a)%>%(summarize(seq=paste0(b))
I've seen similar threads but those use cases were more complex, and I couldn't figure out why these 2 lines don't work.
Since you only have one group ("item") the paste0 will get a vector of the three items in b as input and will return a vector of three strings, but your summarize is expecting a single value (since there is only one group). You need to collapse the paste0 to a single string like this:
library(dplyr)
test <- data.frame(a=rep("item",3), b=c("step1","step2","step3"))
test %>% group_by(a) %>% summarize(seq = paste0(b, collapse = ""))