So my task is very simple, I would like to use R to solve this. I have hundreds of excel files (.xlsx) in a folder and I want to replace an especific text without altering formating of worksheet and preserving the rest of text in the cell, for example:
Text to look for:
F13 A
Replace for:
F20
Text in a current cell:
F13 A Year 2019
Desired result:
F20 Year 2019
I have googled a lot and havent found something appropiate, even though it seems to be a common task. I have a solution using Powershell but it is very slow and I cant believe that there is no simple way using R. Im sure someone had the same problem before, Ill take any sugestions.
You can try :
text_to_look <- 'F13 A'
text_to_replace <- 'F20'
all_files <- list.files('/path/to/files', pattern = '\\.xlsx$', full.names = TRUE)
lapply(all_files, function(x) {
df <- openxlsx::read.xlsx(x)
#Or use readxl package
#df <- readxl::read_excel(x)
df[] <- lapply(df, function(x) {x[grep(text_to_look, x)] <- text_to_replace;x})
openxlsx::write.xlsx(df, basename(x))
})
Related
I have named each file for each day of the years between 2012 and 2016 like this: 2012-01-01.csv; 2012-01-02.csv... until 2016-12-31.csv. But for 2017 and 2018 the files are named like this: 20170101.csv; 20170102.csv... Could someone help me include the hyphens in the second files so that they have the same name as the first ones?
Thanks!
Maybe you can try the following code with list.files + file.rename
old <- list.files(pattern = ".*.csv")
new <- paste0(as.Date(gsub(".csv","",old,fixed = TRUE),format = c("%Y-%m-%d","%Y%m%d")),".csv")
file.rename(old,new)
Edit: use this instead
setwd("C:\\Users\\...Path to your data")
DataFileNames <- list.files(pattern="\\.csv$")
sub("(\\d{4})(\\d{2})(\\d{2})(.*)","\\1-\\2-\\3\\4",DataFileNames)
file.rename(DataFileNames,NewDataFileNames)
Old answer:
Theres a lot of missing information from your question, but you should
be able to adjust the code below to suit your needs.Mostly, you'll
need to edit the read.csv line if you have headers and adjust other
parameters.
Note: This will overwrite all your tables, so make sure the data is imported properly with the read.csv in the lapply before you
run the write.csv (last line)
setwd("C:\\Users\\...Path to your data")
DataFileNames <- list.files(pattern="\\.csv$")
Datafiles <- lapply(DataFileNames, read.csv, header=FALSE)
DataFileNames <- sub("(\\d{4})(\\d{2})(\\d{2}).csv","\\1-\\2-\\3",DataFileNames)
lapply(1:length(Datafiles), function(x) write.csv(Datafiles[x], DataFileNames[x]))
I have a list of html files, I have taken some texts from the web and make them read with the read_html.
My files names are like:
a1 <- read_html(link of the text)
a2 <- read_html(link of the text)
.
.
. ## until:
a100 <- read_html(link of the text)
I am trying to create a corpus with these.
Any ideas how can I do it?
Thanks.
You could allocate the vector beforehand:
text <- rep(NA, 100)
text[1] <- read_html(link1)
...
text[100] <- read_html(link100)
Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply:
text <- lapply(links, read_html)
(here links is a vector of the links).
It would be rather bad coding style to use assign:
# not a good idea
for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))
since this is rather slow and hard to process further.
I would suggest using purrr for this solution:
library(tidyverse)
library(purrr)
library(rvest)
files <- list.files("path/to/html_links", full.names = T)
all_html <- tibble(file_path = files) %>%
mutate(filenames = basename(files)) %>%
mutate(text = map(file_path, read_html))
Is a nice way to keep track of which piece of text belongs to which file. It also makes things like sentiment or any other type analysis easy at a document level.
I have an issue that really bugs me: I've tried to convert to Rproj lately, because I would like to make my data and scripts available at some point. But with one of them, I get an error that, I think, should not occur. Here is the tiny code that gives me so much trouble, the R.proj being available at: https://github.com/fredlm/mockup.
library(readxl)
list <- list.files(path = "data", pattern = "file.*.xls") #List excel files
#Aggregate all excel files
df <- lapply(list, read_excel)
for (i in 1:length(df)){
df[[i]] <- cbind(df[[i]], list[i])
}
df <- do.call("rbind", df)
It gives me the following error right after "df <- lapply(list, read_excel)":
Error in read_fun(path = path, sheet = sheet, limits = limits, shim =
shim, : path[1]="file_1.xls": No such file or directory
Do you know why? When I do it old school, i.e. using 'setwd' before creating 'list', everything works just fine. So it looks like lapply does not know where to look for the file when used in a Rproj, which seems very odd...
What did I miss?
Thanks :)
Thanks to a non-stackoverflower, a solution was found. It's silly, but 'list' was missing a directory, so lapply couldn't aggregate the data. The following works just fine:
list <- paste("data/", list.files(path = "data", pattern = pattern = "file.*.xls"), sep = "") #List excel files
I have some problems regarding Project R and merging tab-delimited files.
I guess my biggest problem is that we are talking about 165 files with a size of around 180MB.
For sure I work on an Ubuntu server after my local OSX computer was not able to handle the data.
I tried several methods as outline here:
So we are talking about the "lapply method", the "plyr" method, a for-loop solution and a newer method with the fread-function.
Nevertheless it's not possible to merge all these csv files into one data.frame.
require(data.table)
require(bit64)
require(plyr)
setwd("~/Documents/Data/App/feed/")
options(stringsAsFactors = FALSE)
options(scipen = 999)
# List of all the contained files
file_list <- list.files()
#FREAD-Method
dataset = rbindlist(lapply(file_list, fread, header=FALSE, sep="\t"))
# LAPPLY-Method
dataset2 <- do.call("rbind",lapply(file_list, FUN=function(files){read.table(files, header=TRUE, sep="\t")}))
#LDPLY-Method
dataset3 <- ldply(file_list, read.table, header=FALSE, sep="\t", .progress = "text", .inform = TRUE)
I guess the For-Loop-solution is not so fancy.
Now my big question: How can I merge my (big) CSV files in one data.frame? Or is R not able to handle such big data?
Would be happy if anybody can help me out since I tried to find the solution with several parameters for 2 days now.
I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.
I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv
Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.