I have 1000 files with column similar column names.for example :
df1
DATE PRICE CLOSE
df2
DATE PRICE CLOSE
and so on...
If I try to merge them based by date they do get merge but the columns have retained their old names and I want to rename them in a loop
so merge data set looks like this
Date Price Close PRICE CLOSE
I want something like
DATE PRICE1 CLOSE1 PRICE2 CLOSE2.
Is there any easy way to do it?
I have tried couple of things which is not giving me correct output
this is using plyr package:
mod_join = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)[,c('Date','High','Low')]})
join_all(datalist,by = "Date")
}
this is using merge command on all data frame:
merge2 = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)[,c('Date','High','Low')]})
Reduce(function(x,y) {merge(x,y,by.x= "Date",by.y = "Date",all=T)}, datalist)}
}
I tried using for loop by making the data frame lead then using each data frame to subset and merge subsequently but somehow its not subsetting the dataframes:
for (i in 1:1000){
data_subset <- sprintf('data_%d',i)
mydata_subset <- data.frame(,data_subset["Date"],data_subset["High"],data_subset["DayLow"])
obj_name <- paste('subset_Pricedata',i,sep ="_")
assign(obj_name,value = mydata_subset)
}
Any help will be great.
Thanks
Hopefully, this will do your job:
library(plyr)
df1 = rename(df1,c("PRICE"="PRICE1","CLOSE"="CLOSE1"))
df2 = rename(df2,c("PRICE"="PRICE2","CLOSE"="CLOSE2"))
new = merge(df1,df2,all=TRUE)
Please comment if you face any difficulties.
What about this approach?
It should be fast as it uses data.table and its fread-function
library(data.table)
merge2 <- function(mypath){
filenames <- list.files(path=mypath, full.names=TRUE)
fileslist <- lapply(filenames, function(nam){
# reads the file
file <- fread(nam)
setnames(file, 2, "price") # renames the second col to "price"
setnames(file, 3, "close") # third to "close"
return(file)
})
dat <- rbindlist(fileslist)
return(dat)
}
EDIT
I just realised that you want to merge your data instead of having it in the long format. What you can do is just add a variable with a name to the data.table "file" before returning the file by adding:
file[, varnam := nam]
and then cast the final data.table "dat" before returning it, using the reshape2 library and its dcast function.
I had a similar problem. Here's what I ended up using, although there is likely a cleaner way.
The function suffix_col_names will add a suffix to a subset of columns. I use this because I eventually merge week1 and week 2 data on columns 1-10.
#function called suffix_col_names
suffix_col_names<-function(your_df, start_col, end_col, your_str, your_sep){
for (i in start_col:end_col){
colnames(your_df)[i]<-paste(colnames(your_df)[i], sep=your_sep,your_str)
}
return(your_df)
}
#call function to rename columns in week1 and week2
week_1_data<-suffix_col_names(week1,11,24,"1",".")
week_2_data<-suffix_col_names(week2,11,24,"2",".")
Related
I am pulling 10-Ks off the SEC website using the EDGAR package in R. Fortunately, the text files come with a consistent file naming convention: CIK number (this is a unique filing ID)_File type_Date.
Ultimately I want to analyze these by SIC/industry group, so I think the best way to do this would be to add the SIC industry code to this filename rule.
I am including an image of what I would like to do below. It is kind of like a database join except my file names would be taking the new field. Not sure how to do that, I am pretty new to R and file scripting.
I am assuming that you have a data.frame with a column filenames. (Or a vector containing all the filenames) See the code below:
# A data.frame with a character column 'filenames'
df$CIK <- sapply(df$filenames, FUN = function(x) {unlist(strsplit(x, split = "_"))[1]})
df$CIK <- as.character(df$CIK)
Now, let us assume that you have another data.frame with two columns: CIK and SIC.
# A data.frame with two character columns: 'CIK' and 'SIC'
# df2.
#
# We add another column to the first data.frame: 'new_filenames'
df$new_filename <- sapply(1:nrow(df), FUN = function(idx, CIK, filenames, df2) {
SIC <- df2$SIC[which(df2$CIK == CIK[idx])]
new_filename <- as.character(paste(SIC, "_", filenames[idx], sep = ""))
new_filenames
}, CIK = df$CIK, filenames = df$filenames, df2 = df2)
# Now the new filenames are available in df$new_filenames
View(df)
I have the following .csv file:
https://drive.google.com/open?id=0Bydt25g6hdY-RDJ4WG41VFpyX1k
And I would like to be able to take the date and agent name(pasting its constituent parts) and append them as columns to the right of the table, up until it finds a different name and date, doing the same for the remaining name and date items, to get the following result:
The only thing I have been able to do with the dplyr package is the following:
library(dplyr)
library(stringr)
report <- read.csv(file ="test15.csv", head=TRUE, sep=",")
date_pattern <- "(\\d+/\\d+/\\d+)"
date <- str_extract(report[,2], date_pattern)
report <- mutate(report, date = date)
Which gives me the following result:
The difficulty I am finding is probably using conditionals in order make the script get the appropriate string and append it as a column at the end of the table.
This might be crude, but I think it illustrates several things: a) setting stringsAsFactors=F; b) "pre-allocating" the columns in the data frame; and c) using the column name instead of column number to set the value.
report<-read.csv('test15.csv', header=T, stringsAsFactors=F)
# first, allocate the two additional columns (with NAs)
report$date <- rep(NA, nrow(report))
report$agent <- rep(NA, nrow(report))
# step through the rows
for (i in 1:nrow(report)) {
# grab current name and date if "Agent:"
if (report[i,1] == 'Agent:') {
currDate <- report[i+1,2]
currName=paste(report[i,2:5], collapse=' ')
# otherwise append the name/date
} else {
report[i,'date'] <- currDate
report[i,'agent'] <- currName
}
}
write.csv(report, 'test15a.csv')
The following loop takes ages. Is there any way to this in a more time-efficient way? The following data.table consists of 27 variables and more than 600k observations.
data <- read.table("file.txt", header = T, sep= "|")
colnames(data)[c(1)] <- c("X")
data <- as.data.table(data)
n=1;
vector <- vector()
for(i in 2:nrow(data))
{
if(data[["X"]][i] != data[["X"]][i-1])
{
n=1; vector[i]=1}
else {
n=n+1; vector[i]=n}}
Basically, I need to index every appearance of a unique entry in X, i.e. the first time it appeared, the second time it appeared, etc and then merge this to the existing data as additional column. However, I got stock at compiling the vector.
Thank you.
First off, use fread:
DT <- fread("file.txt", sep = "|")
Next, use setnames:
setnames(DT, 1, "X")
Finally, use rowid:
DT[ , vector := rowid(X)]
I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)
I need to , efficiently, parse one of my dataframe column (a url string)
and call a function (strsplit) to parse it, e.g.:
url <- c("www.google.com/nir1/nir2/nir3/index.asp")
unlist(strsplit(url,"/"))
My data frame : spark.data.url.clean looks like this:
classes url
[107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3
This df has 100k rows and I don't want to loop/iterate over it, parse each url separately and write the results to a new data frame.
What I DO need/want is to create a new 5 column data frame:
df.result <- data.frame(fullurl = as.character(),baseurl=as.character(), firstlevel = as.character(), secondlevel=as.character(),thirdlevel=as.character(),classificaiton=as.character())
call one of the "apply" family function over spark.data.url.clean$url
and to write the results to the new data frame df.result such that the first column (fullurl) will be populated with the relevant spark.data.url.clean$url, the 2nd to 5th columns will be populated with the relevant results from applying
unlist(strsplit(url,"/"))
- taking the only the first, 2nd, 3rd and 4th elements from the resulted vector and putting it in the first,2nd, 3rd and 4th columns in df.result and finally putting the spark.data.url.clean$classes in the new data frame columns df.result$classificaiton
Sorry for the complication and let me know if anything need to be further cleared out.
There is no need for apply, as far as I see it.
Try this:
spark.data.url.clean <- data.frame(classes = c(107,662,685,508,111,654,509),
url = c("drudgereport.com/level1/level2/level3", "drudgeddddreport.com/levelfe1/lefvel2/leveel3",
"drudgeaasreport2.com/lefvel13/lffvel244/fel223", "otherurl.com/level1/second/level3",
"whateversite.com/level13/level244/level223", "esportsnow.com/first/level2/level3",
"reeport2.com/level13/level244/third"), stringsAsFactors = FALSE)
df.result <- spark.data.url.clean
names(df.result) <- c("classification", "fullurl")
df.result[c("baseurl", "firstlevel", "secondlevel", "thirdlevel")] <- do.call(rbind, strsplit(df.result$fullurl, "/"))
You could consider using the package splitstackshape to do this; we can use its cSplit-function. Setting drop to F ensures that the original column is preserved. Not that it returns a data.table, not a data.frame.
library(splitstackshape)
output <- cSplit(dat,2,sep="/", drop=F)
data used:
dat <- data.frame(classes="[107,662,685,508,111,654,509]",
url="drudgereport.com/level1/level2/level3")
Here's an option with data.table which should be pretty fast. If your data looks like this:
> df
# classes url
#1 [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3
You can do the following:
library(data.table)
setDT(df) # convert to data.table
cols <- c("baseurl", "firstlevel", "secondlevel", "thirdlevel") # define new column names
df[, (cols) := tstrsplit(url, "/", fixed = TRUE)[1:4]] # assign new columns
Now, the data looks like this:
> df
# classes url baseurl firstlevel secondlevel thirdlevel
#1: [107,662,685,508,111,654,509] drudgereport.com/level1/level2/level3 drudgereport.com level1 level2 level3
The simple solution is to use:
apply(row, 2, function(col) {})