I am importing multiple excel workbooks, processing them, and appending them subsequently. I want to create a temporary dataframe (tempfile?) that holds nothing in the beginning, and after each successive workbook processing, append it. How do I create such temporary dataframe in the beginning?
I am coming from Stata and I use tempfile a lot. Is there a counterpart to tempfile from Stata to R?
As #James said you do not need an empty data frame or tempfile, simply add newly processed data frames to the first data frame. Here is an example (based on csv but the logic is the same):
list_of_files <- c('1.csv','2.csv',...)
pre_processor <- function(dataframe){
# do stuff
}
library(dplyr)
dataframe <- pre_processor(read.csv('1.csv')) %>%
rbind(pre_processor(read.csv('2.csv'))) %>%>
...
Now if you have a lot of files or a very complicated pre_processsing then you might have other questions (e.g. how to loop over the list of files or to write the right pre_processing function) but these should be separate and we really need more specifics (example data, code so far, etc.).
Related
In R, each time a data frame is filtered for example are there any changes made to the source data frame? What are best practices for preserving the original data frame?
Okay, so I do not understand exactly what you mean but, if you have a .csv file for example ("example.csv") in your working directory and you create an r-object (example) from it, the original .csv file is maintained intact.
The example object however changes whenever you apply functions or filters to it. The easiest way to maintain an original data frame is to apply those functions to a differently named object (i.e. example2)
you may save as another data frame or output them for preservation
mtcars1 <- mtcars %>%
select(mpg,cyl,hp,vs)
Save one object to a file
saveRDS(mtcars1 , file = "my_data.rds")
Restore the object
readRDS(file = "my_data.rds")
Save multiple objects
save(mtcars, mtcars1, file = "multi_data.RData")
Restore multiple objects again
load("multi_data.RData")
I am writing a dataframe using a csv file. I am making a data frame. However, when I go to run it, it's not recognizing the objects in the file. It will recognize some of them, but not all.
smallsample <- data.frame(read.csv("SmallSample.csv",header = TRUE),smallsample$age,smallsample$income,smallsample$gender,smallsample$marital,smallsample$numkids,smallsample$risk)
smallsample
It wont recognize marital or numkids, despite the fact that those are the column names in the table in the .csv file.
When you use read.csv the output is already in a dataframe.
You can simple use smallsample <- read.csv("SmallSample.csv")
Result using a dummy csv file
<table><tbody><tr><th> </th><th>age</th><th>income</th><th>gender</th><th>marital</th><th>numkids</th><th>risk</th></tr><tr><td>1</td><td>32</td><td>34932</td><td>Female</td><td>Single</td><td>1</td><td>0.9611315</td></tr><tr><td>2</td><td>22</td><td>50535</td><td>Male</td><td>Single</td><td>0</td><td>0.7257541</td></tr><tr><td>3</td><td>40</td><td>42358</td><td>Male</td><td>Single</td><td>1</td><td>0.6879534</td></tr><tr><td>4</td><td>40</td><td>54648</td><td>Male</td><td>Single</td><td>3</td><td>0.568068</td></tr></tbody></table>
I created hundreds of data frames in R, and I want to export them to a local position. All the names of the data frames are stored in a vector :
name.vec<-c('df1','df2','df3','df4','df5','df5')
each of which in name.vec is a data frame .
what I want to do is to export those data frames as excel file, but I did not want to do it the way below :
library("xlsx")
write.xlsx(df1,file ="df1.xlsx")
write.xlsx(df2,file ="df2.xlsx")
write.xlsx(df3,file ="df3.xlsx")
because with hundreds of data frames, it's tedious and dangerous.
I want some thing like below instead :
library('xlsx')
for (k in name.vec) {
write.xlsx(k,file=paste0(k,'.xlsx'))
}
but this would not work.
Anyone know how to achieve this? your time and knowledge would be deeply appreciated. Thanks in advance.
The first reason the for loop doesn't work is that the code is attempting to write a single name, 'df1' for example, as the xlsx file contents, instead of the data frame. This is because you're not storing the data frames themselves in the "name.vec" you have. So to fix the for loop, you'd have to do something more like this:
df.list<-list(df1,df2,df3)
name.vec<-c('df1','df2','df3')
library('xlsx')
for (k in 1:length(name.list)){
write.xlsx(df.list[[k]],file=paste0(name.vec[k],'.xlsx'))
}
However, for loops are generally slower than other options. So here's another way:
sapply(1:length(df.list),
function(i) write.xlsx(df.list[[i]],file=paste0(name.vec[i],'.xlsx')))
Output is 3 data frames, taken from the df list, named by the name vector.
It may also be best to at some point switch to the newer package for this: writexl.
I have to run a correlation analysis on over 100 .txt files. I have a script which reads a single file, organizes the data in the appropriate way that I need, and then stores the correlation value as a new variable. The script is quite large as the data gets reformatted a lot.
my question. How can I make this script run repeatedly on all the 100+ .txt files, and stores the single correlation value for all 100+ in a single DF? Ideally the final DF would consist of two columns, one with the .txt ID and another with the Correlation coefficient, and it would have 100+ rows.
Can I literally copy and paste the script into a for loop? If so how would that appear? I'm a newbee!
Any ideas?
Thanks!
As akrun mentioned, you can do this with lapply. Without seeing your data, I would recommend something like this:
my.files <- list.files(pattern = "txt") # use a pattern that only matches the files you want to read in
output <- lapply(my.files, correlation_function)
# Combine list of outputs into a single data.frame
output.df <- do.call(rbind, output)
This assumes that you have a function called correlation_function that takes a filename as input, loads the file into R, runs the correlation analysis, and returns a data.frame.
I am working on a large questionnaire - and I produce summary frequency tables for different questions (e.g. df1 and df2).
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
c<-c(1,1,5,2,1)
Percent<-c(10,10,50,20,10)
df2<-data.frame(a,c,Percent)
rm(a,b,c,Percent)
I normally export the dataframes as csv files using the following command:
write.csv(df1 ,file="df2.csv")
However, as my questionnaire has many questions and therefore dataframes, I was wondering if there is a way in R to combine different dataframes (say with a line separating them), and export these to a csv (and then ultimately open them in Excel)? When I open Excel, I therefore will have just one file with all my question dataframes in, one below the other. This one csv file would be so much easier than having individual files which I have to open in turn to view the results.
Many thanks in advance.
If your end goal is an Excel spreadsheet, I'd look into some of the tools available in R for directly writing an xls file. Personally, I use the XLConnect package, but there is also xlsx and also several write.xls functions floating around in various packages.
I happen to like XLConnect because it allows for some handy vectorization in situations just like this:
require(XLConnect)
#Put your data frames in a single list
# I added two more copies for illustration
dfs <- list(df1,df2,df1,df2)
#Create the xls file and a sheet
# Note that XLConnect doesn't seem to do tilde expansion!
wb <- loadWorkbook("/Users/jorane/Desktop/so.xls",create = TRUE)
createSheet(wb,"Survey")
#Starting row for each data frame
# Note the +1 to get a gap between each
n <- length(dfs)
rows <- cumsum(c(1,sapply(dfs[1:(n-1)],nrow) + 1))
#Write the file
writeWorksheet(wb,dfs,"Survey",startRow = rows,startCol = 1,header = FALSE)
#If you don't call saveWorkbook, nothing will happen
saveWorkbook(wb)
I specified header = FALSE since otherwise it will write the column header for each data frame. But adding a single row at the top in the xls file at the end isn't much additional work.
As James commented, you could use
merge(df1, df2, by="a")
but that would combine the data horizontally. If you want to combine them vertically you could use rbind:
rbind(df1, df2, df3,...)
(Note: the column names need to match for rbind to work).