In R data frames functions to replace original frame - r

In R, each time a data frame is filtered for example are there any changes made to the source data frame? What are best practices for preserving the original data frame?

Okay, so I do not understand exactly what you mean but, if you have a .csv file for example ("example.csv") in your working directory and you create an r-object (example) from it, the original .csv file is maintained intact.
The example object however changes whenever you apply functions or filters to it. The easiest way to maintain an original data frame is to apply those functions to a differently named object (i.e. example2)

you may save as another data frame or output them for preservation
mtcars1 <- mtcars %>%
select(mpg,cyl,hp,vs)
Save one object to a file
saveRDS(mtcars1 , file = "my_data.rds")
Restore the object
readRDS(file = "my_data.rds")
Save multiple objects
save(mtcars, mtcars1, file = "multi_data.RData")
Restore multiple objects again
load("multi_data.RData")

Related

How to export list as CSV

I have created a dataset that consists of 574 Rows and 85 Columns. The data type is a list. I want to export this data to CSV as I want to perform some analysis. I tried converting List to Dataframe using dataFrame <- as.data.frame(Data) command. I also looked out for other commands but was not able to convert the list to dataframe, or any other format. My goal is to export the data to a CSV file.
This image is a preview of the dataset:
This image shows that data type is list of dimension 574*85:
You can try this "write.csv" function on your list.
write.csv(list,"a.csv")
it will automatically save in your working directory.
Provide a list format of yours. I'm not sure below answer is useful for you or not.
If your list is as below
all_data_list = [[1,2,3],[1,4,5],[1,5,6],...]
you have to do:
df = pd.DataFrame(all_data_list)

How best to handle unknown object names in R

I am writing a generic procedure and I don't understand how to handle names of objects that are unknown. In this case I am loading all *.Rda files in a directory and doing rbind to make a data frame. The names and number of Rda files can vary. My question is how best to handle this situation?
library(data.table)
# Load all data frames in wd
my_files <- list.files(pattern='*.Rda',full.names = TRUE)
# Names of files without .Rda suffix
my_files_names <- gsub(".Rda$","",list.files(pattern='*.Rda'))
# load each data frame, creates objects with names in my_files_names
for(i in 1:length(my_files)){
load(my_files[i])
}
# make large data frame from all loaded data frames
combined_df <- rbindlist(my_files_names)
I am getting the error
Input is character but should be a plain list of items to be stacked
combined_df <- rbindlist(as.list(my_files_names)) doesn't work.
The example works using rbind with each object as an argument, but for some reason a character vector can't be used to refer to objects with names not known at run-time. What am I missing?
The solution was a two-liner:
library(dplyr)
my_files <- list.files(pattern='*.rds',full.names = TRUE)
combined_df <- bind_rows(lapply(my_files, readRDS))
First, the names of the objects were not important so I could this different approach. Second, the use of .Rda files was causing problems. This file type can contain more than one object. Although my files only had a single data frame per file, the code about would not run with load as an argument in lapply. I converted my files to .rds files, which only allow one data frame per file and the code ran fine.

Creating temporary data frames in R

I am importing multiple excel workbooks, processing them, and appending them subsequently. I want to create a temporary dataframe (tempfile?) that holds nothing in the beginning, and after each successive workbook processing, append it. How do I create such temporary dataframe in the beginning?
I am coming from Stata and I use tempfile a lot. Is there a counterpart to tempfile from Stata to R?
As #James said you do not need an empty data frame or tempfile, simply add newly processed data frames to the first data frame. Here is an example (based on csv but the logic is the same):
list_of_files <- c('1.csv','2.csv',...)
pre_processor <- function(dataframe){
# do stuff
}
library(dplyr)
dataframe <- pre_processor(read.csv('1.csv')) %>%
rbind(pre_processor(read.csv('2.csv'))) %>%>
...
Now if you have a lot of files or a very complicated pre_processsing then you might have other questions (e.g. how to loop over the list of files or to write the right pre_processing function) but these should be separate and we really need more specifics (example data, code so far, etc.).

how do i get my variables to be recognized?

I am writing a dataframe using a csv file. I am making a data frame. However, when I go to run it, it's not recognizing the objects in the file. It will recognize some of them, but not all.
smallsample <- data.frame(read.csv("SmallSample.csv",header = TRUE),smallsample$age,smallsample$income,smallsample$gender,smallsample$marital,smallsample$numkids,smallsample$risk)
smallsample
It wont recognize marital or numkids, despite the fact that those are the column names in the table in the .csv file.
When you use read.csv the output is already in a dataframe.
You can simple use smallsample <- read.csv("SmallSample.csv")
Result using a dummy csv file
<table><tbody><tr><th> </th><th>age</th><th>income</th><th>gender</th><th>marital</th><th>numkids</th><th>risk</th></tr><tr><td>1</td><td>32</td><td>34932</td><td>Female</td><td>Single</td><td>1</td><td>0.9611315</td></tr><tr><td>2</td><td>22</td><td>50535</td><td>Male</td><td>Single</td><td>0</td><td>0.7257541</td></tr><tr><td>3</td><td>40</td><td>42358</td><td>Male</td><td>Single</td><td>1</td><td>0.6879534</td></tr><tr><td>4</td><td>40</td><td>54648</td><td>Male</td><td>Single</td><td>3</td><td>0.568068</td></tr></tbody></table>

R: give data frames new names based on contents of their current name

I'm writing a script to plot data from multiple files. Each file is named using the same format, where strings between “.” give some info on what is in the file. For example, SITE.TT.AF.000.52.000.001.002.003.WDSD_30.csv.
These data will be from multiple sites, so SITE, or WDSD_30, or any other string, may be different depending on where the data is from, though it's position in the file name will always indicate a specific feature such as location or measurement.
So far I have each file read into R and saved as a data frame named the same as the file. I'd like to get something like the following to work: if there is a data frame in the global environment that contains WDSD_30, then plot a specific column from that data frame. The column will always have the same name, so I could write plot(WDSD_30$meas), and no matter what site's files were loaded in the global environment, the script would find the WDSD_30 file and plot the meas variable. My goal for this script is to be able to point it to any folder containing files from a particular site, and no matter what the site, the script will be able to read in the data and find files containing the variables I'm interested in plotting.
A colleague suggested I try using strsplit() to break up the file name and extract the element I want to use, then use that to rename the data frame containing that element. I'm stuck on how exactly to do this or whether this is the best approach.
Here's what I have so far:
site.files<- basename(list.files( pattern = ".csv",recursive = TRUE,full.names= FALSE))
sfsplit<- lapply(site.files, function(x) strsplit(x, ".", fixed =T)[[1]])
for (i in 1:length(site.files)) assign(site.files[i],read.csv(site.files[i]))
for (i in 1:length(site.files))
if (sfsplit[[i]][10]==grep("PARQL", sfsplit[[i]][10]))
{assign(data.frame.getting.named.PARQL, sfsplit[[i]][10])}
else if (sfsplit[i][10]==grep("IRBT", sfsplit[[i]][10]))
{assign(data.frame.getting.named.IRBT, sfsplit[[i]][10])
...and so on for each data frame I'd like to eventually plot from.Is this a good approach, or is there some better way? I'm also unclear on how to refer to the objects I made up for this example, data.frame.getting.named.xxxx, without using the entire filename as it was read into R. Is there something like data.frame[1] to generically refer to the 1st data frame in the global environment.

Resources