so I have 29 data files that I want to load into R. The files are called "1.csv", "2.csv" etc. all the way to 29. Here is the code depicting what I'm trying to do:
file.number <- c(1:29)
"the value in file.number".data <- read.csv("the value in file.number"".csv")
Basically I am looking for a way to load code based on a list, and label it accordingly. Is this possible?
Any help will be greatly appreciated!!!
This would probably work
dfList <- setNames(lapply(paste0(1:29, ".csv"), read.csv), paste0(1:29, ".data"))
Now you've got a named list of 29 data frames. Then you can access each individual data frame with the $ operator, e.g. dfList$"4.data". Note that you'll need quotes or backticks since you've chosen to begin the names with a digit. You can avoid that by using [[ to access the elements i.e. dfList[["4.data"]], or changing to different names such as paste0("data", 1:29), or any name that doesn't begin with a digit.
Another option would be Map
Map(read.csv, paste0(1:29, ".csv"))
This will automatically set the names to the names of the file being read i.e. 1.csv, 2.csv, etc. But again, backticks or quotes would be needed to access the elements with the $ operator because the names begin with digits.
listwithdfs <- lapply(1:29, function(x) read.csv(paste0(x, ".csv")) )
names(listwithdfs) <- 1:29
better to only have one single object in workspace.
now you can index with
listwithdfs[[13]]
Related
I have a folder with multiple .dta files and I'm using the read_dta() function of the haven library to bind them. The problem is that some of the files have thier column names in lower case and others have them in upper case.
I was wondering if there is a way to only read the specific columns by changing their name to lower case in every case without reading the whole file and then selecting the columns, since the files are really large and this would take forever.
I was hoping that by using the .name_repair = element in the read_dta() function I could do this, but I really don't know how.
Im trying something like this
#Set working directory:
setwd("T:/")
#List of .dta file names to bind:
list_names<-list_names[grepl("_sdem.dta", list_names)]
#Variable names to select form those files:
vars_select<-c("r_def", "c_res", "ur", "con", "n_hog", "v_sel", "n_pro_viv","fac", "n_ren", "upm","eda", "clase1", "clase2", "clase3", "ent", "sex", "e_con", "niv_ins", "eda7c", "tpg_p8a","emp_ppal", "tue_ppal", "sub_o" )
#Read and bind ONLY the selected variables form the list of files
dataset <- data.frame()
for (i in 1:length(list_names)){
temp_data <- read_dta(list_names[i], col_select = vars_select)
dataset <- rbind(dataset, temp_data)
}
The problem is that when some of the files have their variable names in upper case format, their variables are not in the vars_select list and therefore, the next error appears:
Error: Can't subset columns that don't exist.
x Columns `r_def`, `c_res`, `n_hog`, `v_sel`, `n_pro_viv`, etc. don't exist.
I was trying to use the .name_repair = element in the read_dta() function to try to correct this, by using the tolower() function.
I was trying something like this with a specific file that has an upper case variable name format:
example_data <- read_dta("T:/2017_2_sdem.dta", col_select = vars_select, .name_repair = tolower(names()))
But the same error appears:
Error: Can't subset columns that don't exist.
x Columns `r_def`, `c_res`, `n_hog`, `v_sel`, `n_pro_viv`, etc. don't exist.
Thanks so much for your help!
I have 500 csv. files with data that looks like:
sample data
I want to extract one cell (e.g. B4 or 0.477) per a csv file and combine those values into a single csv. What are some recommendations on how to do this easily?
You can try something like this
all.fi <- list.files("/path/to/csvfiles", pattern=".csv", full.names=TRUE) # store names of csv files in path as a string vector
library(readr) # package for read_lines and write_lines
ans <- sapply(all.fi, function(i) { eachline <- read_lines(i, n=4) # read only the 4th line of the file
ans <- unlist(strsplit(eachline, ","))[2] # split the string on commas, then extract the 2nd element of the resulting vector
return(ans) })
write_lines(ans, "/path/to/output.csv")
I can not add a comment. So, I will write my comment here.
Since your data is very large and it is very difficult to load it individually, then try this: Importing multiple .csv files into R. It is similar to the first part of your problem. For second part, try this:
You can save your data as a data.frame (as with the comment of #Bruno Zamengo) and then you can use select and merge functions in R. Then, you can easily combine them in single csv file. With select and merge functions you can select all the values you need and them combine them. I used this idea in my project. Do not forget to use lapply.
I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)
I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.
CSV file looks like this (modified for brevity). Several columns have spaces in their title and R can't seem to distinguish them.
Alias;Type;SerialNo;DateTime;Main status; [...]
E1;E-70;781733;01/04/2010 11:28;8; [...]
Here is the code I am trying to execute:
s_data <- read.csv2( file=f_name )
attach(s_data)
s_df = data.frame(
scada_id=ID,
plant=PlantNo,
date=DateTime,
main_code=Main status,
seco_code=Additional Status,
main_text=MainStatustext,
seco_test=AddStatustext,
duration=Duration)
detach(s_data)
I have also tried substituting
main_code=Main\ status
and
main_code="Main status"
Unless you specify check.names=FALSE, R will convert column names that are not valid variable names (e.g. contain spaces or special characters or start with numbers) into valid variable names, e.g. by replacing spaces with dots. Try names(s_data). If you do use check.names=TRUE, then use single back-quotes (`) to surround the names.
I would also recommend using rename from the reshape package (or, these days, dplyr::rename).
s_data <- read.csv2( file=f_name )
library(reshape)
s_df <- rename(s_data,ID="scada_id",
PlantNo="plant",DateTime="date",Main.status="main_code",
Additional.status="seco_code",MainStatustext="main_text",
AddStatustext="seco_test",Duration="duration")
For what it's worth, the tidyverse tools (i.e. readr::read_csv) have the opposite default; they don't transform the column names to make them legal R symbols unless you explicitly request it.
s_data <- read.csv( file=f_name , check.names=FALSE)
I believe spaces get replaced by dots "." when importing CSV files. So you'd write e.g. Main.status. You can check by entering names(s_data) to see what the names are.