Differences in imported data from one file vs. lots of files

Differences in imported data from one file vs. lots of files - r

I have built a function which allows me to process .csv files one by one. This involves importing data using the read.csv function, assigning one of the columns a name, and making a series of calculations based on that one column. However, I'm having problems with how to apply this function to a whole folder of files. Once a list of files is generated, do I need to read the data from each file from within my function, or prior to the application of it? This is what I had previously to import the data:
AllData <- read.csv("filename.csv", header=TRUE, skip=7)
DataForCalcs <- Data[5]
My code resulted in the calculation of a number of variables, which I put into a matrix at the end of the code, and used the apply function to calculate the max of each of those variables.
NewVariables <- matrix(c(Variable1, Variable2, Variable3, Variable4, Variable5)
colnames(NewVariables <- c("Variable1", "Variable2", "Variable3", Variable4", "Variable5")
apply(NewVariables, 2, max, na.rm=TRUE)
This worked great, but I then need to write this table to a new .csv file, which contains these results for each of the ~300 files I want to process, preceded by the name of each file. I'm new to this, so I would really appreciate your time helping me out!

Have you thought about reading in all your .csv files in a loop that combines them into one dataframe? I do this all the time like this:
df <- c()
for (x in list.files(pattern="*.csv")) {
u<-read.csv(x, skip=6)
u$Label = factor(x) #A column that is the filename
df <- rbind(df,u)
}
This of course assumes that every .csv file has an equal number of columns that are named the same thing. But if that assumption is true then you can simply treat the resulting dataframe like one dataframe.
One you have you dataframe entered you can use the Label column as your group by variable. Also you'll need to select only the 5th and 13th variables as well as the label variable. Then if your goal is to take say the max and max values for each .csv file and produce another dataframe of those max values you'd go about it like this.
library(dplyr)
df.summary <- df %>%
group_by(Label) %>%
summarise_each(funs(max)) ##Take the max value of each column except Label
There are better ways to do this using gather() but I don't want to overwhelm you.

Related

Combining a list of data frames into a new data frame in R

This is a 3rd edit to the question (leaving below thread just in case):
The following code makes some sample data frames, selects those with "_areaX" in the title and makes a list of them. The goal is then to combine the data frames in the list into 1 data frame. It almost works...
Area1 <- 100
Area2 <- 200
Area3 <- 300
Zone <- 3
a1_areaX <- data.frame(Area1)
a2_areaX <- data.frame(Area2)
a3_areaX <- data.frame(Area3)
a_zoneX <- data.frame(Zone)
library(dplyr)
pattern = "_areaX"
df_list <- mget(ls(envir = globalenv(), pattern = pattern))
big_data = bind_rows(df_list, .id = "FileName")
The problem is the newly created data frame looks like this:
And I need it to look like this:
File Name
Area measurement
a1_areaX
100
a2_areaX
200
a3_areaX
300
Below are my earlier attempts at asking this question. Edited from first version:
I have csv files imported into R Global Env that look like this (I'd share the actual file(s) but there doesn't seem to be a way to do this here):
They all have a name, the above one is called "s6_section_area". There are many of them (with different names) and I've put them all together into a list using this code:
pattern = "section_area"
section_area_list <- list(mget(grep(pattern,ls(globalenv()), value = TRUE), globalenv()))
Now I want a new data frame that looks like this, put together from the data frames in the above made list.
File Name
Area measurement
a1_section_area
a number
a2_section_area
another number
many more
more numbers
So, the first column should list the name of the original file and the second column the measurement provided in that file.
Hope this is clearer - Not sure how else to provide reproducible example without sharing the actual files (which doesn't seem to be an option).
addition to edit: Using this code
section_area_data <- bind_rows(section_area_list, .id = "FileName")
I get (it goes on and on to the right)
I'm after a table that looks like the sample above, left column is file name with a list of file names going down. Right column is the measurement for that file name (taken from original file).

Note that in your list of dataframes (df_list) all the columns have different names (Area1, Area2, Area3) whereas in your output dataframe they all have been combined into one single column. So for that you need to change the different column names to the same one and bind the dataframes together.
library(dplyr)
library(purrr)
result <- map_df(df_list, ~.x %>%
rename_with(~"Area", contains('Area')), .id = 'FileName')
result
# FileName Area
#1 a1_areaX 100
#2 a2_areaX 200
#3 a3_areaX 300

Thanks everyone for your suggestions. In the end, I was able to combine the suggestions and some more thinking and came up with this, which works perfectly.
library("dplyr")
pattern = "section_area"
section_area_list <- mget(ls(envir = globalenv(), pattern = pattern))
section_area_data <- bind_rows(section_area_list, .id = "FileName") %>%
select(-V1)
So, a bunch of csv files were imported into R Global Env. A list of all files with a name ending in "section_area" was made. Those files were than bound into one big data frame, with the file names as one column and the value (area measurement in this case) in the other column (there was a pointless column in the original csv files called "V1" which I deleted).
This is what one of the many csv files looks like
sample csv file
And this is the layout of the final data frame (it goes on for about 150 rows)
final data frame

Appending two excel files into one dataframe

I am trying to append two excel files from Excel into R.
I am using the following the code to do so:
rm(list = ls(all.names = TRUE))
library(rio) #this is for the excel appending
library("dplyr") #this is required for filtering and selecting
library(tidyverse)
library(openxlsx)
path1 <- "A:/Users/Desktop/Test1.xlsx"
path2 <- "A:/Users/Desktop/Test2.xlsx"
dat = bind_rows(path1,path2)
Output
> dat = bind_rows(path1,path2)
Error: Argument 1 must have names.
Run `rlang::last_error()` to see where the error occurred
I appreciate that this is more for combining rows together, but can someone help me with combining difference workbooks into one dataframe in R Studio?

bind_rows() works with data frames AFTER they have been loaded into the R environment. Here are you merely trying to "bind" 2 strings of characters together, hence the error. First you need to import the data from Excel, which you could do with something like:
test_df1 <- readxl::read_xlsx(path1)
test_df2 <- readxl::read_xlsx(path2)
and then you should be able to run:
test_df <- bind_rows(test_df1, test_df2)
A quicker way would be to iterate the process using the map function from purrr:
test_df <- map_df(c(path1, path2), readxl::read_xlsx)

If you want to append one under the other, meaning that both excels have the same columns, I would
select the rows I wanted from the first excel and create a dataframe
select the rows from the second excel and create a second dataframe
append them with rbind().
On the other hand if you want to append the one next to another, I would choose the columns needed from first and second excel into two dataframes respectively and then I would go with cbind()

How to read cell from data frame in a for loop where the frame name increases with the loop

Sorry for the terrible title. First post here, and new with R.
I am trying to import data from multiple CSV files, extract a single row from each CSV to individual data frames then make a new data frame for a specific value from each initial data frame. I hope this makes sense.
Here is the code I have used so far:
# Take downloaded IFD csv's for 15 points, extract 1% AEP, 6 hour rainfall depths.
files <- list.files(path = "C:PATH")
for (i in 1:length(files)){ # Head of for-loop, length is 15 files
assign(paste0("data", i), # Read and store data frames for row containing 6 hour depths
read.csv2(paste0("C:PATH", files[i]), sep = ",", header = FALSE, nrows = 1, skip = 26))
}
#final value in data frame, position [1,9] is the 1% AEP depth for 6 hours. Extract all of these values from the initial 15 data frames into new dataframes.
for (i in 1:15) {
SixHourOnePercentAEP[i] <- data[i][1,9]
}
In the last argument, an error is returned trying to call data[i][1,9] since dataframe[x,y] is trying to find a cell where the iteration of the i occurs. Looking for a way around this.

It seems that you are trying to create dataframes such as data1, data2, etc for each corresponding file. Then you are trying to access the i-th dataframe with the syntax data[i].
But that's not how it works. "data" is not an array of dataframes, but instead you have different variables named data1, data2, etc. What you need is to access given variable by name. You can do it this way:
for (i in 1:15) {
SixHourOnePercentAEP[i] <- get(paste0("data",i))[1,9]
}
The get() function gets a variable whose name has been passed as a character argument.
I found however your code extremely inefficient. Why gather all the entire dataframes beforehand, when the only thing you need is one cell from each one? You should rewrite your first loop to extract the desired value from the dataframe immediately then store it, discarding the rest of the data right away if I understand you purpose correctly.

How to fill dataframe rows for progressive files in a for loop in R

I'm trying to analyze some data acquired from experimental tests with several variables being recorded. I've imported a dataframe into R and I want to obtain some statistical information by processing these data.
In particular, I want to fill in an empty dataframe with the same variable names of the imported dataframe but with statistical features like mean, median, mode, max, min and quantiles as rows for each variable.
The input dataframes are something like 60 columns x 250k rows each.
I've already managed to do this using apply as in the following lines of code for a single input file.
df[1,] <- apply(mydata,2,mean,na.rm=T)
df[2,] <- apply(mydata,2,sd,na.rm=T)
...
Now I need to do this in a for loop for a number of input files mydata_1, mydata_2, mydata_3, ... in order to build several summary statistics dataframes, one for each input file.
I tried in several different ways, trying with apply and assign but I can't really manage to access each row of interest in the output dataframes cycling over the several input files.
I wuold like to do something like the code below (I know that this code does not work, it's just to give an idea of what I want to do).
The output df dataframes are already defined and empty.
for (xx in 1:number_of_mydata_files) {
df_xx[1,]<-apply(mydata_xx,2,mean,na.rm=T)
df_xx[2,]<-apply(mydata_xx,2,sd,na.rm=T)
...
}
Actually I can't remember the error message given by this code, but the problem is that I can't even run this because it does not work.
I'm quite a beginner of R, so I don't have so much experience in using this language. Is there a way to do this? Are there other functions that could be used instead of apply and assign)?
EDIT:
I add here a simple table description that represents the input dataframes I’m using. Sorry for the poor data visualization right here. Basically the input dataframes I’m using are .csv imported files, looking like tables with the first row being the column description, aka the name of the measured variable, and the following rows being the acquired data. I have 250 000 acquisitions for each variable in each file, and I have something like 5-8 files like this being my input.
Current [A] | Force [N] | Elongation [%] | ...
—————————————————————————————————————
Value_a_1 | Value_b_1 | Value_c_1 | ...
I just want to obtain a data frame like this as an output, with the same variables name, but instead with statistical values as rows. For example, the first row, instead of being the first values acquired for each variable, would be the mean of the 250k acquisitions for each variable. The second row would be the standard deviation, the third the variance and so on.
I’ve managed to build empty dataframes for the output summary statistics, with just the columns and no rows yet. I just want to fill them and do this iteratively in a for loop.

Not sure what your data looks like but you can do the following where lst represents your list of data frames.
lst <- list(iris[,-5],mtcars,airquality)
lapply(seq_along(lst),
function(x) sapply(lst[[x]],function(x)
data.frame(Mean=mean(x,na.rm=TRUE),
sd=sd(x,na.rm=TRUE))))
Or as suggested by #G. Grothendieck simply:
lapply(lst, sapply, function(x)
data.frame(Mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE)))
If all your files are in the same directory, set working directory as that and use either list.files() or ls() to walk along your input files.
If they share the same column names, you can rbind the result into a single data set.

Comparing column headers of two files to fetch data in R

I have a large CSV file, say INPUT, with about 500+ columns. I also have a dataframe DF that contains a subset of the column headers of INPUT which changes at every iteration.
I have to fetch the data from only those columns of INPUT that is present in the dataframe DF and write it into another CSV file, say OUTPUT.
In short,
INPUT.csv:
ID,Col_A,Col_B,Col_C,Col_D,Col_E,Col_F,,,,,,,,,,,,,Col_S,,,,,,,,,,,,,,,,Col_Z
1,009,abcd,67,xvz,33,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,oup,,,,,,,,,,,,,,,,,,90
2,007,efgh,87,wuy,56,67,,,,,,,,,,,,,,,,,,,,,,,,,,,,ghj,,,,,,,,,,,,,,,,,,,,888
print(DF):
[1] "Col_D" "Col_Z"
[3] "Col_F" "Col_S"
OUTPUT.csv
ID,Col_D,Col_Z,Col_F,Col_S
1,xvz,90,50,oup
2,wuy,888,67,ghj
I'm a beginner when it comes to R. I would prefer for the matching of dataframe with the INPUT file to be automated, because i don't want to do this everyday when the dataframe gets updated.

I'm not sure whether this is the answer :
input <- read.table(...)
input[colnames(input) %in% colnames(DF)]

if I understand it correctly, you need to import the INPUT.csv file inside R and then match the columns of your DF with those columns of your INPUT, is that correct?
you can either use the match function or just import the INPUT.csv file inside RStudio via "Import Dataset" button and subset it. Subsetting of imported dataframes is fairly easy.
If you will import your dataset as INPUT, then you can make the subset of these columns in following way: INPUT[,c(1,2,4)]
and that will get you first, second and fourth column of the INPUT dataset.

First, to upload the csv is simple:
dataframe_read <- read.csv('/Path/to/csv/')
If I understand correctly that one dataframes columns is always a subset, the code is as follows:
### Example Dataframes
df1 <- data_frame(one = c(1,3,4), two= c(1,2,3), three = c(1,2,3))
df2 <- data_frame(one = c(1,3,4), three= c(1,2,3))
### Make new data frame
df3 <- df1[,colnames(df2)]
### Write new dataframe
write.csv(df3, 'hello.csv')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex