Apply series of changes to multiple similar datasets in R - r

I have 20 csv files of data that are formatted exactly the same, about 40 columns of different numbers, but with different values in each column. I want to apply a series of changes to each data frame in order to extract specific information from every one of them.
Specifically I want to extract four columns from each data frame, find the maximum value of each column in each data frame and then add all of these maximum values together, so I get one final number for each data frame. Something like this:
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
I have the code written above to do all these steps for every data frame individually, but is it possible to apply this code to all of them at once?

If you put all 20 filenames into a vector called files
Maxes <- numeric(length(files))
i <- 1
for (file in files) {
data <- read.csv(file)
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
Maxes[i] <- Add
i <- i+1
}
Though that str(data) will just cause a lot of stuff to print to the terminal 20 times. I'm not sure the value of that, but it was in your question so I included.

Put all your files into a common folder such as /path/temp/
csvs <- list.files("/path/temp") # vector of csv
Use custom function for colMax
colMax <- function(data) sapply(data, max, na.rm = TRUE)
Using foreach, dplyr, and readr
library(foreach)
library(dplyr)
foreach(i=1:length(csvs), .combine="c") %do% { read_csv(csvs[i]) %>%
select(1:4) %>%
colMax(.) %>%
sum(.)
} # returns a vector

Related

Observations in the environment of R studio but empties dataframes

I am working with precipitation data in R but I had a problem that I cannot solve. I will put here the code to be clearer. I have precipitation data (mm/h) for 47 meteorological stations with minute data, but I need the data hourly and a file with all the stations to then interpolate. The problem is, in this moment, to create that dataframes with the 47 stations, these dataframes must be structured normally by 47 observations and 3 variables.
But the problem is that in the environment I can see that apparently the process is correct, but when I open the dataframe I get surprised because there is only one value as you can see in the image.
Look the dataframe 071212
This is the code I have used to generate the dataframes.
setwd("D:/Escritorio/ohiiunam/estaciones")
temp <- list.files(pattern="*.csv")
lista = lapply(temp, read.csv)
lista<-data.table::rbindlist(lista)
n_last <- 6
lista$id2<- substr(lista$id, nchar(lista$id) - n_last + 1, nchar(lista$id))
unicos <- unique(lista$id2)
fun <- function(i) {
i<-lista %>% select(id, intensidad.mm.h, id2) %>% filter(lista$id2==i)
}
for (i in unicos) {
i <- as.data.frame(fun(i))
}
Well, you have a slight problem that you've named your data.frames with names that aren't syntactically valid (variable names cannot start with numbers). To work with them, you'll need to surround the name in backticks.
View(`071208`)
It's not clear how you loaded those data.frames, but it might be better to change that import routine to prefix the names with some character value.
I have solved the problem with modifications in the assign function.
for (i in unicos) {
assign(paste0("jul", sep=".", i), data.frame(lista %>% select(id, id2, intensidad.mm.h) %>% filter (lista$id2==i)))
}
Thank you for the help.

How to clean multiple excel files in one time in R?

I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.

R: find Index of data frame with non-unique/duplicated values

I want to extract some values out of a vector, modify them and put them back at the original position.
I have been searching a lot and tried different approaches to this problem. I'm afraid this might be really simple but I'm not seeing it yet.
Creating a vector and convert it to a dataframe with. Also creating a empty dataframe for the results.
hight <- c(5,6,1,3)
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
hight_min_df <- data.frame()
Extract for every pair of values the smaller value with corresponding ID.
for(i in 1:(length(hight_df[,2])-1))
{
hight_min_df[i,1] <- which(grepl(min(hight_df[,2][i:(i+1)]), hight_df[,2]))
hight_min_df[i,2] <- min(hight_df[,2][i:(i+1)])
}
Modify the extracted values and aggregate same IDs by higher value. At the end writing the modified values back.
hight_min_df[,2] <- hight_min_df[,2]+20
adj_hight <- aggregate(x=hight_min_df[,2],by=list(hight_min_df[,1]), FUN=max)
hight[adj_hight[,1]] <- adj_hight[,2]
This works perfectly as long a I have only uniqe values in hight.
How can I run this script with a vector like this: hight <- c(5,6,1,3,5)?
Alright there's a lot to unpack here. Instead of looping, I would suggest piping functions with dplyr. Read the vignette here - it is an outstanding resource and an excellent approach to data manipulation in R.
So using dplyr we can rewrite your code like this:
library(dplyr)
hight <- c(5,6,1,3,5) #skip straight to the test case
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
adj_hight <- hight_df %>%
#logic psuedo code: if the last hight (using lag() function),
# going from the first row to the last,
# is greater than the current rows hight, take the current rows value. else
# take the last rows value
mutate(subst.id = ifelse(lag(hight) > hight, ID, lag(ID)),
subst.val = ifelse(lag(hight) > hight, hight, lag(hight)) + 20) %>%
filter(!is.na(subst.val)) %>% #remove extra rows
select(subst.id, subst.val) %>% #take just the columns we want
#grouping - rewrite of your use of aggregate
group_by(subst.id) %>%
summarise(subst.val = max(subst.val)) %>%
data.frame(.)
#tying back in
hight[adj_hight[,1]] <- adj_hight[,2]
print(hight)
Giving:
[1] 25 6 21 23 5

Creating functions in R with iterative code

I work with surveys and would like to export a large number of tables (drawn from data frames) into an .xlsx or .csv file. I use the xlsx package to do this. This package requires me to stipulate which column in the excel file is the first column of the table. Because I want to paste multiple tables into the .csv file I need to be able to stipulate that the first column for table n is the length of table (n-1) + x number of spaces. To do this I planned on creating values like the following.
dt# is made by changing a table into a data frame.
table1 <- table(df$y, df$x)
dt1 <- as.data.frame.matrix(table1)
Here I make the values for the number of the starting column
startcol1 = 1
startcol2 = NCOL(dt1) + 3
startcol3 = NCOL(dt2) + startcol2 + 3
startcol4 = NCOL(dt3) + 3 + startcol2 + startcol3
And so on. I will probably need to produce somewhere between 50-100 tables. Is there a way in R to make this an iterative process so I can create the 50 values of starting columns without having to write 50+ lines of code with each one building on the previous?
I found stuff on stack overflow and other blogs about writing for - loops or using apply type functions in R but this all seemed to deal with manipulating a vector as opposed to adding values to the workspace. Thanks
You can use a structure similar to this:
Your list of files to read:
file_list = list.files("~/test/",pattern="*csv",full.names=TRUE)
for each file, read and process the data frame and capture how many columns there are in the frame you are reading/processing:
columnsInEachFile = sapply(file_list,
function(x)
{
df = read.csv(x,...) # with your approriate arguments
# do any necessary processing you require per file
return(ncol(df))
}
)
The cumulative sum of the number of columns plus 1 will indicate the start columns of a data frame that contains your processed data stuck next to each other:
columnsToStartDataFrames = cumsum(columnsInEachFile)+1
columnsToStartDataFrames = columnsToStartDataFrames[-length(columnsToStartDataFrames)] # last value is not the start of a data frame but the end
Assuming tab.lst is a list containing tables, then you can do:
cumsum(c(1, sapply(tail(tab.lst, -1), ncol)))
Basically, what I'm doing here is I'm looping through all the tables but the last one (since that one's start col is determined by the second to last), and getting each table's width with ncol. Then I'm doing the cumulative sum over that vector to get all the start positions.
And here is how I created the tables (tables based on all possible combinations of columns in df):
df <- replicate(5, sample(1:10), simplify=F) # data frame with 5 columns
names(df) <- tail(letters, 5) # name the cols
name.combs <- combn(names(df), 2) # get all 2 col combinations
tab.lst <- lapply( # make tables for each 2 col combination
split(name.combs, col(name.combs)), # loop through every column in name.combs
function(x) table(df[[x[[1]]]], df[[x[[2]]]]) # ... and make a table
)

R: Split-Apply-Combine... Apply Functions via Aggregate to Row-Bound Data Frames Subset by Class

Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa
I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.
The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:
# Example parameters
w <- c("mean","sd","scale") # Statistical functions to apply
x <- "C:/Data/" # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp") # Metrics to subset the data
z <- c(1970:2000) # Year range to subset the data
CSVstnClass <- data.frame(CSVstations,CSVclasses)
by(CSVstnClass, CSVstnClass[,2], function(a){ # Station list by class
suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
apply(a, 1, function(b){ # Data frame list, row-wise
classData <- data.frame()
sapply(y, function(d){ # Element list
CSV_DF <- read.csv(paste(x,b[2],"/",b[1],".csv",sep="")) # Read in CSV files as data frames
CSV_DF1 <- CSV_DF[!is.na("Value")]
CSV_DF2 <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)
if(nrow(CSV_DF2) > 0){ # Remove empty data frames
classData <<- rbind(classData,CSV_DF2) # Bind all data frames by row for a class and element
assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)
sapply(w, function(g){ # Function list
# Aggregate results of bound data frame for each unique date
dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
})
}
})
})
})
I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.
Cheers,
Adam
You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.
paths <- dir("C:/Data/", pattern = "\\.csv$")
# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)
# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset,
!is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)
# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]
# Compute aggregates
my_aggregate <- function(df, fun) {
aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun,
na.action = na.pass)
}
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)

Resources