I am relatively new to R. I am merging data contained in multiple csv files into a single zoo object.
Here is a snippet of the code in my for loop:
temp <- read.csv(filename, stringsAsFactors=F)
temp_dates <- as.Date(temp[,2])
temp <- zoo(temp[,17], temp_dates)
dataset <- temp[seq_specified_dates]
# merge data into output
if (length(output) == 0)
output <- dataset
else
output <- merge(output, dataset, all=FALSE)
When I run head() on the output zoo object, I notice bizarrely named column names like: 'dataset.output.output.output' etc. How can I assign more meaningful names to the merged columns. ?
Also, how do I reference a particular column in a zoo object?. For example if output was a dataframe, I could reference the 'Patient_A' column as output$Patient_A. How do I reference a specific column in a merged zoo object?
I think this would work regardless of the date being a zoo class, if you provide an example I may be able to fix the details, but all in all this should be a good starting point.
#1- Put your multiple csv files in one folder
setwd(your path)
listnames = list.files(pattern=".csv")
#2-use package plyr
library(plyr)
pp1 = ldply(listnames,read.csv,header=T) #put all the files in on data.frame
names(pp1)=c('name1','name2','name3',...)
pp1$date = zoo(pp1$date)
# Reshape data frame so it gets organized by date
pp2=reshape(pp1,timevar='name1',idvar='date',direction='wide')
read.zoo is able to read and merge multiple files. For example:
idx <- seq(as.Date('2012-01-01'), by = 'day', length = 30)
dat1<- data.frame(date = idx, x = rnorm(30))
dat2<- data.frame(date = idx, x = rnorm(30))
dat3<- data.frame(date = idx, x = rnorm(30))
write.table(dat1, file = 'ex1.csv')
write.table(dat2, file = 'ex2.csv')
write.table(dat3, file = 'ex3.csv')
datMerged <- read.zoo(c('ex1.csv', 'ex2.csv', 'ex3.csv'))
If you want to access a particular column you can use the $ method:
datMerged$ex1.csv
EDITED:
You can extract a time period with the window method:
window(datMerged, start='2012-01-28', end='2012-01-30')
The xts package includes more extraction methods:
library(xts)
datMergedx['2012-01-03']
datMergedx['2012-01-28/2012-01-30']
Related
I have a problem when I import my excel file in R. It convert the time cells in another format and I don't know what to do to change that.
Here is my excel file:
And here is what I obtain in R:
This is the code I used to import my files:
file.list <- list.files(pattern='*.xlsx',recursive = TRUE)
file.list <- setNames(file.list, file.list)
df.list <- lapply(file.list, read_xlsx, skip=20)
Actibrut <- bind_rows(df.list, .id = "id")
Do you know what is wrong?
Thank you.
Your data is transposed in excel. This is a problem as data.frames are column-major. Using this answer we can fix this
read.transposed.xlsx <- function(file, sheetIndex, as.is = TRUE) {
df <- read_xlsx(file, sheet = sheetIndex, col_names = FALSE)
dft <- as.data.frame(t(df[-1]), stringsAsFactors = FALSE)
names(dft) <- df[[1]]
rownames(dft) <- NULL
dft <- as.data.frame(lapply(dft, type.convert, as.is = as.is))
return(dft)
}
df <- bind_rows(lapply(file.list, \(file){
df <- read.transposed.xlsx(df)
df[['id']] <- file
}))
Afterwards you'll have to convert the columns appropriately, for example (note origin may depend on your machine):
df$"Woke up" <- as.POSIXct(df$"Woke up", origin = '1899-12-31')
# If it comes in as "hh:mm:ss" use
library(lubridate)
df$"Woke up" <- hms(df$"Woke up")
There are a couple of things you need to do.
First, it appears that your data is transposed. Meaning, that your first row looks like variable names and columns contain data. You can easily transpose your data before you import into Rstudio. This will address the (..1, ..2) variable names you see when you import the data.
Secondly, import the date columns as strings.
The command: df.list <- lapply(file.list, read_xlsx, skip=20) uses read_xlsx function. I think you need to explicitly specify the column variable type or import them as string.
Then you can use stringr package (or any other package) to convert strings to date variables. Also consider providing code that you have used.
I have a list of data frames and have given each element in the list (e.g. each data frame) a name:
e.g.
df1 <- data.frame(x = c(1:5), y = c(11:15))
df2 <- data.frame(x = c(1:5), y = c(11:15))
mylist <- list(A = df1, B = df2)
I have a function that I want to apply to each data frame; In this function, I want to include a line to write the results to file (eventually I want to do more complicated things like save plots of the correlation between two variables for each data frame but thought I'd start simple)
e.g.
NewVar <- function(mydata, whichVar, i) {
mydata$newVar <- mydata[, whichVar] + 1
write.csv(mydata, file = i)
}
I want to use lapply() to apply this function to each data frame in my list
something like:
hh<-lapply(mylist, NewVar, whichVar = "y")
I can't figure out how to assign the "i" within the context of lapply so that i iterates over the names in the list of data frames, saving multiple files with different names (in this case, two files named A and B) that correspond with the modified data frames.
It will work with the following lapply call:
lapply(names(mylist), function(x) NewVar(mylist[[x]], "y", x))
There are many options. For example:
lapply(names(mylist),
function(x)write.csv(mylist[x],
file =paste0(x,'.csv')))
or using indexes :
lapply(seq_along(mylist),
function(i)write.csv(mylist[i],
file =paste0(names(mylist)[i],'.csv')))
I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.
here is my excel csv data(test.csv):
type,com,year,month,value
A,CH,2015,1,1000
A,CH,2015,2,5000
A,CH,2016,1,1500
A,MI,2015,1,1300
A,MI,2016,1,5006
B,CH,2015,1,7651
B,CH,2015,2,8684
B,MI,2016,1,2321
B,ZU,2015,1,6842
C,CH,2015,1,1562
C,CH,2016,2,6452
C,CH,2016,3,1562
C,MI,2016,1,6425
C,MI,2016,2,2682
C,ZU,2015,1,8543
C,ZU,2015,2,7531
how can I extract each type to each data frame with R.
To be more concise, I mean I want to build 3 new data frame(typeA, typeB and typeC). And how can I combine year and month into one so I can plot with ggplot2.
Here is an additional question: Where can I find some reference about sorting out data which is similar to the above problem?
In a more common sense:
df <- read.csv("data.csv", header=T)
df_list <- split(df, factor(df$type))
Every entry in df_list is now a new data.frame with one type, e.g. df_list[[1] or df_list$A.
Try:
data =read.csv("test.csv", header=T)
dataA = data[which(data$type =="A"),]
dataB = data[which(data$type =="B"),]
dataC = data[which(data$type =="C"),]
Using R, how do I make a column of a dataframe the dataframe's index? Lets assume I read in my data from a .csv file. One of the columns is called 'Date' and I want to make that column the index of my dataframe.
For example in Python, NumPy, Pandas; I would do the following:
df = pd.read_csv('/mydata.csv')
d = df.set_index('Date')
Now how do I do that in R?
I tried in R:
df <- read.csv("/mydata.csv")
d <- data.frame(V1=df['Date'])
# or
d <- data.frame(Index=df['Date'])
# but these just make a new dataframe with one 'Date' column.
#The Index is still 0,1,2,3... and not my Dates.
I assume that by "Index" you mean row names. You can assign to the row names vector:
rownames(df) <- df$Date
The index can be set while reading the data, in both pandas and R.
In pandas:
import pandas as pd
df = pd.read_csv('/mydata.csv', index_col="Date")
In R:
df <- read.csv("/mydata.csv", header=TRUE, row.names="Date")
The tidyverse solution:
library(tidyverse)
df %>% column_to_rownames(., var = "Date")
while saving the dataframe use row.names=F
e.g. write.csv(prediction.df, "my_file.csv", row.names=F)