Importing excel files with time cells format - r

I have a problem when I import my excel file in R. It convert the time cells in another format and I don't know what to do to change that.
Here is my excel file:
And here is what I obtain in R:
This is the code I used to import my files:
file.list <- list.files(pattern='*.xlsx',recursive = TRUE)
file.list <- setNames(file.list, file.list)
df.list <- lapply(file.list, read_xlsx, skip=20)
Actibrut <- bind_rows(df.list, .id = "id")
Do you know what is wrong?
Thank you.

Your data is transposed in excel. This is a problem as data.frames are column-major. Using this answer we can fix this
read.transposed.xlsx <- function(file, sheetIndex, as.is = TRUE) {
df <- read_xlsx(file, sheet = sheetIndex, col_names = FALSE)
dft <- as.data.frame(t(df[-1]), stringsAsFactors = FALSE)
names(dft) <- df[[1]]
rownames(dft) <- NULL
dft <- as.data.frame(lapply(dft, type.convert, as.is = as.is))
return(dft)
}
df <- bind_rows(lapply(file.list, \(file){
df <- read.transposed.xlsx(df)
df[['id']] <- file
}))
Afterwards you'll have to convert the columns appropriately, for example (note origin may depend on your machine):
df$"Woke up" <- as.POSIXct(df$"Woke up", origin = '1899-12-31')
# If it comes in as "hh:mm:ss" use
library(lubridate)
df$"Woke up" <- hms(df$"Woke up")

There are a couple of things you need to do.
First, it appears that your data is transposed. Meaning, that your first row looks like variable names and columns contain data. You can easily transpose your data before you import into Rstudio. This will address the (..1, ..2) variable names you see when you import the data.
Secondly, import the date columns as strings.
The command: df.list <- lapply(file.list, read_xlsx, skip=20) uses read_xlsx function. I think you need to explicitly specify the column variable type or import them as string.
Then you can use stringr package (or any other package) to convert strings to date variables. Also consider providing code that you have used.

Related

How do I use the concatenate function in R correctly when working with large datasets?

I'm assuming this is a super easy question to answer. I am working with a cervical cancer dataset and I have an Excel spreadsheet that I have already imported into R. I needed to convert the character variables to number variables so I could properly analyze them. That worked. But I have NO IDEA how to use the concatenate function in R for importing the actual data. Since there 859 rows in the data set, I put c(1:859), but I think that just populates the spreadsheet with 1,2,3,4,5,....859. I already have a data set that I've imported, but I have no idea how to code this so I can just transfer what's in the Excel document.
My code:
cervical <- read.csv("/Users/sophia/Downloads/risk_factors_cervical_cancer.csv")
sapply(cervical, class)
summary(cervical)
cervical<- data.frame(Number.of.sexual.partners = c(1:859),
First.sexual.intercourse = c(1:859),
Num.of.pregnancies = c(1:859),
Smokes..years. = c(1:859),
Hormonal.Contraceptives..years. = c(1:859),
IUD..years. = c(1:859))
cervical$Number.of.sexual.partners <- as.character(cervical$Number.of.sexual.partners)
cervical$First.sexual.intercourse <- as.character(cervical$First.sexual.intercourse)
cervical$Num.of.pregnancies <- as.character(cervical$Num.of.pregnancies)
cervical$Smokes..years. <- as.character(cervical$Smokes..years.)
cervical$Hormonal.Contraceptives..years. <-
as.character(cervical$Hormonal.Contraceptives..years.)
cervical$IUD..years. <- as.character(cervical$IUD..years.)
sapply(cervical, class)
cervical$Number.of.sexual.partners <-
as.numeric(as.character(cervical$Number.of.sexual.partners))
cervical$First.sexual.intercourse <-
as.numeric(as.character(cervical$First.sexual.intercourse))
cervical$Num.of.pregnancies <- as.numeric(as.character(cervical$Num.of.pregnancies))
cervical$Smokes..years. <- as.numeric(as.character(cervical$Smokes..years.))
cervical$Hormonal.Contraceptives..years. <-
as.numeric(as.character(cervical$Hormonal.Contraceptives..years.))
cervical$IUD..years. <- as.numeric(as.character(cervical$IUD..years.))
sapply(cervical, class)
Not so sure what you need.
And it is also hard to help without the actual data.
But if you want to convert all character variables into numeric, you can use dplyr, mutate(), across(), where(), is.character() and as.numeric()
library(dplyr)
cervical%>%mutate(across(where(is.character), as.numeric))
example:
#Create dataframe with numbers as characters:
df<-data.frame(a=as.character(1:4), b=as.character(sample(1:4)), stringsAsFactors = FALSE)
sapply(df, class)
a b
"character" "character"
Change class with dplyr:
df<-df%>%mutate(across(where(is.character), as.numeric))
sapply(df, class)
a b
"numeric" "numeric"

Using read_csv specifying data types for groups of columns in R

I would like to use read_csv because I am working with a large data. The types of variables are reading incorrectly because I have many missing values. It would be possible to identify the type of variable (column) from the name of the variable, because it includes "DATE" if it is a date-type, "Names" if it is a character type and a rest of the variables can have a default 'col_guess' type. I do not want to type all the 55 variables so I tried this code first:
df <- read_csv('df.csv', col_types = cols((grepl("DATE$", colnames(df))==T)=col_date()), cols((grepl("Name$", colnames(df))==T)=col_character()))
I received tghis message:
Error: unexpected '=' in "df <- read_csv('df.csv', col_types = cols((grepl("DATE$", colnames(df))==T)="
So I tried to write a loop and because the df data is already in R (but the wrongly identified data variables' values have been deleted).
for (colname in colnames(df)){
if (grepl("DATE$", colname)==T){
ct1 <- cols(colname=col_date("%d/%m/%Y"))
}else if (grepl("Name$", colname)==T){
ct2 <- cols(colname=col_character())
}else{
ct3 <- cols(colname=col_guess())
tx <- c(ct1, ct2, ct3)
print(tx)
}
}
It does not do what I would like to get as an output and I do not know how I would need to continue if I would get the loop right.
The data is a public data, you can download it here (BasicCompanyDataAsOneFile): http://download.companieshouse.gov.uk/en_output.html
Any suggestion would be appreciated, thank you.
Since the data is already read in R, you can identify the columns by their names and apply the function to their respective columns.
df <- readr::read_csv('df.csv')
date_cols <- grep('DATE$', names(df))
char_cols <- grep('Name$', names(df))
df[date_cols] <- lapply(df[date_cols], as.Date)
df[char_cols] <- lapply(df[char_cols], as.character)
You can also try type.convert which automatically changes data to their respective types but it might not work for date columns.
df <- type.convert(df)
I read the data in using read_csv
df <- read_csv('DF.csv', col_types = cols(.default="c"))
then I used the following codes for changing the columns' data types
date_cols <- grep('DATE$', names(df))
df[date_cols] <- lapply(df[date_cols], as.Date)

How to clean multiple excel files in one time in R?

I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.

Importing values and labels from SPSS with memisc

I want to import both values and labels from a dataset but I don't understand how to do it with this package (the documentation is not clear). I know it is possible because Rz (a gui interface for R) uses memisc to do this. I prefer, though, not to depend on too many packages.
Here the only piece of code I have:
dataset <- spss.system.file("file.sav")
See the example in ?importer() which covers spss.system.file().
spss.system.file creates an 'importer' object that can show you variable names.
To actually use the data, you need to either do:
## To get the whole file
dataset2 <- as.data.set(dataset)
## To get selected variables
dataset2 <- subset(dataset, select=c(variable names)) to get selected variables.
You end up with a data.set object which is quite complex, but does have what you want. For analysis, you usually need to do: as.data.frame on dataset2.
I figured out a solution to this that I like
df <- suppressWarnings(read.spss("C:/Users/yada/yada/yada/ - SPSS_File.sav", to.data.frame = TRUE, use.value.labels = TRUE))
var_labels <- attr(df, "variable.labels")
names <- data.frame(column = 1:ncol(df), names(df), labels = var_labels, row.names=NULL)
names(df) <- names$labels
names(df) <- make.names(df))

zoo merge() and merged column names

I am relatively new to R. I am merging data contained in multiple csv files into a single zoo object.
Here is a snippet of the code in my for loop:
temp <- read.csv(filename, stringsAsFactors=F)
temp_dates <- as.Date(temp[,2])
temp <- zoo(temp[,17], temp_dates)
dataset <- temp[seq_specified_dates]
# merge data into output
if (length(output) == 0)
output <- dataset
else
output <- merge(output, dataset, all=FALSE)
When I run head() on the output zoo object, I notice bizarrely named column names like: 'dataset.output.output.output' etc. How can I assign more meaningful names to the merged columns. ?
Also, how do I reference a particular column in a zoo object?. For example if output was a dataframe, I could reference the 'Patient_A' column as output$Patient_A. How do I reference a specific column in a merged zoo object?
I think this would work regardless of the date being a zoo class, if you provide an example I may be able to fix the details, but all in all this should be a good starting point.
#1- Put your multiple csv files in one folder
setwd(your path)
listnames = list.files(pattern=".csv")
#2-use package plyr
library(plyr)
pp1 = ldply(listnames,read.csv,header=T) #put all the files in on data.frame
names(pp1)=c('name1','name2','name3',...)
pp1$date = zoo(pp1$date)
# Reshape data frame so it gets organized by date
pp2=reshape(pp1,timevar='name1',idvar='date',direction='wide')
read.zoo is able to read and merge multiple files. For example:
idx <- seq(as.Date('2012-01-01'), by = 'day', length = 30)
dat1<- data.frame(date = idx, x = rnorm(30))
dat2<- data.frame(date = idx, x = rnorm(30))
dat3<- data.frame(date = idx, x = rnorm(30))
write.table(dat1, file = 'ex1.csv')
write.table(dat2, file = 'ex2.csv')
write.table(dat3, file = 'ex3.csv')
datMerged <- read.zoo(c('ex1.csv', 'ex2.csv', 'ex3.csv'))
If you want to access a particular column you can use the $ method:
datMerged$ex1.csv
EDITED:
You can extract a time period with the window method:
window(datMerged, start='2012-01-28', end='2012-01-30')
The xts package includes more extraction methods:
library(xts)
datMergedx['2012-01-03']
datMergedx['2012-01-28/2012-01-30']

Resources