How to write different classes into one excel sheet in R - r

I have a set of data containing both characters and numerics in R. I know that I cannot save them in a data frame since the columns have to have all the same type. Therefore I saved them in a list.
Is there a way to save this list into an excel sheet so that the numerics are saved as numbers and the characters as characters?
I have been using XLConnect sa far and it works for a data frame, where you have columns with either all numerics or all characters
Thanks in advance
Example:
I have several measurements of different quantities (let's call them a and b).
These measurements can either be a number (numeric) or "n.d." (character). For example 3 measurements:
First: a = 5, b ="n.d."
Second: a = 3, b = 4
Third: a = "n.d.", b = "n.d."
EDIT:
The solution given by Roland works great:
Encode the missing values as NA in the data frame to have the columns all as numerics.
Then use
setMissingValue(wb, value = "n.d.")
to set the missing cells to "n.d." in the Excel sheet
Thanks a lot for your time and help

The solution given by Roland works great:
Encode the missing values as NA in the data frame to have the columns all as numerics.
Then use
setMissingValue(wb, value = "n.d.")
to set the missing cells to "n.d." in the Excel sheet
Thanks a lot for your time and help

Given the example:
df=data.frame(a=c(5,'n.d.','n.d.'),b=c('n.d.',4,'n.d.')),
you can use
write.csv(df,file='data.csv')
Then you can import the CSV file into Excel manually.

Related

R uploads numeric value into boolean column - why?

I want to upload an Excel file as a dataframe in R.
It is a large file with a lot of numbers and some #NV values.
The upload works good for the majority of columns (in total, there are 4,000 columns). But for some columns, R changes the columns to "TRUE" or "FALSE", creating a boolean column.
I don't want that, since all of the columns are supposed to be numeric.
Do you know why R does that?
It would really help if you provided code snippets, because there are many different excel-to-dataframe libraries/methods/behaviors.
But assuming that you are using writexl, the read_excel function has a guess_max parameter for this kind of case. guess_max is 1000 by default.
Try df <- read_excel(path = filepath, sheet = sheet_name, guess_max = 100000)
Since dataframes cannot have different data types in the same column, read_excel has to read your excel file and guess what data type each column should be, before actually filling the dataframe. If a column happens to only have NA values in the first 1000 rows, read_excel will assume you have a column of booleans, and then all subsequent values encountered in future rows will be cast accordingly. So if you set guess_max to something huge, you make read_excel slower, but it might avoid the casting of numerics to booleans.

Change the class of a cell of a data-frame to Date

everyone!
As part of my clinical study I created a xlsx spreadsheet containing a data set. Only columns 2 to 12 and lines 1 to 307 are useful to me. I now manipulate my spreadsheet under R, after importing it (read_excel, etc.).
In my columns 11 and 12 ('data' and 'raw_data'), some cells correspond to dates (for example the first 2 rows of 'data' and 'raw_data'). Indeed, this corresponds to the patient's visit dates. However, as you can see, these dates are given to me in number of days since the origin "1899-12-30". However, I would like to be able to transform them into a current date format (2019-07-05).
My problem is that in these columns I don't only have dates, I have different numerical results (times, means, scores, etc.) .
I started by transforming the class of my columns from character to factor/numeric so that I could better manipulate the columns later. But I can't change only the format of cells corresponding to a date.
Do you know if it is possible to transform only the cells concerned and if so how?
I attach my code and a preview of my data frame.
Part "Unsuccessful trial": I tried with this kind of thing. Of course the date changes format here but as soon as I try to make this change in the data frame it doesn't work.
Thank you for your help!
# Indicate the id of the patient
id = "01_AA"
# Get protocol data of patient
idlst <- dir("/data/protocolData", full.names = T, pattern = id)
# Convert the xlsx database into dataframe
idData <- data.table::rbindlist(lapply(
idlst,
read_excel,
n_max = 307,
range = cell_cols("B:M"), # just keep the table
), fill = TRUE)
idData <- as.tibble(idData)
idData<- idData %>%
mutate_at(vars(1:10), as.factor)%>%
mutate_at(vars(11:length(idData)), as.numeric)
# Unsuccessful trial
as.Date.character(data[1:2,11:12], origin ='1899-12-30')
Thank you for your comments and indeed this is one of the problems with R.
I solved my problem with the following code where idData is my df.
# Change the data format of the date cells of the column Data and Raw_data:
idData$Data[grepl("date",idData$Measure)] <- as.character(as.Date(
as.numeric(
idData$Data[grepl("date",idData$Measure)]),
origin = "1899-12-30"))

How do I import data from a .csv file into R without repeating the values of the first column into all the other ones?

I want to import data into R from a .csv file.
So far I have done the following:
> #Clear environment
rm(list=ls())
#Read my data into R
myData <- read.csv("C:/Users/.../flow.csv", header=TRUE)
#Convert from list to array
myData <- array(as.numeric(unlist(myData)), dim=c(264,3))
#Create vectors with specific values of interest: qMax, qMin
qMax <- myData[,2]
qMin <- myData[,3]
#Transform vectors into matrices
qMax <- matrix(qMax,nrow = 12, ncol = round((length(qMax)/12)))
qMin <- matrix(qMin,nrow = 12, ncol = round((length(qMin)/12)))
After importing the data using read.csv, I have a list. I then proceed to transform this list into an array with 264 lines of data spread through 3 columns. Here I have my first problem.
I know that each column of my list brings a different set of data; the values are not the same. However, after I check to see what I imported, it seems that only the first column is imported correctly, but then it repeats itself for columns one and two.
Here's an image for better explanation:
The matrix has the right layout, yet wrong data. Columns 2 and 3 should have different values from each other and from column 1.
How do I correct that? I have checked the source and the original document has all the correct values.
Also, assuming I will eventually get rid of this mistake, will the proceeding lines of code from the block "#Transform vectors into matrices" deliver a 12 x 22 matrix? The first six elements of both qMax and qMin are NA and I wish to keep it this way in the matrix. Will R perform that with these lines of code or will I need to change it?
Thank you.
Edit: As suggested by akrun, here's the results for str(myData and for dput(droplevels(head(myData)))

Subtracting Columns in R

I have two columns I am trying to subtract and put into a new one, but one of them contains values that read '#NULL!', after converting over from SPSS and excel, so R reads it as a factor and will not let me subtract. What is the easiest way to fix it knowing I have 19,000+ rows of data?
While reading the dataset using read.table/read.csv, we can specify the na.strings argument for those values that needs to be transformed to 'NA' or missing values. So, in your dataset it would be
dat <- read.table('yourfile.txt', na.strings=c("#NULL!", "-99999", "-88888"),
header=TRUE, stringsAsFactors=FALSE)

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!
If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)
You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.

Resources