Importing values and labels from SPSS with memisc - r

I want to import both values and labels from a dataset but I don't understand how to do it with this package (the documentation is not clear). I know it is possible because Rz (a gui interface for R) uses memisc to do this. I prefer, though, not to depend on too many packages.
Here the only piece of code I have:
dataset <- spss.system.file("file.sav")

See the example in ?importer() which covers spss.system.file().
spss.system.file creates an 'importer' object that can show you variable names.
To actually use the data, you need to either do:
## To get the whole file
dataset2 <- as.data.set(dataset)
## To get selected variables
dataset2 <- subset(dataset, select=c(variable names)) to get selected variables.
You end up with a data.set object which is quite complex, but does have what you want. For analysis, you usually need to do: as.data.frame on dataset2.

I figured out a solution to this that I like
df <- suppressWarnings(read.spss("C:/Users/yada/yada/yada/ - SPSS_File.sav", to.data.frame = TRUE, use.value.labels = TRUE))
var_labels <- attr(df, "variable.labels")
names <- data.frame(column = 1:ncol(df), names(df), labels = var_labels, row.names=NULL)
names(df) <- names$labels
names(df) <- make.names(df))

Related

Importing excel files with time cells format

I have a problem when I import my excel file in R. It convert the time cells in another format and I don't know what to do to change that.
Here is my excel file:
And here is what I obtain in R:
This is the code I used to import my files:
file.list <- list.files(pattern='*.xlsx',recursive = TRUE)
file.list <- setNames(file.list, file.list)
df.list <- lapply(file.list, read_xlsx, skip=20)
Actibrut <- bind_rows(df.list, .id = "id")
Do you know what is wrong?
Thank you.
Your data is transposed in excel. This is a problem as data.frames are column-major. Using this answer we can fix this
read.transposed.xlsx <- function(file, sheetIndex, as.is = TRUE) {
df <- read_xlsx(file, sheet = sheetIndex, col_names = FALSE)
dft <- as.data.frame(t(df[-1]), stringsAsFactors = FALSE)
names(dft) <- df[[1]]
rownames(dft) <- NULL
dft <- as.data.frame(lapply(dft, type.convert, as.is = as.is))
return(dft)
}
df <- bind_rows(lapply(file.list, \(file){
df <- read.transposed.xlsx(df)
df[['id']] <- file
}))
Afterwards you'll have to convert the columns appropriately, for example (note origin may depend on your machine):
df$"Woke up" <- as.POSIXct(df$"Woke up", origin = '1899-12-31')
# If it comes in as "hh:mm:ss" use
library(lubridate)
df$"Woke up" <- hms(df$"Woke up")
There are a couple of things you need to do.
First, it appears that your data is transposed. Meaning, that your first row looks like variable names and columns contain data. You can easily transpose your data before you import into Rstudio. This will address the (..1, ..2) variable names you see when you import the data.
Secondly, import the date columns as strings.
The command: df.list <- lapply(file.list, read_xlsx, skip=20) uses read_xlsx function. I think you need to explicitly specify the column variable type or import them as string.
Then you can use stringr package (or any other package) to convert strings to date variables. Also consider providing code that you have used.

How do I use the concatenate function in R correctly when working with large datasets?

I'm assuming this is a super easy question to answer. I am working with a cervical cancer dataset and I have an Excel spreadsheet that I have already imported into R. I needed to convert the character variables to number variables so I could properly analyze them. That worked. But I have NO IDEA how to use the concatenate function in R for importing the actual data. Since there 859 rows in the data set, I put c(1:859), but I think that just populates the spreadsheet with 1,2,3,4,5,....859. I already have a data set that I've imported, but I have no idea how to code this so I can just transfer what's in the Excel document.
My code:
cervical <- read.csv("/Users/sophia/Downloads/risk_factors_cervical_cancer.csv")
sapply(cervical, class)
summary(cervical)
cervical<- data.frame(Number.of.sexual.partners = c(1:859),
First.sexual.intercourse = c(1:859),
Num.of.pregnancies = c(1:859),
Smokes..years. = c(1:859),
Hormonal.Contraceptives..years. = c(1:859),
IUD..years. = c(1:859))
cervical$Number.of.sexual.partners <- as.character(cervical$Number.of.sexual.partners)
cervical$First.sexual.intercourse <- as.character(cervical$First.sexual.intercourse)
cervical$Num.of.pregnancies <- as.character(cervical$Num.of.pregnancies)
cervical$Smokes..years. <- as.character(cervical$Smokes..years.)
cervical$Hormonal.Contraceptives..years. <-
as.character(cervical$Hormonal.Contraceptives..years.)
cervical$IUD..years. <- as.character(cervical$IUD..years.)
sapply(cervical, class)
cervical$Number.of.sexual.partners <-
as.numeric(as.character(cervical$Number.of.sexual.partners))
cervical$First.sexual.intercourse <-
as.numeric(as.character(cervical$First.sexual.intercourse))
cervical$Num.of.pregnancies <- as.numeric(as.character(cervical$Num.of.pregnancies))
cervical$Smokes..years. <- as.numeric(as.character(cervical$Smokes..years.))
cervical$Hormonal.Contraceptives..years. <-
as.numeric(as.character(cervical$Hormonal.Contraceptives..years.))
cervical$IUD..years. <- as.numeric(as.character(cervical$IUD..years.))
sapply(cervical, class)
Not so sure what you need.
And it is also hard to help without the actual data.
But if you want to convert all character variables into numeric, you can use dplyr, mutate(), across(), where(), is.character() and as.numeric()
library(dplyr)
cervical%>%mutate(across(where(is.character), as.numeric))
example:
#Create dataframe with numbers as characters:
df<-data.frame(a=as.character(1:4), b=as.character(sample(1:4)), stringsAsFactors = FALSE)
sapply(df, class)
a b
"character" "character"
Change class with dplyr:
df<-df%>%mutate(across(where(is.character), as.numeric))
sapply(df, class)
a b
"numeric" "numeric"

Using read_csv specifying data types for groups of columns in R

I would like to use read_csv because I am working with a large data. The types of variables are reading incorrectly because I have many missing values. It would be possible to identify the type of variable (column) from the name of the variable, because it includes "DATE" if it is a date-type, "Names" if it is a character type and a rest of the variables can have a default 'col_guess' type. I do not want to type all the 55 variables so I tried this code first:
df <- read_csv('df.csv', col_types = cols((grepl("DATE$", colnames(df))==T)=col_date()), cols((grepl("Name$", colnames(df))==T)=col_character()))
I received tghis message:
Error: unexpected '=' in "df <- read_csv('df.csv', col_types = cols((grepl("DATE$", colnames(df))==T)="
So I tried to write a loop and because the df data is already in R (but the wrongly identified data variables' values have been deleted).
for (colname in colnames(df)){
if (grepl("DATE$", colname)==T){
ct1 <- cols(colname=col_date("%d/%m/%Y"))
}else if (grepl("Name$", colname)==T){
ct2 <- cols(colname=col_character())
}else{
ct3 <- cols(colname=col_guess())
tx <- c(ct1, ct2, ct3)
print(tx)
}
}
It does not do what I would like to get as an output and I do not know how I would need to continue if I would get the loop right.
The data is a public data, you can download it here (BasicCompanyDataAsOneFile): http://download.companieshouse.gov.uk/en_output.html
Any suggestion would be appreciated, thank you.
Since the data is already read in R, you can identify the columns by their names and apply the function to their respective columns.
df <- readr::read_csv('df.csv')
date_cols <- grep('DATE$', names(df))
char_cols <- grep('Name$', names(df))
df[date_cols] <- lapply(df[date_cols], as.Date)
df[char_cols] <- lapply(df[char_cols], as.character)
You can also try type.convert which automatically changes data to their respective types but it might not work for date columns.
df <- type.convert(df)
I read the data in using read_csv
df <- read_csv('DF.csv', col_types = cols(.default="c"))
then I used the following codes for changing the columns' data types
date_cols <- grep('DATE$', names(df))
df[date_cols] <- lapply(df[date_cols], as.Date)

r data frames: delete variable names that all contain the same string

This may very well be a dupe, but I can't figure out the terminology to google it.
I know how to normally delete a dataframe. But now I'm importing Qualtrics data and there I kind of systematically assigned variable names like timer1_1, timer2_1, timer3_1, timer1_2, timer2_2, timer3_2 and so on.
Basically in this example I want to delete every column that contains the variable name "timer".
Is there a way how I do this? I 56 variable names named timer*, and I want them gone (among other variables that have the same type of structure).
The question that I saw which was similar was about the values in a column. So maybe some kind of grep() voodoo will work here as well.
You can do:
df <- df[grep("timer", names(df), value = TRUE, invert = TRUE)]
This will work with your typical case as well as any of these corner cases:
df <- data.frame(x = 1:2, y = 1:2)
df <- data.frame(x = 1:2, timer1 = 1:2)
df <- data.frame(timer1 = 1:2)

R: read .dta file and use value labels only for selected variables to create a factor

What is the easiest way to read a .dta file in R and convert only specific variables as factors, using Stata value labels? I didn't find a way to specify the convert.factors option in the foreign package. I also failed with the mimisc package.
library('foreign')
df <- read.dta("statafile.dta", convert.factors = TRUE)
I'd suggest something like this:
df <- read.dta("statafile.dta", convert.factors = FALSE)
df2 <- read.dta("statafile.dta", convert.factors = TRUE)
cols2convert <- c(3,7,9,11,36) # columns for which you want convert.factors 2B true
df[,cols2convert] <- df2[,cols2convert]
rm(df2)

Resources