I'm trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over two lines.
Has anybody found a simple way of managing the headers in the CSV file so that the analysis is manageable ?
How do other people analyse results from Surveymonkey?
Thanks!
You can export it in a convenient form that fits R from Surveymonkey, see download responses in 'Advanced Spreadsheet Format'
What I did in the end was print out the headers using libreoffice labeled as V1,V2, etc. then I just read in the file as
m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)
and then just did the analysis against m1$V10, m1$V23 etc...
To get around the mess of multiple columns I used the following little function
# function to merge columns into one with a space separator and then
# remove multiple spaces
mcols <- function(df, cols) {
# e.g. mcols(df, c(14:18))
exp <- paste('df[,', cols, ']', sep='', collapse=',' )
# this creates something like...
# "df[,14],df[,15],df[,16],df[,17],df[,18]"
# now we just want to do a paste of this expression...
nexp <- paste(" paste(", exp, ", sep=' ')")
# so now nexp looks something like...
# " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
# now we just need to parse this text... and eval() it...
newcol <- eval(parse(text=nexp))
newcol <- gsub(' *', ' ', newcol) # replace duplicate spaces by a single one
newcol <- gsub('^ *', '', newcol) # remove leading spaces
gsub(' *$', '', newcol) # remove trailing spaces
}
# mcols(df, c(14:18))
No doubt somebody will be able to clean this up!
To tidy up Likert-like scales I used:
# function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
tidylik4 <- function(x) {
xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
y <- ifelse(x == '', NA, x)
ordered(y, levels=xlevels)
}
for (i in 44:52) {
m2[,i] <- tidylik4(m2[,i])
}
Feel free to comment as no doubt this will come up again!
I have to deal with this pretty frequently, and having the headers on two columns is a bit painful. This function fixes that issue so that you only have a 1 row header to deal with. It also joins the multipunch questions so you have top: bottom style naming.
#' #param x The path to a surveymonkey csv file
fix_names <- function(x) {
rs <- read.csv(
x,
nrows = 2,
stringsAsFactors = FALSE,
header = FALSE,
check.names = FALSE,
na.strings = "",
encoding = "UTF-8"
)
rs[rs == ""] <- NA
rs[rs == "NA"] <- "Not applicable"
rs[rs == "Response"] <- NA
rs[rs == "Open-Ended Response"] <- NA
nms <- c()
for(i in 1:ncol(rs)) {
current_top <- rs[1,i]
current_bottom <- rs[2,i]
if(i + 1 < ncol(rs)) {
coming_top <- rs[1, i+1]
coming_bottom <- rs[2, i+1]
}
if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
pre <- current_top
if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")
if(!is.na(current_top) & is.na(current_bottom))
nms[i] <- current_top
}
nms
}
If you note, it returns the names only. I typically just read.csv with ...,skip=2, header = FALSE, save to a variable and overwrite the names of the variable. It also helps ALOT to set your na.strings and stringsAsFactor = FALSE.
nms = fix_names("path/to/csv")
d = read.csv("path/to/csv", skip = 2, header = FALSE)
names(d) = nms
As of November 2013, the webpage layout seems to have changed. Choose Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software). Then go to Exports and download the file. You'll get raw data as first row = question headers / each following row = 1 response, possibly split between multiple files if you have many responses / questions.
The issue with the headers is that columns with "select all that apply" will have a blank top row, and the column heading will be the row below. This is only an issue for those types of questions.
With this in mind, I wrote a loop to go through all columns and replace the column names with the value from the second row if the column name was blank- which has a character length of 1.
Then, you can kill the second row of the data and have a tidy data frame.
for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
}
df <- df[-1,]
Coming to the party late, but this is still an issue and the best workaround I've found is using a function to paste the column names and sub-column names together, based on repeating values.
For instance, if exporting to .csv, the repeated column names will automatically be replaced with an X in RStudio. If exporting to .xlsx, the repeated value will be ....
Here's a base R solution:
sm_header_function <- function(x, rep_val){
orig <- x
sv <- x
sv <- sv[1,]
sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
sv <- t(sv)
sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
names(sv)[1] <- "name"
names(sv)[2] <- "value"
sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
sv$new_value <- paste0(sv$new_value, " ", sv$value)
new_names <- as.character(sv$new_value)
colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
orig <- orig[-c(1),]
return(orig)
}
sm_header_function(df, "X")
sm_header_function(df, "...")
With some sample data, the change in column names would look like this:
Original export from SurveyMonkey:
> colnames(sample)
[1] "Respondent ID" "Please provide your contact information:" "...11"
[4] "...12" "...13" "...14"
[7] "...15" "...16" "...17"
[10] "...18" "...19" "I wish it would have snowed more this winter."
Cleaned export from SurveyMonkey:
> colnames(sample_clean)
[1] "Respondent ID" "Please provide your contact information: Name"
[3] "Please provide your contact information: Company" "Please provide your contact information: Address"
[5] "Please provide your contact information: Address 2" "Please provide your contact information: City/Town"
[7] "Please provide your contact information: State/Province" "Please provide your contact information: ZIP/Postal Code"
[9] "Please provide your contact information: Country" "Please provide your contact information: Email Address"
[11] "Please provide your contact information: Phone Number" "I wish it would have snowed more this winter. Response"
Sample data:
structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621,
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin",
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale",
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's",
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2",
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia",
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa",
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.",
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104",
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country",
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins#gmail.com",
"mjemison#nasa.gov", "stargazer#gmail.com", "dubois#web.com",
"firstnurse#aol.com", "galileo123#yahoo.com", "imthinking#gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646",
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944",
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response",
"Strongly disagree", "Strongly agree", "Neither agree nor disagree",
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
How about the following: use read.csv() with header=FALSE. Make two arrays, one with the two lines of headings and one with the answers to the survey. Then paste() the two rows/sentences of together. Finally, use colnames().
Related
I have a dataframe in the following long format:
I need to convert it into a list which should look something like this:
Wherein, each of the main element of the list would be the "Instance No." and its sub-elements should contain all its corresponding Parameter & Value pairs - in the format of "Parameter X" = "abc" as you can see in the second picture, listed one after the other.
Is there any existing function which can do this? I wasn't really able to find any. Any help would be really appreciated.
Thank you.
A dplyr solution
require(dplyr)
df_original <- data.frame("Instance No." = c(3,3,3,3,5,5,5,2,2,2,2),
"Parameter" = c("age", "workclass", "education", "occupation",
"age", "workclass", "education",
"age", "workclass", "education", "income"),
"Value" = c("Senior", "Private", "HS-grad", "Sales",
"Middle-aged", "Gov", "Hs-grad",
"Middle-aged", "Private", "Masters", "Large"),
check.names = FALSE)
# the split function requires a factor to use as the grouping variable.
# Param_Value will be the properly formated vector
df_modified <- mutate(df_original,
Param_Value = paste0(Parameter, "=", Value))
# drop the parameter and value columns now that the data is contained in Param_Value
df_modified <- select(df_modified,
`Instance No.`,
Param_Value)
# there is now a list containing dataframes with rows grouped by Instance No.
list_format <- split(df_modified,
df_modified$`Instance No.`)
# The Instance No. is still in each dataframe. Loop through each and strip the column.
list_simplified <- lapply(list_format,
select, -`Instance No.`)
# unlist the remaining Param_Value column and drop the names.
list_out <- lapply(list_simplified ,
unlist, use.names = F)
There should now be a list of vectors formatted as requested.
$`2`
[1] "age=Middle-aged" "workclass=Private" "education=Masters" "income=Large"
$`3`
[1] "age=Senior" "workclass=Private" "education=HS-grad" "occupation=Sales"
$`5`
[1] "age=Middle-aged" "workclass=Gov" "education=Hs-grad"
The posted data.table solution is faster, but I think this is a bit more understandable.
require(data.table)
your_dt <- data.table(your_df)
dt_long <- melt.data.table(your_dt, id.vars='Instance No.')
class(dt_long) # for debugging
dt_long[, strVal:=paste(variable,value, sep = '=')]
result_list <- list()
for (i in unique(dt_long[['Instance No.']])){
result_list[[as.character(i)]] <- dt_long[`Instance No.`==i, strVal]
}
Just for reference. Here is the R base oneliner to do this. df is your dataframe.
l <- lapply(split(df, list(df["Instance No."])),
function(x) paste0(x$Parameter, "=", x$Value))
I am trying to combine multiple tableGrob objects with a ggplot object into a single .png file; I am having trouble understanding how to edit the tableGrob theme parameters in a way that lets me adjust the padding and dimensions of the table objects. Ideally, I want them to be in a 4*1 grid, with minimal padding between each. The text of the table objects should also be left-justified.
I am working with dummy data, and each row of my input dataset will be used to create its own .png file (two rows are included in the code snippet below to generate a reproducible example).
I tried to use this post as an example, and set the grid.arrange spacing based on the "heights" attributes of each table object, but this hasn't quite done the trick. As a side note, right now, the plot gets overwritten each time; I'll fix this later, am just concerned with getting the output dimensions/arrangement correct. Code is below; edited to include library calls, and fixed a typo:
require("ggplot2")
require("gridExtra")
require("grid")
# Generate dummy data frame
sampleVector <- c("1", "Amazing", "Awesome", "0.99", "0.75", "0.5", "$5,000.00", "0.55", "0.75", "0.31", "0.89", "0.25", "Strong community support", "Strong leadership", "Partners had experience", "", "CBO not supportive", "Limited experience", "Limited monitoring", "")
sampleVectorB <- c("3", "Amazing", "Awesome", "0.99", "0.75", "0.5", "$5,000.00", "0.55", "0.75", "0.31", "0.89", "0.25", "Strong community support", "Strong leadership", "Partners had experience", "", "CBO not supportive", "Limited experience", "Limited monitoring", "")
sampleDF <- data.frame(rbind(sampleVector, sampleVectorB))
colnames(sampleDF) <- c("CBO", "PMQ", "HMQ", "ER", "PR", "HR", "NS", "CTI", "Home and Hosp", "Home", "Phone", "Other", "S1", "S2", "S3", "S4", "C1", "C2", "C3", "C4")
indata <- sampleDF
#Finds the longest string from a vector of strings (i.e. #chars, incld. whitespace); returns index into this vector that corresponds to this string
findMax <- function(tempVector){
tempMaxIndex <- 1
for(i in 1:length(tempVector)){
print(nchar(tempVector[i]))
if(nchar(tempVector[i]) > nchar(tempVector[tempMaxIndex])){
tempMaxIndex <- i
}
}
return(tempMaxIndex)
}
# Same as above but w/o the colon:
addWhitespacePlain <- function(stringVec, maxNum){
for(i in 1:length(stringVec))
{
string <- stringVec[i]
while(nchar(string) < maxNum+1){
string <- paste(string, " ")
}
stringVec[i] <- string
}
return(stringVec)
}
staticText <- c("Participant Match Quality", "Hospital-Level Match Quality", "Enrollment Rate", "Participant Readmissions", "Hospital Readmissions", "Net Savings",
"Strengths", "Challenges")
m <- findMax(staticText)
staticText <- addWhitespacePlain(staticText, nchar(staticText[m]))
# Loop through our input data and keep only one CBO each pass
for(i in 1:length(indata$CBO)){
# Select only the row that has this CBO's data
temp <- indata[i,]
###############################################################################################
# MAKE TOP TEXT TABLE (as a DF)
# Get values from our input data set to fill in the values for the top text portion of the graphic
topVals <- t(data.frame(temp[2], temp[3], temp[4], temp[5], temp[6], temp[7]))
topDF <- data.frame(staticText[1:6], topVals, row.names=NULL)
colnames(topDF) <- c(paste("CBO", temp[1]), " ")
# Find which of the strings from the top text portion is the longest (i.e. max # of chars; including whitespace)
m2 <- findMax(rownames(topDF)) # returns an index into the vector; this index corresponds to string w/max num of chars
# Add whitespace to non-max strings so all have the same length and also include colons
rownames(topDF) <- addWhitespacePlain(rownames(topDF), nchar(rownames(topDF)[m2]))
# for testing
# print(topDF, right=FALSE)
###############################################################################################
# MAKE BAR CHART
#Subset the data to select the vars we need for the horizontal bar plot
graphdata <- t(data.frame(temp[,8:12]))
vars <- c("CTI", "Home & Hosp.", "Home", "Phone", "Other")
graphDF <- data.frame(vars, graphdata, row.names = NULL)
colnames(graphDF) <- c("vars", "values")
# Make the plot (ggplot object)
barGraph <- ggplot(graphDF, aes(x=vars, y=values,fill=factor(vars))) +
geom_bar(stat = "identity") +
theme(axis.title.y=element_blank())+
theme(legend.position="none")+
coord_flip()
# print(barGraph)
###############################################################################################
# MAKE BOTTOM TEXT TABLE
strengths <- t(data.frame(temp[13], temp[14], temp[15], temp[16]))
challenges <- t(data.frame(temp[17], temp[18], temp[19], temp[20]))
#Drop nulls
strengths <- data.frame(strengths[which(!is.na(strengths)),], row.names=NULL)
challenges <- data.frame(challenges[which(!is.na(challenges)),], row.names=NULL)
colnames(strengths) <- c(staticText[7])
colnames(challenges) <- c(staticText[8])
###############################################################################################
# OUTPUT (padding not resolved yet)
# Set the path for the combined image
png("test1", height=1500, width=1000)
#customTheme <- ttheme_minimal(core=list(fg_params=list(hjust=0, x=0.1)),
# rowhead=list(fg_params=list(hjust=0, x=0)))
# top<-tableGrob(topDF, theme=customTheme)
# bottom_strength <- tableGrob(strengths, theme=customTheme)
# bottom_challenges <- tableGrob(challenges, theme=customTheme)
top<-tableGrob(topDF)
bottom_strength <- tableGrob(strengths)
bottom_challenges <- tableGrob(challenges)
x <- sum(top$heights)
y <- sum(bottom_strength$heights)
z <- sum(bottom_challenges$heights)
grid.arrange(top, barGraph, bottom_strength, bottom_challenges,
as.table=TRUE,
heights=c(2, 1, 2, 2),
nrow = 4)
# heights= unit.c(x, unit(1), y, z))
dev.off()
}
Your example is too complicated, I'm not sure what exactly is the issue. The following 4x1 layout has zero padding, is that what you're after?
ta <- tableGrob(iris[1:4,1:2])
tb <- tableGrob(mtcars[1:3,1:3])
tc <- tableGrob(midwest[1:5,1:2])
p <- qplot(1,1) + theme(plot.background=element_rect(colour = "black"))
h <- unit.c(sum(ta$heights), unit(1,"null"), sum(tb$heights), sum(tc$heights))
grid.newpage()
grid.arrange(ta,p,tb,tc, heights=h)
I am using a package called "memisic" in order to generate a codebook of my 2000 variables survey. The codebook is pretty much a frequency table with a description and the wordings of the variable name. The package provides a function that is called codebook that results in a codebook object. The problem is that I can't write this object anywhere. I tried to write it to a text file or to pdf file and it doesn't work.
This is a code to generate a codebook (the author's code):
library(memisc)
Data <- data.set(
vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
income = exp(rnorm(300,sd=.7))*2000
)
Data <- within(Data,{
description(vote) <- "Vote intention"
description(region) <- "Region of residence"
description(income) <- "Household income"
wording(vote) <- "If a general election would take place next tuesday,
the candidate of which party would you vote for?"
wording(income) <- "All things taken into account, how much do all
household members earn in sum?"
foreach(x=c(vote,region),{
measurement(x) <- "nominal"
})
measurement(income) <- "ratio"
labels(vote) <- c(
Conservatives = 1,
Labour = 2,
"Liberal Democrats" = 3,
"Don't know" = 8,
"Answer refused" = 9,
"Not applicable" = 97,
"Not asked in survey" = 99)
labels(region) <- c(
England = 1,
Scotland = 2,
Wales = 3,
"Not applicable" = 97,
"Not asked in survey" = 99)
foreach(x=c(vote,region,income),{
annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
})
missing.values(vote) <- c(8,9,97,99)
missing.values(region) <- c(97,99)
})
r=codebook(Data)
so my final objective is to write the object R to a pdf/word/excel/text file. Any of these would be just great.
The easiest way to get the text file from this would be to just use capture.output:
capture.output(r, file="test.txt")
Here are the first few lines read back into R:
head(readLines("test.txt"))
# [1] "==================================================================================="
# [2] ""
# [3] " vote 'Vote intention'"
# [4] ""
# [5] " \"If a general election would take place next tuesday, the candidate of which"
# [6] " party would you vote for?\""
It's possible to output the codebook directly to a txt file using the Write function:
Write(codebook(Data), file = "datacodebook.txt")
I'm having a looping issue. It should be simple to solve, but "R for Stata Users" (I've coded in Stata for a couple of years), Roger Peng's videos, and Google don't seem to be helping me. Can one of you please explain to me what I'm doing wrong?
I'm trying to write a loop that run through the 'thresholds' dataframe to pull out information from three sets of columns. I can do what I want to do by writing the same segment of code three times, but as the code gets more complicated, this will become quite cumbersome.
Here is a sample of 'thresholds' (see dput output below, added by a friendly reader):
threshold_1_name threshold_1_dir threshold_1_value
1 overweight > 25
2 possible malnutrition < 31
3 Q1 > 998
4 Q1 > 998
5 Q1 > 998
6 Q1 > 998
threshold_1_units threshold_2_name threshold_2_dir threshold_2_value threshold_2_units
1 kg/m^2 obese > 30 kg/m^2
2 cm <NA> > NA
3 <NA> Q3 > 998
4 Q3 > 998
5 Q3 > 998
6 Q3 > 998
This code does what I want to do:
newvars1 <- paste(thresholds$varname, thresholds$threshold_1_name, sep = "_")
noval <- is.na(thresholds$threshold_1_value)
newvars1 <- newvars1[!noval]
newvars2 <- paste(thresholds$varname, thresholds$threshold_2_name, sep = "_")
noval <- is.na(thresholds$threshold_2_value)
newvars2 <- newvars2[!noval]
newvars3 <- paste(thresholds$varname, thresholds$threshold_3_name, sep = "_")
noval <- is.na(thresholds$threshold_3_value)
newvars3 <- newvars3[!noval]
And here is how I am trying to loop:
variables <- NULL
for (i in 1:3) {
valuevar <- paste("threshold", i, "value", sep = "_")
namevar <- paste("threshold", i, "name", sep = "_")
newvar <- paste("varnames", i, sep = "")
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (check == FALSE) {
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
variables <- c(variables, newvars)
}
And here is the error I am receiving:
Error: unexpected '}' in "}"
I think something about the way I am calling the 'i' is messing things up, but I'm not sure how to do it correctly. My Stata habits using locals are really biting me in the butt as I switch to R.
EDIT to add dput output, by a friendly reader:
thresholds <- structure(list(varname = structure(1:6, .Label = c("varA", "varB",
"varC", "varD", "varE", "varF"), class = "factor"), threshold_1_name = c("overweight",
"possible malnutrition", "Q1", "Q1", "Q1", "Q1"), threshold_1_dir = c(">",
"<", ">", ">", ">", ">"), threshold_1_value = c(25L, 31L, 998L,
998L, 998L, 998L), threshold_1_units = c("kg/m^2", "cm", NA,
NA, NA, NA), threshold_2_name = c("obese", "<NA>", "Q3", "Q3",
"Q3", "Q3"), threshold_2_dir = c(">", ">", ">", ">", ">", ">"
), threshold_2_value = c(30L, NA, 998L, 998L, 998L, 998L), threshold_2_units = c("kg/m^2",
"cm", NA, NA, NA, NA)), .Names = c("varname", "threshold_1_name",
"threshold_1_dir", "threshold_1_value", "threshold_1_units",
"threshold_2_name", "threshold_2_dir", "threshold_2_value", "threshold_2_units"
), row.names = c(NA, -6L), class = "data.frame")
The first problem I see is in if(check = "FALSE") that's an assignment = if you're testing a condition it needs to be ==. Also, quoting the word "FALSE" means you're testing a variable for the string value (literally the word FALSE), not the logical value, which is FALSE without the quotations.
The second problem has been rightly pointed out by #BlueMagister, you're missing ) at the end of for (j in 1:length(...)) {
See # bad!
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (check = "FALSE") { # bad!
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
See # good!
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (check == FALSE) { # good!
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
But because it's an if statement you can use really simple logic, especially on logicals (TRUE / FALSE values).
See # better!
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (!check) { # better!
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
There is obviously a missing bracket in you for loop. You should consider to use an editor that supports brace matching to avoid those kind of errors.
I think the easiest thing to do would be to just write a function that does what your desired non-looping code does. For reference, here's the output from that code, using the dput output from the edit to your question.
> newvars1 <- paste(thresholds$varname, thresholds$threshold_1_name, sep = "_")
> newvars1 <- newvars1[!is.na(thresholds$threshold_1_value)]
> newvars2 <- paste(thresholds$varname, thresholds$threshold_2_name, sep = "_")
> newvars2 <- newvars2[!is.na(thresholds$threshold_2_value)]
> c(newvars1, newvars2)
[1] "varA_overweight" "varB_possible malnutrition"
[3] "varC_Q1" "varD_Q1"
[5] "varE_Q1" "varF_Q1"
[7] "varA_obese" "varC_Q3"
[9] "varD_Q3" "varE_Q3"
[11] "varF_Q3"
Here's what that function would look like:
unlist(lapply(1:2, function(k) {
newvars <- paste(thresholds$varname,
thresholds[[paste("threshold", k, "name", sep="_")]], sep = "_")
newvars <- newvars[!is.na(thresholds[[paste("threshold", k, "value", sep="_")]])]
}))
# [1] "varA_overweight" "varB_possible malnutrition"
# [3] "varC_Q1" "varD_Q1"
# [5] "varE_Q1" "varF_Q1"
# [7] "varA_obese" "varC_Q3"
# [9] "varD_Q3" "varE_Q3"
#[11] "varF_Q3"
I tried to figure out what was going on in your loop but there was a lot in there that didn't make sense to me; here's how I'd write it if I was going to loop in that way.
variables <- NULL
for (i in 1:2) {
valuevar <- paste("threshold", i, "value", sep = "_")
namevar <- paste("threshold", i, "name", sep = "_")
newvars <- c()
for (j in 1:nrow(thresholds)) {
if (!is.na(thresholds[[valuevar]][j])) {
newvars <- c(newvars, paste(thresholds$varname[j],
thresholds[[namevar]][j], sep = "_"))
}
}
variables <- c(variables, newvars)
}
variables
I'm working in R, but I need to deliver some data in SPSS format with both 'variable labels' and 'value labels' and I'm kinda stuck.
I've added variable labels to my data using the Hmisc's label function. This add the variable labels as a label attribute, which is handy when using describe() from the Hmisc package. The problem is that I cannot get the write.foreign() function, from the foreign package, to recognize these labels as variable labels. I imagine I need to modify write.foreign() to use the label attribute as variable label when writing the .sps file.
I looked at the R list and at stackoverflow, but I could only find a post from 2006 on the R list regarding exporting varibles labels to SPSS from R and it doesn't seem to answer my question.
Here is my working example,
# First I create a dummy dataset
df <- data.frame(id = c(1:6), p.code = c(1, 5, 4, NA, 0, 5),
p.label = c('Optometrists', 'Nurses', 'Financial analysts',
'<NA>', '0', 'Nurses'), foo = LETTERS[1:6])
# Second, I add some variable labels using label from the Hmisc package
# install.packages('Hmisc', dependencies = TRUE)
library(Hmisc)
label(df) <- "Sweet sweet data"
label(df$id) <- "id !##$%^"
label(df$p.label) <- "Profession with human readable information"
label(df$p.code) <- "Profession code"
label(df$foo) <- "Variable label for variable x.var"
# modify the name of one varibes, just to see what happens when exported.
names(df)[4] <- "New crazy name for 'foo'"
# Third I export the data with write.foreign from the foreign package
# install.packages('foreign', dependencies = TRUE)
setwd('C:\\temp')
library(foreign)
write.foreign(df,"df.wf.txt","df.wf.sps", package="SPSS")
list.files()
[1] "df.wf.sps" "df.wf.txt"
When I inspect the .sps file (see the content of 'df.wf.sps' below) my variable labels are identical to my variable names, except for foo that I renamed to "New crazy name for 'foo'." This variable has a new and seemly random name, but the correct variable label.
Does anyone know how to get the label attributes and the variable names exported as 'variable labels' and 'labels names' into a .sps file? Maybe there is a smarter way to store 'variable labels' then my current method?
Any help would be greatly appreciated.
Thanks, Eric
Content of 'df.wf.sps' export using write.foreign from the foreign package
DATA LIST FILE= "df.wf.txt" free (",")
/ id p.code p.label Nwcnf.f. .
VARIABLE LABELS
id "id"
p.code "p.code"
p.label "p.label"
Nwcnf.f. "New crazy name for 'foo'"
.
VALUE LABELS
/
p.label
1 "0"
2 "Financial analysts"
3 "Nurses"
4 "Optometrists"
/
Nwcnf.f.
1 "A"
2 "B"
3 "C"
4 "D"
5 "E"
6 "F"
.
EXECUTE.
Update April 16 2012 at 15:54:24 PDT;
What I am looking for is a way to tweak write.foreign to write a .sps file where this part,
[…]
VARIABLE LABELS
id "id"
p.code "p.code"
p.label "p.label"
Nwcnf.f. "New crazy name for 'foo'"
[…]
looks like this,
[…]
VARIABLE LABELS
id "id !##$%^"
p.code "Profession code"
p.label "Profession with human readable information"
"New crazy name for 'foo'" "New crazy name for 'foo'"
[…]
The last line is a bit ambitious, I don't really need to have a variables with white spaces in the names, but I would like the label attributes to be transferred to the .spas file (that I produce with R).
Try this function and see if it works for you. If not, add a comment and I can see what I can do as far as troubleshooting goes.
# Step 1: Make a backup of your data, just in case
df.orig = df
# Step 2: Load the following function
get.var.labels = function(data) {
a = do.call(llist, data)
tempout = vector("list", length(a))
for (i in 1:length(a)) {
tempout[[i]] = label(a[[i]])
}
b = unlist(tempout)
structure(c(b), .Names = names(data))
}
# Step 3: Apply the variable.label attributes
attributes(df)$variable.labels = get.var.labels(df)
# Step 4: Load the write.SPSS function available from
# https://stat.ethz.ch/pipermail/r-help/2006-January/085941.html
# Step 5: Write your SPSS datafile and codefile
write.SPSS(df, "df.sav", "df.sps")
The above example is assuming that your data is named df, and you have used Hmisc to add labels, as you described in your question.
Update: A Self-Contained Function
If you do not want to alter your original file, as in the example above, and if you are connected to the internet while you are using this function, you can try this self-contained function:
write.Hmisc.SPSS = function(data, datafile, codefile) {
a = do.call(llist, data)
tempout = vector("list", length(a))
for (i in 1:length(a)) {
tempout[[i]] = label(a[[i]])
}
b = unlist(tempout)
label.temp = structure(c(b), .Names = names(data))
attributes(data)$variable.labels = label.temp
source("http://dl.dropbox.com/u/2556524/R%20Functions/writeSPSS.R")
write.SPSS(data, datafile, codefile)
}
Usage is simple:
write.Hmisc.SPSS(df, "df.sav", "df.sps")
The function that you linked to (here) should work, but I think the problem is that your dataset doesn't actually have the variable.label and label.table attributes that would be needed to write the SPSS script file.
I don't have access to SPSS, but try the following and see if it at least points you in the right direction. Unfortunately, I don't see an easy way to do this other than editing the output of dput manually.
df = structure(list(id = 1:6,
p.code = c(1, 5, 4, NA, 0, 5),
p.label = structure(c(5L, 4L, 2L, 3L, 1L, 4L),
.Label = c("0", "Financial analysts",
"<NA>", "Nurses",
"Optometrists"),
class = "factor"),
foo = structure(1:6,
.Label = c("A", "B", "C", "D", "E", "F"),
class = "factor")),
.Names = c("id", "p.code", "p.label", "foo"),
label.table = structure(list(id = NULL,
p.code = NULL,
p.label = structure(c("1", "2", "3", "4", "5"),
.Names = c("0", "Financial analysts",
"<NA>", "Nurses",
"Optometrists")),
foo = structure(1:6,
.Names = c("A", "B", "C", "D", "E", "F"))),
.Names = c("id", "p.code", "p.label", "foo")),
variable.labels = structure(c("id !##$%^", "Profession code",
"Profession with human readable information",
"New crazy name for 'foo'"),
.Names = c("id", "p.code", "p.label", "foo")),
codepage = 65001L)
Compare the above with the output of dput for your sample dataset. Notice that label.table and variable.labels have been added, and a line that said something like row.names = c(NA, -6L), class = "data.frame" was removed.
Update
NOTE: This will not work with the default write.foreign function in R. To test this you first need to load the write.SPSS function shared here, and (of course), make sure that you have the foreign package loaded. Then, you write your files as follows:
write.SPSS(df, datafile="df.sav", codefile="df.sps")