I'm having a looping issue. It should be simple to solve, but "R for Stata Users" (I've coded in Stata for a couple of years), Roger Peng's videos, and Google don't seem to be helping me. Can one of you please explain to me what I'm doing wrong?
I'm trying to write a loop that run through the 'thresholds' dataframe to pull out information from three sets of columns. I can do what I want to do by writing the same segment of code three times, but as the code gets more complicated, this will become quite cumbersome.
Here is a sample of 'thresholds' (see dput output below, added by a friendly reader):
threshold_1_name threshold_1_dir threshold_1_value
1 overweight > 25
2 possible malnutrition < 31
3 Q1 > 998
4 Q1 > 998
5 Q1 > 998
6 Q1 > 998
threshold_1_units threshold_2_name threshold_2_dir threshold_2_value threshold_2_units
1 kg/m^2 obese > 30 kg/m^2
2 cm <NA> > NA
3 <NA> Q3 > 998
4 Q3 > 998
5 Q3 > 998
6 Q3 > 998
This code does what I want to do:
newvars1 <- paste(thresholds$varname, thresholds$threshold_1_name, sep = "_")
noval <- is.na(thresholds$threshold_1_value)
newvars1 <- newvars1[!noval]
newvars2 <- paste(thresholds$varname, thresholds$threshold_2_name, sep = "_")
noval <- is.na(thresholds$threshold_2_value)
newvars2 <- newvars2[!noval]
newvars3 <- paste(thresholds$varname, thresholds$threshold_3_name, sep = "_")
noval <- is.na(thresholds$threshold_3_value)
newvars3 <- newvars3[!noval]
And here is how I am trying to loop:
variables <- NULL
for (i in 1:3) {
valuevar <- paste("threshold", i, "value", sep = "_")
namevar <- paste("threshold", i, "name", sep = "_")
newvar <- paste("varnames", i, sep = "")
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (check == FALSE) {
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
variables <- c(variables, newvars)
}
And here is the error I am receiving:
Error: unexpected '}' in "}"
I think something about the way I am calling the 'i' is messing things up, but I'm not sure how to do it correctly. My Stata habits using locals are really biting me in the butt as I switch to R.
EDIT to add dput output, by a friendly reader:
thresholds <- structure(list(varname = structure(1:6, .Label = c("varA", "varB",
"varC", "varD", "varE", "varF"), class = "factor"), threshold_1_name = c("overweight",
"possible malnutrition", "Q1", "Q1", "Q1", "Q1"), threshold_1_dir = c(">",
"<", ">", ">", ">", ">"), threshold_1_value = c(25L, 31L, 998L,
998L, 998L, 998L), threshold_1_units = c("kg/m^2", "cm", NA,
NA, NA, NA), threshold_2_name = c("obese", "<NA>", "Q3", "Q3",
"Q3", "Q3"), threshold_2_dir = c(">", ">", ">", ">", ">", ">"
), threshold_2_value = c(30L, NA, 998L, 998L, 998L, 998L), threshold_2_units = c("kg/m^2",
"cm", NA, NA, NA, NA)), .Names = c("varname", "threshold_1_name",
"threshold_1_dir", "threshold_1_value", "threshold_1_units",
"threshold_2_name", "threshold_2_dir", "threshold_2_value", "threshold_2_units"
), row.names = c(NA, -6L), class = "data.frame")
The first problem I see is in if(check = "FALSE") that's an assignment = if you're testing a condition it needs to be ==. Also, quoting the word "FALSE" means you're testing a variable for the string value (literally the word FALSE), not the logical value, which is FALSE without the quotations.
The second problem has been rightly pointed out by #BlueMagister, you're missing ) at the end of for (j in 1:length(...)) {
See # bad!
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (check = "FALSE") { # bad!
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
See # good!
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (check == FALSE) { # good!
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
But because it's an if statement you can use really simple logic, especially on logicals (TRUE / FALSE values).
See # better!
for (j in 1:length(thresholds$varname)) {
check <- is.na(thresholds[valuevar[j]])
if (!check) { # better!
newvars <- paste(thresholds$varname, thresholds[namevar], sep = "_")
}
}
There is obviously a missing bracket in you for loop. You should consider to use an editor that supports brace matching to avoid those kind of errors.
I think the easiest thing to do would be to just write a function that does what your desired non-looping code does. For reference, here's the output from that code, using the dput output from the edit to your question.
> newvars1 <- paste(thresholds$varname, thresholds$threshold_1_name, sep = "_")
> newvars1 <- newvars1[!is.na(thresholds$threshold_1_value)]
> newvars2 <- paste(thresholds$varname, thresholds$threshold_2_name, sep = "_")
> newvars2 <- newvars2[!is.na(thresholds$threshold_2_value)]
> c(newvars1, newvars2)
[1] "varA_overweight" "varB_possible malnutrition"
[3] "varC_Q1" "varD_Q1"
[5] "varE_Q1" "varF_Q1"
[7] "varA_obese" "varC_Q3"
[9] "varD_Q3" "varE_Q3"
[11] "varF_Q3"
Here's what that function would look like:
unlist(lapply(1:2, function(k) {
newvars <- paste(thresholds$varname,
thresholds[[paste("threshold", k, "name", sep="_")]], sep = "_")
newvars <- newvars[!is.na(thresholds[[paste("threshold", k, "value", sep="_")]])]
}))
# [1] "varA_overweight" "varB_possible malnutrition"
# [3] "varC_Q1" "varD_Q1"
# [5] "varE_Q1" "varF_Q1"
# [7] "varA_obese" "varC_Q3"
# [9] "varD_Q3" "varE_Q3"
#[11] "varF_Q3"
I tried to figure out what was going on in your loop but there was a lot in there that didn't make sense to me; here's how I'd write it if I was going to loop in that way.
variables <- NULL
for (i in 1:2) {
valuevar <- paste("threshold", i, "value", sep = "_")
namevar <- paste("threshold", i, "name", sep = "_")
newvars <- c()
for (j in 1:nrow(thresholds)) {
if (!is.na(thresholds[[valuevar]][j])) {
newvars <- c(newvars, paste(thresholds$varname[j],
thresholds[[namevar]][j], sep = "_"))
}
}
variables <- c(variables, newvars)
}
variables
Related
I have a dataset with proteins accession numbers (DataGranulomeTidy). I have written a function (extractInfo) in r to scrape some information of those proteins from the ncbi website. The function works as expected when I run it in a short "for" loop.
DataGranulomeTidy <- tibble(GIaccessionNumber = c("29436380", "4504165", "17318569"))
extractInfo <- function(GInumber){
tempPage <- readLines(paste("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=", GInumber, "&db=protein&report=genpept&conwithfeat=on&withparts=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000", sep = ""), skipNul = TRUE)
tempPage <- base::paste(tempPage, collapse = "")
Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
Symbol <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
GeneID <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
out <- paste(Symbol, Accession, GeneID, sep = "---")
return(out)
}
for(n in 1:3){
print(extractInfo(GInumber = DataGranulomeTidy$GIaccessionNumber[n]))
}
[1] "MYH9--- AAH49849---4627"
[1] "GSN--- NP_000168---2934"
[1] "KRT1--- NP_006112---3848"
When I use the same function in a dplyr pipe I doesn't work and I can't figure our why.
> DataGranulomeTidy %>% mutate(NewVar = extractInfo(.$GIaccessionNumber))
Error in file(con, "r") : argumento 'description' inválido
At this point I could make things work without using the "pipe" operator by using the "for" operator but I would like so much to understand why the function does not work in the dplyr pipe.
It is the cause that your UDF can't treat vector.
vectorized_extractInfo <- Vectorize(extractInfo, "GInumber")
DataGranulomeTidy %>%
mutate(NewVar = vectorized_extractInfo(GIaccessionNumber))
As #cuttlefish44 already pointed out, the problem is that your fun is not a vectorized fun. My approach uses purrr::map_chr. Another option would be to use dplyr::rowwise:
library(tidyverse)
DataGranulomeTidy <- tibble(GIaccessionNumber = c("29436380", "4504165", "17318569"))
extractInfo <- function(GInumber){
tempPage <- readLines(paste("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=", GInumber, "&db=protein&report=genpept&conwithfeat=on&withparts=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000", sep = ""), skipNul = TRUE)
tempPage <- base::paste(tempPage, collapse = "")
Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
Symbol <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
GeneID <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
out <- paste(Symbol, Accession, GeneID, sep = "---")
return(out)
}
DataGranulomeTidy %>% mutate(NewVar = map_chr(GIaccessionNumber, extractInfo))
#> # A tibble: 3 x 2
#> GIaccessionNumber NewVar
#> <chr> <chr>
#> 1 29436380 MYH9--- AAH49849---4627
#> 2 4504165 GSN--- NP_000168---2934
#> 3 17318569 KRT1--- NP_006112---3848
Created on 2020-04-17 by the reprex package (v0.3.0)
There is a rentrez package for NCBI queries, for example:
library(rentrez)
protein <- entrez_summary("protein", id = 29436380)
protein$caption
# [1] "AAH49849"
links <- entrez_link(dbfrom = "protein", id = 29436380, db = "gene")
links$links$protein_gene
# [1] "4627"
gene <- entrez_summary("gene", id = links$links$protein_gene)
gene$name
# [1] "MYH9"
Wrap this up into a function, then we don't need to mess about with regex.
My dataframe:
>datasetM
Mean
ENSORLG00000001933:tex11 2500.706
ENSORLG00000010797: 44225.330
ENSORLG00000003008:pabpc1a 11788.555
ENSORLG00000001973:sept6 3100.493
ENSORLG00000000997: 5418.796
Output needed:
>out
[1] "tex11" "ENSORLG00000010797" "pabpc1a" "sept6" "ENSORLG00000000997"
I tried this, but I only retrieve the part before the separator:
titles <- rownames(datasetM)
vapply(strsplit(titles,":"), `[`, 1, FUN.VALUE=character(1))
Note: There is not logic in the alternance of ENS000:name and ENS00:
Note 2: ENSOR are rownames
Note 3: When there is nothing after ":" I want the ENSOR
Here is a solution with base R:
sapply(strsplit(rownames(df), ":"), function(x) x[length(x)])
# [1] "tex11" "ENSORLG00000010797" "pabpc1a" "sept6"
# [5] "ENSORLG00000000997"
Another solution with sub, might be simpler:
sub("^\\w+:(?=\\w)|:", "", rownames(df), perl = TRUE)
# [1] "tex11" "ENSORLG00000010797" "pabpc1a" "sept6"
# [5] "ENSORLG00000000997"
Data:
df = read.table(text = " Mean
ENSORLG00000001933:tex11 2500.706
ENSORLG00000010797: 44225.330
ENSORLG00000003008:pabpc1a 11788.555
ENSORLG00000001973:sept6 3100.493
ENSORLG00000000997: 5418.796", header = TRUE, row.names = 1)
Here is a vectorized way to do this using a regex (taken from here) to identify the last character of each rowname,
rownames(df)[!sub('.*(?=.$)', '', rownames(df), perl=TRUE) == ':'] <-
sub('.*:', '', rownames(df)[!sub('.*(?=.$)', '', rownames(df), perl=TRUE) == ':'])
which gives,
V2
tex11 2500.706
ENSORLG00000010797: 44225.330
pabpc1a 11788.555
sept6 3100.493
ENSORLG00000000997: 5418.796
DATA
dput(df)
structure(list(V2 = c(2500.706, 44225.33, 11788.555, 3100.493,
5418.796)), .Names = "V2", row.names = c("tex11", "ENSORLG00000010797:",
"pabpc1a", "sept6", "ENSORLG00000000997:"), class = "data.frame")
NOTE You can remove the colons from rownames simply by
rownames(df) <- sub(':', '', rownames(df))
In r is there a way to add the comma separator to thousands on a Venn diagram.
venn.plot <- VennDiagram::draw.pairwise.venn(10000, 7000, 3000, c("First", "Second"), scaled = FALSE)
grid::grid.draw(venn.plot)
The resulting chart looks like the one below.
It does not look like this function was designed to do that. If you really want to use this function, you could "hack" it to replace the default formatting code it uses for labels. Note that this method is very fragile since we are editing particular "lines" of code. First make a copy of the function
myvenn <- VennDiagram::draw.pairwise.venn
Here's the default formatter
body(myvenn)[[46]]
# wrapLab <- function(num) {
# stri = ""
# if (print.mode[1] == "percent") {
# stri <- paste(signif(num * 100/denom, digits = sigdigs),
# "%", sep = "")
# if (isTRUE(print.mode[2] == "raw")) {
# stri <- paste(stri, "\n(", num, ")", sep = "")
# }
# }
# if (print.mode[1] == "raw") {
# stri <- num
# if (isTRUE(print.mode[2] == "percent")) {
# stri <- paste(stri, "\n(", paste(signif(num * 100/denom,
# digits = sigdigs), "%)", sep = ""), sep = "")
# }
# }
# return(stri)
# }
Let's replace that with a call to prettyNum to add the commas
body(myvenn)[[46]][[3]] <- quote(function(x) {
prettyNum(x ,big.mark=",",scientific=FALSE)
})
Now we can call our version of the function
venn.plot <- myvenn(10000, 7000, 3000, c("First", "Second"), scaled = FALSE)
grid::grid.draw(venn.plot)
You could also edit the items manually.
venn.plot[[5]][["label"]] <- "7,000"
venn.plot[[6]][["label"]] <- "4,000"
venn.plot[[7]][["label"]] <- "3,000"
grid::grid.draw(venn.plot)
Here is another way with a loop
venn.plot <- VennDiagram::draw.pairwise.venn(10000, 7000, 3000, c("First", "Second"), scaled = FALSE)
for(i in 1:length(venn.plot)){
if(!is.null(venn.plot[[i]][["label"]]) &&
!is.na(as.numeric(venn.plot[[i]][["label"]]))
) {
venn.plot[[i]][["label"]] <- prettyNum(venn.plot[[i]][["label"]], big.mark = ",")
}
}
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
grid::grid.draw(venn.plot)
I have list of data.frame as an output of custom function, so I intend to split each data.frame by its last column, where threshold is given. However, I manipulated the two list nicely, and combined them to get only one table. But I have an error when manipulating this new table. I can't figure out where is issue come from. How can I fix this error ? Can anyone point me out to possibly fix this error ? If this error can be fixed, I want to implement wrapper. How can I easily manipulate list of data.frame ? Any better idea to debug the error ?
mini example :
savedDF <- list(
bar = data.frame(.start=c(12,21,37), .stop=c(14,29,45), .score=c(5,9,4)),
cat = data.frame(.start=c(18,42,18,42,81), .stop=c(27,46,27,46,114), .score=c(10,5,10,5,34)),
foo = data.frame(.start=c(3,3,33,3,33,91), .stop=c(24,24,10,24,10,17), .score=c(22,22,6,22,6,7))
)
discardedDF <- list(
bar = data.frame(.start=c(16,29), .stop=c(20,37), .score=c(2,11)),
cat = data.frame(.start=c(21,31), .stop=c(23,43), .score=c(1,9)),
foo = data.frame(.start=c(54, 79), .stop=c(71,93), .score=c(3,8))
)
I can manipulate this way :
both <- do.call("rbind", c(savedDF, discardedDF))
cn <- c("letter", "seq")
# FIXME :
DF <- cbind(
read.table(text = chartr("_", ".", rownames(both)), header=T, sep = ".", col.names = cn),
both)
DF <- transform(DF, isPassed = ifelse(.score > 8, "Pass", "Fail"))
by(DF, DF[c("letter", "isPassed")],
function(x) write.csv(x[-(1:length(savedDF))],
sprintf("%s_%s_%s.csv", x$letter[1], x$isPassed[1])))
But I have an error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 15 did not have 2 elements
Why I have this error ? Can anyone point me out how to fix this ?
my desired output is list of CSV file as follows :
bar.saved.Pass.csv
bar.saved.Fail.csv
bar.discarded.Pass.csv
bar.discarded.Fail.csv
cat.saved.Pass.csv
cat.saved.Fail.csv
cat.discarded.Pass.csv
cat.discarded.Fail.csv
foo.saved.Pass.csv
foo.saved.Fail.csv
foo.discarded.Pass.csv
foo.discarded.Fail.csv
But I think controlling exported CSV files still not desired. How can I improve functionality of this wrapper ? I intend to let use choose output directory by custom, or more dynamic would be nice. Any idea ? Thanks a lot
Is this what you are looking for?
library(tidyverse)
library(magrittr)
both <- do.call("rbind", c(savedDF, discardedDF))
both %<>% rownames_to_column(var = "cn")
both %<>% separate(cn, c("letters", "seq"), sep = "\\.")
both %<>% mutate(isPassed = ifelse(.score > 8, "Passed", "Failed"),
isDiscard = ifelse(is.na(seq), "Saved", "Discarded"))
list_of_dfs <- both %>% split(list(.$letters, .$isPassed, .$isDiscard))
csv_names <- paste0("/Users/nathanday/Desktop/", names(list_of_dfs), ".csv") # change this path
mapply(write.csv, list_of_dfs, csv_names)
The %<>% operator is short hand so both %<>% rownames_to_columm(var = "cn") is identical to both <- rownames_to_column(both, var = "cn")
To make it more "dynamic" for allowing output path input, you could wrap this in the function structure you already have like this:
output_where <- function(output_path, list1, list2) {
if (!dir.exists(output_path)) {
dir.create(file.path(output_path))
}
both <- do.call(rbind, c(list1, list2))
both %<>% rownames_to_column(var = "cn")
both %<>% separate(cn, c("letters", "seq"), sep = "\\.")
both %<>% mutate(isPassed = ifelse(.score > 8, "Passed", "Failed"), isDiscard = ifelse(is.na(seq), "Saved", "Discarded"))
list_of_dfs <- both %>% split(list(.$letters, .$isPassed, .$isDiscard))
csv_names <- paste0(output_path, names(list_of_dfs), ".csv")
return(mapply(write.csv, list_of_dfs, csv_names))
}
output_where("~/Desktop/", savedDF, discardedDF)
for even more dynamics:
output_where <- function(output_path, list1, list2) {
if (!dir.exists(output_path)) {
dir.create(file.path(output_path))
}
names(list1) <- paste("list1", names(list1), sep = ".")
names(list2) <- paste("list2", names(list2), sep = ".")
both <- do.call(rbind, c(list1, list2))
both %<>% rownames_to_column(var = "cn")
both %<>% separate(cn, c("original_list", "letters", "seq"), sep = "\\.")
both %<>% mutate(isPassed = ifelse(.score > 8, "Passed", "Failed"))
list_of_dfs <- both %>% split(list(.$letters, .$isPassed, .$original_list))
csv_names <- paste0(output_path, names(list_of_dfs), ".csv")
return(mapply(write.csv, list_of_dfs, csv_names))
}
I'm trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over two lines.
Has anybody found a simple way of managing the headers in the CSV file so that the analysis is manageable ?
How do other people analyse results from Surveymonkey?
Thanks!
You can export it in a convenient form that fits R from Surveymonkey, see download responses in 'Advanced Spreadsheet Format'
What I did in the end was print out the headers using libreoffice labeled as V1,V2, etc. then I just read in the file as
m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)
and then just did the analysis against m1$V10, m1$V23 etc...
To get around the mess of multiple columns I used the following little function
# function to merge columns into one with a space separator and then
# remove multiple spaces
mcols <- function(df, cols) {
# e.g. mcols(df, c(14:18))
exp <- paste('df[,', cols, ']', sep='', collapse=',' )
# this creates something like...
# "df[,14],df[,15],df[,16],df[,17],df[,18]"
# now we just want to do a paste of this expression...
nexp <- paste(" paste(", exp, ", sep=' ')")
# so now nexp looks something like...
# " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
# now we just need to parse this text... and eval() it...
newcol <- eval(parse(text=nexp))
newcol <- gsub(' *', ' ', newcol) # replace duplicate spaces by a single one
newcol <- gsub('^ *', '', newcol) # remove leading spaces
gsub(' *$', '', newcol) # remove trailing spaces
}
# mcols(df, c(14:18))
No doubt somebody will be able to clean this up!
To tidy up Likert-like scales I used:
# function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
tidylik4 <- function(x) {
xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
y <- ifelse(x == '', NA, x)
ordered(y, levels=xlevels)
}
for (i in 44:52) {
m2[,i] <- tidylik4(m2[,i])
}
Feel free to comment as no doubt this will come up again!
I have to deal with this pretty frequently, and having the headers on two columns is a bit painful. This function fixes that issue so that you only have a 1 row header to deal with. It also joins the multipunch questions so you have top: bottom style naming.
#' #param x The path to a surveymonkey csv file
fix_names <- function(x) {
rs <- read.csv(
x,
nrows = 2,
stringsAsFactors = FALSE,
header = FALSE,
check.names = FALSE,
na.strings = "",
encoding = "UTF-8"
)
rs[rs == ""] <- NA
rs[rs == "NA"] <- "Not applicable"
rs[rs == "Response"] <- NA
rs[rs == "Open-Ended Response"] <- NA
nms <- c()
for(i in 1:ncol(rs)) {
current_top <- rs[1,i]
current_bottom <- rs[2,i]
if(i + 1 < ncol(rs)) {
coming_top <- rs[1, i+1]
coming_bottom <- rs[2, i+1]
}
if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
pre <- current_top
if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")
if(!is.na(current_top) & is.na(current_bottom))
nms[i] <- current_top
}
nms
}
If you note, it returns the names only. I typically just read.csv with ...,skip=2, header = FALSE, save to a variable and overwrite the names of the variable. It also helps ALOT to set your na.strings and stringsAsFactor = FALSE.
nms = fix_names("path/to/csv")
d = read.csv("path/to/csv", skip = 2, header = FALSE)
names(d) = nms
As of November 2013, the webpage layout seems to have changed. Choose Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software). Then go to Exports and download the file. You'll get raw data as first row = question headers / each following row = 1 response, possibly split between multiple files if you have many responses / questions.
The issue with the headers is that columns with "select all that apply" will have a blank top row, and the column heading will be the row below. This is only an issue for those types of questions.
With this in mind, I wrote a loop to go through all columns and replace the column names with the value from the second row if the column name was blank- which has a character length of 1.
Then, you can kill the second row of the data and have a tidy data frame.
for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
}
df <- df[-1,]
Coming to the party late, but this is still an issue and the best workaround I've found is using a function to paste the column names and sub-column names together, based on repeating values.
For instance, if exporting to .csv, the repeated column names will automatically be replaced with an X in RStudio. If exporting to .xlsx, the repeated value will be ....
Here's a base R solution:
sm_header_function <- function(x, rep_val){
orig <- x
sv <- x
sv <- sv[1,]
sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
sv <- t(sv)
sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
names(sv)[1] <- "name"
names(sv)[2] <- "value"
sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
sv$new_value <- paste0(sv$new_value, " ", sv$value)
new_names <- as.character(sv$new_value)
colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
orig <- orig[-c(1),]
return(orig)
}
sm_header_function(df, "X")
sm_header_function(df, "...")
With some sample data, the change in column names would look like this:
Original export from SurveyMonkey:
> colnames(sample)
[1] "Respondent ID" "Please provide your contact information:" "...11"
[4] "...12" "...13" "...14"
[7] "...15" "...16" "...17"
[10] "...18" "...19" "I wish it would have snowed more this winter."
Cleaned export from SurveyMonkey:
> colnames(sample_clean)
[1] "Respondent ID" "Please provide your contact information: Name"
[3] "Please provide your contact information: Company" "Please provide your contact information: Address"
[5] "Please provide your contact information: Address 2" "Please provide your contact information: City/Town"
[7] "Please provide your contact information: State/Province" "Please provide your contact information: ZIP/Postal Code"
[9] "Please provide your contact information: Country" "Please provide your contact information: Email Address"
[11] "Please provide your contact information: Phone Number" "I wish it would have snowed more this winter. Response"
Sample data:
structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621,
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin",
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale",
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's",
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2",
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia",
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa",
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.",
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104",
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country",
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins#gmail.com",
"mjemison#nasa.gov", "stargazer#gmail.com", "dubois#web.com",
"firstnurse#aol.com", "galileo123#yahoo.com", "imthinking#gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646",
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944",
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response",
"Strongly disagree", "Strongly agree", "Neither agree nor disagree",
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
How about the following: use read.csv() with header=FALSE. Make two arrays, one with the two lines of headings and one with the answers to the survey. Then paste() the two rows/sentences of together. Finally, use colnames().