I have SPSS data, which I have to migrate to R. The data is large with 202 columns and thousands of rows
v1 v2 v3 v4 v5
1 USA Male 21 Married
2 INDIA Female 54 Single
3 CHILE Male 33 Divorced ...and so on...
The data file
contains variable labels "Identification No", "Country of origin", "Gender", "(Current) Year", "Marital Status - Candidate"
I read my data from SPSS with following command
data<-read.spss(file.sav,to.data.frame=TRUE,reencode='utf-8')
The column name is read as v1,v2,v3,v4 etc, but I want variable labels as my column name in data frame. I used following command to find the variable labels and set it as names
vname<-attr(data,"variable.labels")
for(i in 1:202){vl[i]<-vname[[i]]}
names(data)<-vl
Now the problem is that I have to address that column like data$"Identification number", which is not very nice. I want to remove quotation marks around the column names. How can I do that?
You can't. An unquoted space is a syntactic symbol that breaks the grammar up.
An option is to change the names to ones without spaces in, and you can use the make.names function to do that.
> N = c("foo","bar baz","bar baz")
> make.names(N)
[1] "foo" "bar.baz" "bar.baz"
You might want to make sure you have unique names:
> make.names(N, unique=TRUE)
[1] "foo" "bar.baz" "bar.baz.1"
The quotation marks were there because the names had spaces in them. print(vl,quotes=FALSE) displayed text without quotation marks. But I had to use quotation marks in order to use it as a single variable name. Without quotation marks, the spaces would break the variable names.
This could be solved by removing spaces in the name. I solved this by substituting all the spaces in between the names by using gsub command
vl<-gsub(" ","",vl)
names(data)<-vl
Now most of the column names can be accessed without using quotation marks. But the names containing other punctuation marks couldn't be used without quotation.
Alos the solution by Spacedman worked fine and seems easier to use.
make.names(vl, unique=TRUE)
But I liked the solution by David Arenburg.
gsub("[ [:punct:]]", "" , vl)
It removed all punctuation marks and made the column name clean and better.
Spaces are okay in data.table column names without much fuss. But, no, there's no way to avoid using quotation marks for the reason Spacedman gave: spaces break up the syntax.
require(data.table)
DT <- data.table(a = c(1,1), "bc D" = c(2,3))
# three identical results:
DT[['bc D']]
DT$bc
DT[,`bc D`]
Okay, so partial matching with $ (which also works with data.frames) gets you out of using quotes. But it will bring trouble if you get it wrong.
Related
Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()
Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0
df <- read.csv(
text = '"2019-Jan","2019-Feb",
"3","1"',
check.names = FALSE
)
OK, so I use check.names = FALSE and now my column names are not syntactically valid. What are the practical consequences?
df
#> 2019-Jan 2019-Feb
#> 1 3 1 NA
And why is this NA appearing in my data frame? I didn't put that in my code. Or did I?
Here's the check.names man page for reference:
check.names logical. If TRUE then the names of the variables in the
data frame are checked to ensure that they are syntactically valid
variable names. If necessary they are adjusted (by make.names) so that
they are, and also to ensure that there are no duplicates.
The only consequence is that you need to escape or quote the names to work with them. You either string-quote and use standard evaluation with the [[ column subsetting operator:
df[['2019-Jan']]
… or you escape the identifier name with backticks (R confusingly also calls this quoting), and use the $ subsetting:
df$`2019-Jan`
Both work, and can be used freely (as long as they don’t lead to exceedingly unreadable code).
To make matters more confusing, R allows using '…' and "…" instead of `…` in certain contexts:
df$'2019-Jan'
Here, '2019-Jan' is not a character string as far as R is concerned! It’s an escaped identifier name.1
This last one is a really bad idea because it confuses names2 with character strings, which are fundamentally different. The R documentation advises against this. Personally I’d go further: writing 'foo' instead of `foo` to refer to a name should become a syntax error in future versions of R.
1 Kind of. The R parser treats it as a character string. In particular, both ' and " can be used, and are treated identically. But during the subsequent evaluation of the expression, it is treated as a name.
2 “Names”, or “symbols”, in R refer to identifiers in code that denote a variable or function parameter. As such, a name is either (a) a function name, (b) a non-function variable name, (c) a parameter name in a function declaration, or (d) an argument name in a function call.
The NA issue is unrelated to the names. read.csv is expecting an input with no comma after the last column. You have a comma after the last column, so read.csv reads the blank space after "2019-Feb", as the column name of the third column. There is no data for this column, so an NA value is assigned.
Remove the extra comma and it reads properly. Of course, it may be easier to just remove the last column after using read.csv.
df <- read.csv(
text = '"2019-Jan","2019-Feb"
"3","1"',
check.names = FALSE
)
df
# 2019-Jan 2019-Feb
# 1 3 1
Consider df$foo where foo is a column name. Syntactically invalid names will not work.
As for the NA it’s a consequence of there being three columns in your first line and only two in your second.
There must be a simple answer to this, but I'm new to regex and couldn't find one.
I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:
regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]
The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.
I'd then like to create a new column containing counts of x v. y phrases in each row.
Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).
Thanks for helping a newbie!
I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.
df <- data.frame(text = c("(Roe v. Wade) is not about boats",
"(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time",
"I am trying to confuse you (this is not a court case)",
"this one is also confusing (But with Capital Letters)",
"this is confusing (With Capitols and v. d)"),
stringsAsFactors = FALSE)
The regular expression I think you want is:
cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))",
df$text, perl=T))
You can then get the number of cases and add it to your data frame with:
df$numCases <- vapply(cases, length, numeric(1))
As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.
Change your regex like below,
regmatches(df$text, gregexpr("(?<=\\()[^()]*\\sv\\.\\s[^()]*(?=\\))", df$text, perl=T))[[1]]
DEMO
I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.
The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))
Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.
The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`
I have a column of gene symbols that I have retrieved directly from a database, and some of the rows contain two or more symbols which are comma separated (see example below).
SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26
I would like to remove the commas, and place the separated symbols into new rows like so:
SLC6A13
ATP5J2-PTCD1
BUD31
PTCD1
ACOT7
BUD3
PDAP1
TTC26
I haven't been able to find a straight forward way to do this in R, does anyone have any suggestions?
You can use this vector result to put into a matrix or a data.frame:
vec <- scan(text="SLC6A13
ATP5J2-PTCD1,BUD31,PTCD1
ACOT7
BUD31,PDAP1
TTC26", what=character(), sep=",")
Read 8 items
vec
[1] "SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7" "BUD31" "PDAP1"
[8] "TTC26"
Perhaps:
as.matrix(vec)
(The scan function can also read from files. The "text" parameter was only added relatively recently, but it saves typing file=textConnection("...").)
Another option is to use readLines and strsplit :
unlist(strsplit(readLines(textConnection(txt)),','))
"SLC6A13" "ATP5J2-PTCD1" "BUD31" "PTCD1" "ACOT7"
"BUD31" "PDAP1" "TTC26"