converting a chr into num in R - r

I have some data that is currently in character form, and I need to put it into numeric form so that I can get the mean. I'm new to R so any help will be much appreciated. My initial thought was that the missing data is causing it to not be read as num, but could it be because the numbers are "3" instead of 3?
Here's what I have:
X
chr [1:1964] "3", "4", "4", "1", NA
I've tried different methods of converting X from chr to num:
X <- na.omit(Y, Z, as.numeric)
mean(X)
# [1] NA
# Warning message:
# In mean.default(X) :
# argument is not numeric or logical: returning NA
X <- c(Y, Z, na.rm=TRUE)
mean(X, na.rm=TRUE)
# [1] NA
# Warning message:
# In mean.default(X, na.rm = TRUE) :
# argument is not numeric or logical: returning NA
X <- c(Y, Z, na.rm=TRUE)
str(X)
# Named chr [1:1965] "3" "4" "4" "1" "5" "7" NA "6" NA "5" ...
# - attr(*, "names")= chr [1:1965] "" "" "" "" ...

As always, an example of your actual data is helpful. I think I can answer anyway, though. If your data are character data, then converting to numeric like this will work most of the time:
X2 <- as.numeric(X)
If you have missing values, are they showing up as NA? Or did you write something else there to indicate missingness such as "missing"? If you've got something other than NA in your original data, then when you do the as.numeric(X) conversion, R will convert those values to NA and give you a warning message.
To take the mean of a numeric object that has missing values, use:
mean(X2, na.rm=TRUE)

This should work:
mean(as.numeric(X), na.rm=TRUE)
Doing the as.numeric() will introduce an NA for values like "X" and many of the summary functions have a na.rm parameter to ignore NA values in the vector.
But of course taking the mean of a list of chromosomes is a pretty weird operation.

Related

Set values NA from first occurence of Pattern to end

Is there a faster/ shorter way to set values after and including match to NA ?
vec <- 1:10;vec[c(3,5,7)]<-c(NA,NaN,"remove")
#"1" "2" NA "4" "NaN" "6" "remove" "8" "9" "10"
Desired Outcome:
#"1" "2" NA "4" "NaN" "6" NA NA NA NA
My code:
vec[{grep("^remove$",vec)[1]}:length(vec)]<-NA
Please note:
In that case, we assume there will be a "remove" element prominent. So the solution does not have to take care of the case that there isn't any.
You can use match to stop searching after the first match is found:
m = match("remove", vec) - 1L
if (is.na(m)){
vec
} else {
c(head(vec, m), rep(vec[NA_integer_], length(vec)-m))
}
You'd have to have a pretty large vector to notice a speed difference, though, I guess. Alternately, this might prove faster:
m = match("remove", vec)
if (!is.na(m)){
vec[m:length(vec)] <- NA
}
Not sure if this is shorter or faster but here is one alternative :
vec[which.max(vec == "remove"):length(vec)] <- NA
vec
#[1] "1" "2" NA "4" "NaN" "6" NA NA NA NA
Here , we find the first occurrence of "remove" using which.max and then add NA's till the end of the vector.
OP has mentioned that there is a "remove" element always present so we need not take care of other case however, in case we still want to keep a check we can add an additional condition.
inds <- vec == "remove"
if (any(inds)) {
vec[which.max(inds) : length(vec)] <- NA
}
We can use cumsum on a logical vector
vec[cumsum(vec %in% "remove") > 0] <- NA
We can also just extend the vec to the desired length:
`length<-`(vec[1:(which(vec=="remove")-1)],length(vec))
[1] "1" "2" NA "4" "NaN" "6" NA NA NA NA

A way to grep using regular expresion to receive a dataframe or a list in R

I have a column in a dataframe that looks like this:
peptide <- c("aaa(0.011)bbb(0.989)ccc","aaa(1)bbbccc","aaabbb(0.15)ccc(0.85)ddd",
"aaabbb(0.75)cc(0.24)ddd(0.01)")
I would like to extract the text flanking each of the brackets. Sometimes there are up to 7 sets of brackets in each string (in my example there is a maximum of 3). While extracting the text, I would like to get rid of the brackets and numbers all together, and just keep the letters. Let’s say I want to extract up to five letters on each side of each bracket pair. If I achieved that, my output would look like this:
col1 col2 col3
aaabbbcc aabbbccc NA
aaabbbcc NA NA
aabbbcccdd bbcccddd NA
aabbbccddd bbbccddd ccddd
Where each row corresponds to strings extracted from one peptide.
I am quite new to R, and completely new to grep/sub, and am unable to find a way to grep into a data-frame.
The closest thing I came up with is this:
before<- sub(".*([[:print:]][[:print:]][[:print:]][[:print:]][[:print:]])\\(.*","\\1", peptide)
after<- sub(".*\\)([[:print:]][[:print:]][[:print:]][[:print:]][[:print:]]).*","\\1", peptide)
final <- paste(before,after,sep="")
This does not return what I want.
> final
[1] "1)bbbbbb(0" "aaa(1)bbbcccbbbcc" "5)cccccc(0" "75)cccc(0."
First, it just returns one string per peptide, while I would want it to return as many strings as there are pairs of brackets. Second, I know that my regular expressions are not correct - I do not omit numbers and brackets, and I would like to.
EDIT: I edited the output, because there was a typo in it, and I removed a mention to another question that I have not had time to ask before receiving answers here!
Any suggestions welcome!
First define sep to be any character that does not appear in peptide. We used a space below.
Then create two variables in which the numeric fields have been removed and the parentheses around them have been removed too. p0 is precisely that while ps is like p0 but the last character of each of the non-numeric fields is replaced with sep (so that we can later locate it).
Using the above variables compute pos which is a numeric matrix whose ith column contains the character positions of the end of the ith fields in p0. To do this we use gregexpr to find the locations of sep in ps and then manipulate that into a numeric matrix pos.
Then for each element of pos determine the character positions of the corresponding output string's start and end and use substring to extract those substrings from p0 reshaping to the same dimensions as pos.
sep <- " "
pat <- "(.)\\(.*?\\)"
ps <- gsub(pat, sep, peptide)
p0 <- gsub(pat, "\\1", peptide)
g <- gregexpr(sep, ps, fixed = TRUE)
pos <- t(unname(do.call("cbind", lapply(g, ts))))
replace(pos, TRUE, substring(p0, pos-5+1, pos+5))
giving:
[,1] [,2] [,3]
[1,] "aaabbbcc" "aabbbccc" NA
[2,] "aaabbbcc" NA NA
[3,] "aabbbcccdd" "bbcccddd" NA
[4,] "aabbbccddd" "bbbccddd" "ccddd"
My first thought is to use strsplit using the numbers/parens as separators:
str(
strsplit(peptide, '[().[:digit:]]+')
)
# List of 4
# $ : chr [1:3] "aaa" "bbb" "ccc"
# $ : chr [1:2] "aaa" "bbbccc"
# $ : chr [1:3] "aaabbb" "ccc" "ddd"
# $ : chr [1:3] "aaabbb" "cc" "ddd"
This looks good so far, so we can now iterate over each break and grab the before/after concatenations. (Ignore for now the removeqmark= option, I'll justify it in a moment.)
surrounding <- function(vec, k=5, removeqmark=TRUE) {
l <- length(vec)
out <- sapply(seq_len(l-1), function(i) {
bef <- paste(vec[1:i], collapse="")
aft <- paste(vec[(i+1):l], collapse="")
paste0(substr(bef, max(1, nchar(bef)-k+1), nchar(bef)),
substr(aft, 1, min(k, nchar(aft))))
})
if (removeqmark) out <- gsub("\\?", "", out)
out
}
Now we can iterate over the split-string vectors using this function:
str(
lapply(strsplit(peptide, '[().[:digit:]]+'), surrounding)
)
# List of 4
# $ : chr [1:2] "aaabbbcc" "aabbbccc"
# $ : chr "aaabbbcc"
# $ : chr [1:2] "aabbbcccdd" "bbcccddd"
# $ : chr [1:2] "aabbbccddd" "bbbccddd"
Unfortunately, it's dropping the third of the last vector. This is not surprising to me, since ending on a separator does not necessarily return an empty string. So we can add something to each string IFF we are ending on a separator:
( peptide2 <- gsub("([().[:digit:]])$", "\\1?", peptide) )
# [1] "aaa(0.011)bbb(0.989)ccc" "aaa(1)bbbccc" "aaabbb(0.15)ccc(0.85)ddd"
# [4] "aaabbb(0.75)cc(0.24)ddd(0.01)?"
str(
strsplit(peptide2, '[().[:digit:]]+')
)
# List of 4
# $ : chr [1:3] "aaa" "bbb" "ccc"
# $ : chr [1:2] "aaa" "bbbccc"
# $ : chr [1:3] "aaabbb" "ccc" "ddd"
# $ : chr [1:4] "aaabbb" "cc" "ddd" "?"
str(
lapply(strsplit(peptide2, '[().[:digit:]]+'), surrounding)
)
# List of 4
# $ : chr [1:2] "aaabbbcc" "aabbbccc"
# $ : chr "aaabbbcc"
# $ : chr [1:2] "aabbbcccdd" "bbcccddd"
# $ : chr [1:3] "aabbbccddd" "bbbccddd" "ccddd"
where our default is to remove the question mark from the resulting surrounds. To use a different surrounding number than 5, just do:
lapply(strsplit(peptide2, '[().[:digit:]]+'), surrounding, k=2)
In order to combine this into a data.frame, you need some more work, since you have rows of different lengths.
rows <- lapply(strsplit(peptide2, '[().[:digit:]]+'), surrounding)
( maxrows <- max(lengths(rows)) )
# [1] 3
rows <- lapply(rows, function(r) c(r, rep(NA_character_, maxrows - length(r))))
do.call(rbind, rows)
# [,1] [,2] [,3]
# [1,] "aaabbbcc" "aabbbccc" NA
# [2,] "aaabbbcc" NA NA
# [3,] "aabbbcccdd" "bbcccddd" NA
# [4,] "aabbbccddd" "bbbccddd" "ccddd"
(This is generating a matrix ... sandwich in as.data.frame if you need a frame.)
You can use a function that will create a left- and right-side for each set of brackets (so you will get n - 1 strings for n brackets) and collapse everything to the left and right with a comma. Then just sub out at most 5 characters from each side of the comma.
peptide <- c("aaa(0.011)bbb(0.989)ccc","aaa(1)bbbccc","aaabbb(0.15)ccc(0.85)ddd",
"aaabbb(0.75)cc(0.24)ddd(0.01)")
f <- function(x) {
l <- lapply(seq_along(x), function(ii) {
x <- rbind(trimws(x), replace(gsub('.', '', x), ii, ','))
paste(x, collapse = '')
})
sapply(l[-length(l)], function(x)
gsub('([a-z]{1,5}),([a-z]{1,5})?|.', '\\1\\2', x))
}
sp <- strsplit(gsub('\\([0-9.]+\\)', ', ', peptide), ',')
## for example
f(sp[[4L]])
# [1] "aabbbccddd" "bbbccddd" "ccddd"
## apply to everything and return as a data frame
l <- lapply(sp, f)
l <- lapply(l, function(x) {
ml <- max(lengths(l))
setNames(`length<-`(x, ml), paste0('col', seq.int(ml)))
})
data.frame(do.call('rbind', l))
# col1 col2 col3
# 1 aaabbbcc aabbbccc <NA>
# 2 aaabbbcc <NA> <NA>
# 3 aabbbcccdd bbcccddd <NA>
# 4 aabbbccddd bbbccddd ccddd

Coerce variables in data frame to appropriate format

I'm working a data frame which consists of multiple different data types (numerics, characters, timestamps), but unfortunately all of them are received as characters. Hence I need to coerce them into their "appropriate" format dynamically and as efficiently as possible.
Consider the following example:
df <- data.frame("val1" = c("1","2","3","4"), "val2" = c("A", "B", "C", "D"), stringsAsFactors = FALSE)
I obviously want val1 to be numeric and val2 to remain as a character. Therefore, my result should look like this:
'data.frame': 4 obs. of 2 variables:
$ val1: num 1 2 3 4
$ val2: chr "A" "B" "C" "D"
Right now I'm accomplishing this by checking if the coercion would result in NULL and then proceeding in coercing if this isn't the case:
res <- as.data.frame(lapply(df, function(x){
x <- sapply(x, function(y) {
if (is.na(as.numeric(y))) {
return(y)
} else {
y <- as.numeric(y)
return(y)
}
})
return(x)
}), stringsAsFactors = FALSE)
However, this doesn't strike me as the correct solution because of multiple issues:
I suspect that there is a faster way of accomplishing this
For some reason I receive the warning In FUN(X[[i]], ...) : NAs introduced by coercion, although this isn't the case (see result)
This seems inappropriate when handling other data types, i.e. dates
Is there a general, heuristic approach to this, or another, more sustainable solution? Thanks
The recent file readers like data.table::fread or the readr package do a pretty decent job in identifying and converting columns to the appropriate type.
So my first reaction was to suggest to write the data to file and read it in again, e.g.,
library(data.table)
fwrite(df, "dummy.csv")
df_new <- fread("dummy.csv")
str(df_new)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
- attr(*, ".internal.selfref")=<externalptr>
or without actually writing to disk:
df_new <- fread(paste(capture.output(fwrite(df, "")), collapse = "\n"))
However, d.b's suggestions are much smarter but need some polishing to avoid coercion to factor:
df[] <- lapply(df, type.convert, as.is = TRUE)
str(df)
'data.frame': 4 obs. of 2 variables:
$ val1: int 1 2 3 4
$ val2: chr "A" "B" "C" "D"
or
df[] <- lapply(df, readr::parse_guess)
You should check dataPreparation package. You will find function findAndTransformNumerics function that will do exactly what you want.
require(dataPreparation)
data("messy_adult")
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"character" "character" "factor"
messy_adult is an ugly data set to illustrate functions from this package. Here num1 and num2 are strings :/
messy_adult <- findAndTransformNumerics(messy_adult)
[1] "findAndTransformNumerics: It took me 0.18s to identify 3 numerics column(s), i will set them as numerics"
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum1"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum2"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I am doing the columnnum3"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "findAndTransformNumerics: It took me 0.09s to transform 3 column(s) to a numeric format."
Here we performed the search and it logged what it found
And know:
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"numeric" "numeric" "factor"
Hope it helps!
Disclamer: I'm the author of this package.

What is attr(*, "value.labels") when reading SPSS into R?

I have an SPSS file, but not SPSS. So I want to open it in R.
If I open it using:
library(foreign)
dat <- read.spss("file.sav", to.data.frame=TRUE)
I get the warning
re-encoding from CP1252
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
If I understand correctly, the encoding notification is not a problem (I'm in an UTF-8 locale), but what does the warning about levels mean?
If I open the file using:
dat <- read.spss("file.sav", to.data.frame=TRUE, use.value.labels = FALSE)
the warning disappears, but I'm not sure if what I do is correct.
Also, calling str(dat) gives me output like:
pt_art : atomic 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "value.labels")= Named chr "2" "1"
.. ..- attr(*, "names")= chr "IPT" "VT"
What does attr(*, "value.labels") mean? I know that "pt_art" means "type of psychotherapy" and "IPT" and "VT" are the two therapy types and "2" and "1" are the numeric codes representing those types, so what we have are what are levels and labels in R, but how do I correctly transfer that into R?
The warning occurs when you try and define a factor with a labels argument that contains duplicate values.
(x <- sample(letters[1:4], 10, replace = TRUE))
## [1] "b" "c" "d" "d" "b" "c" "d" "c" "c" "c"
factor(x, levels = x)
## [1] b c d d b c d c c c
## Levels: b c d d b c d c c c
## Warning message:
## In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
## duplicated levels will not be allowed in factors anymore
SPSS usually uses value labels to denote categorical variables (that should become factors in R). However note this section from the ?read.spss help page.
Occasionally in SPSS, value labels will be added to some values of
a continuous variable (e.g. to distinguish different types of
missing data), and you will not want these variables converted to
factors. By setting 'max.value.labels' you can specify that
variables with a large number of distinct values are not converted
to factors even if they have value labels. In addition, variables
will not be converted to factors if there are non-missing values
that have no value label. The value labels are then returned in
the '"value.labels"' attribute of the variable.

R: numeric vector becoming non-numeric after cbind of dates

I have a numeric vector (future_prices) in my case. I use a date vector from another vector (here: pred_commodity_prices$futuredays) to create numbers for the months. After that I use cbind to bind the months to the numeric vector. However, was happened is that the numeric vector become non-numeric. Do you know how what the reason for this is? When I use as.numeric(future_prices) I get strange values. What could be an alternative? Thanks
head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a
1 68.33907 62.37888
2 68.08553 62.32658
is.numeric(future_prices)
[1] TRUE
> month = format(as.POSIXlt.date(pred_commodity_prices$futuredays), "%m")
> future_prices <- cbind (future_prices, month)
> head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a month
1 "68.3390747063745" "62.3788824938719" "01"
is.numeric(future_prices)
[1] FALSE
The reason is that cbind returns a matrix, and a matrix can only hold one data type. You could use a data.frame instead:
n <- 1:10
b <- LETTERS[1:10]
m <- cbind(n,b)
str(m)
chr [1:10, 1:2] "1" "2" "3" "4" "5" "6" "7" "8" "9" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "n" "b"
d <- data.frame(n,b)
str(d)
'data.frame': 10 obs. of 2 variables:
$ n: int 1 2 3 4 5 6 7 8 9 10
$ b: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
See ?format. The format function returns:
An object of similar structure to ‘x’ containing character
representations of the elements of the first argument ‘x’ in a
common format, and in the current locale's encoding.
from ?cbind, cbind returns
... a matrix combining the ‘...’ arguments
column-wise or row-wise. (Exception: if there are no inputs or
all the inputs are ‘NULL’, the value is ‘NULL’.)
and all elements of a matrix must be of the same class, so everything is coerced to character.
F.Y.I.
When one column is "factor", simply/directly using as.numeric will change the value in that column. The proper way is:
data.frame[,2] <- as.numeric(as.character(data.frame[,2]))
Find more details: Converting values to numeric, stack overflow

Resources