Problem with replacing a comma with a period - r

I replace the comma with a period in the data.frame column
data[,22] <- as.numeric(sub(",", ".", sub(".", "", data[,22], fixed=TRUE), fixed=TRUE))
But I have values that look like this: 110.00, 120.00, 130.00...
When replacing, I get the value:11000.0, 12000.0, 13000.0
And I would like to get: 110.0,120.0, 130.0....
My column 22 data.frame:
| n |
|--------|
| 92,5 |
| 94,5 |
| 96,5 |
| 110.00|
| 120.00|
| 130.00|
What I want to get:
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.0|
| 120.0|
| 130.0|
or
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.00|
| 120.00|
| 130.00|

Don't replace the periods since they are already in the format that you want. Replace only commas to period and turn the data to numeric.
data[[22]] <- as.numeric(sub(',', '.', fixed = TRUE, data[[22]]))

Using str_replace
library(stringr)
data[[22]] <- as.numeric(str_replace(data[[2]], ",", fixed(".")))

You can use gsub like below
transform(
df,
n = as.numeric(gsub("\\D", ".", n))
)
where non-digital character, i.e., "," or ".", are replaced by "."

Related

Split a single row into multiple rows keeping the delimiter intact

I am trying to split a single row in my data set into multiple rows by keeping the delimiter intact.
This is a sample of my input data set
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1. Teams must be split into two |
| | 2. Teams must have ten players in each team |
| | 3. Each player must bring their own gear |
|---------------------|----------------------------------------------- |
When I use Strsplit function, I get the following output:
df = data.frame(rules =unlist(strsplit(as.character(df$Rules),"?=[[digits]]", perl = T)))
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1 |
|--------------------------------------------------------------------- |
1 | .Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2 |
|--------------------------------------------------------------------- |
1 | .Teams must have ten players in each team |
|--------------------------------------------------------------------- |
My desired Output
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1.Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2.Teams must have ten players in each team |
|--------------------------------------------------------------------- |
Here is a way to collapse each number with the following character string in column Rules. It throws warnings, not errors.
grp <- cumsum(!is.na(as.numeric(df$Rules)))
res <- lapply(split(df, grp), function(X){
data.frame(Group = X[[1]][1],
Rules = paste(X[[2]], collapse = ""))
})
res <- do.call(rbind, res)
res
# Group Rules
#1 1 1.Teams must be split into two
#2 1 2.Teams must have ten players in each team
Data.
df <- data.frame(Group = rep(1, 4),
Rules = c(1, ".Teams must be split into two",
2, ".Teams must have ten players in each team"),
stringsAsFactors = FALSE)

R: Regex to match more than one pipe occurrence

I have a dataset in which I paste values in a dplyr chain and collapse with the pipe character (e.g. " | "). If any of the values in the dataset are blank, I just get recurring pipe characters in the pasted list.
Some of the values look like this, for example:
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
I want to match all the pipes that occur more than once and delete them, so that just the names appear like so:
correctstring = "| GHOULSBY,SCROGGINS | CAT,JOHNSON | |BURGLAR,PALA |"
I tried the following, but to no avail:
mutate(names = gsub('[\\|]{2,}', '', name_list))
The difficulty in this question is in formulating a regex which can selectively remove every pipe, except the ones we want to remain as actual separators between terms. We can match on the following pattern:
\|\s+(?=\|)
and then replace just empty string. This pattern will remove any pipe (and any following whitespace) so long as what follows is another pipe. A removal would not occur when a pipe is followed by an actual term, or when it is followed by the end of the string.
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
result <- gsub("\\|\\s+(?=\\|)", "", badstring, perl=TRUE)
result
[1] "| GHOULSBY,SCROGGINS | CAT,JOHNSON | BURGLAR,PALA |"
Demo
Edit:
If you expect to have inputs like | | | which are devoid of any terms, and you would expect empty string as the output, then my solution would fail. I don't see an obvious way to modify the above regex, but you can handle this case with one more call to sub:
result <- sub("^\\|$", "", result)
We also might be able to modify the original pattern to use an alternation covering all cases:
result <- gsub("\\|\\s+(?=\\|)|(?:^\\|$)", "", badstring, perl=TRUE)

How can I remove certain part of row names in data frame

I have a data set with the following format:
ID | Value
-------------------------- | -------------------------------
AAA1|404744 | 1.7554
ANKHD1-EIF4EBP3|404734 | 0.5174
HLA-B|3106 | 11.7659
HLA-A|3105 | 18.0851
What I want is removing certain part of the row names like this:
ID | Value
--------------------- | -------------------------------
AAA1 | 1.7554
ANKHD1-EIF4EBP3 | 0.5174
HLA-B | 11.7659
HLA-A | 18.0851
Thanks a lot!
We can do this with sub. Match the | (a metacharacter implies or - so either escape \\| it or place it in brackets to get the literal character) followed by characters (.*) and replace it with blank ("")
df$ID <- sub("[|].*", "", df$ID)

R apply script output in different formats for similar inputs

I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable.
How to do it elegantly without transposing?
So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.
In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A
+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker | DATE_f | wght | excl_val |
+---------+---------+-----------+-------------+-------------+
| A | XLK | 6/20/2003 | 0.087600002 | 1.980834016 |
| A | XLK | 6/23/2003 | 0.08585 | 1.898560068 |
| A | XLK | 6/24/2003 | 0.085500002 | |
| AAPL | XLK | 6/20/2003 | 0.070080002 | |
| AAPL | XLK | 6/23/2003 | 0.06868 | |
| AAPL | XLK | 6/24/2003 | 0.068400002 | |
| AA | XLK | 6/20/2003 | 1.910754014 | |
| AA | XLK | 6/23/2003 | 1.829880067 | |
| AA | XLK | 6/24/2003 | 1.819775 | |
| | | | | |
| | | | | |
+---------+---------+-----------+-------------+-------------+
There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.
I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.
In R, I got as far as this
weights$excl_val <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))
but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.
About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short
Thank you!
Arsenio
When you have data in a data.frame, it is better if the rows are meaningful
(in particular, the columns should have the same length):
in this case, excl_val looks like a separate vector.
After putting the information it contains in the data.frame,
things become easier.
# Sample data
k <- 5
d <- data.frame(
sTicker = rep(LETTERS[1:k], k),
fTicker = rep(LETTERS[1:k], each=k),
DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Resources