How to replace a specific character string with a number? - r

I'm working with a dataframe, entitled Clutch, of information about cards in a trading card game. One of the variables, CMD+, can consist of the following values:
"R+1"
"L+1"
"R+2"
"L+2"
0
What I want to do is to create a new variable, Clutch$C+, that takes these string values for each data point and replaces them with numbers. R+1 and L+1 are replaced with 0.5, and R+2 and L+2 are replaced with 1. 0 is unchanged.
How do I do this? Sorry if this is a basic question, my R skills aren't great at the minute, working on getting better.

probably not the most beautiful solution but this should work.
C<-rep(0,length(Clutch$CMD))
Clutch<-cbind(Clutch,C)
Clutch$C+[which(Clutch$CMD+=="R+1")]<-0.5
Clutch$C+[which(Clutch$CMD+=="L+1")]<-0.5
Clutch$C+[which(Clutch$CMD+=="R+2")]<-1

You can try:
paste0(as.numeric(gsub("\\D", "\\1", x))/2, sub("\\D", "\\1", x))
[1] "0.5+1" "0.5+1" "1+2" "1+2"

Here is one way using the fact that the result is half the digit in your string :
Clutch <- data.frame(`CMD+` = sample(c("R+1", "L+1", "R+2", "L+2", 0), 10, replace = TRUE))
Clutch[["C+"]] <- as.numeric(gsub("[^0-9]", "", Clutch$CMD))/2
Clutch
> Clutch
CMD. C+
1 R+1 0.5
2 R+2 1.0
3 R+1 0.5
4 L+1 0.5
5 L+1 0.5
6 R+1 0.5
7 R+1 0.5
8 L+1 0.5
9 0 0.0
10 L+1 0.5

You can simply use gsub
> as.numeric(gsub(".*[+]","",a))/2
[1] 0.5 0.5 1.0 1.0 0.0
If it is a data frame. You can use this-
> library(data.table)
> dt <- data.frame(CMD = c("R+1", "L+1", "R+2", "L+2", 0))
> setDT(dt)[,CMD:=as.numeric(gsub(".*[+]","",a))/2]
> dt
CMD
1: 0.5
2: 0.5
3: 1.0
4: 1.0
5: 0.0

Another idea is to use a simple ifelse statement that looks for 1 in the string and replaces with 0.5, and 2 to replace with 1, i.e.
#where x is your column,
as.numeric(ifelse(grepl('1', x), 0.5, ifelse(grepl('2', x), 1, x)))
#[1] 0.5 0.5 1.0 1.0 0.0

Related

How to use Loops or Lapply to generate 100 new variables from a matrix?

So, I have a data frame containing 100 different variables. Now, I want to create 100 new variables corresponding to each of the variable in the original data frame. Currently, I am trying loops and lapply to figure out the way out of it, but haven't had much luck so far.
Here is just a snapshot of how the data frame looks like(suppose my data frame has name er):
a b c d
1 2 3 4
5 6 7 8
9 0 1 2
and using each of these 4 variable I have to create a new variable. Hence, total of 4 new variables. My variable should be like lets suppose a1=0.5+a, b1=0.5+b and so on.
I am doing trying the following two approaches:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
and alternatively, I am trying lapply as follows:
dep <- lapply(er, function(x) {
x<-0.5+er
}
But, none of them are working. Can anyone let me know what's the problem with these codes or suggest an efficient code to do this. I have show just 4 variables here for demonstration. I have around 100 of them.
You could directly add 0.5 (or any number) to the dataframe.
er[paste0(names(er), '1')] <- er + 0.5
er
# a b c d a1 b1 c1 d1
#1 1 2 3 4 1.5 2.5 3.5 4.5
#2 5 6 7 8 5.5 6.5 7.5 8.5
#3 9 0 1 2 9.5 0.5 1.5 2.5
Ronak's answer provides the most efficient way of solving your problem. I'll focus on why your attempts didn't work.
er <- data.frame(a = c(1, 5, 9), b = c(2, 6, 0), c = c(3, 7, 1), d = c(4, 8, 2))
A. for loop:
for (i in 1:ncol(er)) {
[[i]] <- 0.5 + [[i]]
}
Thinking of how R is interpreting each element of your loop. It will go from 1 to however many columns of er, and use the i placeholder, so on the first iteration it will do:
[[1]] <- 0.5 + [[1]]
Which doesn't make sense because you're not indicating what you are indexing at all. Instead, what you would want is:
for (i in 1:ncol(er)) {
er[[i]] <- 0.5 + er[[i]]
}
Here, each iteration will mean "assign to the ith column of er, the ith column of er + 0.5". If you want to further add that you want to create new variables, you would do the following (which is somewhat similar to Ronak's answer, just less efficient):
for (i in 1:ncol(er)) {
er[[paste0(names(er)[i], "1")]] <- 0.5 + er[[i]]
}
As a side note, it is preferred to use seq_along(er) instead of 1:ncol(er).
B. lapply:
dep <- lapply(er, function(x) {
x<-0.5+er
}
When creating a function, whatever you need to specify what you want to return by calling it. Here, function(x) { x + 0.5 } is sufficient to indicate that you want to return the variable + 0.5. Since lapply() returns a list (the function's name is short for "list apply"), you'll want to use as.data.frame():
as.data.frame(lapply(er, function(x) { x + 0.5 }))
However, this doesn't change the variable names, and there's no easy efficient way to change that here:
dep <- as.data.frame(lapply(er, function(x) { x + 0.5 }))
names(dep) <- paste0(names(dep), "1")
cbind(er, dep)
a b c d a1 b1 c1 d1
1 1 2 3 4 1.5 2.5 3.5 4.5
2 5 6 7 8 5.5 6.5 7.5 8.5
3 9 0 1 2 9.5 0.5 1.5 2.5
C. Another way would be using dplyr syntax, which is more elegant and readable:
library(dplyr)
mutate(er, across(everything(), ~ . + 0.5, .names = "{.col}1"))

How can I normalize column values in a data frame for all rows that share the same ID given in another column?

I have a dataframe that looks like this
ID value
1 0.5
1 0.6
1 0.7
2 0.5
2 0.5
2 0.5
and I would like to add a column with normalization for values of the same ID like this: norm = value/max(values with same ID)
ID value norm
1 0.5 0.5/0.7
1 0.6 0.6/0.7
1 0.7 1
2 0.5 1
2 0.3 0.3/0.5
2 0.5 1
Is there an easy way to do this in R without first sorting and then looping?
Cheers
A solution using basic R tools:
data$norm <- with(data, value / ave(value, ID, FUN = max))
Function ave is pretty useful, and you may want to read ?ave.
# Create an example data frame
dt <- read.csv(text = "ID, value
1, 0.5
1, 0.6
1, 0.7
2, 0.5
2, 0.5
2, 0.5")
# Load package
library(tidyverse)
# Create a new data frame with a column showing normalization
dt2 <- dt %>%
# Group the ID, make sure the following command works only in each group
group_by(ID) %>%
# Create the new column norm
# norm equals each value divided by the maximum value of each ID group
mutate(norm = value/max(value))
We can use data.table
library(data.table)
setDT(dt)[, norm := value/max(value), ID]

First index of longest ordered portion of a vector

I am looking to extract the longest ordered portion of a vector. So for example with this vector:
x <- c(1,2,1,0.5,1,4,2,1:10)
x
[1] 1.0 2.0 1.0 0.5 1.0 4.0 2.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
I'd apply some function, get the following returned:
x_ord <- some_func(x)
x_ord
[1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
I've been trying to leverage is.unsorted() to determine at what point the vector is no longer sorted. Here is my messy attempt and what I have so far:
for(i in 1:length(x)){
if( is.unsorted(x[i:length(x)])==TRUE ){
cat(i,"\n")}
else{x_ord=print(x[i])}
}
However, this clearly isn't right as x_ord is producing a 10. I am also hoping to make this more general and cover non increasing numbers after the ordered sequence as well with a vector something like this:
x2 <- c(1,2,1,0.5,1,4,2,1:10,2,3)
Right now though I am stuck on identifying the increasing sequence in the first vector mentioned.
Any ideas?
This seems to work:
s = 1L + c(0L, which( x[-1L] < x[-length(x)] ), length(x))
w = which.max(diff(s))
x[s[w]:(s[w+1]-1L)]
# 1 2 3 4 5 6 7 8 9 10
s are where the runs start, plus length(x)+1, for convenience:
the first run starts at 1
subsequent runs starts where there is a drop
we tack on length(x)+1, where the next run would start if the vector continued
diff(s) are the lengths of the runs and which.max takes the first maximizer, to break ties.
s[w] is the start of the chosen run; s[w+1L] is the start of the next run; so to get the numbers belonging to the chosen run: s[w]:(s[w+1]-1L).
Alternately, split and then select the desired subvector:
sp = split(x, cumsum(x < c(-Inf, x[-length(x)])))
sp[[which.max(lengths(sp))]]
# 1 2 3 4 5 6 7 8 9 10

R populating a vector [duplicate]

This question already has answers here:
R fill vector efficiently
(4 answers)
Closed 6 years ago.
I have a vector of zeros, say of length 10. So
v = rep(0,10)
I want to populate some values of the vector, on the basis of a set of indexes in v1 and another vector v2 that actually has the values in sequence. So another vector v1 has the indexes say
v1 = c(1,2,3,7,8,9)
and
v2 = c(0.1,0.3,0.4,0.5,0.1,0.9)
In the end I want
v = c(0.1,0.3,0.4,0,0,0,0.5,0.1,0.9,0)
So the indexes in v1 got mapped from v2 and the remaining ones were 0. I can obviously write a for loop but thats taking too long in R, owing to the length of the actual matrices. Any simple way to do this?
You can assign it this way:
v[v1] = v2
For example:
> v = rep(0,10)
> v1 = c(1,2,3,7,8,9)
> v2 = c(0.1,0.3,0.4,0.5,0.1,0.9)
> v[v1] = v2
> v
[1] 0.1 0.3 0.4 0.0 0.0 0.0 0.5 0.1 0.9 0.0
You can also do it with replace
v = rep(0,10)
v1 = c(1,2,3,7,8,9)
v2 = c(0.1,0.3,0.4,0.5,0.1,0.9)
replace(v, v1, v2)
[1] 0.1 0.3 0.4 0.0 0.0 0.0 0.5 0.1 0.9 0.0
See ?replace for details.

Using Rollapply on two columns

I'm trying to do something similar I was asking for here and unfortunately I cannot work it out.
This is my data frame (data), a time series of prices:
Date Price Vol
1998-01-01 200 0.3
1998-01-02 400 0.4
1998-01-03 600 -0.2
1998-01-04 100 0.1
...
1998-01-20 100 0.1
1998-01-21 200 -0.4
1998-01-21 500 0.06
....
1998-02-01 100 0.2
1998-02-02 200 0.4
1998-02-03 500 0.3
1998-02-04 100 0.1
etc.
I would like to tell R, to
take the 1st value of "Vol" and divide it by the 20th value of "Price", then
take the 2st value of "Vol" and divide it by the 21th value of "Price", then.
take the 3st value of "Vol" and divide it by the 22th value of "Price", then
etc.
In my other post, I was able to use this function to calculate a return over a holding period of 20 days:
> data.xts <- xts(data[, -1], data[, 1])
> hold <- 20
> f <- function(x) log(tail(x, 1)) - log(head(x, 1))
> data.xts$returns.xts <- rollapply(data.xts$Price, FUN=f,
width=hold+1, align="left", na.pad=T)
Is there a way to do something very similar for the problem stated above? So something like
f1 <- function(x,y) head(x, 1) / tail(y,1)
where x is "Vol" and y is "Price" and then apply "rollapply"?
Thank you very much
UPDATE: # Dr G:
Thanks for your suggestions. With a slight change, it did what I wanted!
data.xts <- xts(data[, -1], data[, 1])
hold <- 20
data.xts$quo <- lag(data.xts[,2], hold) / data.xts[,1]
Now my problem is, that the resulting data frame looks like this:
Date Price Vol quo
1 1998-01-01 200 0.3 NA
2 1998-01-02 400 0.4 NA
3 1998-01-03 600 -0.2 NA
4 1998-01-04 100 0.1 NA
...
21 1998-01-20 180 0.2 0.003
I know that there must be NA's as an outcome, but only for the last 20 observations, not the first 20 ones. The formula stated above calculates the correct values, however puts them starting at the 21st row instead of the first row. Do you know how I could change that?
Use by.column = FALSE in rollapply. In order to use the posted data we will divide the volume in the first row by the price in the 3rd row and so on for purposes of reproducible illustration:
library(zoo)
Lines <- "Date Price Vol
1998-01-01 200 0.3
1998-01-02 400 0.4
1998-01-03 600 -0.2
1998-01-04 100 0.1
1998-01-20 100 0.1
1998-01-21 200 -0.4
1998-01-21 500 0.06
1998-02-01 100 0.2
1998-02-02 200 0.4
1998-02-03 500 0.3
1998-02-04 100 0.1"
# read in and use aggregate to remove all but last point in each day.
# In reality we would replace textConnection(Lines) with something
# like "myfile.dat"
z <- read.zoo(textConnection(Lines), header = TRUE,
aggregate = function(x) tail(x, 1))
# divide Volume by the Price of the point 2 rows ahead using by.column = FALSE
# Note use of align = "left" to align with the volume.
# If we used align = "right" it would align with the price.
rollapply(z, 3, function(x) x[1, "Vol"] / x[3, "Price"], by.column = FALSE,
align = "left")
# and this is the same as rollapply with align = "left" as above
z$Vol / lag(z$Price, 2)
# this is the same as using rollapply with align = "right"
lag(z$Vol, -2) / z$Price
By the way, note that zoo uses the same convention for the sign of lag as does R but xts uses the opposite convention so if you convert the above to xts you will have to negate the lags.
It's actually easier than that. Just do this:
data.xts <- xts(data[, -1], data[, 1])
hold <- 20
returns.xts = data.xts[,2] / lag(data.xts[,1], hold)
Actually for this using zoo instead of xts would work as well:
data.zoo<- zoo(data[, -1], data[, 1])
hold <- 20
returns.zoo = data.zoo[,2] / lag(data.zoo[,1], -hold)
Only thing that changes is the sign of the lags (zoo convention is different than xts)
You just need to use
data.xts$quo <- data.xts[,2] / lag( data.xts[,1], -hold)

Resources