I have a variable (say, VarX) with values 1:4 in a dataset with approximately 2000 rows. There are other variables in the dataset too. I would like to create a new variable (NewVar) so that if the value VarX is 1, the value of NewVar is 0.32 (the value from myMat[1, 1]), if the value VarX is 2, the value of NewVar is 0.05 (the value from myMat[2, 1]) and so on...
myMat
VarA VarB VarC
[1,] 0.32 0.34 0.27
[2,] 0.05 0.02 0.11
[3,] 0.11 0.11 0.17
[4,] 0.52 0.52 0.45
I have tried the following and it works fine:
df$NewVar <- ifelse(df$VarX == 1, 0.32,
ifelse(df$VarX == 2, 0.05,
ifelse(df$VarX == 3, 0.11,
ifelse(df$VarX == 4, 0.52, 0))))
However, I have another variable (say, VarY) which has 182 values and another matrix with 182 different values. So, using ifelse() would be quite tedious. Is there another way to perform the task in R? Thank you!
Related
I'd like to calculate data for two new columns in a data.frame where the results are based on the value of the previous row. However, the previous row also needs to be calculated, which means that there is a dependency between the two columns (the input for one calculation is based on the output of another calculation). I could do it through a for, but maybe it's not the right way.
This is a sample for this case:
df <- data.frame(A=c(0.91,0.98,1,1.1), B=c(0.81, 1.11, 0.83, 0.92), C=c(0.09,0.06,0.09,0.08))
df$D <- NA
df$E <- NA
df[1,]$D <- 0.0
I've been trying it through dplyr::mutate.
df %>%
mutate(D = ifelse( lag(A) < 1, lag(E), lag(E) - lag(E) * lag(A)),
E = B - (B - D) * exp(-C)
)
This is how the output should be:
> df
A B C D E
1 0.91 0.81 0.09 0.00000000 0.06971574
2 0.98 1.11 0.06 0.06971574 0.13029718
3 1.00 0.83 0.09 0.13029718 0.19051977
4 1.10 0.92 0.08 0.00000000 0.07073296
I have a large data set and I need to get the standard deviation for the Main column based on the number of rows in other columns. Here is a sample data set:
df1 <- data.frame(
Main = c(0.33, 0.57, 0.60, 0.51),
B = c(NA, NA, 0.09,0.19),
C = c(NA, 0.05, 0.07, 0.05),
D = c(0.23, 0.26, 0.23, 0.26)
)
View(df1)
# Main B C D
# 1 0.33 NA NA 0.23
# 2 0.57 NA 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
Take column B as an example, since row 1&2 are NA, its standard deviation will be sd(df1[3:4,1]); column C&D will be sd(df1[2:4,1]) and sd(df1[1:4,1]). Therefore, the result will be:
# B C D
# 1 0.06 0.05 0.12
I did the followings but it only returned one number - 0.0636
df2 <- df1[,-1]!=0
sd(df1[df2,1], na.rm = T)
My data set has many more columns, and I'm wondering if there is a more efficient way to get it done? Many thanks!
Try:
sapply(df1[,-1], function(x) sd(df1[!is.na(x), 1]))
# B C D
# 0.06363961 0.04582576 0.12093387
x <- colnames(df) # list all columns you want to calculate sd of
value <- sapply(1:length(x) , function(i) sd(df[,x[i],drop=TRUE], na.rm = T))
names(value) <- x
# Main B C D
# 0.12093387 0.07071068 0.01154701 0.01732051
We can get this with colSds from matrixStats
library(matrixStats)
colSds(`dim<-`(df1[,1][NA^is.na(df1[-1])*row(df1[-1])], dim(df1[,-1])), na.rm = TRUE)
#[1] 0.06363961 0.04582576 0.12093387
I have two data bases (data is multicolumn before and after treatment):
Before treatment
Data1<-read.csv("before.csv")
X1 X2 X3
1 0.21 0.32 0.42
2 0.34 0.23 0.33
3 0.42 0.14 0.11
4 0.35 0.25 0.35
5 0.25 0.41 0.44
After treatment
Data2<-read.csv("after.csv")
X1 X2 X3
1 0.33 0.43 0.7
2 0.28 0.51 0.78
3 0.11 0.78 0.34
4 0.54 0.34 0.34
5 0.42 0.64 0.22
I would like to combine the data by columns (i.e. x1 in Data1 and x1 in Data2 similarly: x2 in Data1 and x2 in Data2 and so on) and perform Johansen Cointegration test for each pair.
What I tried is to make:
library("urca")
x1<-cbind(Data1$x1, Data2$x1)
Jo1<-ca.jo(x1, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo1)
x2<-cbind(Data1$x1, Data2$x2)
Jo2<-ca.jo(x2, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo2)
This gives me what I want but I would like to automate the process, i.e. instead of manually combining data, to have all pair-wise combinations.
Based on krishna's answere, but modified the loop:
for(i in 1:ncol(Data1)) {
col <- paste0("X", as.character(i))
data <- cbind(Data1[, col], Data2[, col])
colnames(data) <- c(paste0("Data1_",col),paste0("Data2_",col)) # add column names
Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
print(summary(Jo)) # print the summary to the console
}
You can loop through the columns name and find the Johansen Cointegration as follows:
# Create a sample data frame
Data1<- data.frame(X1 = rnorm(10, 0, 1), X2 = rnorm(10, 0, 1), X3 = rnorm(10, 0, 1))
Data2 <-data.frame(X1 = rnorm(10, 0, 1), X2 = rnorm(10, 0, 1), X3 = rnorm(10, 0, 1))
library("urca")
# loop through all columns index
for(i in ncol(Data1)) {
col <- paste0("X", as.character(i)) # find the column name
data <- cbind(Data1[, col], Data2[, col]) # get the data from Data1 and Data2, all rows of a column = col
# Your method for finding Ca.Jo ...
Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo)
}
You can also use colnames for looping as:
for(col in colnames(Data1)) {
print(col)
data <- cbind(Data1[, col], Data2[, col])
print(data)
#Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
#summary(Jo)
}
Hope this will help you.
I'm relatively new to R and am having trouble creating a vector that sums certain values based on other values. I'm not quite sure what the problem is. I don't receive an error, but the output is not what I was looking for. Here is a reproducible example:
fakeprice <- c(1, 2, 2, 1, NA, 5, 4, 4, 3, 3, NA)
fakeconversion <-c(.2, .15, .07, .25, NA, .4, .36, NA, .67, .42, .01)
fakedata <- data.frame(fakeprice, fakeconversion)
fake.list <- sort(unique(fakedata$fakeprice))
fake.sum <- vector(,5)
So, fakedata looks like:
fakeprice fakeconversion
1 1 0.20
2 2 0.15
3 2 0.07
4 1 0.25
5 NA NA
6 5 0.40
7 4 0.36
8 4 NA
9 3 0.67
10 3 0.42
11 NA 0.01
I think the problem lies in the NAs, but I'm not quite sure (there are quite a few in the original data set). Here are the for loops with nested if statements. I kept getting an error when the price was 'NA' and so I added the is.na():
for(i in fake.list){
sum=0
for(j in fakedata$fakeprice){
if(is.na(fakedata$fakeprice[j])==TRUE){
NULL
} else {
if(fakedata$fakeprice[j]==fake.list[i]){
sum <- sum+fakedata$fakeconversion[j]
}}
}
fake.sum[i]=sum
}
sumdata <- data.frame(fake.list, fake.sum)
I'm looking for an output that adds up fakeconversion for each unique price. So, for fakeprice=1, fake.sum=0.45. The resulting data I am looking for would look like:
fake.list fake.sum
1 1 0.45
2 2 0.22
3 3 1.09
4 4 0.36
5 5 0.40
What I get, however, is:
sumdata
fake.list fake.sum
1 1 0.90
2 2 0.44
3 3 0.00
4 4 0.00
5 5 0.00
Any help is very much appreciated!
aggregate(fakedata$fakeconversion, list(price = fakedata$fakeprice), sum, na.rm = TRUE)
The above will deal with the NA in fakeprice 4.
The aggregate function works by subsetting your data by something and then running a function, FUN.
So:
aggregate(x, by, FUN, ...,)
x is what you wish to run the FUN on. By can be given a list if you wish to split the data by multiple columns.
I have a data.frame containing numerics. I want to create a new column within that data.frame that will house factor labels using (letters[]). I want these factor labels to be built from a sequence of numbers that I have, and can change every time.
For example, my original DF has 1 column x containing numerics, I then have a sequence of numbers (3,7,9). So I need the new FLABEL column to populate according to the number sequence, i.e. first 3 lines are a, next 4 lines b and so on.
x FLABEL
0.23 a
0.21 a
0.19 a
0.27 b
0.25 b
0.22 b
0.15 b
0.09 c
0.32 c
0.19 d
0.17 d
I'm struggling with how to do this, I'm assuming some form of for-loop given that my number sequence can vary in length every time I run it So I could be populating letters a & b...or many more.
Based on the comment by #scoa, I suggest the following modified approach:
series <- c(3, 7, 9)
series <- c(series, nrow(DF)) # This ensures that the sequence extends to the last row of DF
series2 <- c(series[1] ,diff(series))
DF$FLABEL <- rep(letters[1:length(series2)], series2)
#> DF
# x FLABEL
#1 0.23 a
#2 0.21 a
#3 0.19 a
#4 0.27 b
#5 0.25 b
#6 0.22 b
#7 0.15 b
#8 0.09 c
#9 0.32 c
#10 0.19 d
#11 0.17 d
By using diff() the length of each sequence is calculated based on the index numbers in the input vector series. In this case, the index values 3, 7, 9 are converted into the number of repetitions of subsequent letters up to the last row of the data frame and stored in series2: 3, 4, 2, 2.
data
text <- "x FLABEL
0.23 x
0.21 x
0.19 x
0.27 x
0.25 x
0.22 x
0.15 x
0.09 x
0.32 x
0.19 x
0.17 x"
DF <- read.table(text = text, header=T)