if() statement with paste0() or grep() in r - r

I made reproducible minimal example, but my real data is really huge
ac_1 <-c(0.1, 0.3, 0.03, 0.03)
ac_2 <-c(0.2, 0.4, 0.1, 0.008)
ac_3 <-c(0.8, 0.043, 0.7, 0.01)
ac_4 <-c(0.2, 0.73, 0.1, 0.1)
c_2<-c(1,2,5,23)
check_1<-c(0.01, 0.902,0.02,0.07)
check_2<-c(0.03, 0.042,0.002,0.00001)
check_3<-c(0.01, 0.02,0.5,0.001)
check_4<-c(0.001, 0.042,0.02,0.2)
id<-1:4
df<-data.frame(id,ac_1, ac_2,ac_3,ac_4,c_2,check_1,check_2,check_3,check_4)
so, the dataframe is like this:
> df
id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
1 1 0.10 0.200 0.800 0.20 1 0.010 0.03000 0.010 0.001
2 2 0.30 0.400 0.043 0.73 2 0.902 0.04200 0.020 0.042
3 3 0.03 0.100 0.700 0.10 5 0.020 0.00200 0.500 0.020
4 4 0.03 0.008 0.010 0.10 23 0.070 0.00001 0.001 0.200
and what I want to do is,
if check_1 is 0.02, I will make the corresponding ac_1 to be missing data.
if check_2 is 0.02, I will make the corresponding ac_2 to be missing data.
I will keep doing this every "check" and "ac"columns
For example, in the check_1 column, the 3th id person have 0.02.
so, this person's ac_1 score should be missing data-- 0.03 should be missing data (NA)
In the check_3 column, the 2nd id person have 0.02.
so, this person's ac_3 score should be missing data.
In the check_4 column, the 3th id person have 0.02
so, this person's ac_4 score should be missing data.
so. what i did is as follows:
for(i in 1:4){
if(paste0("df$check_",i)==0.02){
paste0("df$ac_",i)==NA
}
}
But, it did not work...

You're really close, but you're off on a few fundamentals.
You can't (easily) use strings to refer to objects, so "df$check_1" won't work. You can use strings to refer to column names, but not with $, you need to use [ or [[, so df[["check_1"]] will work.
if isn't vectorized, so it won't work on each value in a column. Use ifelse instead, or even better in this case we can skip the if entirely.
Using == on non-integer numbers is risky due to precision issues. We'll use a tolerance instead.
Minor issue, paste0("df$ac_",i)==NA isn't good, == is for checking equality. You need = or <- for assignment on that line.
Addressing all of these issues:
for(i in 1:4){
df[
## rows to replace
abs(df[[paste0("check_", i)]] - 0.02) < 1e-10,
## column to replace
paste0("ac_", i)
] <- NA
}
df
# id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
# 1 1 0.10 0.200 0.80 0.20 1 0.010 0.03000 0.010 0.001
# 2 2 0.30 0.400 NA 0.73 2 0.902 0.04200 0.020 0.042
# 3 3 NA 0.100 0.70 NA 5 0.020 0.00200 0.500 0.020
# 4 4 0.03 0.008 0.01 0.10 23 0.070 0.00001 0.001 0.200

Its often better to work with long format data, even if just temporarily. Here is an example of doing so, using dplyr and tidyr:
pivot_longer(df, -c(id,c_2)) %>%
separate(name,into=c("type", "pos")) %>%
pivot_wider(names_from=type, values_from = value) %>%
mutate(ac=if_else(near(check,0.02), as.double(NA), ac)) %>%
pivot_wider(names_from = pos, values_from = ac:check)
(Updated with near() thanks to Gregor)
Output:
id c_2 ac_1 ac_2 ac_3 ac_4 check_1 check_2 check_3 check_4
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.1 0.2 0.8 0.2 0.01 0.03 0.01 0.001
2 2 2 0.3 0.4 NA 0.73 0.902 0.042 0.02 0.042
3 3 5 NA 0.1 0.7 NA 0.02 0.002 0.5 0.02
4 4 23 0.03 0.008 0.01 0.1 0.07 0.00001 0.001 0.2

Related

Why does manova() not give me p values?

Whenever I use manova(), then summary.aov(), I only get df, Sum sq, and Mean Sq, with no p value.
My data frame looks like: (sorry I'm not sure if there's a better way to display this!)
subtype lymphocytosis anemia thrombocytopenia eosinophilia hypercalcemia hyperglobulinemia
1 MBC 0.60 0.18 0.17 0.02 0.01 0.04
2 SBC 0.25 0.18 0.14 0.03 0.02 0.12
3 BCLL 1.00 0.29 0.18 0.08 0.03 0.21
neutrophilia neutropenia lymphadenopathy_peripheral lymphadenopathy_visceral splenomegaly
1 0.23 0.02 1.00 0.65 0.60
2 0.22 0.04 0.99 0.62 0.49
3 0.23 0.04 0.40 0.25 0.49
hepatomegaly pleural_effusion peritoneal_effusion intestinal_mass mediastinal_mass pulmonary_mass
1 0.41 0.02 0.05 0.10 0.09 0.22
2 0.37 0.03 0.05 0.17 0.12 0.22
3 0.27 0.01 0.04 0.25 0.03 0.25
The values in the data frame represent the mean number of cases from each subtype for each clinical sign. I am a little worried that, for manova() to work, I should have each individual case and their clinical signs inputted so that manova can do its own math? Which would be a huge pain for me to assemble, hence why I've done it this way. Either way, I still think I should bet getting P values, they just might be wrong if my data frame is wrong?
The code I am using is:
cs_comp_try <- manova(cbind(lymphocytosis, anemia, thrombocytopenia, eosinophilia, hypercalcemia,
hyperglobulinemia, neutrophilia, neutropenia, lymphadenopathy_peripheral, lymphadenopathy_visceral,
splenomegaly, hepatomegaly, pleural_effusion, peritoneal_effusion, intestinal_mass, mediastinal_mass, pulmonary_mass) ~ subtype, data = cs_comp)
summary(cs_comp_try)
summary.aov(cs_comp_try)
The result I get for summary.aov() is:
Response peritoneal_effusion :
Df Sum Sq Mean Sq
subtype 2 6.6667e-05 3.3333e-05
Response intestinal_mass :
Df Sum Sq Mean Sq
subtype 2 0.011267 0.0056333
Response mediastinal_mass :
Df Sum Sq Mean Sq
subtype 2 0.0042 0.0021
Response pulmonary_mass :
Df Sum Sq Mean Sq
subtype 2 6e-04 3e-04
I think I've replicated all the examples I've seen on the internet, so I'm not sure why I'm not getting an F statistic and p value when I run this code.
You can just use the summary function to get the p-values like this (I use iris data as an example):
fit <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(fit)
#> Df Pillai approx F num Df den Df Pr(>F)
#> Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
#> Residuals 147
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-07-15 by the reprex package (v2.0.1)
If you want to extract the actual p-values, you can use the following code:
fit <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(fit)$stats[1, "Pr(>F)"]
#> [1] 2.216888e-42
Created on 2022-07-15 by the reprex package (v2.0.1)

R: calculating interests and balance at each step

I have a stupid question but I can't solve it easily with lag/lead or other things
Let's say I have this table, I have an initial balance of 100, Position is if I bid or not, and percentage is what I get if I bid, how can i calculate the balance to get something like this?
Position Percentage_change Balance
0 0.01 100
0 - 0.01 100
1 0.02 102
1 0.05 107.1
0 - 0.02 107.1
1 0.03 110.3
cumprod is the function you are looking for eg
df <- data.frame(Position = c(0,0,1,1,0,1),
Percentage_change = c(0.01, -0.01, 0.02, 0.05, -0.02, 0.03))
# convert in to multiplier form eg 100 * 1.01
df$Multiplier <- df$Percentage_change + 1
# when position is 0, reset this to 1 so there is no change to the balance
df[df$Position == 0, ]$Multiplier <- 1
# take starting balance of 100 and times by cumulative product of the multipliers
df$Balance <- 100 * cumprod(df$Multiplier)
df
Position Percentage_change Multiplier Balance
1 0 0.01 1.00 100.000
2 0 -0.01 1.00 100.000
3 1 0.02 1.02 102.000
4 1 0.05 1.05 107.100
5 0 -0.02 1.00 107.100
6 1 0.03 1.03 110.313

Assign different values to a large number of columns

I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!
Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0

sum, roundoff and replace values in R

I have two questions?
> data<-read.table("UC.txt",header=TRUE, sep="\t")
> data$tot<-data$P1+data$P2+data$P3+data$P4
> head(data, 5)
geno P1 P2 P3 P4 tot
1 G1 0.015 0.007 0.026 0.951 0.999
2 G2 0.008 0.006 0.015 0.970 0.999
3 G3 0.009 0.006 0.017 0.968 1.000
4 G4 0.011 0.007 0.017 0.965 1.000
5 G5 0.013 0.005 0.021 0.961 1.000
Question #1: sometimes, number of column varies, so, how to sum column2 to last column. something like data[2]:data[n]
library("plyr")
> VD<-function(P4, tot){
if(tot > 1) {return(P4-0.01)}
if(tot < 1) {return(P4+0.01)}
if(tot == 1) {return(P4)}
}
> minu<-ddply(data, 'geno', summarize, Result=VD(P4, tot))
> v <- data$geno==minu$geno
> data[v, "P4"] <- minu[v, "Result"]
> data <- subset(data, select = -tot)
> data$tot<-data$P1+data$P2+data$P3+data$P4
> head(data, 5)
geno P1 P2 P3 P4 tot
1 G1 0.02 0.01 0.03 0.94 1
2 G2 0.01 0.01 0.02 0.96 1
3 G3 0.01 0.01 0.02 0.96 1
4 G4 0.01 0.01 0.02 0.96 1
5 G5 0.01 0.01 0.02 0.96 1
Question #2: Here, I need to roundoff 'tot' to 1 by adjusting P1 to P4.
condition :
1) I should adjust the maximum among P1 to P4
2) The adjusting values may differ, like 0.01, 0.001, 0.0001. ( it is based on 1-tot)
How to do this?
Thanks in advance
For question1, to sum all columns except the first one:
dat$tot <- rowSums(dat[,-1])

Speed up `strsplit` when possible output are known

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):
data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
## Time Location Class Replicate Population
##1 1 1 LIDE.1.S 1 0.03859605
##2 2 1 LIDE.1.S 1 0.03852957
##3 3 1 LIDE.1.S 1 0.03846853
##4 4 1 LIDE.1.S 1 0.03841260
##5 5 1 LIDE.1.S 1 0.03836147
##6 6 1 LIDE.1.S 1 0.03831485
Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)
head(data.df)
## Time Location Species SizeClass Infected Replicate Population
##1 1 1 LIDE 1 S 1 0.03859605
##2 2 1 LIDE 1 S 1 0.03852957
##3 3 1 LIDE 1 S 1 0.03846853
##4 4 1 LIDE 1 S 1 0.03841260
##5 5 1 LIDE 1 S 1 0.03836147
##6 6 1 LIDE 1 S 1 0.03831485
summaryRprof("str.out")
$by.self
self.time self.pct total.time total.pct
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"do.call" 0.04 1.49 2.54 94.78
"unique.default" 0.04 1.49 0.04 1.49
"data.frame" 0.02 0.75 0.12 4.48
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
$by.total
total.time total.pct self.time self.pct
"do.call" 2.54 94.78 0.04 1.49
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"cbind" 0.14 5.22 0.00 0.00
"data.frame" 0.12 4.48 0.02 0.75
"as.data.frame.matrix" 0.08 2.99 0.00 0.00
"as.data.frame" 0.08 2.99 0.00 0.00
"as.factor" 0.08 2.99 0.00 0.00
"factor" 0.06 2.24 0.00 0.00
"unique.default" 0.04 1.49 0.04 1.49
"unique" 0.04 1.49 0.00 0.00
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
"[.data.frame" 0.02 0.75 0.00 0.00
"[" 0.02 0.75 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.68
Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.
Notes:
stringr::str_split_fixed performs this task, but not any faster
The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
data.rds at http://dl.getdropbox.com/u/3356641/data.rds
This should probably offer quite an increase:
library(data.table)
DT <- data.table(data.df)
DT[, c("Species", "SizeClass", "Infected")
:= as.list(strsplit(Class, "\\.")[[1]]), by=Class ]
The reasons for the increase:
data.table pre allocates memory for columns
every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
the by statement allows you to implement the strsplit task once per each unique value.
Here is a nice quick method for the whole process.
# Save the new col names as a character vector
newCols <- c("Species", "SizeClass", "Infected")
# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]
# remove the old column. This is instantaneous.
DT[, Class := NULL]
## Have a look:
DT[, lapply(.SD, class)]
# Time Location Replicate Population Species SizeClass Infected
# 1: integer integer integer numeric factor factor factor
DT
You could get a decent increase in speed by just extracting the parts of the string you need using gsub instead of splitting everything up and trying to put it back together:
data <- readRDS("~/Downloads/data.rds")
data.df <- reshape2:::melt.array(data)
# using `strsplit`
system.time({
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
})
user system elapsed
3.349 0.062 3.411
#using `gsub`
system.time({
data.df$Class <- as.character(data.df$Class)
data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class,
perl = TRUE)
data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class,
perl = TRUE)
data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class,
perl = TRUE)
})
user system elapsed
0.812 0.037 0.848
Looks like you have a factor, so work on the levels and then map back. Use fixed=TRUE in strsplit, adjusting to split=".".
Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA)
cbind(data.df, df0)

Resources