sum, roundoff and replace values in R - r

I have two questions?
> data<-read.table("UC.txt",header=TRUE, sep="\t")
> data$tot<-data$P1+data$P2+data$P3+data$P4
> head(data, 5)
geno P1 P2 P3 P4 tot
1 G1 0.015 0.007 0.026 0.951 0.999
2 G2 0.008 0.006 0.015 0.970 0.999
3 G3 0.009 0.006 0.017 0.968 1.000
4 G4 0.011 0.007 0.017 0.965 1.000
5 G5 0.013 0.005 0.021 0.961 1.000
Question #1: sometimes, number of column varies, so, how to sum column2 to last column. something like data[2]:data[n]
library("plyr")
> VD<-function(P4, tot){
if(tot > 1) {return(P4-0.01)}
if(tot < 1) {return(P4+0.01)}
if(tot == 1) {return(P4)}
}
> minu<-ddply(data, 'geno', summarize, Result=VD(P4, tot))
> v <- data$geno==minu$geno
> data[v, "P4"] <- minu[v, "Result"]
> data <- subset(data, select = -tot)
> data$tot<-data$P1+data$P2+data$P3+data$P4
> head(data, 5)
geno P1 P2 P3 P4 tot
1 G1 0.02 0.01 0.03 0.94 1
2 G2 0.01 0.01 0.02 0.96 1
3 G3 0.01 0.01 0.02 0.96 1
4 G4 0.01 0.01 0.02 0.96 1
5 G5 0.01 0.01 0.02 0.96 1
Question #2: Here, I need to roundoff 'tot' to 1 by adjusting P1 to P4.
condition :
1) I should adjust the maximum among P1 to P4
2) The adjusting values may differ, like 0.01, 0.001, 0.0001. ( it is based on 1-tot)
How to do this?
Thanks in advance

For question1, to sum all columns except the first one:
dat$tot <- rowSums(dat[,-1])

Related

if() statement with paste0() or grep() in r

I made reproducible minimal example, but my real data is really huge
ac_1 <-c(0.1, 0.3, 0.03, 0.03)
ac_2 <-c(0.2, 0.4, 0.1, 0.008)
ac_3 <-c(0.8, 0.043, 0.7, 0.01)
ac_4 <-c(0.2, 0.73, 0.1, 0.1)
c_2<-c(1,2,5,23)
check_1<-c(0.01, 0.902,0.02,0.07)
check_2<-c(0.03, 0.042,0.002,0.00001)
check_3<-c(0.01, 0.02,0.5,0.001)
check_4<-c(0.001, 0.042,0.02,0.2)
id<-1:4
df<-data.frame(id,ac_1, ac_2,ac_3,ac_4,c_2,check_1,check_2,check_3,check_4)
so, the dataframe is like this:
> df
id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
1 1 0.10 0.200 0.800 0.20 1 0.010 0.03000 0.010 0.001
2 2 0.30 0.400 0.043 0.73 2 0.902 0.04200 0.020 0.042
3 3 0.03 0.100 0.700 0.10 5 0.020 0.00200 0.500 0.020
4 4 0.03 0.008 0.010 0.10 23 0.070 0.00001 0.001 0.200
and what I want to do is,
if check_1 is 0.02, I will make the corresponding ac_1 to be missing data.
if check_2 is 0.02, I will make the corresponding ac_2 to be missing data.
I will keep doing this every "check" and "ac"columns
For example, in the check_1 column, the 3th id person have 0.02.
so, this person's ac_1 score should be missing data-- 0.03 should be missing data (NA)
In the check_3 column, the 2nd id person have 0.02.
so, this person's ac_3 score should be missing data.
In the check_4 column, the 3th id person have 0.02
so, this person's ac_4 score should be missing data.
so. what i did is as follows:
for(i in 1:4){
if(paste0("df$check_",i)==0.02){
paste0("df$ac_",i)==NA
}
}
But, it did not work...
You're really close, but you're off on a few fundamentals.
You can't (easily) use strings to refer to objects, so "df$check_1" won't work. You can use strings to refer to column names, but not with $, you need to use [ or [[, so df[["check_1"]] will work.
if isn't vectorized, so it won't work on each value in a column. Use ifelse instead, or even better in this case we can skip the if entirely.
Using == on non-integer numbers is risky due to precision issues. We'll use a tolerance instead.
Minor issue, paste0("df$ac_",i)==NA isn't good, == is for checking equality. You need = or <- for assignment on that line.
Addressing all of these issues:
for(i in 1:4){
df[
## rows to replace
abs(df[[paste0("check_", i)]] - 0.02) < 1e-10,
## column to replace
paste0("ac_", i)
] <- NA
}
df
# id ac_1 ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
# 1 1 0.10 0.200 0.80 0.20 1 0.010 0.03000 0.010 0.001
# 2 2 0.30 0.400 NA 0.73 2 0.902 0.04200 0.020 0.042
# 3 3 NA 0.100 0.70 NA 5 0.020 0.00200 0.500 0.020
# 4 4 0.03 0.008 0.01 0.10 23 0.070 0.00001 0.001 0.200
Its often better to work with long format data, even if just temporarily. Here is an example of doing so, using dplyr and tidyr:
pivot_longer(df, -c(id,c_2)) %>%
separate(name,into=c("type", "pos")) %>%
pivot_wider(names_from=type, values_from = value) %>%
mutate(ac=if_else(near(check,0.02), as.double(NA), ac)) %>%
pivot_wider(names_from = pos, values_from = ac:check)
(Updated with near() thanks to Gregor)
Output:
id c_2 ac_1 ac_2 ac_3 ac_4 check_1 check_2 check_3 check_4
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.1 0.2 0.8 0.2 0.01 0.03 0.01 0.001
2 2 2 0.3 0.4 NA 0.73 0.902 0.042 0.02 0.042
3 3 5 NA 0.1 0.7 NA 0.02 0.002 0.5 0.02
4 4 23 0.03 0.008 0.01 0.1 0.07 0.00001 0.001 0.2

Convert column headers into new columns

My data frame consists of time series financial data from many public companies. I purposely set companies' weights as their column headers while cleaning the data, and I also calculated log returns for each of them in order to calculate weighted returns in the next step.
Here is an example. There are four companies: A, B, C and D, and their corresponding weights in the portfolio are 0.4, 0.3, 0.2, 0.1 separately. So the current data set looks like:
df1 <- data.frame(matrix(vector(),ncol=9, nrow = 4))
colnames(df1) <- c("Date","0.4","0.4.Log","0.3","0.3.Log","0.2","0.2.Log","0.1","0.1.Log")
df1[1,] <- c("2004-10-29","103.238","0","131.149","0","99.913","0","104.254","0")
df1[2,] <- c("2004-11-30","104.821","0.015","138.989","0.058","99.872","0.000","103.997","-0.002")
df1[3,] <- c("2004-12-31","105.141","0.003","137.266","-0.012","99.993","0.001","104.025","0.000")
df1[4,] <- c("2005-01-31","107.682","0.024","137.08","-0.001","99.782","-0.002","105.287","0.012")
df1
Date 0.4 0.4.Log 0.3 0.3.Log 0.2 0.2.Log 0.1 0.1.Log
1 2004-10-29 103.238 0 131.149 0 99.913 0 104.254 0
2 2004-11-30 104.821 0.015 138.989 0.058 99.872 0.000 103.997 -0.002
3 2004-12-31 105.141 0.003 137.266 -0.012 99.993 0.001 104.025 0.000
4 2005-01-31 107.682 0.024 137.08 -0.001 99.782 -0.002 105.287 0.012
I want to create new columns that contain company weights so that I can calculate weighted returns in my next step:
Date 0.4 0.4.W 0.4.Log 0.3 0.3.W 0.3.Log 0.2 0.2.W 0.2.Log 0.1 0.1.W 0.1.Log
1 2004-10-29 103.238 0.400 0.000 131.149 0.300 0.000 99.913 0.200 0.000 104.254 0.100 0.000
2 2004-11-30 104.821 0.400 0.015 138.989 0.300 0.058 99.872 0.200 0.000 103.997 0.100 -0.002
3 2004-12-31 105.141 0.400 0.003 137.266 0.300 -0.012 99.993 0.200 0.001 104.025 0.100 0.000
4 2005-01-31 107.682 0.400 0.024 137.080 0.300 -0.001 99.782 0.200 -0.002 105.287 0.100 0.012
We can try
v1 <- grep("^[0-9.]+$", names(df1), value = TRUE)
df1[paste0(v1, ".w")] <- as.list(as.numeric(v1))

Working with unsplit in R

This question is related to this here I have accepted too early as it doesn't solve what I actually needed.
The data looks more like this:
m4 <- read.table(header=T, text='
model1 model2 model3 Output Model
0.13 0.113 0.18 0.4 m4
0.157 0.11 0.21 0.50 m4
0.058 0.03 0.18 0.46 m4 ')
m3 <- read.table(header=T, text='
model1 model2 model3 Output Model
0.13 0.113 0.18 0.4 m3
0.157 0.11 0.21 0.50 m3
0.058 0.03 0.18 0.46 m3 ')
m2 <- read.table(header=T, text='
model1 model2 model3 Output Model
0.200 0.099 NA NA m3
0.356 0.25 NA NA m3 ')
m1 <- read.table(header=T, text='
model1 model2 model3 Output Model
0.200 0.099 0.3 0.9 m1
0.35 0.252 0.4 0.9 m1 ')
models <- list(m4=m4, m3=m3, m2=m2, m1=m1)
EDIT1:
Desired result with unsplit:
model1 model2 model3 Output Model
0.200 0.099 0.3 0.9 m1
0.35 0.252 0.4 0.9 m1
0.13 0.113 0.18 0.4 m4
0.157 0.11 0.21 0.50 m4
0.058 0.03 0.18 0.46 m4
The desired soulution must be within unsplit...that means: 4th list entry (4,4)== means 2 rows of the 4th list entry, likewise (1,1,1) means first list entry with 3 rows.
EDIT2: Can someone point me where I can read more about unsplit? I cannot find anything even in books.
EDIT 3: Now suppose that I have this helper function to provide me the indexing for extraction from the list:
mat <- matrix(c(1,1,1,1,1),5,4)
mat[1,1] <- 0.66; mat[1,2] <- 0.33; mat[1,3] <- 0.33
mat[2,1] <- .66; mat[2,2] <- 0.33; mat[2,3] <- 0.33
extract <- apply(as.matrix(mat),1,which.max)
This suppose to work:
unsplit(models, extract)
unsplit doesn't do what you think it does. To extract the 1st and 4th models, you just need your usual square bracket indexing.
models[c("m1", "m4")]
or
models[c(1, 4)]
You could use rbind and just access the elements with [:
do.call(rbind, models[c("m1", "m4")])
model1 model2 model3 Output Model
m1.1 0.200 0.099 0.30 0.90 m1
m1.2 0.350 0.252 0.40 0.90 m1
m4.1 0.130 0.113 0.18 0.40 m4
m4.2 0.157 0.110 0.21 0.50 m4
m4.3 0.058 0.030 0.18 0.46 m4

find the index of max value in data frame and add the value

This is my data frame:
>head(dat)
geno P1 P2 P3 P4 dif
1 G1 0.015 0.007 0.026 0.951 0.001
2 G2 0.008 0.006 0.015 0.970 0.001
3 G3 0.009 0.006 0.017 0.968 0.000
4 G4 0.011 0.007 0.017 0.965 0.000
5 G5 0.013 0.005 0.021 0.961 0.000
6 G6 0.009 0.006 0.007 0.977 0.001
Here, I need to find max in each row and add dat$dif to the max.
when i used which.max(dat[,-1]), I am getting error:
Error in which.max(dat[,-1]) :
(list) object cannot be coerced to type 'double'
A previous answer (by Scriven) gives most of it but as others have stated, it incorrectly includes the last column. Here is one method that works around it:
idx <- (! names(dat) %in% c('geno','dif'))
dat$dif + apply(dat[,idx], 1, max)
# 1 2 3 4 5 6
# 0.952 0.971 0.968 0.965 0.961 0.978
You can easily put the idx stuff directly into the dat[,...] subsetting, but I broke it out here for clarity.
idx can be defined by numerous things here, such as "all but the first and last columns": idx <- names(dat)[-c(1, ncol(dat))]; or "anything that looks like P#": idx <- grep('^P[0-9]+', names(dat)).
There's an app, eh function for that :-).
max.col finds the index of the maximum position for each row of a matrix. Take note, that as max.col expects a matrix (numeric values only) you have to exclude the “geno” column when applying this function.
sapply(1:6,function(x) dat[x,max.col(dat[,2:5])[x] +1]) + dat$dif
[1] 0.952 0.971 0.968 0.965 0.961 0.978

ggplot Error: Don't know how to automatically pick scale for object of type function

I plotted a stacked bar graph in R using ggplot2 package,
data<-read.table("K.txt",header=TRUE, sep="\t")
> data
Sample P1 P2 P3 P4
1 G1 0.02 0.01 0.03 0.95
2 G2 0.01 0.01 0.02 0.97
3 G3 0.01 0.01 0.02 0.97
4 G4 0.01 0.01 0.02 0.97
5 G5 0.01 0.01 0.02 0.96
6 G6 0.01 0.01 0.01 0.98
7 G7 0.05 0.01 0.01 0.93
8 G8 0.34 0.01 0.01 0.64
9 G9 0.43 0.01 0.01 0.56
> library("reshape2", lib.loc="C:/Program Files/R/R-2.15.2/library")
> data1<-melt(data)
Using Sample as id variables
> head(data1)
Sample variable value
1 G1 P1 0.02
2 G2 P1 0.01
3 G3 P1 0.01
4 G4 P1 0.01
5 G5 P1 0.01
> library("ggplot2", lib.loc="C:/Program Files/R/R-2.15.2/library")
ggplot(data=data1, aes(x=sample, y=value, fill=variable))+geom_bar(width=1)+scale_y_continuous(expand = c(0,0))+ opts(axis.text.x=theme_text(angle=90))
Don't know how to automatically pick scale for object of type function. Defaulting to continuous
Error in data.frame(x = function (x, size, replace = FALSE, prob = NULL) :
arguments imply differing number of rows: 0, 36
Can any1 help me to sort out this error?
Many thanks
Ramesh
Change sample (built-in function) to Sample (your variable)
ggplot(data=data1, aes(x=Sample, y=value, fill=variable)) +
geom_bar(width=1) +
scale_y_continuous(expand = c(0,0)) +
opts(axis.text.x=theme_text(angle=90))

Resources