I have a dataframe in R
> df
Dataset a1 b2 c3
1 past 0.0029 0.00250 0.0011
2 present 0.0035 0.00078 -0.0018
3 future1 0.0020 0.02100 0.0200
4 future2 0.0390 0.04000 0.0460
How can I multiply the columns of a1,b2 and c3 with value of 4193215.58165948 ?
You can use #Otto Kässi's approach and use cbind to bind it with your Dataset column.
df <- cbind(Dataset = df$Dataset, df[,2:4] * 4193215.58165948)
You can do
df[,2:4] <- df[,2:4] * 4193215.58165948
Or a more explicit (and safe) approach:
v <- c("a1", "b2", "c3")
df[,v] <- df[,v] * 4193215.58165948
Related
I have a dataframe with multiple values per cell, and I want to find and return the values that are only in one column.
ID <- c("1","1","1","2","2","2","3","3","3")
locus <- c("A","B","C","A","B","C","A","B","C")
Acceptable <- c("K44 Z33 G49","K72 QR123","B92 N12 PQ16 G99","V3","L89 I203 UPA66 QF29"," ","K44 Z33 K72","B92 PQ14","J22 M43 VC78")
Unacceptable <- c("K44 Z33 G48","K72 QR123 B22","B92 N12 PQ16 G99","V3 N9 Q7","L89 I203 UPA66 QF29","B8","K44 Z33"," ","J22 M43 VC78")
df <- data.frame(ID,locus,Acceptable,Unacceptable)
dataframe
I want to make another column, Unique_values, that returns all the unique values that are only present in Unacceptable, and that are not in Acceptable. So the output should be this.
I already have a poorly optimized method to find the duplicates between the two columns:
df$Duplicate_values <- do.call(paste, c(df[,c("Acceptable","Unacceptable")], sep=" "))
df$Duplicate_values = sapply(strsplit(df$Duplicate_values, ' '), function(i)paste(i[duplicated(i)]))
#this is for cleaning up the text field so that it looks like the other columns
df$Duplicate_values = gsub("[^0-9A-Za-z///' ]"," ",df$Duplicate_values)
df$Duplicate_values = gsub("character 0",NA,df$Duplicate_values)
df$Duplicate_values = gsub("^c ","",df$Duplicate_values)
df$Duplicate_values = gsub(" "," ",df$Duplicate_values)
df$Duplicate_values = trimws(df$Duplicate_values)
(if anyone knows a faster method to return these duplicates, please let me now!)
I cannot use this method to find the unique values however, because it would then also return the unique values of the Acceptable column, which I do not want.
Any suggestions?
A similar approach using setdiff:
lA <- strsplit(df$Acceptable, " ")
lU <- strsplit(df$Unacceptable, " ")
df$Unique_values <- lapply(1:nrow(df), function(i) paste0(setdiff(lU[[i]], lA[[i]]), collapse = " "))
df
#> ID locus Acceptable Unacceptable Unique_values
#> 1 1 A K44 Z33 G49 K44 Z33 G48 G48
#> 2 1 B K72 QR123 K72 QR123 B22 B22
#> 3 1 C B92 N12 PQ16 G99 B92 N12 PQ16 G99
#> 4 2 A V3 V3 N9 Q7 N9 Q7
#> 5 2 B L89 I203 UPA66 QF29 L89 I203 UPA66 QF29
#> 6 2 C B8 B8
#> 7 3 A K44 Z33 K72 K44 Z33
#> 8 3 B B92 PQ14
#> 9 3 C J22 M43 VC78 J22 M43 VC78
Here is the df:
# A tibble: 6 x 5
t a b c d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3999. 0.00586 0.00986 0.00728 0.00856
2 3998. 0.0057 0.00958 0.00702 0.00827
3 3997. 0.00580 0.00962 0.00711 0.00839
4 3996. 0.00602 0.00993 0.00726 0.00875
I want to get means for an all rows except to not include the first column. The code I wrote:
df$Mean <- rowMeans(df[select(df, -"t")])
The error I get:
Error: Must subset columns with a valid subscript vector.
x Subscript `select(group1, -"t")` has the wrong type `tbl_df<
p2 : double
p8 : double
p10: double
p9 : double
>`.
ℹ It must be logical, numeric, or character.
I tried to convert df to matrix, but then I get another error. How should I solve this?
Now I'm trying to calculate standard error using the code:
se <- function(x){sd(df[,x])/sqrt(length(df[,x]))}
sapply(group1[,2:5],se)
I try to indicate which columns should be used to calculate the error, but again an error pops up:
Error: Must subset columns with a valid subscript vector.
x Can't convert from `x` <double> to <integer> due to loss of precision.
I have used valid column subscripts, so I don't know why the error.
A similar base R solution would be:
df$Mean <- rowMeans(df[,-1],na.rm=T)
Output:
t a b c d Mean
1 3999 0.00586 0.00986 0.00728 0.00856 0.0078900
2 3998 0.00570 0.00958 0.00702 0.00827 0.0076425
3 3997 0.00580 0.00962 0.00711 0.00839 0.0077300
4 3996 0.00602 0.00993 0.00726 0.00875 0.0079900
We can use setdiff to return the columns that are not 't' and then get the rowMeans. This assumes that the column 't' can be anywhere and not based on the position of the column
df$Mean <- rowMeans(df[setdiff(names(df), "t")], na.rm = TRUE)
df
# t a b c d Mean
#1 3999 0.00586 0.00986 0.00728 0.00856 0.0078900
#2 3998 0.00570 0.00958 0.00702 0.00827 0.0076425
#3 3997 0.00580 0.00962 0.00711 0.00839 0.0077300
#4 3996 0.00602 0.00993 0.00726 0.00875 0.0079900
select from dplyr returns the subset of data.frame and not the column names or index. So, we can directly apply rowMeans
library(dplyr)
rowMeans(select(df, -t), na.rm = TRUE)
Or in a pipe
df <- df %>%
mutate(Mean = rowMeans(select(., -t), na.rm = TRUE))
Update
If we need to get the standard error per row, we can use apply with MARGIN as 1
apply(df[setdiff(names(df), 't')], 1,
function(x) sd(x)/sqrt(length(x)))
Or with rowSds from matrixStats
library(matrixStats)
rowSds(as.matrix(df[setdiff(names(df), 't')]))/sqrt(ncol(df)-1)
data
df <- structure(list(t = c(3999, 3998, 3997, 3996), a = c(0.00586,
0.0057, 0.0058, 0.00602), b = c(0.00986, 0.00958, 0.00962, 0.00993
), c = c(0.00728, 0.00702, 0.00711, 0.00726), d = c(0.00856,
0.00827, 0.00839, 0.00875)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
I have a data frame with time series financial data, and I want to calculate the log returns for each of them.
Here is a simplified example (In reality, I have hundereds of columns):
df <- data.frame(Date=c("2004/10/29","2004/11/30","2004/12/31","2005/01/31"), B126 =c("103.238","104.821","105.141","107.682"), H251 =c("131.149","138.989","137.266","137.080"))
df
Date B126 H251
1 2004/10/29 103.238 131.149
2 2004/11/30 104.821 138.989
3 2004/12/31 105.141 137.266
4 2005/01/31 107.682 137.080
I want to get the following:
Date B126 Log H251 Log
1 2004/10/29 103.238 131.149
2 2004/11/30 104.821 0.0152 138.989 0.0580
3 2004/12/31 105.141 0.0030 137.266 -0.0124
4 2005/01/31 107.682 0.0238 137.080 -0.0013
I know how to get log returns for each column by using:
logB126 <- DF$B126
log_returns <- diff(log(logB126), lag = 1)
It is impossible for me to repeat the above steps hundred times so I'm wondering if there is a better way to perform the task?
You could use plyr::colwise
calc_log_return <- function(x) diff(log(x), lag = 1)
logReturns <- plyr::colwise(calc_log_return)(DF[, -1])
This will make a new data.frame of just the log returns. You can easily append the dates column.
Simple for loop should do the job:
df2 <- df[,1:2]
for(name in names(df)[2:length(names(df))]){
df2[,name] <- df[,name]
df2[2:nrow(df2),paste0(name, ".Log")] <- diff(log(as.numeric(as.character(df[,name]))), lag = 1)
}
head(df2)
We can use mutate_each from dplyr
library(dplyr)
df %>% mutate_each(funs(round(c(NA, diff(log(as.numeric(as.character(.))))),3)),
B126:H251)
# Date B126 H251
#1 2004/10/29 NA NA
#2 2004/11/30 0.015 0.058
#3 2004/12/31 0.003 -0.012
#4 2005/0131 0.024 -0.001
Another dplyr solution. After applying mutate_each, use merge from base R to add the new columns to the original data.
library(dplyr)
# clean up data (convert strings to numbers)
df <- df %>% mutate_each(funs(as.numeric(as.character(.))), B126:H251)
# calculate log diff and merge
df %>% merge(df %>% mutate_each(funs(c(NA,diff(log(.)))), B126:H251), by='Date', suffixes=c('','_log'))
# optionally apply rounding function
df %>% mutate_each(funs(round(.,3)), B126_log:H251_log)
Output:
Date B126 H251 B126_log H251_log
1 2004/10/29 103.238 131.149 NA NA
2 2004/11/30 104.821 138.989 0.015 0.058
3 2004/12/31 105.141 137.266 0.003 -0.012
4 2005/0131 107.682 137.080 0.024 -0.001
It may be asked many times here, but i am not able to relate it to any since my function returns data frame.
I have my custom function which builds model and outputs a data frame with slope(coeff2) in one column, intercept(coeff1) in another column, number of input records in one column ,etc. Ideally i am building my own data frame in the function and output it from the function. Now I want to subset my input data frame based on a column and apply my function on that.
Example :-
f.get_reg <- function(df) {
linear.model <-lm(df$DM ~ df$FW,)
N <- length(df$DM)
slope <- coef(linear.model)[2]
intercept <- coef(linear.model)[1]
S <- summary(linear.model)$sigma
df.out <- data.frame (N,slope, intercept, S)
return (df.out)
}
sample_id FW DM StdDev_DM Median_DM Count X90 X60 crit Z.scores
6724 116.39 16.20690 0.9560414 16.0293 60 3.35 3.2 3.2 1
6724 116.39 16.20690 0.9560414 16.0293 60 3.35 3.2 3.2 1
6724 110.24 16.73077 0.9560414 16.0293 60 3.35 3.2 3.2 1
6728 110.24 16.73077 0.9560414 16.0293 60 3.35 3.2 3.2 1
6728 112.81 16.15542 0.9560414 16.0293 60 3.35 3.2 3.2 1
6728 112.81 16.15542 0.9560414 16.0293 60 3.35 3.2 3.2 1
Now I want to apply my function to each unique subset of sample_ids and output only one data frame with one record as an output for each subset.
dplyr
You could use do in dplyr:
library(dplyr)
df %>%
group_by(sample_id) %>%
do(f.get_reg(.))
Which gives:
sample_id N slope intercept S
(int) (int) (dbl) (dbl) (dbl)
1 6724 3 -0.08518211 26.12125 7.716050e-15
2 6728 3 -0.22387160 41.41037 5.551115e-17
data.table
Use .SD in data.table:
library(data.table)
df <- data.table(df)
df[,f.get_reg(.SD),sample_id]
Which gives the same result:
sample_id N slope intercept S
1: 6724 3 -0.08518211 26.12125 7.716050e-15
2: 6728 3 -0.22387160 41.41037 5.551115e-17
base R
Using by:
resultList <- by(df,df$sample_id,f.get_reg)
sample_id <- names(resultList)
result <- do.call(rbind,resultList)
result$sample_id <- sample_id
rownames(result) <- NULL
Which gives:
N slope intercept S sample_id
1 3 -0.08518211 26.12125 7.716050e-15 6724
2 3 -0.22387160 41.41037 5.551115e-17 6728
Want to create a new column "non_coded" using existing 3 columns- allele_2 , allele_1 and A1
the conditions I want satisfied are :
if allele_2 == A1 then non_coded = allele_1
if allele_2 != A1 then non_coded = allele_2
Thanks in advance,
Rad
OK This is what the data looks like:
SNPID chrom STRAND IMPUTED allele_2 allele_1 MAF CALL_RATE HET_RATE
1 rs1000000 12 + Y A G 0.12160 1.00000 0.2146
2 rs10000009 4 + Y G A 0.07888 0.99762 0.1386
HWP RSQ PHYS_POS A1 M1_FRQ M1_INFO M1_BETA M1_SE M1_P
1 1.0000 0.9817 125456933 A 0.1173 0.9452 -0.0113 0.0528 0.83090
2 0.1164 0.8354 71083542 A 0.9048 0.9017 -0.0097 0.0593 0.87000
The code I tried:
Hy_MVA$non_coded <- ifelse(Hy_MVA$allele_2 == Hy_MVA$A1, Hy_MVA$allele_1, Hy_MVA$allele_2)
result:
SNPID chrom STRAND IMPUTED allele_2 allele_1 MAF CALL_RATE HET_RATE
1 rs1000000 12 + Y A G 0.12160 1.00000 0.2146
2 rs10000009 4 + Y G A 0.07888 0.99762 0.1386
HWP RSQ PHYS_POS A1 M1_FRQ M1_INFO M1_BETA M1_SE M1_P non_coded
1 1.0000 0.9817 125456933 A 0.1173 0.9452 -0.0113 0.0528 0.83090 3
2 0.1164 0.8354 71083542 A 0.9048 0.9017 -0.0097 0.0593 0.87000 3
What I want:
SNPID chrom STRAND IMPUTED allele_2 allele_1 MAF CALL_RATE HET_RATE
1 rs1000000 12 + Y A G 0.12160 1.00000 0.2146
2 rs10000009 4 + Y G A 0.07888 0.99762 0.1386
HWP RSQ PHYS_POS A1 M1_FRQ M1_INFO M1_BETA M1_SE M1_P non_coded
1 1.0000 0.9817 125456933 A 0.1173 0.9452 -0.0113 0.0528 0.83090 G
2 0.1164 0.8354 71083542 A 0.9048 0.9017 -0.0097 0.0593 0.87000 G
As Chase said, use ifelse(). I guess the code then becomes:
non_coded <- ifelse(allele_2 == A1, allele_1, allele_2)
Edit
After seeing the updated question, it makes sense that you get numbers because allele_1 and allele_2 are factors. Adding a as.character() should fix this:
A1 <- c("A","A","B")
allele_1 <- as.factor(c("A","C","C"))
allele_2 <- as.factor(c("A","B","B"))
non_coded <- ifelse(allele_2 == A1, as.character(allele_1), as.character(allele_2))
non_coded
[1] "A" "B" "C"
Since you want non_coded to be one of two values:
Hy_MVA$non_coded <- Hy_MVA$allele_2
Hy_MVA$non_coded[Hy_MVA$allele_2 == Hy_MVA$A1] <- Hy_MVA$allele_1[Hy_MVA$allele_2 == Hy_MVA$A1]
That replaces values with allele_1 values in only the rows where allele_2 == A1. It sounds as though you might have a problem with ifelse converting a factor to a numeric.