How can i loop through a consecutive window? - r

I have a df like this:
> df
symbol x1 x2
1 A 3.6 5.2
2 A 10.0 4.8
3 A 5.2 0.2
4 A -10.2 0.4
5 A 5.4 -2.5
6 B 9.9 6.5
7 B 15.8 -1.8
8 B 4.5 -5.9
9 C -2.0 0.5
10 C -10.0 2.6
11 C 7.7 8.9
12 C 10.5 18.5
I want to calculate the r squared between x1 and x2 column by symbol so I want to get a new df like this
symbol r squared
1 A 0.27
2 B 0.30
3 C 0.68
I use ifelse but it isn't working.
for (i in 1:12){
results[i] <- ifelse(df$symbol == symbollist[i], summary(lm(df$x1~df$x2))$r.squared,0)
}
How can I solve this problem in R?

You can use byto perform lm for each symbol:
by(df, df$symbol, function(x) summary(lm(x1~x2, x))$r.squared)
df$symbol: A
[1] 0.07445258
-----------------------------------------------------------------------------------------------------------
df$symbol: B
[1] 0.09014209
-----------------------------------------------------------------------------------------------------------
df$symbol: C
[1] 0.687236

You can use the dplyr package for this. Try:
library(dplyr)
result <- df %>%
group_by(symbol) %>%
summarize(cor(x1, x2))

Related

Setting values to NA in one column based on conditions in another column

Here's a simplified mock dataframe:
df1 <- data.frame(amb = c(2.5,3.6,2.1,2.8,3.4,3.2,1.3,2.5,3.2),
warm = c(3.6,5.3,2.1,6.3,2.5,2.1,2.4,6.2,1.5),
sensor = c(1,1,1,2,2,2,3,3,3))
I'd like to set all values in the "amb" column to NA if they're in sensor 1, but retain the values in the "warm" column for sensor 1. Here's what I'd like the final output to look like:
amb warm sensor
NA 3.6 1
NA 5.3 1
NA 2.1 1
2.8 6.3 2
3.4 2.5 2
3.2 2.1 2
1.3 2.4 3
2.5 6.2 3
3.2 1.5 3
Using R version 4.0.2, Mac OS X 10.13.6
A possible solution, based on dplyr:
library(dplyr)
df1 %>%
mutate(amb = ifelse(sensor == 1, NA, amb))
#> amb warm sensor
#> 1 NA 3.6 1
#> 2 NA 5.3 1
#> 3 NA 2.1 1
#> 4 2.8 6.3 2
#> 5 3.4 2.5 2
#> 6 3.2 2.1 2
#> 7 1.3 2.4 3
#> 8 2.5 6.2 3
#> 9 3.2 1.5 3
Seems to be best handled with the vectorized function is.na<-
is.na(df1$amb) <- df1$sensor %in% c(1) # that c() isn't needed
But to be most general and support tests of proper test for equality among floating point numbers the answer might be:
is.na(df1$amb) <- df1$sensor-1 < 1e-16

R - How to show the real order from 1.1 to 1.12?

In my dataset, there are values from 1.1 to 1.12
R, by default, order them showing 1.10, 1.11 and 1.12 before 1.2.
Example:
## V1 V2
## A 1.1
## B 1.10
## J 1.11
## K 1.12
## G 1.2
## D 1.3
## E 1.4
## J 1.5
## G 1.6
## T 1.7
## R 1.8
## O 1.9
How can I change the order of V2 ordering the values from 1.1 to 1.12?
(In this way)
## V1 V2
## A 1.1
## G 1.2
## D 1.3
## E 1.4
## J 1.5
## G 1.6
## T 1.7
## R 1.8
## O 1.9
## B 1.10
## J 1.11
## K 1.12
of course my real dataset is more complex but i belive that a solution for this case could solve my problem
Thanks!
If the column V2 is of class "character", the following code using function str_order of package stringr will order the data.frame by numeric order of V2.
i <- stringr::str_order(df1$V2, numeric = TRUE)
df1 <- df1[i, ]
df1
# V1 V2
#1 A 1.1
#5 E 1.2
#6 F 1.3
#7 G 1.4
#8 H 1.5
#9 I 1.6
#10 J 1.7
#11 K 1.8
#12 L 1.9
#2 B 1.10
#3 C 1.11
#4 D 1.12
I have left the row names unordered to show that the entire rows have changed their position, not just column V2. To reset the row names,
row.names(df1) <- NULL
Test data
V1 <- LETTERS[1:12]
V2 <- sprintf("1.%d", 1:12)
V2 <- sort(V2)
df1 <- data.frame(V1, V2)
Here's a tidyverse solution:
library(tidyverse)
data <- data.frame(V1 = LETTERS[seq(1, 20)], V2 = paste0(1, ".", seq(1,20)))
data_sorted <- data %>%
separate(V2, sep = "\\.", into = c("left", "right"), remove = F) %>%
arrange(left, right) %>%
select(-left, -right)
data_sorted
#> V1 V2
#> 1 A 1.1
#> 2 J 1.10
#> 3 K 1.11
#> 4 L 1.12
#> 5 M 1.13
#> 6 N 1.14
#> 7 O 1.15
#> 8 P 1.16
#> 9 Q 1.17
#> 10 R 1.18
#> 11 S 1.19
#> 12 B 1.2
#> 13 T 1.20
#> 14 C 1.3
#> 15 D 1.4
#> 16 E 1.5
#> 17 F 1.6
#> 18 G 1.7
#> 19 H 1.8
#> 20 I 1.9
Created on 2021-01-08 by the reprex package (v0.3.0)

How to get Mean SD and Pvalue for multiple groups in r?

My data frame looks like this:
category calss test1 test2
1 Yes 5.5 4.2
1 No 5.8 4.3
1 Yes 6.6 3.2
2 Yes 6 7.7
2 No 5.7 5.8
3 No 9.7 4.5
3 Yes 6.8 8.5
2 No 6.3 9.6
3 Yes 8.5 2.6
I want to calculate the mean, SD, and p values (between test1 and test2) base on class and category respectively.
I used dplyr to calculate mean and SD and now I am struggling to calculate the p value, as my dataset contains 1000 lines, 4 different categories, and 8 classes.
Here is what I get after using dplyr for the mean and sd:
category class test1_Mean test1_SD test2_Mean test2_SD
1 Yes 6 1 3.7 1.1
1 No 5.8 0 4.3 0
2 Yes 9.6 0 4.4 0
2 No 6 1.1 7.7 1
3 Yes 7.6 0.5 5.5 0.8
3 No 9.7 0 4.5 0
The output I want is:
category class test1_Mean test1_SD test2_Mean test2_SD Pvalue
1 Yes 6 1 3.7 1.1 0.05
1 No 5.8 0 4.3 0 0.14
2 Yes 9.6 0 4.4 0 0.69
2 No 6 1.1 7.7 1 0.001
3 Yes 7.6 0.5 5.5 0.8 2.00E+05
3 No 9.7 0 4.5 0 0.04
Thanks in advance.
You can try :
library(dplyr)
df %>%
group_by(category, calss) %>%
summarise(pvalue = t.test(test1, test2)$p.value)
I think this what you are looking for:
library(dplyr)
df %>% group_by(category, class) %>%
summarise(test1_mean=mean(test1), test2_mean=mean(test2), test1_SD=sd(test1), test2_SD=sd(test2), pvalue = t.test(test1, test2)$p.value)
An option with data.table
library(data.table)
setDT(df)[, .(pvalue = t.test(test1, test2)$p.value), .(category, calss)]

R dplyr left_join error - missing values produced when joining values rounded to one decimal place

UPDATE: Following comments (thanks!) I have now successfully merge these datasets by converting the ratios from 'double' to 'character'. I am still interested to know why R has trouble merging rounded 'double' variables though (the problem occurs also when using merge and sqldf and is not restricted to dplyr).
I have been having trouble with merging dataframes using dplyr’s left_join and ‘double’ variables.. I’m attempting to join small tables of data (only 20 rows) on the basis of a grouping of the ratio of target to predictor to identify what proportion of these ratios fall within 5% bandings for training, validation and test set results (from models produced by h2o).
To produce the groups of ratio I’ve simply rounded them to the nearest one decimal place (so the ratio variable remains a ‘double’ type like the original calculated value). However, when I attempt to merge on these rounded values, random rows of NAs appear in the merged dataframe.
Here’s an example of the code and the tables produced - the final table has a row of missing values for the ratios rounded to 0.7, but all are present in the original tables.
Is this a possible bug?
(I know that I could do the calculations more efficiently in dplyr, but as it’s not merging at the moment there seems little point in tidying things up).
trainR <- mutate(trainR, ratio_GBM = round((ZZ_TOP_PRICE_R/pred_GBM),digits=1))
> train_pcs_GBM <- count(trainR, ratio_GBM)
> train_pcs_GBM <- rename(train_pcs_GBM, train_count_GBM = n)
> train_pcs_GBM <- mutate(train_pcs_GBM, train_pc_GBM = round((train_count_GBM/length(trainR$ratio_GBM)*100),1))
> train_pcs_GBM
# A tibble: 32 x 3
ratio_GBM train_count_GBM train_pc_GBM
<dbl> <int> <dbl>
1 0.0 1 0.0
2 0.1 15 0.0
3 0.2 302 0.1
4 0.3 440 0.1
5 0.4 1575 0.3
6 0.5 23304 4.6
7 0.6 62381 12.4
8 0.7 71098 14.1
9 0.8 65792 13.0
10 0.9 66991 13.3
# ... with 22 more rows
> #View(train_pcs_GBM)
>
> validR <- mutate(validR, ratio_GBM = round((ZZ_TOP_PRICE_R/pred_GBM),digits=1))
> valid_pcs_GBM <- count(validR, ratio_GBM)
> valid_pcs_GBM <- rename(valid_pcs_GBM, valid_count_GBM = n)
> valid_pcs_GBM <- mutate(valid_pcs_GBM, valid_pc_GBM = round((valid_count_GBM/length(validR$ratio_GBM)*100),1))
> valid_pcs_GBM
# A tibble: 34 x 3
ratio_GBM valid_count_GBM valid_pc_GBM
<dbl> <int> <dbl>
1 0.1 9 0.0
2 0.2 148 0.1
3 0.3 168 0.1
4 0.4 688 0.4
5 0.5 8058 4.8
6 0.6 20980 12.5
7 0.7 23446 14.0
8 0.8 21932 13.1
9 0.9 22030 13.1
10 1.0 19342 11.5
# ... with 24 more rows
> View(valid_pcs_GBM)
>
> testR <- mutate(testR, ratio_GBM = round((ZZ_TOP_PRICE_R/pred_GBM),digits=1))
> test_pcs_GBM <- count(testR, ratio_GBM)
> test_pcs_GBM <- rename(test_pcs_GBM, test_count_GBM = n)
> test_pcs_GBM <- mutate(test_pcs_GBM, test_pc_GBM = round((test_count_GBM/length(testR$ratio_GBM)*100),1))
> test_pcs_GBM
# A tibble: 31 x 3
ratio_GBM test_count_GBM test_pc_GBM
<dbl> <int> <dbl>
1 0.1 10 0.0
2 0.2 148 0.1
3 0.3 154 0.1
4 0.4 600 0.4
5 0.5 8359 5.0
6 0.6 20739 12.4
7 0.7 23283 13.9
8 0.8 21899 13.1
9 0.9 22216 13.2
10 1.0 19202 11.4
# ... with 21 more rows
>
> # Merge table of percentages ******************************************************************************
> res_GBM <- data.frame(ratio = numeric(20))
> res_GBM$ratio <- seq(0.1,2,0.1)
> res_GBM <- left_join(res_GBM, train_pcs_GBM, by = c("ratio" = "ratio_GBM"))
> res_GBM <- left_join(res_GBM, valid_pcs_GBM, by = c("ratio" = "ratio_GBM"))
> res_GBM <- left_join(res_GBM, test_pcs_GBM, by = c("ratio" = "ratio_GBM"))
> res_GBM
ratio train_count_GBM train_pc_GBM valid_count_GBM valid_pc_GBM test_count_GBM test_pc_GBM
1 0.1 15 0.0 9 0.0 10 0.0
2 0.2 302 0.1 148 0.1 148 0.1
3 0.3 NA NA NA NA NA NA
4 0.4 1575 0.3 688 0.4 600 0.4
5 0.5 23304 4.6 8058 4.8 8359 5.0
6 0.6 62381 12.4 20980 12.5 20739 12.4
7 0.7 NA NA NA NA NA NA
8 0.8 65792 13.0 21932 13.1 21899 13.1
9 0.9 66991 13.3 22030 13.1 22216 13.2
10 1.0 58921 11.7 19342 11.5 19202 11.4
11 1.1 47096 9.3 15071 9.0 15086 9.0
12 1.2 NA NA NA NA NA NA
13 1.3 NA NA NA NA NA NA
14 1.4 NA NA NA NA NA NA
15 1.5 NA NA NA NA NA NA
16 1.6 6761 1.3 2372 1.4 2359 1.4
17 1.7 NA NA NA NA NA NA
18 1.8 NA NA NA NA NA NA
19 1.9 NA NA NA NA NA NA
20 2.0 1095 0.2 482 0.3 435 0.3
>
with(DL_FC_res, lineplot(ratio, test_pc, train_pc, valid_pc, FC_pc))

Data Transformations based on certain transformation criteria

I want to transform a dataset based on certain conditions. These conditions are given in another dataset. Let me explain it using an example.
Suppose I've a dataset in the following format:
Date Var1 Var2
3/1/2016 8 14
3/2/2016 7 8
3/3/2016 7 6
3/4/2016 10 8
3/5/2016 5 10
3/6/2016 9 15
3/7/2016 2 5
3/8/2016 6 14
3/9/2016 8 15
3/10/2016 8 8
And the following dataset has the transformation conditions and is in the following format:
Variable Trans1 Trans2
Var1 1||2 0.5||0.7
Var2 1||2 0.3||0.8
Now, I want to extract first conditions from transformation table for Var1, 1.0.5, and add 1 to Var1 and multiply it by 0.5. I'll do the same for var2, add by 1 and multiply by 0.3. This transformation will give me new variable Var1_1 and var2_1. I'll do the same thing for the other transformation, which will give me Var1_2 and Var2_2. For Var1_2, the transformation is Var1 sum with 2 and multiplied by 0.7.
After the transformation, the dataset will look like the following:
Date Var1 Var2 Var1_1 Var2_1 Var1_2 Var2_2
3/1/2016 8 14 4.5 4.5 7 11.2
3/2/2016 7 8 4 2.7 6.3 7
3/3/2016 7 6 4 2.1 6.3 5.6
3/4/2016 10 8 5.5 2.7 8.4 7
3/5/2016 5 10 3 3.3 4.9 8.4
3/6/2016 9 15 5 4.8 7.7 11.9
3/7/2016 2 5 1.5 1.8 2.8 4.9
3/8/2016 6 14 3.5 4.5 5.6 11.2
3/9/2016 8 15 4.5 4.8 7 11.9
3/10/2016 8 8 4.5 2.7 7 7
Given that your original data.frame is called df and your conditions table cond1 then we can create a custom function,
funV1Cond1 <- function(x){
t1 <- as.numeric(gsub("[||].*", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub("[||].*", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond1(df)
#[1] 4.5 4.0 4.0 5.5 3.0 5.0 1.5 3.5 4.5 4.5
Same way with function 2
funV1Cond2 <- function(x){
t1 <- as.numeric(gsub(".*[||]", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub(".*[||]", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond2(df)
#[1] 7.0 6.3 6.3 8.4 4.9 7.7 2.8 5.6 7.0 7.0
Assuming that Trans1 column has 3 conditions i.e. 1, 2, 3 then,
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[2]))
#[1] 2
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[1]))
#[1] 1
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[3]))
#[1] 3
Note that I changed the delimeter to a ','

Resources