I have the below data set:
Profit
MRO 15x5
D30
$150.00
-9.189
-0.24
$12.50
-6.076
-0.248
-$125.00
-7.699
-0.282
-$162.50
-8.008
-0.281
-$175.00
-0.183
-0.056
-$175.00
-0.235
-0.061
$275.00
0.141
-0.027
-$175.00
-4.062
-0.103
-$162.50
-5.654
-0.258
-$162.50
-1.578
-0.051
-$175.00
-3.336
-0.205
-$162.50
-1.523
-0.022
$412.50
-1.524
-0.194
$337.50
-1.049
-0.055
$100.00
-1.043
-0.059
I want to first arrange column D30 in ascending order and then look into the Profit column. If the top n row and bottom n row values (a range of cells) are less than -50 in the Profit column then delete the entire row in the data set.
The result would be like this:
Profit
MRO 15x5
D30
$275.00
0.141
-0.027
-$162.50
-1.578
-0.051
$337.50
-1.049
-0.055
-$175.00
-0.183
-0.056
$100.00
-1.043
-0.059
-$175.00
-0.235
-0.061
-$175.00
-4.062
-0.103
$412.50
-1.524
-0.194
-$175.00
-3.336
-0.205
$150.00
-9.189
-0.24
$12.50
-6.076
-0.248
This output is the result of the deletion of the top 1st row and bottom 3 rows from the entire data set as these rows (range of values) were having Profit values less than -50.
Can anyone please help me to do this in the R program using dplyr or by using some other filtering packages?
I would be thankful for your kind support.
Regards,
Farhan
Use cumany. Combined with filter, it removes rows until a criterion is met (here Profit <= -50).
The first command is a way to parse your Profit column into a numeric column.
library(dplyr)
data %>% mutate(Profit = parse_number(str_replace(Profit,"^-\\$(.*)$", "$-\\1"))) %>%
arrange(D30) %>%
filter(cumany(Profit > -50)) %>%
arrange(desc(D30)) %>%
filter(cumany(Profit > -50))
Profit MRO_15x5 D30
1 275.0 0.141 -0.027
2 -162.5 -1.578 -0.051
3 337.5 -1.049 -0.055
4 -175.0 -0.183 -0.056
5 100.0 -1.043 -0.059
6 -175.0 -0.235 -0.061
7 -175.0 -4.062 -0.103
8 412.5 -1.524 -0.194
9 -175.0 -3.336 -0.205
10 150.0 -9.189 -0.240
11 12.5 -6.076 -0.248
Related
I am running PCR on a data set, but my results from PCR is giving me the same values for both CV and adjCV, is this correct or there is anything wrong with the data.
Here is my code:
pcr <- pcr(F1~., data = data, scale = TRUE, validation = "CV")
summary(PCR)
validationplot(pcr)
validationplot(pcr, val.type = "MSEP")
validationplot(pcr, val.type = "R2")
predplot(pcr)
coefplot(PCR)
set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE,
prob = c(0.8,0.2))
train <- data[ind ==1,]
test <- data[ind ==2,]
pcr_train <- pcr(F1~., data = train, scale =TRUE, validation = "CV")
y_test <- test[, 1]
pcr_pred <- predict(pcr, test, ncomp = 4)
mean((pcr_pred - y_test) ^2)
And I am getting this error when I print the mean command
Warning in mean.default((pcr_pred - y_test)^2) :
argument is not numeric or logical: returning NA
Sample data:
F1 F2 F3 F4 F5
4.378 2.028 -5.822 -3.534 -0.546
4.436 2.064 -5.872 -3.538 -0.623
4.323 1.668 -5.954 -3.304 -0.782
5.215 3.319 -5.863 -4.139 -0.632
4.074 1.497 -6.018 -3.176 -0.697
4.403 1.761 -6 -3.339 -0.847
4.99 3.105 -5.985 -3.97 -0.638
4.783 2.968 -5.94 -3.903 -0.481
4.361 1.786 -5.866 -3.397 -0.685
4.594 1.958 -5.985 -3.457 -0.91
0.858 -4.734 -6.104 -0.692 -0.87
0.878 -3.846 -6.289 -1.064 -0.618
0.876 -4.479 -6.148 -0.803 -0.801
0.937 -5.498 -5.958 -0.376 -1.184
0.953 -4.71 -6.123 -0.705 -0.96
0.738 -5.386 -5.877 -0.444 -0.884
0.833 -5.562 -5.937 -0.343 -1.104
1.184 -3.52 -6.221 -1.234 -0.38
1.3 -4.129 -6.168 -0.963 -0.73
3.359 -3.618 -5.302 0.481 -0.649
3.483 -2.938 -5.361 0.157 -0.482
3.673 -3.779 -5.326 0.516 -1.053
2.521 -6.577 -4.499 1.861 -1.374
2.52 -4.757 -4.866 1.182 -0.736
2.482 -4.732 -4.857 1.142 -0.708
2.543 -6.699 -4.496 1.947 -1.426
2.458 -3.182 -5.219 0.514 -0.255
2.558 -5.66 -4.757 1.558 -1.142
2.627 -1.806 -5.313 -1.808 1.054
3.773 -0.526 -5.236 -0.6 -0.23
3.65 -0.954 -4.97 -0.361 -0.413
3.816 -1.18 -5.228 -0.284 -0.575
3.752 -0.522 -5.346 -0.562 -0.293
3.961 -0.24 -5.423 -0.69 -0.408
3.734 -0.711 -5.307 -0.479 -0.347
4.094 -0.415 -5.103 -0.729 -0.35
3.894 -0.957 -5.133 -0.435 -0.457
3.741 -0.484 -5.363 -0.574 -0.279
3.6 -0.698 -5.422 -0.435 -0.306
3.845 -0.351 -5.306 -0.666 -0.269
3.886 -0.481 -5.332 -0.596 -0.39
3.552 -2.106 -5.043 0.128 -0.634
4.336 -10.323 -2.95 3.346 -3.494
3.918 -0.809 -5.315 -0.442 -0.567
3.757 -0.502 -5.347 -0.572 -0.288
3.712 -0.627 -5.353 -0.505 -0.314
3.954 -0.72 -5.492 -0.428 -0.691
4.088 -0.588 -5.412 -0.53 -0.688
3.728 -0.641 -5.338 -0.505 -0.321
I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.
I have to plot data from immunized animals in a way to visualize possible correlations in protection. As a background, when we vaccinate an animal it produces antibodies, which might or not be linked to protection. We immunized bovine with 9 different proteins and measured antibody titers which goes up to 1.5 (Optical Density (O.D.)). We also measured tick load that goes up to 5000. Each animal have different titers for each protein and different tick loads, maybe some proteins are more important for protection than the others, and we think that a heatmap could illustrate it.
TL;DR: Plot a heatmap with one variable (Ticks) that goes from 6 up to 5000, and another variable (Prot1 to Prot9) that goes up to 1.5.
A sample of my data:
Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
G1-54-102 control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
G1-130-102 control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
G1-133-102 control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
G3-153-102 vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
G3-200-102 vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
G3-807-102 vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035
I have little knowledge in R, but I'm really excited to learn more about it. So feel free to put whatever code you want and I will try my best to understand it.
Thank you in advance.
Luiz
Here is an option to use the ggplot2 package to create a heatmap. You will need to convert your data frame from wide format to long format. It is also important to convert the Ticks column from numeric to factor if the numbers are discrete.
library(tidyverse)
library(viridis)
dat2 <- dat %>%
gather(Prot, Value, starts_with("Prot"))
ggplot(dat2, aes(x = factor(Ticks), y = Prot, fill = Value)) +
geom_tile() +
scale_fill_viridis()
DATA
dat <- read.table(text = "Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
'G1-54-102' control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
'G1-130-102' control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
'G1-133-102' control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
'G3-153-102' vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
'G3-200-102' vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
'G3-807-102' vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035",
header = TRUE, stringsAsFactors = FALSE)
In the newest version of ggplot2 / the tidyverse, you don't even need to explicitly load the viridis-package. The scale is included via scale_fill_viridis_c(). Exciting times!
Suppose I have many data frames, that have varying row numbers (of data) but Date as common among them. e.g. :
DF1:
Date Index Change
05-04-17 29911.55 0
03-04-17 29910.22 0.0098
31-03-17 29620.5 -0.0009
30-03-17 29647.42 0.0039
29-03-17 29531.43 0.0041
28-03-17 29409.52 0.0059
27-03-17 29237.15 -0.0063
24-03-17 29421.4 0.003
And
DF2:
Date NG NG_Change
05-04-17 213.8 0.0047
04-04-17 212.8 0.0421
03-04-17 204.2 -0.0078
31-03-17 205.8 -0.0068
30-03-17 207.2 -0.0166
29-03-17 210.7 0.0483
28-03-17 201 0.005
27-03-17 200 -0.0015
24-03-17 200.3 0.0137
And another one:
DF3:
Date TI_Price TI_Change
05-04-17 51.39 0.0071
04-04-17 51.03 0.0157
03-04-17 50.24 -0.0071
31-03-17 50.6 0.005
30-03-17 50.35 0.017
29-03-17 49.51 0.0236
28-03-17 48.37 0.0134
I wanted to combine them, using Dates column "as common variable", in a way that there are only those rows in the final for which Dates are common. Such as:
Date TI_Price TI_Change NG NG_Change TI_Price TI_Change
05-04-17 51.39 0.0071 213.8 0.0047 51.39 0.0071
04-04-17 51.03 0.0157 212.8 0.0421 51.03 0.0157
03-04-17 50.24 -0.0071 204.2 -0.0078 50.24 -0.0071
31-03-17 50.6 0.005 205.8 -0.0068 50.6 0.005
30-03-17 50.35 0.017 207.2 -0.0166 50.35 0.017
29-03-17 49.51 0.0236 210.7 0.0483 49.51 0.0236
28-03-17 48.37 0.0134 201 0.005 48.37 0.0134
I am just wondering if there is any method so that I could merge them in one go and not like the merge() function which takes DF2 and DF2 at a time, merge and then the result is merged with DF3.
What I used and tweaked around (but waste):
myfulldata = merge(DF1, DF2, all.x=T)
I'm trying to declare the colorAxis and let a series of computed "Scores" define the gradient for coloring the bubbles. The visualization just keeps giving me random colors, all with the "OutlierScore" next to them on an ugly legend to the right of the plot. I don't understand what I'm doing wrong as my options list matches all of the demo codes I find. I'm using the final gvisBubbleChart statement as the output to my renderGvis code in server.R.
Here's some sample data:
Attribute CloseRate Quotes OutlierScore Size
AdvancedShopper:N 0.261 3411 292.47 1.016
AdvancedShopper:Y 0.119 10421 259.68 2.283
PriorCarrier:HP 0.277 1876 186.46 0.739
Vehicles:1 0.183 8784 179.98 1.988
Vehicles:2 0.106 3471 121.81 1.027
LeadType:Cold 0.104 3177 117.09 0.974
SPINOFF:Y 0.414 510 115.65 0.492
LeadType:Warm 0.223 2184 115.47 0.795
MULTI_CAR_DSCNT_FLG:HMC 0.303 879 107.88 0.559
MULTI_CAR_DSCNT_FLG:MC 0.111 3451 105.75 1.024
PRI_CARR_NME:HP 0.253 1287 100.58 0.633
PriorCarrier:GEICO 0.099 2476 99.74 0.847
PriorCarrier:No Prior Insurance 0.304 802 99.61 0.545
PRI_CARR_NME:No Prior Insurance 0.304 802 99.61 0.545
FR_BAND:P-R 0.112 3227 98.15 0.983
PIP_DED:2,500 0.197 3053 95.11 0.952
AgencyName:South Agency 0.213 2120 94.81 0.783
RSrc:SPIN-OFF Additional Policy 0.434 373 91.99 0.467
CompanionType:None 0.141 11332 87.60 2.448
D2V:D1V1 0.175 5830 85.67 1.454
Here's my gvisBubbleChart declaration.
YLim = c(0,max(GData$Quotes)*1.05)
XLim = c(0,max(GData$CloseRate)*1.01)
gvisBubbleChart(GData, idvar="Attribute", xvar="CloseRate", yvar="Quotes", colorvar="OutlierScore", sizevar="Size",
options=list(title="One-Way Bubble Chart",
hAxis=paste("{title: 'Close Rate', minValue:0, maxValue:",XLim[2],"}",sep=""),
vAxis=paste("{title: 'Quotes', minValue:0, maxValue:",YLim[2],"}",sep=""),
width=1400, height=600, colorAxis="{minValue: 0, colors: ['red', 'green']}",
sizeAxis = '{minValue: 0, maxSize: 10}',
bubble="{textStyle:{color: 'none'}}"))