Data pattern visualization using R - r

I have a table like
0.5625 0.037037037 0.009923785
0.7734375 0.0781893 0.009923785
0.9609375 0.127572016 0.009923785
0.26953125 0.008230453 0.009923785
0.85546875 0.144032922 0.009923785
0.873046875 0.187928669 0.009923785
0.969726563 0.138545953 0.009923785
0.711914063 0.031550069 0.009923785
0.588867188 0.066300869 0.009923785
0.670898438 0.038866027 0.009923785
0.331054688 0.004572474 0.009328358
0.670898438 0.038866027 0.009923785
0.8203125 0.1015625 0.009923785
0.794921875 0.115234375 0.009923785
0.947265625 0.228515625 0.009923785
0.284179688 0.032226563 0.009923785
0.987304688 0.079101563 0.009923785
0.485351563 0.081054688 0.009923785
0.584960938 0.012695313 0.009288663
0.485351563 0.081054688 0.009923785
0.862048458 0.112664883 0.00996348
0.844804516 0.126747993 0.00996348
0.859089866 0.072807892 0.00996348
0.069334708 0.013713014 0.00996348
0.515944115 0.001011122 0.009288663
0.787155502 0.089283342 0.009923785
I want to visualize the data in such a way that center point should be the result data and it should be connected to all those points which have provided that result((example 0.009288663 is generated by (0.515944115, 0.001011122) and (0.485351563, 0.081054688) so 0.009288663 should be connected to (a1,b1) and (a2,b2)).
In the bellow. resembles the results.
(a2,b2)<-----.------------>(a1,b1)
I have tried using following code:
scatterplot3d(x = test$A, # x axis
y = test$B, # y axis
z = test$Result, # z axis
x.ticklabs = levels(test$A),
y.ticklabs = levels(test$B))
1st Approach:
But what I realized, that above method is going to result in plotting 2 points in the 3D plane, instead of the way I needed.
2nd Approach:
I tried plotting all the points and based on condition tried connecting them, that can be like a workaround but still, I couldn't able to figure of the placeholder for the result.
Any help with the query will be much appreciated.
Thanks

Is this what you mean? Note that I only read in the first two columns with [,1:2] here, but it should work even if you read in the full dataset:
> library(dplyr)
> library(tibble)
> test <- as_tibble(read.table("yourdata.txt",header=TRUE))[,1:2]
> test
# A tibble: 26 x 2
A B
<dbl> <dbl>
1 0.562 0.0370
2 0.773 0.0782
3 0.961 0.128
4 0.270 0.00823
5 0.855 0.144
6 0.873 0.188
7 0.970 0.139
8 0.712 0.0316
9 0.589 0.0663
10 0.671 0.0389
# ... with 16 more rows
Create columns containing the midpoints of the x's and midpoints of the y's:
> test %>% mutate(xdiff=((A+lag(A))/2),ydiff=((B+lag(B))/2))
# A tibble: 26 x 4
A B xdiff ydiff
<dbl> <dbl> <dbl> <dbl>
1 0.562 0.0370 NA NA
2 0.773 0.0782 0.668 0.0576
3 0.961 0.128 0.867 0.103
4 0.270 0.00823 0.615 0.0679
... the rest are truncated
And then feed all the center points to a plot:
> test %>% mutate(xdiff=((A+lag(A))/2),ydiff=((B+lag(B))/2)) %>%
ggplot() + geom_point(aes(x=xdiff,y=ydiff))
You can even draw the segments that created the points by adding geom_segment but you're going to have to spend some time coming up with a creative color strategy, because it kind of makes the plot look messy:
> test %>% mutate(xdiff=((A+lag(A))/2),ydiff=((B+lag(B))/2)) %>%
ggplot() + geom_point(aes(x=xdiff,y=ydiff)) +
geom_segment(aes(x=A,y=B,xend=lead(A),yend=lead(B)))

Related

The if else statement compare to 0

I try to get the square root of negative number. I got the absolute value of data and, for the positive number, I use the squart root of absolute number directly, otherwive add an negaitve sign to the result. However all numbers I got are negaitve...
My code
Results shown
I try to get negaitve and positive results, but I only got negative numbers.your text``your text
Library and Data
Not sure exactly what you are doing because your original data frame isn't included in the question. However, I have simulated a dataset that should emulate what you want depending on what you are doing. First, I loaded the tidyverse package for data wrangling like creating/manipulating variables, then set a random seed so you can reproduce the simulated data.
#### Load Library ####
library(tidyverse)
#### Set Random Seed ####
set.seed(123)
Now I create a randomly distributed x value that is both positive and negative.
#### Create Randomly Distributed X w/Neg Values ####
tib <- tibble(
x = rnorm(n=100)
)
Creating Variables
Now we can make absolute values, followed by square roots, which are made negative if the original raw value was negative.
#### Create Absolute and Sqrt Values ####
new.tib <- tib %>%
mutate(
abs.x = abs(x),
sq.x = sqrt(abs.x),
final.x = ifelse(x < 0,
sq.x * -1,
sq.x)
)
new.tib
If you print new.tib, the end result will look like this:
# A tibble: 100 × 4
x abs.x sq.x final.x
<dbl> <dbl> <dbl> <dbl>
1 2.20 2.20 1.48 1.48
2 1.31 1.31 1.15 1.15
3 -0.265 0.265 0.515 -0.515
4 0.543 0.543 0.737 0.737
5 -0.414 0.414 0.644 -0.644
6 -0.476 0.476 0.690 -0.690
7 -0.789 0.789 0.888 -0.888
8 -0.595 0.595 0.771 -0.771
9 1.65 1.65 1.28 1.28
10 -0.0540 0.0540 0.232 -0.232
If you just want to select the final x values, you can simply select them, like so:
new.tib %>%
select(final.x)
Giving you just this vector:
# A tibble: 100 × 1
final.x
<dbl>
1 1.48
2 1.15
3 -0.515
4 0.737
5 -0.644
6 -0.690
7 -0.888
8 -0.771
9 1.28
10 -0.232
# … with 90 more rows
Using the first example in ?ifelse:
x <- c(6:-4)
[1] 6 5 4 3 2 1 0 -1 -2 -3 -4
sqrt(ifelse(x >= 0, x, -x))
[1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000
[7] 0.000000 1.000000 1.414214 1.732051 2.000000

How to plot my tibble in r like shown below? [duplicate]

This question already has answers here:
How to add texture to fill colors in ggplot2
(8 answers)
Closed 9 months ago.
My tibble looks like this:
# A tibble: 5 × 6
clusters neuroticism introverty empathic open unconscious
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.242 1.02 0.511 0.327 -0.569
2 2 -0.285 -0.257 -1.36 0.723 -0.994
3 3 0.904 -0.973 0.317 0.0622 -0.0249
4 4 -0.836 0.366 0.519 0.269 1.00
5 5 0.0602 -0.493 -1.03 -1.53 -0.168
I was wondering how I can plot this with ggplot2, so that It looks like the big five personality profiles shown in this picture:
My goal is to plot an personality profile for each cluster.
In order to plot it you'd typically need to have the data in a long format. A tidyverse-solution using pivot_longer and then ggplot could look like:
df |>
pivot_longer(-clusters) |>
ggplot(aes(x = name,
y = value,
fill = as.factor(clusters))) +
geom_col(position = "dodge")
Plot:
Data:
df <-
tribble(
~clusters,~neuroticism,~introverty,~empathic,~open,~unconscious,
1,0.242,1.02,0.511,0.327,-0.569,
2,-0.285,-0.257,-1.36,0.723,-0.994,
3,0.904,-0.973,0.317,0.0622,-0.0249,
4,-0.836,0.366,0.519,0.269,1.00,
5,0.0602,-0.493,-1.03,-1.53,-0.168
)

How to calculate the AUC of a graph in R? [duplicate]

This question already has answers here:
Calculate the Area under a Curve
(7 answers)
Closed 1 year ago.
I have a dataframe (gdata) with x (as "r") and y (as "km") coordinates of a function.
When I plot it like this:
plot(x = gdata$r, y = gdata$km, type = "l")
I get the graph of the function:
Now I want to calculate the area under the curve from x = 0 to x = 0.6. When I look for appropriate packages I only find something like calculation AUC of a ROC curve. But is there a way just to calculate the AUC of a normal function?
The area under the curve (AUC) of a given set of data points can be archived using numeric integration:
Let data be your data frame containing x and y values. You can get the area under the curve from lower x0=0 to upper x1=0.6 by integrating the function, which is linearly approximating your data.
This is a numeric approximation and not exact, because we do not have an infinite number of data points: For y=sqrt(x) we will get 0.3033 instead of true value of 0.3098. For 200 rows in data we'll get even better with auc=0.3096.
library(tidyverse)
data <-
tibble(
x = seq(0, 2, length.out = 20)
) %>%
mutate(y = sqrt(x))
data
#> # A tibble: 20 × 2
#> x y
#> <dbl> <dbl>
#> 1 0 0
#> 2 0.105 0.324
#> 3 0.211 0.459
#> 4 0.316 0.562
#> 5 0.421 0.649
#> 6 0.526 0.725
#> 7 0.632 0.795
#> 8 0.737 0.858
#> 9 0.842 0.918
#> 10 0.947 0.973
#> 11 1.05 1.03
#> 12 1.16 1.08
#> 13 1.26 1.12
#> 14 1.37 1.17
#> 15 1.47 1.21
#> 16 1.58 1.26
#> 17 1.68 1.30
#> 18 1.79 1.34
#> 19 1.89 1.38
#> 20 2 1.41
qplot(x, y, data = data)
integrate(approxfun(data$x, data$y), 0, 0.6)
#> 0.3033307 with absolute error < 8.8e-05
Created on 2021-10-03 by the reprex package (v2.0.1)
The absolute error returned by integrate is corerect, iff the real world between every two data points is a perfect linear interpolation, as we assumed.
I used the package MESS to solve the problem:
# Toy example
library(MESS)
x <- seq(0,3, by=0.1)
y <- x^2
auc(x,y, from = 0.1, to = 2, type = "spline")
The analytical result is:
7999/3000
Which is approximately 2.666333333333333
The R script offered gives: 2.66632 using the spline approximation and 2.6695 using the linear approximation.

How could I use R to pull a few select lines out of a large text file?

I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36

Conditional sorting / reordering of column values in R

I have a data set similar to the following with 1 column and 60 rows:
value
1 0.0423
2 0.0388
3 0.0386
4 0.0342
5 0.0296
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
40 0.1424
.
60 -0.0312
I want to reorder the rows so that certain conditions are met. For example one condition could be: sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100
With the data set looking like this for example.
value
1 0.0423
2 0.0388
3 0.0386
4 0.1312
5 -0.0312
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
.
.
60 0.0342
What I tried was using repeat and sample as in the following:
repeat{
df1 <- as_tibble(sample(sdf$value, replace = TRUE))
if (sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100) break
}
Unfortunately, this method takes quite some time and I was wondering if there is a faster way to reorder rows based on mathematical conditions such as sum or prod
Here's a quick implementation of the hill-climbing method I outlined in my comment. I've had to slightly reframe the desired condition as "distance of sum(x[4:7]) from 0.105" to make it continuous, although you can still use the exact condition when doing the check that all requirements are satisfied. The benefit is that you can add extra conditions to the distance function easily.
# Using same example data as Jon Spring
set.seed(42)
vs = rnorm(60, 0.05, 0.08)
get_distance = function(x) {
distance = abs(sum(x[4:7]) - 0.105)
# Add to the distance with further conditions if needed
distance
}
max_attempts = 10000
best_distance = Inf
swaps_made = 0
for (step in 1:max_attempts) {
# Copy the vector and swap two random values
new_vs = vs
swap_inds = sample.int(length(vs), 2, replace = FALSE)
new_vs[swap_inds] = rev(new_vs[swap_inds])
# Keep the new vector if the distance has improved
new_distance = get_distance(new_vs)
if (new_distance < best_distance) {
vs = new_vs
best_distance = new_distance
swaps_made = swaps_made + 1
}
complete = (sum(vs[4:7]) < 0.11) & (sum(vs[4:7]) > 0.1)
if (complete) {
print(paste0("Solution found in ", step, " steps"))
break
}
}
sum(vs[4:7])
There's no real guarantee that this method will reach a solution, but I often try this kind of basic hill-climbing when I'm not sure if there's a "smart" way to approach a problem.
Here's an approach using combn from base R, and then filtering using dplyr. (I'm sure there's a way w/o it but my base-fu isn't there yet.)
With only 4 numbers from a pool of 60, there are "only" 488k different combinations (ignoring order; =60*59*58*57/4/3/2), so it's quick to brute force in about a second.
# Make a vector of 60 numbers like your example
set.seed(42)
my_nums <- rnorm(60, 0.05, 0.08);
all_combos <- combn(my_nums, 4) # Get all unique combos of 4 numbers
library(tidyverse)
combos_table <- all_combos %>%
t() %>%
as_tibble() %>%
mutate(sum = V1 + V2 + V3 + V4) %>%
filter(sum > 0.1, sum < 0.11)
> combos_table
# A tibble: 8,989 x 5
V1 V2 V3 V4 sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.160 0.00482 0.0791 -0.143 0.100
2 0.160 0.00482 0.101 -0.163 0.103
3 0.160 0.00482 0.0823 -0.145 0.102
4 0.160 0.00482 0.0823 -0.143 0.104
5 0.160 0.00482 -0.0611 -0.00120 0.102
6 0.160 0.00482 -0.0611 0.00129 0.105
7 0.160 0.00482 0.0277 -0.0911 0.101
8 0.160 0.00482 0.0277 -0.0874 0.105
9 0.160 0.00482 0.101 -0.163 0.103
10 0.160 0.00482 0.0273 -0.0911 0.101
# … with 8,979 more rows
This says that in this example, there are about 9000 different sets of 4 numbers from my sequence which meet the criteria. We could pick any of these and put them in positions 4-7 to meet your requirement.

Resources