I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36
I have to plot data from immunized animals in a way to visualize possible correlations in protection. As a background, when we vaccinate an animal it produces antibodies, which might or not be linked to protection. We immunized bovine with 9 different proteins and measured antibody titers which goes up to 1.5 (Optical Density (O.D.)). We also measured tick load that goes up to 5000. Each animal have different titers for each protein and different tick loads, maybe some proteins are more important for protection than the others, and we think that a heatmap could illustrate it.
TL;DR: Plot a heatmap with one variable (Ticks) that goes from 6 up to 5000, and another variable (Prot1 to Prot9) that goes up to 1.5.
A sample of my data:
Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
G1-54-102 control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
G1-130-102 control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
G1-133-102 control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
G3-153-102 vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
G3-200-102 vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
G3-807-102 vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035
I have little knowledge in R, but I'm really excited to learn more about it. So feel free to put whatever code you want and I will try my best to understand it.
Thank you in advance.
Luiz
Here is an option to use the ggplot2 package to create a heatmap. You will need to convert your data frame from wide format to long format. It is also important to convert the Ticks column from numeric to factor if the numbers are discrete.
library(tidyverse)
library(viridis)
dat2 <- dat %>%
gather(Prot, Value, starts_with("Prot"))
ggplot(dat2, aes(x = factor(Ticks), y = Prot, fill = Value)) +
geom_tile() +
scale_fill_viridis()
DATA
dat <- read.table(text = "Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
'G1-54-102' control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
'G1-130-102' control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
'G1-133-102' control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
'G3-153-102' vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
'G3-200-102' vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
'G3-807-102' vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035",
header = TRUE, stringsAsFactors = FALSE)
In the newest version of ggplot2 / the tidyverse, you don't even need to explicitly load the viridis-package. The scale is included via scale_fill_viridis_c(). Exciting times!
I conducted a Factor Analysis in R and discovered there are 4 latent variables (factors).
I am trying to run a diagram/cluster visualization, to better represent my findings.
I would want something like this:
Here is my output:
print(factor2$loadings,cutoff = 0.3)
Loadings:
MR2 MR1 MR4 MR3
Inflatie 0.796
Dobanda 0.439
optq20 0.627
optq22 0.661
optq25 0.489
optq27 0.462
optq28 0.651
optq29 0.359
optq30 0.636
optq36 0.322
optq37 0.621
optq38 0.517
optq39 0.620
optq43 0.543
MR2 MR1 MR4 MR3
SS loadings 1.560 1.524 1.225 0.873
Proportion Var 0.111 0.109 0.087 0.062
Cumulative Var 0.111 0.220 0.308 0.370
I have searched all over the Internet but could not find some useful code for that.
Thank you!
I have a two-part problem. I've searched all over stack and found answers related to my problems, but no variations I've tried have worked yet. Thanks in advance for any help!
I have a large data frame that contains many variables.
First, I want to (1) standardize a variable by another variable (in my case, speaker), and (2) filter out values after the variable has been standardized (greater than 2 standard deviations away from the mean). (1) and (2) can be taken care of by a function using dplyr.
Second, I have many variables I want to do this for, so I'm trying to find an automated way to do this, such as with a for loop.
Problem 1: Writing a function containing dplyr functions
Here is a sample of what my data frame looks like:
df = data.frame(speaker=c("eng1","eng1","eng1","eng1","eng1","eng1","eng2","eng2","eng2","eng2","eng2"),
ratio_means001=c(0.56,0.202,0.695,0.436,0.342,10.1,0.257,0.123,0.432,0.496,0.832),
ratio_means002=c(0.66,0.203,0.943,0.432,0.345,0.439,0.154,0.234,NA,0.932,0.854))
Output:
speaker ratio_means001 ratio_means002
1 eng1 0.560 0.660
2 eng1 0.202 0.203
3 eng1 0.695 0.943
4 eng1 0.436 0.432
5 eng1 0.342 0.345
6 eng1 10.100 0.439
7 eng2 0.257 0.154
8 eng2 0.123 0.234
9 eng2 0.432 NA
10 eng2 0.496 0.932
11 eng2 0.832 0.854
Below is the basic code I want to turn into a function:
standardized_data = group_by(df, speaker) %>%
mutate(zRatio1 = as.numeric(scale(ratio_means001)))%>%
filter(!abs(zRatio1) > 2)
So that the data frame will now look like this (for example):
speaker ratio_means001 ratio_means002 zRatio1
(fctr) (dbl) (dbl) (dbl)
1 eng1 0.560 0.660 -0.3792191
2 eng1 0.202 0.203 -0.4699781
3 eng1 0.695 0.943 -0.3449943
4 eng1 0.436 0.432 -0.4106552
5 eng1 0.342 0.345 -0.4344858
6 eng2 0.257 0.154 -0.6349445
7 eng2 0.123 0.234 -1.1325034
8 eng2 0.432 NA 0.0148525
9 eng2 0.496 0.932 0.2524926
10 eng2 0.832 0.854 1.5001028
Here is what I have in terms of a function so far. The mutate part works, but I've been struggling with adding the filter part:
library(lazyeval)
standardize_variable = function(col1, new_col_name) {
mutate_call = lazyeval::interp(b = interp(~ scale(a)), a = as.name(col1))
group_by(data,speaker) %>%
mutate_(.dots = setNames(list(mutate_call), new_col_name)) %>%
filter_(interp(~ !abs(b) > 2.5, b = as.name(new_col_name))) # this part does not work
}
I receive the following error when I try to run the function:
data = standardize_variable("ratio_means001","zRatio1")
Error in substitute_(`_obj`[[2]], values) :
argument "_obj" is missing, with no default
Problem 2: Looping over the function
There are many variables that I'd like to apply the above function to, so I would like to find a way to either use a loop or another helpful function to help automate this process. The variable names differ only in a number at the end, so I have come up with something like this:
d <- data.frame()
for(i in 1:2)
{
col1 <- paste("ratio_means00", i, sep = "")
new_col <- paste("zRatio", i, sep = "")
d <- rbind(d, standardize_variable(col1, new_col))
}
However, I get the following error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
Thanks again for any help on these issues!
Alternative 1
I believe the main problem you were having with your function had to do with you calling interp twice. Fixing that led to an additional problem with filter, which I think was due to scale adding attributes (I'm using a development version dplyr, dplyr_0.4.3.9001). Wrapping as.numeric around scale gets rid of that.
So with the fixes your function looks like:
standardize_variable = function(col1, new_col_name) {
mutate_call = lazyeval::interp(~as.numeric(scale(a)), a = as.name(col1))
group_by(df, speaker) %>%
mutate_(.dots = setNames(list(mutate_call), new_col_name)) %>%
filter_(interp(~ !abs(b) > 2, b = as.name(new_col_name)))
}
I found the loop through the variables to be a bit more complicated than what you had, as I believe you want to merge your datasets back together once you make one for each variable. One option is to save them to a list and then use do.call with merge to get the final dataset.
d = list()
for(i in 1:2) {
col1 <- paste("ratio_means00", i, sep = "")
new_col <- paste("zRatio", i, sep = "")
d[[i]] = standardize_variable(col1, new_col)
}
do.call(merge, d)
speaker ratio_means001 ratio_means002 zRatio1 zRatio2
1 eng1 0.202 0.203 -0.4699781 -1.1490444
2 eng1 0.342 0.345 -0.4344858 -0.6063693
3 eng1 0.436 0.432 -0.4106552 -0.2738853
4 eng1 0.560 0.660 -0.3792191 0.5974521
5 eng1 0.695 0.943 -0.3449943 1.6789806
6 eng2 0.123 0.234 -1.1325034 -0.7620572
7 eng2 0.257 0.154 -0.6349445 -0.9590348
8 eng2 0.496 0.932 0.2524926 0.9565726
9 eng2 0.832 0.854 1.5001028 0.7645194
Alternative 2
An alternative to all of this would be to use mutate_each and rename_ for the first part of the problem and then use an interp with a lapply loop for the final filtering of all of the scaled variables simultaneously.
In the code below I take advantage of the fact that mutate_each allows naming for single functions starting in dplyr_0.4.3.9001. Things look a bit complicated in rename_ because I was making the names you wanted for the new columns. To simplify things you could leave them ending in _z from mutate_each and save yourself the complicated step of rename_ with gsub and grepl.
df2 = df %>%
group_by(speaker) %>%
mutate_each(funs(z = as.numeric(scale(.))), starts_with("ratio_means00")) %>%
rename_(.dots = setNames(names(.)[grepl("z", names(.))],
paste0("zR", gsub("r|_z|_means00", "", names(.)[grepl("z", names(.))]))))
Once that's done, you just need to filter by multiple columns. I think it's easiest to make a list of conditions you want to filter with use interp and lapply, and then give that to the .dots argument of filter_.
dots = lapply(names(df2)[starts_with("z", vars = names(df2))],
function(y) interp(~abs(x) < 2, x = as.name(y)))
filter_(df2, .dots = dots)
Source: local data frame [9 x 5]
Groups: speaker [2]
speaker ratio_means001 ratio_means002 zRatio1 zRatio2
(fctr) (dbl) (dbl) (dbl) (dbl)
1 eng1 0.560 0.660 -0.3792191 0.5974521
2 eng1 0.202 0.203 -0.4699781 -1.1490444
3 eng1 0.695 0.943 -0.3449943 1.6789806
4 eng1 0.436 0.432 -0.4106552 -0.2738853
5 eng1 0.342 0.345 -0.4344858 -0.6063693
6 eng2 0.257 0.154 -0.6349445 -0.9590348
7 eng2 0.123 0.234 -1.1325034 -0.7620572
8 eng2 0.496 0.932 0.2524926 0.9565726
9 eng2 0.832 0.854 1.5001028 0.7645194
Alternative 3
I often find these problems most straightforward if I reshape the dataset instead of working across columns. For example, still using the newest version of mutate_each but skipping the renaming step for simplicity you could gather all the standardized columns together using the gather function from tidyr and then filter the new column.
library(tidyr)
df %>%
group_by(speaker) %>%
mutate_each(funs(z = as.numeric(scale(.))), starts_with("ratio_means00")) %>%
gather(group, zval, ends_with("_z")) %>%
filter(abs(zval) <2 )
# First 12 lines of output
Source: local data frame [20 x 5]
Groups: speaker [2]
speaker ratio_means001 ratio_means002 group zval
<fctr> <dbl> <dbl> <chr> <dbl>
1 eng1 0.560 0.660 ratio_means001_z -0.3792191
2 eng1 0.202 0.203 ratio_means001_z -0.4699781
3 eng1 0.695 0.943 ratio_means001_z -0.3449943
4 eng1 0.436 0.432 ratio_means001_z -0.4106552
5 eng1 0.342 0.345 ratio_means001_z -0.4344858
6 eng2 0.257 0.154 ratio_means001_z -0.6349445
7 eng2 0.123 0.234 ratio_means001_z -1.1325034
8 eng2 0.432 NA ratio_means001_z 0.0148525
9 eng2 0.496 0.932 ratio_means001_z 0.2524926
10 eng2 0.832 0.854 ratio_means001_z 1.5001028
11 eng1 0.560 0.660 ratio_means002_z 0.5974521
12 eng1 0.202 0.203 ratio_means002_z -1.1490444
...
If the desired final form is the wide format, you can use spread (also from tidyr for that. One advantage (to me) is that you can keep all values of one variable even when another variable failed the filtering step.
df %>%
group_by(speaker) %>%
mutate_each(funs(z = as.numeric(scale(.))), starts_with("ratio_means00")) %>%
gather(group, zval, ends_with("_z")) %>%
filter(abs(zval) <2 ) %>%
spread(group, zval)
Source: local data frame [11 x 5]
Groups: speaker [2]
speaker ratio_means001 ratio_means002 ratio_means001_z ratio_means002_z
<fctr> <dbl> <dbl> <dbl> <dbl>
1 eng1 0.202 0.203 -0.4699781 -1.1490444
2 eng1 0.342 0.345 -0.4344858 -0.6063693
3 eng1 0.436 0.432 -0.4106552 -0.2738853
4 eng1 0.560 0.660 -0.3792191 0.5974521
5 eng1 0.695 0.943 -0.3449943 1.6789806
6 eng1 10.100 0.439 NA -0.2471337
7 eng2 0.123 0.234 -1.1325034 -0.7620572
8 eng2 0.257 0.154 -0.6349445 -0.9590348
9 eng2 0.432 NA 0.0148525 NA
10 eng2 0.496 0.932 0.2524926 0.9565726
11 eng2 0.832 0.854 1.5001028 0.7645194
If you don't want to keep the NA, you can always na.omit them at a later time.
I have multiple xts objects that look something like this:
t x0 x1 x2 x3 x4
2000-01-01 0.397 0.262 -0.039 0.440 0.186
2000-01-02 0.020 -0.219 0.197 0.301 -0.300
2000-01-03 0.010 0.064 -0.034 0.214 -0.451
2000-01-04 -0.002 -0.483 -0.251 0.023 0.153
2000-01-05 0.451 0.375 0.566 0.258 -0.092
2000-01-06 0.411 0.219 0.108 0.137 -0.087
2000-01-07 0.111 -0.039 0.187 -0.271 0.346
2000-01-08 0.131 -0.078 0.163 -0.057 -0.313
2000-01-09 -0.409 -0.022 0.211 -0.297 0.217
.
.
.
Now, I am interested in finding out how well the average of the other variables explains each single x variable during each period 𝜏 (e.g. monthly).
Or in other words, I'm interested in building a time series of the r-squared in the following regression:
xi,t = 𝛽0 + 𝛽1 avgi,t + 𝜀i,t
where avgi,t is be the average of all other variables at time t. And running that for each i 𝜖 n variables and observations t 𝜖 𝜏 (e.g. 𝜏 is a month). I would like to be able to run this with any period 𝜏, where I believe xts might be able to help, since there are functions such as endpoints or apply.monthly?
I have years and years of data, and multiple of those xts objects, so the question is, what would be a sensible way to run those regressions, collecting the time series of r-squared values? A for-loop should be able to take care of doing that for each separate xts object, but I really don't think a for-loop would be very wise for each single object?