R. Remove blocks of observations in df if they fulfill condition - r

I have a huge dataframe (>1,000,000 rows) like this.
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579
It is composed by blocks of 7 observations: (Intercept), GT, SEX, M, N, GT:SEX and GT:N. However, a few blocks lack one or more of the observations (e.g. the third block lacks GT:SEX). Using R, I want to remove these blocks. In this toy example I would get:
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904

I think you'd want to group by SNP and check those blocks for whether they comply with your expectations:
library(dplyr)
expected_terms <- c("(Intercept)", "GT", "SEX", "M", "N", "GT:SEX", "GT:N")
df %>%
group_by(SNP) %>%
filter(
all(expected_terms %in% term)
)
Stricter than that, if you need to make sure that each of your terms exist only once or no other terms appear:
df %>%
group_by(SNP) %>%
filter(
# use `table` to count occurrence of terms, keep only if all are counted exactly once
all(table(term)[expected_terms] == 1),
# keep only if no terms remain after removing your expected set
length(setdiff(term, expected_terms)) == 0
)

Assuming that (Intercept) is present everytime, you can test if the length of each block is 7.
x[unlist(lapply(split(seq_len(nrow(x)), cumsum(x$term == "(Intercept)")),
function(y) {if(length(y) == 7) y else NULL})), ]
# term estimate st.error statistic p.value SNP
#1 (Intercept) 7.68 0.17 44.64 0.00 rs1406947
#2 GT 0.01 0.01 0.07 0.19 rs1406947
#3 SEX 1.52 0.14 10.87 0.10 rs1406947
#4 M 0.12 0.29 0.41 0.67 rs1406947
#5 N -0.06 0.12 -0.48 0.63 rs1406947
#6 GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
#7 GT:N 0.00 0.06 -0.08 0.93 rs1406947
#8 (Intercept) 9.23 0.20 34.64 0.00 rs25904
#9 GT 0.05 0.04 0.12 0.22 rs25904
#10 SEX 1.67 0.76 10.34 0.10 rs25904
#11 M 0.14 0.39 0.51 0.55 rs25904
#12 N -0.08 0.05 -0.46 0.55 rs25904
#13 GT:SEX -0.19 0.11 -0.34 0.44 rs25904
#14 GT:N -0.22 0.33 -0.44 0.55 rs25904
Data:
x <- read.table(header=TRUE, text="term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579")

Related

Error in cor(Auto[, -9]) : 'x' must be numeric

data("Auto")
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)
cor(Auto[, -9])
Error in cor(Auto[, -9]) : 'x' must be numeric
I don't know how deal with this error
I think you are not referencing the columns correctly. Try -
data <- ISLR::Auto
data$mpg01 <- as.integer(data$mpg > median(data$mpg))
cor(data[, -9])
# mpg cylinders displacement horsepower weight acceleration year origin mpg01
#mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42 0.58 0.57 0.84
#cylinders -0.78 1.00 0.95 0.84 0.90 -0.50 -0.35 -0.57 -0.76
#displacement -0.81 0.95 1.00 0.90 0.93 -0.54 -0.37 -0.61 -0.75
#horsepower -0.78 0.84 0.90 1.00 0.86 -0.69 -0.42 -0.46 -0.67
#weight -0.83 0.90 0.93 0.86 1.00 -0.42 -0.31 -0.59 -0.76
#acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00 0.29 0.21 0.35
#year 0.58 -0.35 -0.37 -0.42 -0.31 0.29 1.00 0.18 0.43
#origin 0.57 -0.57 -0.61 -0.46 -0.59 0.21 0.18 1.00 0.51
#mpg01 0.84 -0.76 -0.75 -0.67 -0.76 0.35 0.43 0.51 1.00

Create data frame from EFA output in R

I am working on EFA and would like to customize my tables. There is a function, psych.print to suppress factor loadings of a certain value to make the table easier to read. When I run this function, it produces this data and the summary stats in the console (in an .RMD document, it produces console text and a separate data frame of the factor loadings with loadings suppressed). However, if I attempt to save this as an object, it does not keep this data.
Here is an example:
library(psych)
bfi_data=bfi
bfi_data=bfi_data[complete.cases(bfi_data),]
bfi_cor <- cor(bfi_data)
factors_data <- fa(r = bfi_cor, nfactors = 6)
print.psych(fa_ml_oblimin_2, cut=.32, sort="TRUE")
In an R script, it produces this:
item MR2 MR3 MR1 MR5 MR4 MR6 h2 u2 com
N2 17 0.83 0.654 0.35 1.0
N1 16 0.82 0.666 0.33 1.1
N3 18 0.69 0.549 0.45 1.1
N5 20 0.47 0.376 0.62 2.2
N4 19 0.44 0.43 0.506 0.49 2.4
C4 9 -0.67 0.555 0.45 1.3
C2 7 0.66 0.475 0.53 1.4
C5 10 -0.56 0.433 0.57 1.4
C3 8 0.56 0.317 0.68 1.1
C1 6 0.54 0.344 0.66 1.3
In R Markdown, it produces this:
How can I save that data.frame as an object?
Looking at the str of the object it doesn't look that what you want is built-in. An ugly way would be to use capture.output and try to convert the character vector to dataframe using string manipulation. Else since the data is being displayed it means that the data is present somewhere in the object itself. I could find out vectors of same length which can be combined to form the dataframe.
loadings <- unclass(factors_data$loadings)
h2 <- factors_data$communalities
#There is also factors_data$communality which has same values
u2 <- factors_data$uniquenesses
com <- factors_data$complexity
data <- cbind(loadings, h2, u2, com)
data
This returns :
# MR2 MR3 MR1 MR5 MR4 MR6 h2 u2 com
#A1 0.11 0.07 -0.07 -0.56 -0.01 0.35 0.38 0.62 1.85
#A2 0.03 0.09 -0.08 0.64 0.01 -0.06 0.47 0.53 1.09
#A3 -0.04 0.04 -0.10 0.60 0.07 0.16 0.51 0.49 1.26
#A4 -0.07 0.19 -0.07 0.41 -0.13 0.13 0.29 0.71 2.05
#A5 -0.17 0.01 -0.16 0.47 0.10 0.22 0.47 0.53 2.11
#C1 0.05 0.54 0.08 -0.02 0.19 0.05 0.34 0.66 1.32
#C2 0.09 0.66 0.17 0.06 0.08 0.16 0.47 0.53 1.36
#C3 0.00 0.56 0.07 0.07 -0.04 0.05 0.32 0.68 1.09
#C4 0.07 -0.67 0.10 -0.01 0.02 0.25 0.55 0.45 1.35
#C5 0.15 -0.56 0.17 0.02 0.10 0.01 0.43 0.57 1.41
#E1 -0.14 0.09 0.61 -0.14 -0.08 0.09 0.41 0.59 1.34
#E2 0.06 -0.03 0.68 -0.07 -0.08 -0.01 0.56 0.44 1.07
#E3 0.02 0.01 -0.32 0.17 0.38 0.28 0.51 0.49 3.28
#E4 -0.07 0.03 -0.49 0.25 0.00 0.31 0.56 0.44 2.26
#E5 0.16 0.27 -0.39 0.07 0.24 0.04 0.41 0.59 3.01
#N1 0.82 -0.01 -0.09 -0.09 -0.03 0.02 0.67 0.33 1.05
#N2 0.83 0.02 -0.07 -0.07 0.01 -0.07 0.65 0.35 1.04
#N3 0.69 -0.03 0.13 0.09 0.02 0.06 0.55 0.45 1.12
#N4 0.44 -0.14 0.43 0.09 0.10 0.01 0.51 0.49 2.41
#N5 0.47 -0.01 0.21 0.21 -0.17 0.09 0.38 0.62 2.23
#O1 -0.05 0.07 -0.01 -0.04 0.57 0.09 0.36 0.64 1.11
#O2 0.12 -0.09 0.01 0.12 -0.43 0.28 0.30 0.70 2.20
#O3 0.01 0.00 -0.10 0.05 0.65 0.04 0.48 0.52 1.06
#O4 0.10 -0.05 0.34 0.15 0.37 -0.04 0.24 0.76 2.55
#O5 0.04 -0.04 -0.02 -0.01 -0.50 0.30 0.33 0.67 1.67
#gender 0.20 0.09 -0.12 0.33 -0.21 -0.15 0.18 0.82 3.58
#education -0.03 0.01 0.05 0.11 0.12 -0.22 0.07 0.93 2.17
#age -0.06 0.07 -0.02 0.16 0.03 -0.26 0.10 0.90 2.05
Ronak Shaw answered my question above, and I used his answer to help create the following function, which nearly reproduces the psych.print data.frame of fa.sort output
fa_table <- function(x, cut) {
#get sorted loadings
loadings <- fa.sort(fa_ml_oblimin)$loadings %>% round(3)
#cut loadings
loadings[loadings < cut] <- ""
#get additional info
add_info <- cbind(x$communalities,
x$uniquenesses,
x$complexity) %>%
as.data.frame() %>%
rename("commonality" = V1,
"uniqueness" = V2,
"complexity" = V3) %>%
rownames_to_column("item")
#build table
loadings %>%
unclass() %>%
as.data.frame() %>%
rownames_to_column("item") %>%
left_join(add_info) %>%
mutate(across(where(is.numeric), round, 3))
}

R. Add column to df where rows have names of element from list

I have a list of all files (dataframes) within a directory:
library("plyr")
library("dplyr")
library("broom")
library("tidyr")
snp_list <- list.files(pattern="*.txt", all.files = T,full.names = F)
I also have a dataframe A obtained through the following function:
pv1= lapply(snp_list, function(x) tidy(lm(PV ~ GT*SEX + M + GT*N,read.table(x,header=TRUE)))) %>%
bind_rows()
Dataframe A has 7 rows ((Intercept), GT, SEX, M, N, GT:SEX, GT:N) for each element in list snp_list. In this toy example the list has 3 elements (rs1406947.txt rs25904.txt rs7133579.txt), but in reality there are 1,200,000 elements
A:
term estimate st.error statistic p.value
(Intercept) 7.68 0.17 44.64 0
GT 0.01 0.01 0.07 0.19
SEX 1.52 0.14 10.87 0.1
M 0.12 0.29 0.41 0.67
N -0.06 0.12 -0.48 0.63
GT:SEX -0.03 0.08 -0.44 0.65
GT:N -0.00 0.06 -0.08 0.93
(Intercept) 9.23 0.20 34.64 0
GT 0.05 0.04 0.12 0.22
SEX 1.67 0.76 10.34 0.1
M 0.14 0.39 0.51 0.55
N -0.08 0.05 -0.46 0.55
GT:SEX -0.19 0.11 -0.34 0.44
GT:N -0.22 0.33 -0.44 0.55
(Intercept) 7.99 0.66 44.44 0
GT 0.01 0.3 0.04 0.33
SEX 1.22 0.22 10.44 0.15
M 0.88 0.22 0.33 0.44
N -0.5 0.5 -0.5 0.6
GT:SEX -0.06 0.09 -0.74 0.35
GT:N -0.00 0.03 -0.04 0.78
I want to add a new column "SNP" to A, where each row has the name of the element the rows belongs to (nrows = 7*1,200,000). I would get this:
term estimate st.error statistic p.value SNP
(Intercept) 7.68 0.17 44.64 0 rs1406947
GT 0.01 0.01 0.07 0.19 rs1406947
SEX 1.52 0.14 10.87 0.1 rs1406947
M 0.12 0.29 0.41 0.67 rs1406947
N -0.06 0.12 -0.48 0.63 rs1406947
GT:SEX -0.03 0.08 -0.44 0.65 rs1406947
GT:N -0.00 0.06 -0.08 0.93 rs1406947
(Intercept) 9.23 0.20 34.64 0 rs25904
GT 0.05 0.04 0.12 0.22 rs25904
SEX 1.67 0.76 10.34 0.1 rs25904
M 0.14 0.39 0.51 0.55 rs25904
N -0.08 0.05 -0.46 0.55 rs25904
GT:SEX -0.19 0.11 -0.34 0.44 rs25904
GT:N -0.22 0.33 -0.44 0.55 rs25904
(Intercept) 7.99 0.66 44.44 0 rs7133579
GT 0.01 0.3 0.04 0.33 rs7133579
SEX 1.22 0.22 10.44 0.15 rs7133579
M 0.88 0.22 0.33 0.44 rs7133579
N -0.5 0.5 -0.5 0.6 rs7133579
GT:SEX -0.06 0.09 -0.74 0.35 rs7133579
GT:N -0.00 0.03 -0.04 0.78 rs7133579
Here's how to do what you asked:
A$SNP=rep(0,nrow(A))
for (i in 1:nrow(A)){
A$SNP[i]=snp_list[(i%/%8)+1]
}
Using integer division, you can generate an index for 7 elements to map to each element in snp_list.

unique() function output in r not accurate - replicates values

Used unique() function to extract unique timestamps from a large electrophysiology dataset. As per my knowledge, the value of length(unique(dataset$time) should have been 201; however, I am getting 1321.
Desired output:
unique(dataset$time)
[1] -1.00 -0.99 -0.98 -0.97 -0.96 -0.95 -0.94 -0.93 -0.92 -0.91 -0.90 -0.89 -0.88 -0.87 -0.86
[16] -0.85 -0.84 -0.83 -0.82 -0.81 -0.80 -0.79 -0.78 -0.77 -0.76 -0.75 -0.74 -0.73 -0.72 -0.71
[31] -0.70 -0.69 -0.68 -0.67 -0.66 -0.65 -0.64 -0.63 -0.62 -0.61 -0.60 -0.59 -0.58 -0.57 -0.56
[46] -0.55 -0.54 -0.53 -0.52 -0.51 -0.50 -0.49 -0.48 -0.47 -0.46 -0.45 -0.44 -0.43 -0.42 -0.41
[61] -0.40 -0.39 -0.38 -0.37 -0.36 -0.35 -0.34 -0.33 -0.32 -0.31 -0.30 -0.29 -0.28 -0.27 -0.26
[76] -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11
[91] -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04
[106] 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
[121] 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34
[136] 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49
[151] 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64
[166] 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79
[181] 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92 0.93 0.94
[196] 0.95 0.96 0.97 0.98 0.99 1.00
The current output replicates this exact pattern and starts repeating the -1, -0.9 etc after reaching 1.0.
Any ideas on why this would happen?

How to subset a time series in R

In particular, I'd like to subset the temperature measurements from 1960 onwards in the time series gtemp in the package astsa:
require(astsa)
gtemp
Time Series:
Start = 1880
End = 2009
Frequency = 1
[1] -0.28 -0.21 -0.26 -0.27 -0.32 -0.32 -0.29 -0.36 -0.27 -0.17 -0.39 -0.27 -0.32
[14] -0.33 -0.33 -0.25 -0.14 -0.11 -0.25 -0.15 -0.07 -0.14 -0.24 -0.30 -0.34 -0.24
[27] -0.19 -0.39 -0.33 -0.35 -0.33 -0.34 -0.32 -0.30 -0.15 -0.10 -0.30 -0.39 -0.33
[40] -0.20 -0.19 -0.14 -0.26 -0.22 -0.22 -0.17 -0.02 -0.15 -0.12 -0.26 -0.08 -0.02
[53] -0.08 -0.19 -0.07 -0.12 -0.05 0.07 0.10 0.01 0.04 0.10 0.03 0.09 0.19
[66] 0.06 -0.05 0.00 -0.04 -0.07 -0.16 -0.04 0.03 0.11 -0.10 -0.10 -0.17 0.08
[79] 0.08 0.06 -0.01 0.07 0.04 0.08 -0.21 -0.11 -0.03 -0.01 -0.04 0.08 0.03
[92] -0.10 0.00 0.14 -0.08 -0.05 -0.16 0.12 0.01 0.08 0.18 0.26 0.04 0.26
[105] 0.09 0.05 0.12 0.26 0.31 0.19 0.37 0.35 0.12 0.13 0.23 0.37 0.29
[118] 0.39 0.56 0.32 0.33 0.48 0.56 0.55 0.48 0.62 0.54 0.57 0.43 0.57
The individual time points are not labeled in years, so although I can do gtemp[3] [1] -0.26, I can't do gtemp[as.date(1960)], for instance to get the value in 1960.
How can I bring out the correspondence between year and measurements, so as to later subset values?
We can make use of the window function
gtemp1 <- window(gtemp, start = 1960)
gtemp1
#Time Series:
#Start = 1960
#End = 2009
#Frequency = 1
#[1] -0.01 0.07 0.04 0.08 -0.21 -0.11 -0.03 -0.01 -0.04 0.08 0.03
#[12]-0.10 0.00 0.14 -0.08 -0.05 -0.16 0.12 0.01 0.08 0.18 0.26
#[23] 0.04 0.26 0.09 0.05 0.12 0.26 0.31 0.19 0.37 0.35 0.12
#[34] 0.13 0.23 0.37 0.29 0.39 0.56 0.32 0.33 0.48 0.56 0.55
#[45] 0.48 0.62 0.54 0.57 0.43 0.57
Function time can also help to answer your question
How can I bring out the correspondence between year and measurements, so as to later subset values?
head(time(gtemp))
[1] 1880 1881 1882 1883 1884 1885
If you want the value that corresponds to 1961, you can write
gtemp[time(gtemp) == 1961]
[1] 0.07
As mentioned in the first answer, you can also use the function window
window(gtemp, start = 1961, end = 1961)
Time Series:
Start = 1961
End = 1961
Frequency = 1
[1] 0.07
that returns the result as one point time series. You can convert it into a number by
as.numeric(window(gtemp, start = 1961, end = 1961))
[1] 0.07

Resources