import specific rows from "txt" into R - r

I have a "example.txt" document just as follows:
SIGNAL: 40 41 42
0.406 0.043 0.051 0.021 0.013
0.056 0.201 0.026 0.009 0.000
0.000 0.128 0 0.009 0.000
TOTAL: 0.657
SIGNAL: 44 45 46 48
0.128 0.338 0.026
0.333 0.03 0.000
0.060 0.013 0.004
0.009 0.017 0.009
0.013 0 0.000
TOTAL: 0.704
SIGNAL: 51 52 54
0.368 0.081 0.085 0.004
0.162 0.09 0.064 0.073
0.013 0.017 0.009 0.000
TOTAL: 0.266
SIGNAL: 60 61 62 63 64 65 66 67
0.530 0.030
0.009 0.179
0.154 0.004
0.068 0.009
TOTAL: 0.796
I want to import the rows between "SIGNAL: 44 45 46 48" and "TOTAL: 0.704" into R, I use read.table("example.txt",skip=6 ,nrow=5) to extract these specific rows, it works.
V1 V2 V3
1 0.128 0.338 0.026
2 0.333 0.030 0.000
3 0.060 0.013 0.004
4 0.009 0.017 0.009
5 0.013 0.000 0.000
However, my real data (has 450,000 rows) is very big, if I want to extract the rows between "SIGNAL: 3000 3001 3002 3003" and the next"TOTAL", how can I do with it? Thank you so much!

I have worked it out based on akrun's code. For example, I want to extract the first two sets. I can just use:
lines <- readLines('example.txt')
g<-c(40,44)
sapply(1:length(g), function(x){Map(function(i,j) read.table(text=lines[(i+1):(j-1)], sep='', header=FALSE), grep(paste('SIGNAL:',g[x]), lines), grep('TOTAL', lines)[which(grep(paste('SIGNAL:',g[x]), lines)==grep('SIGNAL', lines))])})

Related

selectiveinference package: fixedLassoInf after lasso Logit

Following Taylor and Tibshirani (2015), I'm applying the selectiveinference package in R after a Lasso Logit fit with glmnet. Specifically, I'm interested in inference for the lasso with a fixed lambda.
Below I report the code:
First, I standardized the X matrix (as suggested https://cran.r-project.org/web/packages/selectiveInference/selectiveInference.pdf).
Then, I fit glmnet and after I extracted the beta coefficient for a lambda previously picked with LOOCV.
X.std <- std(X1[1:2833,])
fit = glmnet(X.std, Y[1:2833],alpha=1, family=c("binomial"))
fit$lambda
lambda=0.00431814
n=2833
beta_hat = coef(fit, x=X.std, y=Y[1:2833], s=lambda/n, exact=TRUE)
beta_hat
out = fixedLassoInf(X.std, Y[1:2833],beta_hat,lambda,family="binomial")
out
After I run the code, this is what I get. I understood that there is something related to KKT conditions, and that is a problem specific to Lasso Logit, as when I try with family=gaussian, I do not get any warnings or mistakes.
Warning message:
In fixedLogitLassoInf(x, y, beta, lambda, alpha = alpha, type = type, :
Solution beta does not satisfy the KKT conditions (to within specified tolerances)
> res
Call:
fixedLassoInf(x = X.std, y = Y[1:2833], beta = b, lambda = lam,
family = c("binomial"))
Testing results at lambda = 0.004, with alpha = 0.100
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 58.558 6.496 0.000 46.078 124.807 0.049 0.050
2 -8.008 -2.815 0.005 -13.555 -3.106 0.049 0.049
3 -18.514 -6.580 0.000 -31.262 -14.153 0.049 0.048
4 -1.070 -0.390 0.447 -22.976 19.282 0.050 0.050
5 -0.320 -1.231 0.610 -0.660 1.837 0.050 0.000
6 -0.448 -1.906 0.619 -2.378 5.056 0.050 0.050
7 -47.732 -9.624 0.000 -161.370 -44.277 0.050 0.050
8 -39.023 -8.378 0.000 -54.988 -31.510 0.050 0.048
10 23.827 1.991 0.181 -20.151 42.867 0.049 0.049
11 -2.454 -0.522 0.087 -269.951 9.345 0.050 0.050
12 0.045 0.018 0.993 -Inf -14.962 0.000 0.050
13 -18.647 -1.143 0.156 -149.623 25.464 0.050 0.050
14 -3.508 -1.140 0.305 -8.444 7.000 0.049 0.049
15 -0.620 -0.209 0.846 -3.486 46.045 0.050 0.050
16 -3.960 -1.288 0.739 -6.931 47.641 0.049 0.050
17 -8.587 -3.010 0.023 -42.700 -2.474 0.050 0.049
18 2.851 0.986 0.031 2.745 196.728 0.050 0.050
19 -6.612 -1.258 0.546 -14.967 37.070 0.049 0.000
20 -11.621 -2.291 0.021 -29.558 -2.536 0.050 0.049
21 -76.957 -0.980 0.565 -186.701 483.180 0.049 0.050
22 -13.556 -5.053 0.000 -126.367 -13.274 0.050 0.049
23 -4.836 -0.388 0.519 -109.667 125.933 0.050 0.050
24 11.355 0.898 0.492 -55.335 30.312 0.050 0.049
25 -1.118 -0.146 0.919 -4.439 232.172 0.049 0.050
26 -7.776 -1.298 0.200 -17.540 8.006 0.050 0.049
27 0.678 0.234 0.515 -42.265 38.710 0.050 0.050
28 32.938 1.065 0.335 -77.314 82.363 0.050 0.049
Does someone know how to solve this warning?
I would like to understand which kind of "tolerances" should I specify.
Thank for the help.

R: How to find which rows in a file are responsible for two 'populations'

Let's say I have two input data files. The first looks like this:
1 0.00038 0.75053 0.50 35 6000 0.75346
2 0.00038 0.75053 0.50 35 6050 0.72079
3 0.00038 0.75053 0.50 35 6100 0.69229
4 0.00038 0.75053 0.50 35 6150 0.66689
5 0.00038 0.75053 0.50 35 6200 0.64382
6 0.00038 0.75053 0.50 35 6250 0.62269
7 0.00038 0.75053 0.50 35 6300 0.60313
8 0.00038 0.75053 0.50 35 6350 0.58481
9 0.00038 0.75053 0.50 35 6400 0.56756
10 0.00038 0.75053 0.50 35 6450 0.55122
And the second one looks like this:
1 -0.123 -0.306 inf 1.043 0.000 0.010 0.000 0.653 0.000 0.091 0.000 0.009 0.000 3.097 0.000 0.137 0.002
2 -0.142 -0.170 inf 1.035 0.000 0.064 0.000 0.538 0.000 0.560 0.000 0.289 0.000 3.168 0.000 6.182 0.000
3 -0.160 -0.143 inf 1.027 0.000 0.086 0.000 0.401 0.000 0.631 0.000 0.400 0.000 3.348 0.000 0.130 0.000
4 -0.176 -0.117 inf 1.020 0.000 0.107 0.000 0.249 0.000 0.592 0.000 0.435 0.000 3.526 0.000 0.402 0.001
5 -0.191 -0.110 inf 1.014 0.000 0.133 0.000 0.091 0.000 0.514 0.000 0.425 0.000 3.644 0.001 0.598 0.001
6 -0.206 -0.099 inf 1.008 0.000 0.162 0.000 6.247 0.000 0.435 0.001 0.392 0.001 3.675 0.001 0.707 0.002
7 -0.220 -0.093 0.976 1.003 0.000 0.194 0.000 6.168 0.001 0.377 0.001 0.352 0.001 3.602 0.003 0.740 0.003
8 -0.233 -0.092 inf 0.999 0.000 0.226 0.000 6.137 0.001 0.353 0.001 0.302 0.001 3.445 0.004 0.712 0.005
9 -0.246 -0.124 inf 0.996 0.000 0.258 0.000 6.145 0.001 0.363 0.001 0.252 0.001 3.242 0.004 0.620 0.006
10 -0.259 -0.119 inf 0.994 0.000 0.289 0.000 6.172 0.001 0.393 0.001 0.206 0.001 3.028 0.005 0.456 0.008
Now, as you can see, there appears to be 2 populations in this graph, no? I would like to find out which rows in the 2nd file correlate to the different populations. What would be the best way to do this?
If you would like to reproduce this yourself here is the first input file and the second input file.

Need to read specific lines of text file that doesn't have a great format into data frame (preferably multiple data frames)?

I'm completely new to R and I'm not sure the best way to deal with this file so I'm really hoping someone can at least point me in the right direction. I've searched for other solutions and tried using grepl but can't seem to figure out the best way to only read some of the data. The file I'm trying to read in looks something like the text below:
##BLOCKS= 8
Plate: Plate01 1.3 PlateFormat Endpoint Absorbance Raw FALSE 1 1 630 1 12 96 1 8 None
Temperature(°C) 1 2 3 4 5 6 7 8 9 10 11 12
0.00 0.042 0.067 0.292 0.206 0.071 0.067 0.04 0.063 0.059 0.04 0.066 0.04
0.043 0.172 0.179 0.199 0.073 0.067 0.04 0.062 0.058 0.039 0.066 0.039
0.04 0.066 0.29 0.185 0.072 0.067 0.04 0.062 0.058 0.039 0.065 0.039
0.039 0.068 0.291 0.189 0.075 0.069 0.04 0.064 0.058 0.041 0.064 0.039
0.042 0.063 0.271 0.191 0.07 0.068 0.04 0.065 0.058 0.041 0.066 0.04
0.041 0.067 0.342 0.199 0.069 0.066 0.041 0.065 0.057 0.04 0.065 0.042
0.044 0.064 0.295 0.198 0.069 0.067 0.039 0.064 0.057 0.04 0.067 0.041
0.041 0.067 0.29 0.211 0.066 0.067 0.043 0.056 0.058 0.042 0.067 0.042
~End
Plate: Plate#1 1.3 PlateFormat Endpoint Absorbance Raw FALSE 1 1 630 1 12 96 1 8 None
Temperature(°C) 1 2 3 4 5 6 7 8 9 10 11 12
0.00 0.042 0.072 0.257 0.165 0.074 0.07 0.04 0.067 0.055 0.04 0.07 0.04
0.042 0.164 0.136 0.195 0.075 0.07 0.041 0.066 0.055 0.04 0.069 0.04
0.041 0.07 0.344 0.198 0.074 0.069 0.041 0.065 0.055 0.04 0.068 0.04
0.04 0.069 0.307 0.199 0.075 0.072 0.041 0.067 0.055 0.043 0.068 0.041
0.043 0.068 0.296 0.214 0.072 0.071 0.042 0.067 0.055 0.041 0.068 0.041
0.041 0.071 0.452 0.241 0.072 0.069 0.042 0.067 0.054 0.041 0.068 0.043
0.044 0.068 0.299 0.182 0.071 0.071 0.042 0.067 0.054 0.041 0.069 0.041
0.042 0.071 0.333 0.13 0.068 0.07 0.042 0.058 0.054 0.042 0.07 0.041
~End
I only want the columns/rows numbered 1-12 (next to Temperature) and the data under them. I'm new to R but do have some programming experience so I don't necessarily need someone to tell me exactly how to do this but if anyone could at least point me in the right direction of whatever functions I should be looking at I'd really appreciate the help!
Step 1: Get the data into R session with readLines
Lines <- readLines(textConnection("##BLOCKS= 8
Plate: Plate01 1.3 PlateFormat Endpoint Absorbance Raw FALSE 1 1 630 1 12 96 1 8 None
Temperature(°C) 1 2 3 4 5 6 7 8 9 10 11 12
0.00 0.042 0.067 0.292 0.206 0.071 0.067 0.04 0.063 0.059 0.04 0.066 0.04
0.043 0.172 0.179 0.199 0.073 0.067 0.04 0.062 0.058 0.039 0.066 0.039
0.04 0.066 0.29 0.185 0.072 0.067 0.04 0.062 0.058 0.039 0.065 0.039
0.039 0.068 0.291 0.189 0.075 0.069 0.04 0.064 0.058 0.041 0.064 0.039
0.042 0.063 0.271 0.191 0.07 0.068 0.04 0.065 0.058 0.041 0.066 0.04
0.041 0.067 0.342 0.199 0.069 0.066 0.041 0.065 0.057 0.04 0.065 0.042
0.044 0.064 0.295 0.198 0.069 0.067 0.039 0.064 0.057 0.04 0.067 0.041
0.041 0.067 0.29 0.211 0.066 0.067 0.043 0.056 0.058 0.042 0.067 0.042
~End
Plate: Plate#1 1.3 PlateFormat Endpoint Absorbance Raw FALSE 1 1 630 1 12 96 1 8 None
Temperature(°C) 1 2 3 4 5 6 7 8 9 10 11 12
0.00 0.042 0.072 0.257 0.165 0.074 0.07 0.04 0.067 0.055 0.04 0.07 0.04
0.042 0.164 0.136 0.195 0.075 0.07 0.041 0.066 0.055 0.04 0.069 0.04
0.041 0.07 0.344 0.198 0.074 0.069 0.041 0.065 0.055 0.04 0.068 0.04
0.04 0.069 0.307 0.199 0.075 0.072 0.041 0.067 0.055 0.043 0.068 0.041
0.043 0.068 0.296 0.214 0.072 0.071 0.042 0.067 0.055 0.041 0.068 0.041
0.041 0.071 0.452 0.241 0.072 0.069 0.042 0.067 0.054 0.041 0.068 0.043
0.044 0.068 0.299 0.182 0.071 0.071 0.042 0.067 0.054 0.041 0.069 0.041
0.042 0.071 0.333 0.13 0.068 0.07 0.042 0.058 0.054 0.042 0.07 0.041
~End"))
Steps 2 & 3: Build a conditional to include good data lines, and grouping
?strsplit
# Couldn't remember name of `substr`, figured the ?strsplit page would show link
start <- substr(Lines, 1,1) # 1st char was sufficient to build a rule
table(start)
#--- result ----
start
# ~ 0 P T # the 14 is the count of " " (just spaces)
2 14 1 2 2 2 2
#end table
goodL <- Lines[start %in% c(" ","T","0") ]
goodL # Look at result
group <- cumsum(substr(goodL , 1,4)=="Temp") #build grouping
group # check the grouping variable
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
Step 4: Process the groups with lapply(split(goodL, group), function(x) ...
dfrms <- lapply(split(goodL, group),
function(x) read.table(text=substr(x,16, # stuff to right of 16th char
100),header=TRUE))
str(dfrms) # check result,,, not correct, need 12th entry
List of 2
$ 1:'data.frame': 8 obs. of 11 variables:
..$ X1 : num [1:8] 0.042 0.043 0.04 0.039 0.042 0.041 0.044 0.041
..$ X2 : num [1:8] 0.067 0.172 0.066 0.068 0.063 0.067 0.064 0.067
# -----snipped output
dfrms <- lapply(split(goodL, group), # will be a list of dataframes
function(x) read.table(text =substr(x, 16, 120), header=TRUE))
str(dfrms) # Looks good
List of 2
$ 1:'data.frame': 8 obs. of 12 variables:
..$ X1 : num [1:8] 0.042 0.043 0.04 0.039 0.042 0.041 0.044 0.041
..$ X2 : num [1:8] 0.067 0.172 0.066 0.068 0.063 0.067 0.064 0.067
..$ X3 : num [1:8] 0.292 0.179 0.29 0.291 0.271 0.342 0.295 0.29
#--- snippped output
I'd like to give credit to #G.Grothendieck for this strategy. Doing a search on "user:516548 readLines" will pull up many other elegant examples of a similar approach.

mean() returning error "argument is not numeric or logical: returning NA" but only for some columns in data frame?

I'm pretty new to r so maybe this is something obvious but I'm not sure what's going on. I load in a file that has a bunch of data that I then split up into separate data frames. They look something like:
V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
3 1.000 2 3 4 5 6 7.000 8.000 9.000 10.000 11.000 12.000
4 0.042 0.067 0.292 0.206 0.071 0.067 0.040 0.063 0.059 0.040 0.066 0.040
5 0.043 0.172 0.179 0.199 0.073 0.067 0.040 0.062 0.058 0.039 0.066 0.039
6 0.040 0.066 0.29 0.185 0.072 0.067 0.040 0.062 0.058 0.039 0.065 0.039
7 0.039 0.068 0.291 0.189 0.075 0.069 0.040 0.064 0.058 0.041 0.064 0.039
8 0.042 0.063 0.271 0.191 0.07 0.068 0.040 0.065 0.058 0.041 0.066 0.040
9 0.041 0.067 0.342 0.199 0.069 0.066 0.041 0.065 0.057 0.040 0.065 0.042
10 0.044 0.064 0.295 0.198 0.069 0.067 0.039 0.064 0.057 0.040 0.067 0.041
11 0.041 0.067 0.29 0.211 0.066 0.067 0.043 0.056 0.058 0.042 0.067 0.042
I'm trying to find the means of rows 4-6 and 7-9 for each column. I have each data frame in a list called "plates". When I use the line:
plates[[1]][2:4, 7]
I end up with the output:
[1] 0.04 0.04 0.04
If I include mean() in the code above it works fine for columns 7 and higher. However when I used that same code for columns lower than 7, say column 2, I end up with:
[1] 0.067 0.172 0.066
57 Levels: 0.063 0.064 0.066 0.067 0.068 0.069 0.07 0.071 0.072 0.08 0.081 0.082 0.083 0.084 0.085 ... PlateFormat
I have no idea what this 57 Levels: thing is but I'm assuming this is my problem. I only want the mean of the 3 numbers (0.067, 0.172, 0.066) but this 57 Levels being returned appears to be causing mean() to give me the error in the title. Any help with this would be greatly appreciated.
There is an entry somewhere in that column that can't be processed into a number, so read.csv() (or whatever you used) is reading the data in as a factor. It could be a typo (something as simple as an extra decimal point or a trailing comma), a missing-value code such as "?"
You can use
numify <- function(x) as.numeric(as.character(x))
mydata[] <- lapply(mydata, numify)
to convert by brute force, but it would be better to use
bad_vals <- function(x) {
x[!is.na(x) & is.na(numify(x))
}
lapply(mydata, bad_vals)
to identify what the bad values are, so you can fix them upstream in your data file (or add missing-value codes to the na.strings= argument in your input code)

Convert 2 rows in one in a data frame

I have a data frame like this:
> mydata <- read.csv("mydata.csv", header=T, stringsAsFactors=F)
> tbl_df(mydata)
# A tibble: 16,499 x 60
SRC_035_01 SRC_035_01.1 SRC_035_02 SRC_035_02.1
1 Force Time Force Time
2 -0.0037 0.000 0.0041 0.000
3 0.0000 0.004 0.0073 0.004
4 0.0079 0.008 0.0156 0.008
5 0.0150 0.012 0.0228 0.012
6 0.0177 0.016 0.0262 0.016
7 0.0141 0.020 0.0236 0.020
8 0.0103 0.024 0.0206 0.024
9 0.0080 0.028 0.0193 0.028
10 0.0102 0.032 0.0226 0.032
I need to combine the first row with the header, column by column, resulting in this output:
# A tibble: 16,498 x 60
SRC_035_01_Force SRC_035_01.1_Time SRC_035_02_Force SRC_035_02.1_Time
2 -0.0037 0.000 0.0041 0.000
3 0.0000 0.004 0.0073 0.004
4 0.0079 0.008 0.0156 0.008
5 0.0150 0.012 0.0228 0.012
6 0.0177 0.016 0.0262 0.016
7 0.0141 0.020 0.0236 0.020
8 0.0103 0.024 0.0206 0.024
9 0.0080 0.028 0.0193 0.028
10 0.0102 0.032 0.0226 0.032
Anyone could give me a code hint?
Thanks a lot!

Resources