fread together with grepl - r

I have a data (large data 125000 rows, ~20 MB) in which some of the rows with certain string need to be deleted and some columns need to be selected during the reading process.
Firstly, I discovered that grepl function does not work properly since fread makes the data as one column indicated also in this question.
The example data can be found here (by following #akrun advice) and header of the data like this
head(sum_data)
TRIAL : 1 3331 9091
TRIAL : 2 1384786531 278055555
2 0.10 0.000E+00 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709 0.0035 0.0079 -0.9754 0.0081 0.0023 0.9997 -0.135324E-09 0.278754E-01
2 0.20 0.000E+00 -0.0121 0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -0.133521E-09 0.425567E-01
2 0.30 0.000E+00 0.0193 -0.0068 -0.9884 0.0040 0.0139 -0.9782 -0.0158 0.0150 -0.9814 0.0054 -0.0008 0.9997 -0.134103E-09 0.255356E-01
2 0.40 0.000E+00 -0.0157 0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040 0.0010 0.9998 -0.134819E-09 0.257300E-01
2 0.50 0.000E+00 -0.0402 0.0300 -0.9832 -0.0093 0.0269 -0.9781 -0.0326 0.0247 -0.9802 0.0044 -0.0010 0.9997 -0.131515E-09 0.440350E-01
I attempted to read the data with fread and used grepl for removing the rows;
files <-dir(pattern = "*sum.txt",full.names = FALSE)
library(data.table)
fread_files <- function(files){
sum_data_read <- fread(files,skip=2, sep="\t", ) #seperation is tab.
df_grep <- sum_vgm_read [!grepl("TRI",sum_vgm_read$V1),] # for removing the lines that contain "TRIAL" letter in V1 column. But so far there is no V1 column is recognized!!
df <- bind_rows(df_grep) #binding rows after removing
write.table(as.data.table(df),file = gsub("(.*)(\\..*)", "\\1_new\\2", files),row.names = FALSE,col.names = TRUE)
}
and finally lapply
lapply(files, fread_files)
when I perfom this, only one row of data is created as an output which is something going on but I dont know what.
Thanks for help in advance!

Firstly, I discovered that grepl function does not work properly since
fread makes the data as one column indicated also in this question.
But that question's accepted answer says that problem was fixed in v1.9.6. Which version are you using? That's why we ask you to please state the version number up front, to save time answering.
It is a great example file and the question is great.
I would not try to reinvent the wheel as operations like these have long been implemented as command line tools, which you can use together with fread directly. The advantage is that you won't churn through R memory, you can leave the filtering to the command tool and that can be much more efficient. For example, if you load all the lines as lines into R, those strings will be cached in R's global string cache (at least temporarily). Doing that filter outside R first will save that cost.
I downloaded your great file and tested the following which works.
> fread("grep -v TRIAL sum_data.txt")
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 2 0.1 0 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709 0.0035 0.0079 -0.9754 0.0081 0.0023 0.9997 -1.35324e-10 0.0278754
2: 2 0.2 0 -0.0121 0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -1.33521e-10 0.0425567
3: 2 0.3 0 0.0193 -0.0068 -0.9884 0.0040 0.0139 -0.9782 -0.0158 0.0150 -0.9814 0.0054 -0.0008 0.9997 -1.34103e-10 0.0255356
4: 2 0.4 0 -0.0157 0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040 0.0010 0.9998 -1.34819e-10 0.0257300
5: 2 0.5 0 -0.0402 0.0300 -0.9832 -0.0093 0.0269 -0.9781 -0.0326 0.0247 -0.9802 0.0044 -0.0010 0.9997 -1.31515e-10 0.0440350
---
124247: 250 49.5 0 -0.0040 0.0141 0.9802 -0.0152 0.0203 -0.9877 -0.0015 0.0123 -0.9901 0.0069 0.0003 0.9997 -1.30220e-10 0.0213215
124248: 250 49.6 0 -0.0006 0.0284 0.9819 0.0021 0.0248 -0.9920 0.0264 0.0408 -0.9919 0.0028 -0.0028 0.9997 -1.30295e-10 0.0284142
124249: 250 49.7 0 0.0378 0.0305 0.9779 -0.0261 0.0232 -0.9897 -0.0236 0.0137 -0.9928 0.0102 -0.0023 0.9997 -1.29890e-10 0.0410760
124250: 250 49.8 0 0.0569 -0.0203 0.9800 -0.0028 -0.0009 -0.9906 -0.0139 -0.0169 -0.9918 0.0039 -0.0017 0.9997 -1.31555e-10 0.0513482
124251: 250 49.9 0 0.0234 -0.0358 0.9840 -0.0340 0.0114 -0.9873 -0.0255 0.0134 -0.9888 0.0006 0.0009 0.9997 -1.30862e-10 0.0334976
>
The -v makes grep return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) then grep syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.
For further examples on the use of command line tools in fread, you may check the article Convenience features of fread. Please note that "On Windows we recommend Cygwin (run one .exe to install) which includes the command line tools such as grep".

In order to read a file and remove row based on a string criteria, you could use readLines function, and filter the result.
I use stringr package for string manipulation.
library(stringr)
# Read your file by lines
DT <- readLines("sum_data")
length(DT)
#> [1] 124501
# detect which lines contains trial
trial_lines <- str_detect(DT, "TRI")
head(trial_lines)
#> [1] TRUE TRUE FALSE FALSE FALSE FALSE
# Remove those lines
DT <- DT[!trial_lines]
length(DT)
#> [1] 124251
# Rewrite your file by line
writeLines(DT, "new_file")
If you have performance issues, you could try read_lines from package readr instead of base readLines

Related

How can I extract specific data points from a wide-formatted text file in R?

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

Reading data from url, fread warning

I am reading data from a URL, the page with some additional lines about data on that page. I tried read.table and read.csv to read the URL and got the data as a list. So I tried fread to read the data, and I got the data perfectly in a table. I am getting a warning, which I realize is due to the additional lines on that site. Is there a way to avoid this warning.
c < fread('https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data',header = FALSE)
The warning I am getting is shown below.
In fread("https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data", :
Stopped early on line 154. Expected 13 fields but found 1. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<-9999>>
I understand adding fill = TRUE, I can remove the warning. Then the whole page gets read. Then how will filter out the last part?
I have figured out the answer to my question. I just needed to complete.cases function to remove the rows with empty cells.
dat < fread('https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data',
header = FALSE,fill = TRUE)
m = dat[complete.cases(dat),]
Use fill = TRUE
dat <- fread('https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data',header = FALSE, fill = TRUE)
Then, get the subset of rows with
dat1 <- dat[1:154]
-output
> head(dat1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1: 1870 2021 NA NA NA NA NA
2: 1870 -0.373 -0.256 0.277 0.027 -0.400 -0.434 -0.554 -0.409 -0.622 -0.476 -0.278 -0.306
3: 1871 -0.208 -0.090 -0.112 -0.073 -0.035 -0.049 -0.347 -0.263 -0.230 -0.368 -0.094 -0.159
4: 1872 0.028 0.121 0.024 -0.009 -0.069 0.030 -0.189 -0.213 -0.227 -0.111 0.017 -0.041
5: 1873 0.127 -0.239 -0.304 -0.196 -0.331 -0.473 -0.593 -0.688 -0.588 -0.319 -0.229 -0.233
6: 1874 -0.316 -0.308 -0.486 -0.678 -0.361 -0.351 -0.242 -0.232 -0.708 -0.999 -0.480 -0.720
> dim(dat1)
[1] 154 13

How could I use R to pull a few select lines out of a large text file?

I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36

Why MARS (earth package) generates so many predictors?

I am working on a MARS model using earth package in R. My dataset (CE.Rda) consists of one dependent variable (D9_RTO_avg) and 10 potential predictors (NDVI_l1, NDVI_f0, NDVI_f1, NDVI_f2, NDVI_f3, LST_l1, LST_f0, LST_f1, NDVI_f2,NDVI_f3). Next, I show you the head of my dataset
D9_RTO_avg NDVI_l1 NDVI_f0 NDVI_f1 NDVI_f2 NDVI_f3 LST_l1 LST_f0 LST_f1 LST_f2 LST_f3
2 1.866667 0.3082 0.3290 0.4785 0.4330 0.5844 38.25 30.87 31 21.23 17.92
3 2.000000 0.2164 0.2119 0.2334 0.2539 0.4686 35.7 29.7 28.35 21.67 17.71
4 1.200000 0.2324 0.2503 0.2640 0.2697 0.4726 40.13 33.3 28.95 22.81 16.29
5 1.600000 0.1865 0.2070 0.2104 0.2164 0.3911 43.26 35.79 30.22 23.07 17.88
6 1.800000 0.2757 0.3123 0.3462 0.3778 0.5482 43.99 36.06 30.26 21.36 17.93
7 2.700000 0.2265 0.2654 0.3174 0.2741 0.3590 41.61 35.4 27.51 23.55 18.88_
After creating my earth model as follows
mymodel.mod <- earth(D9_RTO_avg ~ ., data=CE, nk=10)
I print the summary of the resulting model by typing
print(summary(mymodel.mod, digits=2, style="pmax"))
and I obtain the following output
D9_RTO_avg =
4.1
+ 38 * LST_f128.68
+ 6.3 * LST_f216.41
- 2.9 * pmax(0, 0.66 - NDVI_l1)
- 2.3 * pmax(0, NDVI_f3 - 0.23)
Selected 5 of 7 terms, and 4 of 13169 predictors
Termination condition: Reached nk 10
Importance: LST_f128.68, NDVI_l1, NDVI_f3, LST_f216.41, NDVI_f0-unused, NDVI_f1-unused, NDVI_f2-unused, ...
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 2 RSS 4046 GRSq 0.29 RSq 0.29
My question is why earth is identifying 13169 predictors when they are actually 10!? It seems that MARS is considering single observations of candidate predictors as predictors themselves. How can I avoid MARS from doing so?
Thanks for your help

boxplots grouped as factor

I'm trying to plot boxplots of multiple variables (columns in a table), grouping the subjects for the levels in cl.med.
This is what I tried:
boxplot(scfa[,c("acetate","propionate")]~as.factor(cl.med),outline=FALSE)
this is my table:
aceticacid.methylester.1 acetate butyrate fumarate caprate propionate X3phenylpropionate valerate formate
01.BA.V 4.509 0.1430 0.0168 4e-04 0.0080 0.0174 0.0008 0.0030 5e-04
01.BA.VG 2.750 0.2736 0.0228 4e-04 0.0047 0.0261 0.0012 0.0014 4e-04
01.BO.VG 15.281 0.1667 0.0159 6e-04 0.0049 0.0191 0.0008 0.0011 4e-04
01.PR.O 0.317 0.2470 0.0327 4e-04 0.0078 0.0293 0.0006 0.0016 4e-04
01.TO.VG 0.210 0.1406 0.0186 4e-04 0.0034 0.0161 0.0006 0.0026 6e-04
and this is my class vector
01.BA.VG 01.BO.VG 01.PR.O 01.TO.VG 02.BA.VG
1 2 3 1 3 2
This produces 3 boxes (for the 3 classes as expected), but the two variables are merged. How could I modify it obtaining 3 boxes for each variable?
Thanks
Your code really shouldn't work at all instead of producing 3 boxes. Usually you have to add them one plot at a time using the add=TRUE argument:
boxplot(scfa[,c("acetate")]~as.factor(cl.med), outline=FALSE, ylim=c(0, 0.3))
boxplot(scfa[,c("propionate")]~as.factor(cl.med), outline=FALSE, add=TRUE)
You either need to 'melt' or 'stack' your data so that the 2 response columns are stacked. Or split the data yourself. Here is an example of the second option:
tmp <- split(iris[,c('Sepal.Width','Petal.Width')], iris$Species)
tmp2 <- unlist(tmp, recursive=FALSE)
boxplot(tmp2)

Resources