Reading data from url, fread warning - r

I am reading data from a URL, the page with some additional lines about data on that page. I tried read.table and read.csv to read the URL and got the data as a list. So I tried fread to read the data, and I got the data perfectly in a table. I am getting a warning, which I realize is due to the additional lines on that site. Is there a way to avoid this warning.
c < fread('https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data',header = FALSE)
The warning I am getting is shown below.
In fread("https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data", :
Stopped early on line 154. Expected 13 fields but found 1. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<-9999>>
I understand adding fill = TRUE, I can remove the warning. Then the whole page gets read. Then how will filter out the last part?

I have figured out the answer to my question. I just needed to complete.cases function to remove the rows with empty cells.
dat < fread('https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data',
header = FALSE,fill = TRUE)
m = dat[complete.cases(dat),]

Use fill = TRUE
dat <- fread('https://psl.noaa.gov/gcos_wgsp/Timeseries/Data/dmi.had.long.data',header = FALSE, fill = TRUE)
Then, get the subset of rows with
dat1 <- dat[1:154]
-output
> head(dat1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1: 1870 2021 NA NA NA NA NA
2: 1870 -0.373 -0.256 0.277 0.027 -0.400 -0.434 -0.554 -0.409 -0.622 -0.476 -0.278 -0.306
3: 1871 -0.208 -0.090 -0.112 -0.073 -0.035 -0.049 -0.347 -0.263 -0.230 -0.368 -0.094 -0.159
4: 1872 0.028 0.121 0.024 -0.009 -0.069 0.030 -0.189 -0.213 -0.227 -0.111 0.017 -0.041
5: 1873 0.127 -0.239 -0.304 -0.196 -0.331 -0.473 -0.593 -0.688 -0.588 -0.319 -0.229 -0.233
6: 1874 -0.316 -0.308 -0.486 -0.678 -0.361 -0.351 -0.242 -0.232 -0.708 -0.999 -0.480 -0.720
> dim(dat1)
[1] 154 13

Related

How can I extract specific data points from a wide-formatted text file in R?

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

Error: longer object length is not a multiple of shorter object length while creating multiple variables in data.table

I'm trying to create multiple columns in data.table in one command since logic is simple. I have column of starting values a0 and need to create time evolution by simply adding constant to next column.
Here is reproducible example
dt <- data.table(a0 = c(0.3, 0.34, 0.45, 0.6, 0.37, 0.444))
dt[, paste0('a', 1:5) := a0 + 1:5 / 4]
I would expect this produces columns a1, a2, a3, a4, a5 by simply adding 1/4 to each next column, instead getting warning and incorrect result
longer object length is not a multiple of shorter object length
dt
a0 a1 a2 a3 a4 a5
1: 0.300 0.550 0.550 0.550 0.550 0.550
2: 0.340 0.840 0.840 0.840 0.840 0.840
3: 0.450 1.200 1.200 1.200 1.200 1.200
4: 0.600 1.600 1.600 1.600 1.600 1.600
5: 0.370 1.620 1.620 1.620 1.620 1.620
6: 0.444 0.694 0.694 0.694 0.694 0.694
It looks R is calculating in a wrong dimension. Tried to add list dt[, paste0('a', 1:5) := list(a0 + 1:5 / 4)], but without luck.
You get the warning because length(dt$a0) is 6 whereas length(1:5) is 5.
dt$a0 + 1:5
#[1] 1.300 2.340 3.450 4.600 5.370 1.444
Warning message:
In dt$a0 + 1:5 :
longer object length is not a multiple of shorter object length
Here the first value of 1:5 is recycled and added to dt$a0[6].
You cannot reference the previous column directly like that. If you want to add new columns based on previous columns value in this case you can do something like
library(data.table)
n <- 5
dt[, paste0('a', seq_len(n)) := lapply(seq_len(n)/4, function(x) x + a0)]
dt
# a0 a1 a2 a3 a4 a5
#1: 0.300 0.550 0.800 1.050 1.300 1.550
#2: 0.340 0.590 0.840 1.090 1.340 1.590
#3: 0.450 0.700 0.950 1.200 1.450 1.700
#4: 0.600 0.850 1.100 1.350 1.600 1.850
#5: 0.370 0.620 0.870 1.120 1.370 1.620
#6: 0.444 0.694 0.944 1.194 1.444 1.694

Two types of variables in a single Heatmap (using R)

I have to plot data from immunized animals in a way to visualize possible correlations in protection. As a background, when we vaccinate an animal it produces antibodies, which might or not be linked to protection. We immunized bovine with 9 different proteins and measured antibody titers which goes up to 1.5 (Optical Density (O.D.)). We also measured tick load that goes up to 5000. Each animal have different titers for each protein and different tick loads, maybe some proteins are more important for protection than the others, and we think that a heatmap could illustrate it.
TL;DR: Plot a heatmap with one variable (Ticks) that goes from 6 up to 5000, and another variable (Prot1 to Prot9) that goes up to 1.5.
A sample of my data:
Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
G1-54-102 control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
G1-130-102 control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
G1-133-102 control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
G3-153-102 vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
G3-200-102 vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
G3-807-102 vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035
I have little knowledge in R, but I'm really excited to learn more about it. So feel free to put whatever code you want and I will try my best to understand it.
Thank you in advance.
Luiz
Here is an option to use the ggplot2 package to create a heatmap. You will need to convert your data frame from wide format to long format. It is also important to convert the Ticks column from numeric to factor if the numbers are discrete.
library(tidyverse)
library(viridis)
dat2 <- dat %>%
gather(Prot, Value, starts_with("Prot"))
ggplot(dat2, aes(x = factor(Ticks), y = Prot, fill = Value)) +
geom_tile() +
scale_fill_viridis()
DATA
dat <- read.table(text = "Animal Group Ticks Prot1 Prot2 Prot3 Prot4 Prot5 Prot6 Prot7 Prot8 Prot9
'G1-54-102' control 3030 0.734 0.402 0.620 0.455 0.674 0.550 0.654 0.508 0.618
'G1-130-102' control 5469 0.765 0.440 0.647 0.354 0.528 0.525 0.542 0.481 0.658
'G1-133-102' control 2070 0.367 0.326 0.386 0.219 0.301 0.231 0.339 0.247 0.291
'G3-153-102' vaccinated 150 0.890 0.524 0.928 0.403 0.919 0.593 0.901 0.379 0.647
'G3-200-102' vaccinated 97 1.370 0.957 1.183 0.658 1.103 0.981 1.051 0.534 1.144
'G3-807-102' vaccinated 606 0.975 0.706 1.058 0.626 1.135 0.967 0.938 0.428 1.035",
header = TRUE, stringsAsFactors = FALSE)
In the newest version of ggplot2 / the tidyverse, you don't even need to explicitly load the viridis-package. The scale is included via scale_fill_viridis_c(). Exciting times!

fread together with grepl

I have a data (large data 125000 rows, ~20 MB) in which some of the rows with certain string need to be deleted and some columns need to be selected during the reading process.
Firstly, I discovered that grepl function does not work properly since fread makes the data as one column indicated also in this question.
The example data can be found here (by following #akrun advice) and header of the data like this
head(sum_data)
TRIAL : 1 3331 9091
TRIAL : 2 1384786531 278055555
2 0.10 0.000E+00 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709 0.0035 0.0079 -0.9754 0.0081 0.0023 0.9997 -0.135324E-09 0.278754E-01
2 0.20 0.000E+00 -0.0121 0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -0.133521E-09 0.425567E-01
2 0.30 0.000E+00 0.0193 -0.0068 -0.9884 0.0040 0.0139 -0.9782 -0.0158 0.0150 -0.9814 0.0054 -0.0008 0.9997 -0.134103E-09 0.255356E-01
2 0.40 0.000E+00 -0.0157 0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040 0.0010 0.9998 -0.134819E-09 0.257300E-01
2 0.50 0.000E+00 -0.0402 0.0300 -0.9832 -0.0093 0.0269 -0.9781 -0.0326 0.0247 -0.9802 0.0044 -0.0010 0.9997 -0.131515E-09 0.440350E-01
I attempted to read the data with fread and used grepl for removing the rows;
files <-dir(pattern = "*sum.txt",full.names = FALSE)
library(data.table)
fread_files <- function(files){
sum_data_read <- fread(files,skip=2, sep="\t", ) #seperation is tab.
df_grep <- sum_vgm_read [!grepl("TRI",sum_vgm_read$V1),] # for removing the lines that contain "TRIAL" letter in V1 column. But so far there is no V1 column is recognized!!
df <- bind_rows(df_grep) #binding rows after removing
write.table(as.data.table(df),file = gsub("(.*)(\\..*)", "\\1_new\\2", files),row.names = FALSE,col.names = TRUE)
}
and finally lapply
lapply(files, fread_files)
when I perfom this, only one row of data is created as an output which is something going on but I dont know what.
Thanks for help in advance!
Firstly, I discovered that grepl function does not work properly since
fread makes the data as one column indicated also in this question.
But that question's accepted answer says that problem was fixed in v1.9.6. Which version are you using? That's why we ask you to please state the version number up front, to save time answering.
It is a great example file and the question is great.
I would not try to reinvent the wheel as operations like these have long been implemented as command line tools, which you can use together with fread directly. The advantage is that you won't churn through R memory, you can leave the filtering to the command tool and that can be much more efficient. For example, if you load all the lines as lines into R, those strings will be cached in R's global string cache (at least temporarily). Doing that filter outside R first will save that cost.
I downloaded your great file and tested the following which works.
> fread("grep -v TRIAL sum_data.txt")
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 2 0.1 0 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709 0.0035 0.0079 -0.9754 0.0081 0.0023 0.9997 -1.35324e-10 0.0278754
2: 2 0.2 0 -0.0121 0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -1.33521e-10 0.0425567
3: 2 0.3 0 0.0193 -0.0068 -0.9884 0.0040 0.0139 -0.9782 -0.0158 0.0150 -0.9814 0.0054 -0.0008 0.9997 -1.34103e-10 0.0255356
4: 2 0.4 0 -0.0157 0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040 0.0010 0.9998 -1.34819e-10 0.0257300
5: 2 0.5 0 -0.0402 0.0300 -0.9832 -0.0093 0.0269 -0.9781 -0.0326 0.0247 -0.9802 0.0044 -0.0010 0.9997 -1.31515e-10 0.0440350
---
124247: 250 49.5 0 -0.0040 0.0141 0.9802 -0.0152 0.0203 -0.9877 -0.0015 0.0123 -0.9901 0.0069 0.0003 0.9997 -1.30220e-10 0.0213215
124248: 250 49.6 0 -0.0006 0.0284 0.9819 0.0021 0.0248 -0.9920 0.0264 0.0408 -0.9919 0.0028 -0.0028 0.9997 -1.30295e-10 0.0284142
124249: 250 49.7 0 0.0378 0.0305 0.9779 -0.0261 0.0232 -0.9897 -0.0236 0.0137 -0.9928 0.0102 -0.0023 0.9997 -1.29890e-10 0.0410760
124250: 250 49.8 0 0.0569 -0.0203 0.9800 -0.0028 -0.0009 -0.9906 -0.0139 -0.0169 -0.9918 0.0039 -0.0017 0.9997 -1.31555e-10 0.0513482
124251: 250 49.9 0 0.0234 -0.0358 0.9840 -0.0340 0.0114 -0.9873 -0.0255 0.0134 -0.9888 0.0006 0.0009 0.9997 -1.30862e-10 0.0334976
>
The -v makes grep return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) then grep syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.
For further examples on the use of command line tools in fread, you may check the article Convenience features of fread. Please note that "On Windows we recommend Cygwin (run one .exe to install) which includes the command line tools such as grep".
In order to read a file and remove row based on a string criteria, you could use readLines function, and filter the result.
I use stringr package for string manipulation.
library(stringr)
# Read your file by lines
DT <- readLines("sum_data")
length(DT)
#> [1] 124501
# detect which lines contains trial
trial_lines <- str_detect(DT, "TRI")
head(trial_lines)
#> [1] TRUE TRUE FALSE FALSE FALSE FALSE
# Remove those lines
DT <- DT[!trial_lines]
length(DT)
#> [1] 124251
# Rewrite your file by line
writeLines(DT, "new_file")
If you have performance issues, you could try read_lines from package readr instead of base readLines

gvisBubbleChart colorAxis Option via Shiny not working

I'm trying to declare the colorAxis and let a series of computed "Scores" define the gradient for coloring the bubbles. The visualization just keeps giving me random colors, all with the "OutlierScore" next to them on an ugly legend to the right of the plot. I don't understand what I'm doing wrong as my options list matches all of the demo codes I find. I'm using the final gvisBubbleChart statement as the output to my renderGvis code in server.R.
Here's some sample data:
Attribute CloseRate Quotes OutlierScore Size
AdvancedShopper:N 0.261 3411 292.47 1.016
AdvancedShopper:Y 0.119 10421 259.68 2.283
PriorCarrier:HP 0.277 1876 186.46 0.739
Vehicles:1 0.183 8784 179.98 1.988
Vehicles:2 0.106 3471 121.81 1.027
LeadType:Cold 0.104 3177 117.09 0.974
SPINOFF:Y 0.414 510 115.65 0.492
LeadType:Warm 0.223 2184 115.47 0.795
MULTI_CAR_DSCNT_FLG:HMC 0.303 879 107.88 0.559
MULTI_CAR_DSCNT_FLG:MC 0.111 3451 105.75 1.024
PRI_CARR_NME:HP 0.253 1287 100.58 0.633
PriorCarrier:GEICO 0.099 2476 99.74 0.847
PriorCarrier:No Prior Insurance 0.304 802 99.61 0.545
PRI_CARR_NME:No Prior Insurance 0.304 802 99.61 0.545
FR_BAND:P-R 0.112 3227 98.15 0.983
PIP_DED:2,500 0.197 3053 95.11 0.952
AgencyName:South Agency 0.213 2120 94.81 0.783
RSrc:SPIN-OFF Additional Policy 0.434 373 91.99 0.467
CompanionType:None 0.141 11332 87.60 2.448
D2V:D1V1 0.175 5830 85.67 1.454
Here's my gvisBubbleChart declaration.
YLim = c(0,max(GData$Quotes)*1.05)
XLim = c(0,max(GData$CloseRate)*1.01)
gvisBubbleChart(GData, idvar="Attribute", xvar="CloseRate", yvar="Quotes", colorvar="OutlierScore", sizevar="Size",
options=list(title="One-Way Bubble Chart",
hAxis=paste("{title: 'Close Rate', minValue:0, maxValue:",XLim[2],"}",sep=""),
vAxis=paste("{title: 'Quotes', minValue:0, maxValue:",YLim[2],"}",sep=""),
width=1400, height=600, colorAxis="{minValue: 0, colors: ['red', 'green']}",
sizeAxis = '{minValue: 0, maxSize: 10}',
bubble="{textStyle:{color: 'none'}}"))

Resources