I want to grab content in the url while the original data come in simple column and row. I tried readHTMLTable and obviously its not working. Using webcsraping xpath, how to get clean data without '\n...' and keep the data in data.frame. Is this possible without saving in csv? kindly help me to improve my code. Thank you
library(rvest)
library(dplyr)
page <- read_html("http://weather.uwyo.edu/cgi-bin/sounding?region=seasia&TYPE=TEXT%3ALIST&YEAR=2006&MONTH=09&FROM=0100&TO=0100&STNM=48657")
xpath <- '/html/body/pre[1]'
txt <- page %>% html_node(xpath=xpath) %>% html_text()
txt
[1] "\n-----------------------------------------------------------------------------\n PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV\n hPa m C C % g/kg deg knot K K K \n-----------------------------------------------------------------------------\n 1009.0 16 23.8 22.7 94 17.56 170 2 296.2 346.9 299.3\n 1002.0 78 24.6 21.6 83 16.51 252 4 297.6 345.6 300.5\n 1000.0 96 24.4 21.3 83 16.23 275 4 297.6 344.8 300.4\n 962.0 434 22.9 20.0 84 15.56 235 10 299.4 345.0 302.1\n 925.0 777 21.4 18.7 85 14.90 245 11 301.2 345.2 303.9\n 887.0 1142 20.3 16.0 76 13.04 255 15 303.7 342.7 306.1\n 850.0 1512 19.2 13.2 68 11.34 230 17 306.2 340.6 308.3\n 839.0 1624 18.8 11.8 64 10.47 225 17 307.0 338.8 308.9\n 828.0 1735 18.0 11.4 65 10.33 ... <truncated>
We can extend your base code and treat the web page as an API endpoint since it takes parameters:
library(httr)
library(rvest)
I use more than ^^ below via :: but I don't want to pollute the namespace.
I'd usually end up writing a small, parameterized function or small package with a cpl parameterized functions to encapsulate the logic below.
httr::GET(
url = "http://weather.uwyo.edu/cgi-bin/sounding",
query = list(
region = "seasia",
TYPE = "TEXT:LIST",
YEAR = "2006",
MONTH = "09",
FROM = "0100",
TO = "0100",
STNM = "48657"
)
) -> res
^^ makes the web page request and gathers the response.
httr::content(res, as="parsed") %>%
html_nodes("pre") -> wx_dat
^^ turns it into an html_document
Now, we extract the readings:
html_text(wx_dat[[1]]) %>% # turn the first <pre> node into text
strsplit("\n") %>% # split it into lines
unlist() %>% # turn it back into a character vector
{ col_names <<- .[3]; . } %>% # pull out the column names (we'll use them later)
.[-(1:5)] %>% # strip off the header
paste0(collapse="\n") -> readings # turn it back into a big text blob
^^ cleaned up the table and we'll use readr::read_table() to parse it. We'll also turn the extract column names into the actual colum names:
readr::read_table(readings, col_names = tolower(unlist(strsplit(trimws(col_names), "\ +"))))
## # A tibble: 106 x 11
## pres hght temp dwpt relh mixr drct sknt thta thte thtv
## <dbl> <int> <dbl> <dbl> <int> <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 1009 16 23.8 22.7 94 17.6 170 2 296. 347. 299.
## 2 1002 78 24.6 21.6 83 16.5 252 4 298. 346. 300.
## 3 1000 96 24.4 21.3 83 16.2 275 4 298. 345. 300.
## 4 962 434 22.9 20 84 15.6 235 10 299. 345 302.
## 5 925 777 21.4 18.7 85 14.9 245 11 301. 345. 304.
## 6 887 1142 20.3 16 76 13.0 255 15 304. 343. 306.
## 7 850 1512 19.2 13.2 68 11.3 230 17 306. 341. 308.
## 8 839 1624 18.8 11.8 64 10.5 225 17 307 339. 309.
## 9 828 1735 18 11.4 65 10.3 220 17 307. 339. 309.
## 10 789 2142 15.1 10 72 9.84 205 16 308. 339. 310.
## # ... with 96 more rows
You didn't say you wanted the station metadata but we can get that too (in the second <pre>:
html_text(wx_dat[[2]]) %>%
strsplit("\n") %>%
unlist() %>%
trimws() %>% # get rid of whitespace
.[-1] %>% # blank line removal
strsplit(": ") %>% # separate field and value
lapply(function(x) setNames(as.list(x), c("measure", "value"))) %>% # make each pair a named list
dplyr::bind_rows() -> metadata # turn it into a data frame
metadata
## # A tibble: 30 x 2
## measure value
## <chr> <chr>
## 1 Station identifier WMKD
## 2 Station number 48657
## 3 Observation time 060901/0000
## 4 Station latitude 3.78
## 5 Station longitude 103.21
## 6 Station elevation 16.0
## 7 Showalter index 0.34
## 8 Lifted index -1.40
## 9 LIFT computed using virtual temperature -1.63
## 10 SWEAT index 195.39
## # ... with 20 more rows
Your data is truncated, so I'll work with what I can:
txt <- "\n-----------------------------------------------------------------------------\n PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV\n hPa m C C % g/kg deg knot K K K \n-----------------------------------------------------------------------------\n 1009.0 16 23.8 22.7 94 17.56 170 2 296.2 346.9 299.3\n 1002.0 78 24.6 21.6 83 16.51 252 4 297.6 345.6 300.5\n 1000.0 96 24.4 21.3 83 16.23 275 4 297.6 344.8 300.4\n 962.0 434 22.9 20.0 84 15.56 235 10 299.4 345.0 302.1\n 925.0 777 21.4 18.7 85 14.90 245 11 301.2 345.2 303.9\n 887.0 1142 20.3 16.0 76 13.04 255 15 303.7 342.7 306.1\n 850.0 1512 19.2 13.2 68 11.34 230 17 306.2 340.6 308.3\n"
It appears to be fixed-width, with lines compacted into a single string using the \n delimiter, so let's split it up:
strsplit(txt, "\n")
# [[1]]
# [1] ""
# [2] "-----------------------------------------------------------------------------"
# [3] " PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV"
# [4] " hPa m C C % g/kg deg knot K K K "
# [5] "-----------------------------------------------------------------------------"
# [6] " 1009.0 16 23.8 22.7 94 17.56 170 2 296.2 346.9 299.3"
# [7] " 1002.0 78 24.6 21.6 83 16.51 252 4 297.6 345.6 300.5"
# [8] " 1000.0 96 24.4 21.3 83 16.23 275 4 297.6 344.8 300.4"
# [9] " 962.0 434 22.9 20.0 84 15.56 235 10 299.4 345.0 302.1"
# [10] " 925.0 777 21.4 18.7 85 14.90 245 11 301.2 345.2 303.9"
# [11] " 887.0 1142 20.3 16.0 76 13.04 255 15 303.7 342.7 306.1"
# [12] " 850.0 1512 19.2 13.2 68 11.34 230 17 306.2 340.6 308.3"
It seems that row 1 is empty, and 2 and 5 are lines that need to be removed. Rows 3-4 appear to be the column header and units, respectively; since R doesn't allow multi-row headers, I'll remove the units, and leave it to you to save them elsewhere if you need them.
From here, it's a straight-forward call (noting the [[1]] for strsplit's returned list):
read.table(text=strsplit(txt, "\n")[[1]][-c(1,2,4,5)], header=TRUE)
# PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
# 1 1009 16 23.8 22.7 94 17.56 170 2 296.2 346.9 299.3
# 2 1002 78 24.6 21.6 83 16.51 252 4 297.6 345.6 300.5
# 3 1000 96 24.4 21.3 83 16.23 275 4 297.6 344.8 300.4
# 4 962 434 22.9 20.0 84 15.56 235 10 299.4 345.0 302.1
# 5 925 777 21.4 18.7 85 14.90 245 11 301.2 345.2 303.9
# 6 887 1142 20.3 16.0 76 13.04 255 15 303.7 342.7 306.1
# 7 850 1512 19.2 13.2 68 11.34 230 17 306.2 340.6 308.3
Related
I have 2 different data.frames. I want to add the grouping$.group column to the phenology data.frame under the conditions given by the group data.frame (LEVEL and SPECIES). I have tried the merge() function using by= but it keeps giving me "Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column". Sorry this might seem like a very easy thing. I'm a beginner..
> head(phenology1)
YEAR GRADIENT SPECIES ELEVATION SITE TREE_ID CN b_E b_W b_M d_E d_W d_X c_E c_W t_max r_max r_delta_t LEVEL
1 2019 1 Pseudotsuga menziesii 395 B1_D B1_D1 59 119 135.5 143.0 139.0 148.5 165 258.0 284 154 0.7908536 0.4244604 lower
2 2019 1 Pseudotsuga menziesii 395 B1_D B1_D2 69 106 127.0 142.0 177.0 173.0 194 283.0 300 156 0.9807529 0.3898305 lower
3 2019 1 Pseudotsuga menziesii 395 B1_D B1_D3 65 97 125.0 154.5 169.0 174.0 202 266.0 299 167 NA 0.3846154 lower
4 2019 1 Picea abies 405 B1_F B1_F1 68 162 171.5 182.0 106.5 127.5 137 268.5 299 190 NA 0.6384977 lower
5 2019 1 Picea abies 405 B1_F B1_F2 78 139 165.5 176.5 152.0 140.5 167 291.0 306 181 0.9410427 0.5131579 lower
6 2019 1 Picea abies 405 B1_F B1_F3 34 147 177.5 188.0 100.0 97.5 128 247.0 275 187 0.5039245 0.3400000 lower
> grouping
LEVEL SPECIES emmean SE df lower.CL upper.CL .group
lower Pseudotsuga menziesii 107 8.19 12 89.5 125 1
upper Pseudotsuga menziesii 122 8.19 12 103.8 140 12
lower Abies alba 128 8.19 12 110.2 146 12
upper Abies alba 144 8.19 12 126.7 162 12
upper Picea abies 147 8.19 12 129.2 165 2
lower Picea abies 149 8.19 12 131.5 167 2
You can use left_join() from dplyr package (join phenology1 with only the columns LEVEL, SPECIES and .group from grouping):
library(dplyr)
phenology1 %>%
left_join(grouping %>% select(LEVEL, SPECIES, .group))
This automatically selects identical column names in both data frames to join on. If you want to set these explicitely, you can add by = c("LEVEL" = "LEVEL", "SPECIES" = "SPECIES").
Base R using match function:
phenology1$.group <- grouping$.group[match(grouping$SPECIES, phenology1$SPECIES) & match(grouping$LEVEL, phenology1$LEVEL)]
I tried converting my data frame into polygon using the code from previous
post but I got an error message. Please I need assistance on how to fix this.
Thanks. Below is my code:
County MEDIAN_V latitude longitude RACE DRAG AGIP AGIP2 AGIP3
Akpa 18.7 13.637 46.048 3521875 140.1290323 55 19 5
Uopa 17.9 12.85 44.869 3980000 86.71929825 278 6 4
Kaop 15.7 14.283 45.41 6623750 167.6746988 231 66 17
Nguru 14.7 13.916 44.764 3642500 152.256705 87 15 11
Nagima 20.2 14.7666636 43.249999 23545500 121.699 271 287 450
Dagoja 17.2 16.7833302 45.5166646 2316000 135.5187713 114 374 194
AlKoma 20.7 16.7999968 51.7333304 767000 83.38818565 NA NA NA
Ikaka 18.1 15.46833146 43.5404978 5687500 99.86455331 18 29 11
Maru 17.4 15.452 44.2173 10845625 90.98423127 679 424 159
Nko 19.4 16.17 43.89 10693000 109.7594937 126 140 60
Dfor 16.8 14.702 44.336 16587000 120.7656012 74 52 30
Hydr 20.7 16.666664 49.499998 5468000 126.388535 2 5 NA
lami 23 16.17 43.156 10432875 141.3487544 359 326 795
Ntoka 16.9 13.9499962 44.1833326 21614750 134.3637902 153 84 2
Lakoje 20.6 13.244 44.606 4050250 100.5965167 168 108 75
Mbiri 14.6 15.4499982 45.333332 2386625 166.9104478 465 452 502
Masi 18.2 14.633 43.6 4265250 117.16839 6 1 NA
Sukara 20.6 16.94021 43.76393 6162750 66.72009029 974 928 1176
Shakara 18.9 15.174 44.213 10721000 151.284264 585 979 574
Bambam 18.8 14.5499978 46.83333 3017625 142.442623 101 84 134
Erika 17.8 13.506 43.759 23565000 93.59459459 697 728 1034
mydata %>%
group_by(County) %>%
summarise(geometry = st_sfc(st_cast(st_multipoint(cbind(longitude,
latitude)), 'POLYGON'))) %>%
st_sf()
After running the above I got an error message:
Error in ClosePol(x) : polygons require at least 4 points
Please can someone help me out with how to fix this.
I have a function that takes in a dataframe, a percentile threshold, and the name of a given column, and computes all values that are above this threshold in the given column as a new column (0 for <, and 1 for >=). However, it won't allow me to do the df$column_name inside the quantile function because column_name is not actually a column name, but a variable storing the actual column name. Therefore df$column_name will return NULL. Is there any way to work around this and keep the code forma somewhat similar to what it is currently? Or do I have to specify the actual numerical column value instead of the name? While I can do this, it is definitely not as convenient/comprehensible as just passing in the column name.
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df$column_name, c(threshold))
new_df <- df %>%
mutate(ifelse(column_name > threshold_value, 1, 0))
return(new_df)
}
Thank you so much for your help!
I modified your function as follows. Now the function can take a data frame, a threshold, and a column name. This function only needs the base R.
# Modified function
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df[[column_name]], threshold)
new_df <- df
new_df[["new_col"]] <- ifelse(df[[column_name]] > threshold_value, 1, 0)
return(new_df)
}
# Take the trees data frame as an example
head(trees)
# Girth Height Volume
# 1 8.3 70 10.3
# 2 8.6 65 10.3
# 3 8.8 63 10.2
# 4 10.5 72 16.4
# 5 10.7 81 18.8
# 6 10.8 83 19.7
# Apply the function
func1(trees, 0.5, "Volume")
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 0
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1
If you still want to use dplyr, it is essential to learn how to deal with non-standard evaluation. Please see this to learn more (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html). The following code will also works.
library(dplyr)
func2 <- function(df, threshold, column_name) {
col_en <- enquo(column_name)
threshold_value <- quantile(df %>% pull(!!col_en), threshold)
new_df <- df %>%
mutate(new_col := ifelse(!!col_en >= threshold_value, 1, 0))
return(new_df)
}
func2(trees, 0.5, Volume)
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 1
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1
http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504
This is the website from with I want to read data.
My code is as follows,
library(XML)
fileurl <- "http://www.aqistudy.cn/historydata/daydata.php?city=苏州&month=201404"
doc <- htmlTreeParse(fileurl, useInternalNodes = TRUE, encoding = "utf-8")
rootnode <- xmlRoot(doc)
pollution <- xpathSApply(rootnode, "/td", xmlValue)
But I got a lot of messy code, and I don't know how to fix this problem.
I appreciate for any help!
This can be simplified using library(rvest) to directly read the table
library(rvest)
url <- "http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504"
doc <- read_html(url) %>%
html_table()
doc[[1]]
# 日期 AQI 范围 质量等级 PM2.5 PM10 SO2 CO NO2 O3 排名
# 1 2015-04-01 106 67~144 轻度污染 79.3 105.1 20.2 1.230 89.5 76 308
# 2 2015-04-02 74 31~140 良 48.1 79.7 18.8 1.066 51.5 129 231
# 3 2015-04-03 98 49~136 良 72.9 89.2 16.0 1.323 50.9 62 293
# 4 2015-04-04 92 56~158 良 67.6 78.2 14.3 1.506 57.4 93 262
# 5 2015-04-05 87 42~167 良 63.7 56.1 16.9 1.245 50.8 91 215
# 6 2015-04-06 46 36~56 优 29.1 30.8 10.0 0.817 37.5 98 136
# 7 2015-04-07 45 34~59 优 27.0 42.4 12.0 0.640 36.6 77 143
Is there a way to aggregate multiple sub-totals with reshape2? E.g. for the airquality dataset
require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
aqm <- subset(aqm, month %in% 5:6 & day %in% 1:7)
I can make a subtotal column for each month, that has the average for all variables within that month:
dcast(aqm, day ~ month+variable, mean, margins = "variable")
day 5_ozone 5_solar.r 5_wind 5_temp 5_(all) 6_ozone 6_solar.r
1 1 41 190 7.4 67 76.350 NaN 286
2 2 36 118 8.0 72 58.500 NaN 287
3 3 12 149 12.6 74 61.900 NaN 242
4 4 18 313 11.5 62 101.125 NaN 186
5 5 NaN NaN 14.3 56 35.150 NaN 220
6 6 28 NaN 14.9 66 36.300 NaN 264
7 7 23 299 8.6 65 98.900 29 127
6_wind 6_temp 6_(all)
1 8.6 78 124.20000
2 9.7 74 123.56667
3 16.1 67 108.36667
4 9.2 84 93.06667
5 8.6 85 104.53333
6 14.3 79 119.10000
7 9.7 82 61.92500
I can also make a subtotal column for each variable, that has the average for all months within that variable:
dcast(aqm, day ~ variable+month, mean, margins = "month")
day ozone_5 ozone_6 ozone_(all) solar.r_5 solar.r_6 solar.r_(all)
1 1 41 NaN 41 190 286 238.0
2 2 36 NaN 36 118 287 202.5
3 3 12 NaN 12 149 242 195.5
4 4 18 NaN 18 313 186 249.5
5 5 NaN NaN NaN NaN 220 220.0
6 6 28 NaN 28 NaN 264 264.0
7 7 23 29 26 299 127 213.0
wind_5 wind_6 wind_(all) temp_5 temp_6 temp_(all)
1 7.4 8.6 8.00 67 78 72.5
2 8.0 9.7 8.85 72 74 73.0
3 12.6 16.1 14.35 74 67 70.5
4 11.5 9.2 10.35 62 84 73.0
5 14.3 8.6 11.45 56 85 70.5
6 14.9 14.3 14.60 66 79 72.5
7 8.6 9.7 9.15 65 82 73.5
Is there a way to tell reshape2 to calculate both sets of subtotals in one command? This command is close, adding in the grand total, but omits the monthly subtotals:
dcast(aqm, day ~ variable+month, mean, margins = c("variable", "month"))
If I get your question right, you can use
acast(aqm, day ~ variable ~ month, mean, margins = c("variable", "month"))[,,'(all)']
The acast gets you the summary for each day over each variable over each month. The total aggregate "slice" ([,,'(all)']) has a row for each day, with a column for each variable (averaged over all months) and a '(all)' column averaging each day, over all variables over all months.
Is this what you needed?