I am trying to read a .txt file from the internet and get it into a useable form in R. Seems like it should be easy, but I am struggling:
Data is from Berkeley Earth:
b_earth_url <- 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
I have tried the following:
read.table(b_earth_url, sep = '\t', comment.char = '%', row.names = NULL)
or:
b_earth_data <- readLines(b_earth_url)[!grepl('%', readLines(b_earth_url))]
data.frame(b_earth_data, stringsAsFactors = F)
I have tried a few other options, but can't get past a dataframe with a single variable, containing a fixed width chr vector.
I have tried extract(), separate() and strsplit(), and can't get any of them to work. I don't think I know how to use a fixed width separator for sep =
The separator is whitespace (blanks) not tabs:
out <- read.table(b_earth_url, comment.char = '%')
head(out)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1 1850 1 -0.781 0.382 NaN NaN NaN NaN NaN NaN NaN NaN
# 2 1850 2 -0.260 0.432 NaN NaN NaN NaN NaN NaN NaN NaN
# 3 1850 3 -0.399 0.348 NaN NaN NaN NaN NaN NaN NaN NaN
# 4 1850 4 -0.696 0.296 NaN NaN NaN NaN NaN NaN NaN NaN
# 5 1850 5 -0.690 0.320 NaN NaN NaN NaN NaN NaN NaN NaN
# 6 1850 6 -0.392 0.228 -0.529 0.147 NaN NaN NaN NaN NaN NaN
Related
This question already has answers here:
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 2 years ago.
I have some data gathered from a weather buoy:
station longitude latitude time wd wspd gst wvht dpd apd mwd bar
42001 -89.658 25.888 1975-08-13T22:00:00Z 23 4.1 NaN NaN NaN NaN NaN 1017.4
42001 -89.658 25.888 1975-08-13T23:00:00Z 59 3.1 NaN NaN NaN NaN NaN 1017.3
42001 -89.658 25.888 1975-08-14T00:00:00Z 30 5.2 NaN NaN NaN NaN NaN 1017.4
42001 -89.658 25.888 1975-08-14T01:00:00Z 70 2 NaN NaN NaN NaN NaN 1017.8
42001 -89.658 25.888 1975-08-14T02:00:00Z 87 5.7 NaN NaN NaN NaN NaN 1018.2
42001 -89.658 25.888 1975-08-14T03:00:00Z 105 5.6 NaN NaN NaN NaN NaN 1018.6
42001 -89.658 25.888 1975-08-14T04:00:00Z 116 5.8 NaN NaN NaN NaN NaN 1018.7
42001 -89.658 25.888 1975-08-14T05:00:00Z 116 5 NaN NaN NaN NaN NaN 1018.5
42001 -89.658 25.888 1975-08-14T06:00:00Z 123 4.5 NaN NaN NaN NaN NaN 1018.1
42001 -89.658 25.888 1975-08-14T07:00:00Z 137 4.1 NaN NaN NaN NaN NaN 1017.9
42001 -89.658 25.888 1975-08-14T08:00:00Z 151 3.6 NaN NaN NaN NaN NaN 1017.7
42001 -89.658 25.888 1975-08-14T09:00:00Z 153 3.5 NaN NaN NaN NaN NaN 1017.6
42001 -89.658 25.888 1975-08-14T10:00:00Z 180 3.5 NaN NaN NaN NaN NaN 1017.7
42001 -89.658 25.888 1975-08-14T11:00:00Z 189 2.8 NaN NaN NaN NaN NaN 1018
42001 -89.658 25.888 1975-08-14T12:00:00Z 183 1.7 NaN NaN NaN NaN NaN 1018.3
42001 -89.658 25.888 1975-08-14T13:00:00Z 172 0.7 NaN NaN NaN NaN NaN 1018.8
42001 -89.658 25.888 2001-11-18T11:00:00Z 38 7.3 8.8 1.1 6.67 4.51 69 1021
42001 -89.658 25.888 2001-11-18T12:00:00Z 29 7.9 9.3 1.01 5.88 4.42 57 1021.4
42001 -89.658 25.888 2001-11-18T13:00:00Z 29 7.4 8.3 1.02 7.14 4.42 65 1022.1
42001 -89.658 25.888 2001-11-18T14:00:00Z 23 8 9.5 0.97 5.56 4.48 55 1022.6
42001 -89.658 25.888 2001-11-18T15:00:00Z 16 7.6 8.9 1 6.67 4.5 64 1023.2
42001 -89.658 25.888 2001-11-18T16:00:00Z 26 8.9 10.2 0.94 4.17 4.49 29 1023.1
42001 -89.658 25.888 2001-11-18T17:00:00Z 26 8.5 10.2 0.98 4.55 4.48 36 1022.7
42001 -89.658 25.888 2001-11-18T18:00:00Z 17 7.8 9.1 1.07 4.76 4.56 30 1021.9
42001 -89.658 25.888 2001-11-18T19:00:00Z 24 8.1 9.1 1.07 4.55 4.6 29 1021
42001 -89.658 25.888 2001-11-18T20:00:00Z 18 8.3 11.1 1.21 6.25 4.6 69 1020
42001 -89.658 25.888 2001-11-18T21:00:00Z 30 8 9.4 1.2 6.67 4.72 77 1019.8
42001 -89.658 25.888 2001-11-18T22:00:00Z 39 8.2 9.6 1.32 6.67 4.8 76 1019.8
42001 -89.658 25.888 2001-11-18T23:00:00Z 32 8.5 9.6 1.21 6.67 4.63 71 1019.7
42001 -89.658 25.888 2001-11-19T00:00:00Z 38 8.9 10.3 1.28 6.25 4.6 72 1019.8
42001 -89.658 25.888 2001-11-19T01:00:00Z 48 8.3 9.6 1.26 6.67 4.53 71 1020.2
42001 -89.658 25.888 2001-11-19T02:00:00Z 54 10.1 11.6 1.28 6.67 4.59 65 1021.1
42001 -89.658 25.888 2001-11-19T03:00:00Z 60 3 4.7 1.29 5.88 4.58 72 1021.5
42001 -89.658 25.888 2001-11-19T04:00:00Z 77 0.8 1.7 1.25 6.67 4.92 63 1021.2
42001 -89.658 25.888 2001-11-19T05:00:00Z 153 2.1 3 1.21 6.67 4.91 64 1021
42001 -89.658 25.888 2001-11-19T06:00:00Z 20 2.2 5.5 1.18 6.25 4.92 65 1020.6
42001 -89.658 25.888 2001-11-19T07:00:00Z 158 6.2 9.7 1.31 6.67 5.22 67 1020.3
42001 -89.658 25.888 2001-11-19T08:00:00Z 162 7.4 9 1.26 6.67 5.42 73 1020.1
42001 -89.658 25.888 2001-11-19T09:00:00Z 218 4.8 6.2 1.2 7.69 4.98 65 1019.9
How could I create a data frame from aggregating the data (using the mean) on a monthly basis while leaving out the NaN values? The start of the data has numerous rows with NaN, but for several years there are values in those rows.
I've tried:
DF2 <- transform(buoy1, time = substring(time, 1, 7))
aggregate(as.numeric(wd) ~ time, DF2[-1,], mean, na.rm=TRUE))
Which generates
401 2010-09 109.20556
402 2010-10 107.42473
403 2010-11 130.67222
404 2010-12 135.75000
405 2011-01 156.11306
406 2011-02 123.33931
407 2011-03 137.29744
408 2011-04 119.85139
409 2011-05 148.65276
410 2011-06 104.74722
411 2011-07 88.16393
412 2011-09 106.60229
413 2011-10 93.32527
414 2011-11 149.52712
415 2011-12 123.09005
416 2012-01 145.38731
417 2012-02 115.40288
418 2012-03 127.44415
419 2012-04 133.02503
420 2012-05 122.34683
421 2012-06 146.95265
422 2012-07 133.58199
423 2012-08 149.08356
Is there a more efficient way to aggregate across all the columns at once?
Something like
DF2[,5:20] <- sapply(DF2[,5:20], as.numeric, na.rm=TRUE)
monthAvg <- aggregate(DF2[, 5:20], cut(time, "month"),mean)
But then I get:
Error in cut.default(time, "month") : 'x' must be numeric
Here is a base R solution
d <- within(buoy1[-1:-3], time <- format(as.POSIXct(time), "%Y-%m"))
aggregate(. ~ time, d, mean, na.rm = TRUE, na.action = NULL)
# "." means anything other than the RHS, which is `time` in this case
Output
time wd wspd gst wvht dpd apd mwd bar
1 1975-08 118.37500 3.806250 NaN NaN NaN NaN NaN 1018.000
2 2001-11 58.04348 6.882609 8.452174 1.157391 6.186957 4.690435 61.04348 1021.043
You could create a new column with year and month information and take mean of multiple columns using across.
library(dplyr)
df %>%
group_by(time = format(as.POSIXct(time), '%Y-%m')) %>%
summarise(across(gst:bar, mean, na.rm = TRUE)) -> result
result
I have some challenges with a wiki table and hope someone who has done it before can give me advice. From the wikitable mw-collapsible table I need to get the data into a pandas data frames. (The code does not work). I am not sure how to get this going. In this initial attempt to pull data it ValueError: Length of values does not match length of index. Will appreciate your help!
import urllib.request
url = "https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_South_Africa"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables=soup.find_all("table")
all_tables
right_table=soup.find('table', class_='wikitable mw-collapsible')
right_table
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
H=[]
I=[]
J=[]
K=[]
L=[]
M=[]
N=[]
O=[]
P=[]
Q=[]
U=[]
for row in right_table.findAll('tr'):
cells=row.findAll('td')
if len(cells)==17:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
D.append(cells[3].find(text=True))
E.append(cells[4].find(text=True))
F.append(cells[5].find(text=True))
G.append(cells[6].find(text=True))
H.append(cells[7].find(text=True))
I.append(cells[8].find(text=True))
J.append(cells[9].find(text=True))
K.append(cells[10].find(text=True))
L.append(cells[11].find(text=True))
M.append(cells[12].find(text=True))
N.append(cells[13].find(text=True))
P.append(cells[14].find(text=True))
Q.append(cells[15].find(text=True))
U.append(cells[16].find(text=True))
import pandas as pd
df=pd.DataFrame(A,columns=['DATE'])
df['EC']=B
df['FS']=C
df['GAU']=D
df['KJN']=F
df['LIM']=G
df['MPU']=H
df['NW']=I
df['NC']=J
df['WC']=K
df['NEW']=L
df['TOTAL']=M
df['NEW']=N
df['TOTAL']=O
df['REC']=P
df['TESTED']=Q
df['REF']=U
df
Aweful lot of work to get into a dataframe when pandas has the read_html() function to do precisely that (actually uses beautifulsoup under the hood).
.read_html() will return a list of dataframes (Ie the <table> tags in the html). It's just a matter of pulling out the one you want.
import pandas as pd
url = "https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_South_Africa"
dfs = pd.read_html(url)
df = dfs[3]
Output:
print (df.to_string())
Date EC FS GP KZN LP MP NW NC WC Confirmed Deaths Rec Tested Ref
Date EC FS GP KZN LP MP NW NC WC New Total New Total Rec Tested Ref
0 2020-03-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 181 [22]
1 2020-03-05 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN [2]
2 2020-03-06 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 1.0 NaN NaN NaN NaN NaN
3 2020-03-07 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 1.0 2.0 NaN NaN NaN NaN [11]
4 2020-03-08 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN 1.0 3.0 NaN NaN NaN NaN [23]
5 2020-03-09 NaN NaN NaN 4.0 NaN NaN NaN NaN NaN 4.0 7.0 NaN NaN NaN NaN [24]
6 2020-03-10 NaN NaN 2.0 1.0 NaN NaN NaN NaN NaN 3.0 10.0 NaN NaN NaN 239 [25]
7 2020-03-11 NaN NaN 2.0 NaN NaN NaN NaN NaN 1.0 3.0 13.0 NaN NaN NaN 645 [12][26]
8 2020-03-12 NaN 0.0 1.0 1.0 NaN 1.0 NaN NaN NaN 3.0 16.0 NaN NaN NaN 848 [27][28][29]
9 2020-03-13 NaN NaN 4.0 2.0 NaN NaN NaN NaN 2.0 8.0 24.0 NaN NaN NaN 924 [30][31]
10 2020-03-14 NaN NaN 7.0 1.0 NaN NaN NaN NaN 6.0 14.0 38.0 NaN NaN NaN 1017 [32][33]
11 2020-03-15 NaN NaN 7.0 1.0 NaN NaN NaN NaN 5.0 13.0 51.0 NaN NaN NaN 1476 [34][3][35]
12 2020-03-16 NaN NaN 7.0 NaN 1.0 1.0 NaN NaN 2.0 11.0 62.0 NaN NaN NaN 2405 [17][36]
13 2020-03-17 NaN NaN 14.0 4.0 NaN NaN NaN NaN 5.0 23.0 85.0 NaN NaN NaN 2911 [18][37]
14 2020-03-18 NaN NaN 16.0 3.0 NaN 2.0 NaN NaN 10.0 31.0 116.0 NaN NaN NaN 3070 [38][19][39]
15 2020-03-19 NaN NaN 15.0 3.0 NaN 1.0 NaN NaN 15.0 34.0 150.0 NaN NaN NaN 4832 [40][41][42]
16 2020-03-20 NaN 7.0 33.0 1.0 NaN NaN NaN NaN 11.0 52.0 202.0 NaN NaN 2 6438 [43][44]
17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 Cases 0.0 7.0 109.0 24.0 1.0 5.0 0.0 0.0 56.0 NaN NaN NaN including local transmission including local transmission including local transmission including local transmission
I am plotting a chart using highcharter function. You can notice the timestamp starts from June 29th. But i when plot it , the graph shows data plotting from June 28,18.30. How do i change this time zone??
> head(d)
timestamps x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
47948 2017-06-29 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 48.5 1210.87
47949 2017-06-29 00:01:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 49.2 1213.91
47950 2017-06-29 00:02:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 49.0 1213.59
47951 2017-06-29 00:03:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 50.0 1214.28
47952 2017-06-29 00:04:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 50.0 1212.13
47953 2017-06-29 00:05:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 49.8 1216.06
library(highcharter)
highchart() %>%
hc_title(text = "A nice chart") %>%
hc_add_series_times_values(d$timestamps,
d$x12, name = "x12")
Any help is appreciated. Thank you.
This is how I managed to deactivate UTC in highcharter.
hcGopts <- getOption("highcharter.global")
hcGopts$useUTC <- FALSE
options(highcharter.global = hcGopts)
The global options are not directly accessible from R. From JavaScript, it would be like this:
Highcharts.setOptions({
global: {
useUTC: false
}
});
I have a very large time series data set in the following format.
"Tag.1","1/22/2015 11:59:54 PM","570.29895",
"Tag.1","1/22/2015 11:59:56 PM","570.29895",
"Tag.1","1/22/2015 11:59:58 PM","570.29895",
"Tag.1","1/23/2015 12:00:00 AM","649.67133",
"Tag.2","1/22/2015 12:00:02 AM","1.21",
"Tag.2","1/22/2015 12:00:04 AM","1.21",
"Tag.2","1/22/2015 12:00:06 AM","1.21",
"Tag.2","1/22/2015 12:00:08 AM","1.21",
"Tag.2","1/22/2015 12:00:10 AM","1.21",
"Tag.2","1/22/2015 12:00:12 AM","1.21",
I would like to separate this out into a data frame with a common column for the time stamp and one column each for the tags.
Date.Time, Tag.1, Tag.2, Tag.3...
1/22/2015 11:59:54 PM,570.29895,
Any suggestions would be appreciated!
Maybe something like this:
cast(df,V2~V1,mean,value='V3')
V2 Tag.1 Tag.2
1 1/22/2015 11:59:54 PM 570.2989 NaN
2 1/22/2015 11:59:56 PM 570.2989 NaN
3 1/22/2015 11:59:58 PM 570.2989 NaN
4 1/22/2015 12:00:02 AM NaN 1.21
5 1/22/2015 12:00:04 AM NaN 1.21
6 1/22/2015 12:00:06 AM NaN 1.21
7 1/22/2015 12:00:08 AM NaN 1.21
8 1/22/2015 12:00:10 AM NaN 1.21
9 1/22/2015 12:00:12 AM NaN 1.21
10 1/23/2015 12:00:00 AM 649.6713 NaN
cast is a part of reshape package
Bests,
ZP
I have a table looking as the following:
ID period 1 period 2 period 3 period 4
A 4 2 25 42
B 3 56 2 45
C 16 1 34 67
D 56 2 8 48
I want to check in R how many times(cols) in each row I get values lower than 10. For example in row A I have two values lower than 10.
Any ideas???
I used the quantile values and got the following:
quantile(v[,2:5],na.rm=TRUE)
0% 25% 50% 75% 100%
1.00 2.75 20.50 45.75 67.00
But this is not exactly what I need; I want to know the percentage (or count) of values below 10. I tried using the following and also didnĀ“t work:
limit
[1] 10
v$tot<-count(v,c("ID","period1","period2"),wt_var=limit)`
The first few rows of the actual dataset areas follows:
id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 xxxlll 61 36 277 462 211 182 45 41 128 174 179 87 18 NaN NaN NaN NaN
2 ccvvbb 281 340 592 455 496 348 422 491 408 548 596 611 570 580 530 602 614
3 ddffgr 587 964 895 866 1120 725 547 90 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 rrteww 257 331 320 411 442 316 334 403 355 444 522 661 508 499 520 413 494
5 oiertw 261 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I guess I'll add an answer in case the OP doesn't, but in this case I'd use rowSums and logical comparison...
# '-1' drops the ID column
x <- rowSums( df[ ,-1 ] < 10 )
names(x) <- df$ID
x
#A B C D
#2 2 1 2