Tabulize function in R - r

I want to extract the table of page 112 in this pdf document:
http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040
# report 2017
url_location <-"http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040"
out <- extract_tables(url_location, pages = 112)
I have tried using these tutorials (link1,link2) about 'tabulize' package but I largely failed. There are some difficult aspects which I am not very experienced how to handle in R.
Can someone suggest something and help me with that ?
Installation
devtools::install_github("ropensci/tabulizer")
# load package
library(tabulizer)

Java deps — while getting easier to deal with — aren't necessary when the tables are this clean. Just a bit of string wrangling will get you what you need:
library(pdftools)
library(stringi)
library(tidyverse)
# read it with pdftools
book <- pdf_text("global-wealth-databook.pdf")
# go to the page
lines <- stri_split_lines(book[[113]])[[1]]
# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")
# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1
# find line after table
end <- last(which(lines == ""))-1
# smuch into something read.[table|csv] can read
tab <- paste0(stri_replace_all_regex(lines[start:end], "[[:space:]][[:space:]]+", "\t"), collapse="\n")
#read it
read.csv(text=tab, header=FALSE, sep="\t", stringsAsFactors = FALSE)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 Egypt 56,036 3,168 324 98.1 1.7 0.2 0.0 100.0 91.7
## 2 El Salvador 3,957 14,443 6,906 66.0 32.8 1.2 0.0 100.0 65.7
## 3 Equatorial Guinea 670 8,044 2,616 87.0 12.2 0.7 0.1 100.0 77.3
## 4 Eritrea 2,401 3,607 2,036 94.5 5.4 0.1 100.0 57.1 NA
## 5 Estonia 1,040 43,158 27,522 22.5 72.2 5.1 0.2 100.0 56.4
## 6 Ethiopia 49,168 153 103 100.0 0.0 100.0 43.4 NA NA
## 7 Fiji 568 6,309 3,059 85.0 14.6 0.4 0.0 100.0 68.2
## 8 Finland 4,312 159,098 57,850 30.8 33.8 33.5 1.9 100.0 76.7
## 9 France 49,239 263,399 119,720 25.3 21.4 49.3 4.0 100.0 70.2
## 10 Gabon 1,098 15,168 7,367 62.0 36.5 1.5 0.0 100.0 68.4
## 11 Gambia 904 898 347 99.2 0.7 0.0 100.0 72.4 NA
## 12 Georgia 2,950 19,430 9,874 50.7 47.6 1.6 0.1 100.0 66.8
## 13 Germany 67,244 203,946 47,091 29.5 33.7 33.9 2.9 100.0 79.1
## 14 Ghana 14,574 809 411 99.5 0.5 0.0 100.0 66.1 NA
## 15 Greece 9,020 111,684 54,665 20.7 52.9 25.4 1.0 100.0 67.7
## 16 Grenada 70 17,523 4,625 74.0 24.3 1.5 0.2 100.0 81.5
## 17 Guinea 5,896 814 374 99.4 0.6 0.0 100.0 69.7 NA
## 18 Guinea-Bissau 884 477 243 99.8 0.2 100.0 65.6 NA NA
## 19 Guyana 467 5,345 2,510 89.0 10.7 0.3 0.0 100.0 67.2
## 20 Haiti 6,172 2,879 894 96.2 3.6 0.2 0.0 100.0 76.9
## 21 Hong Kong 6,172 193,248 46,079 26.3 50.9 20.9 1.9 100.0 85.1
## 22 Hungary 7,846 39,813 30,111 11.8 83.4 4.8 0.0 100.0 45.3
## 23 Iceland 245 587,649 444,999 13.0 72.0 15.0 100.0 46.7 NA
## 24 India 834,608 5,976 1,295 92.3 7.2 0.5 0.0 100.0 83.0
## 25 Indonesia 167,559 11,001 1,914 81.9 17.0 1.1 0.1 100.0 83.7
## 26 Iran 56,306 3,831 1,856 94.1 5.7 0.2 0.0 100.0 67.3
## 27 Ireland 3,434 248,466 84,592 31.2 22.7 42.3 3.6 100.0 81.3
## 28 Israel 5,315 198,406 78,244 22.3 38.7 36.7 2.3 100.0 74.2
## 29 Italy 48,544 223,572 124,636 21.3 22.0 54.1 2.7 100.0 66.0
## 30 Jamaica 1,962 9,485 3,717 79.0 20.2 0.8 0.0 100.0 74.3
## 31 Japan 105,228 225,057 123,724 7.9 35.7 53.9 2.6 100.0 60.9
## 32 Jordan 5,212 13,099 6,014 65.7 33.1 1.2 0.0 100.0 76.1
## 33 Kazakhstan 12,011 4,441 334 97.6 2.1 0.3 0.0 100.0 92.6
## 34 Kenya 23,732 1,809 662 97.4 2.5 0.1 0.0 100.0 77.2
## 35 Korea 41,007 160,609 67,934 20.0 40.5 37.8 1.7 100.0 70.0
## 36 Kuwait 2,996 97,304 37,788 30.3 48.3 20.4 1.0 100.0 76.9
## 37 Kyrgyzstan 3,611 4,689 2,472 92.7 7.0 0.2 0.0 100.0 62.9
## 38 Laos 3,849 5,662 1,382 94.6 4.7 0.7 0.0 100.0 84.9
## 39 Latvia 1,577 27,631 17,828 29.0 68.6 2.2 0.1 100.0 53.6
## 40 Lebanon 4,085 24,161 6,452 69.0 28.5 2.3 0.2 100.0 82.0
## 41 Lesotho 1,184 3,163 945 95.9 3.8 0.3 0.0 100.0 79.8
## 42 Liberia 2,211 2,193 959 97.3 2.6 0.1 0.0 100.0 71.6
## 43 Libya 4,007 45,103 24,510 29.6 61.1 9.2 0.2 100.0 59.9
## 44 Lithuania 2,316 27,507 17,931 27.3 70.4 2.1 0.1 100.0 51.6
## 45 Luxembourg 450 313,687 167,664 17.0 20.0 58.8 4.2 100.0 68.1
## 46 Macedonia 1,607 9,044 5,698 77.0 22.5 0.5 0.0 100.0 56.4
UPDATE
This is more generic but you'll still have to do some manual cleanup. I think you would even if you used Tabula.
library(pdftools)
library(stringi)
library(tidyverse)
# read it with pdftools
book <- pdf_text("~/Downloads/global-wealth-databook.pdf")
transcribe_page <- function(book, pg) {
# go to the page
lines <- stri_split_lines(book[[pg]])[[1]]
# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")
# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1
# find line after table
end <- last(which(lines == ""))-1
# get the target rows
rows <- lines[start:end]
# map out where data values are
stri_replace_first_regex(rows, "([[:alpha:]]) ([[:alpha:]])", "$1_$2") %>%
stri_replace_all_regex("[^[:blank:]]", "X") %>%
map(~rle(strsplit(.x, "")[[1]])) -> pos
# compute the number of data fields
nfields <- ceiling(max(map_int(pos, ~length(.x$lengths))) / 2)
# do our best to get them into columns
data_frame(rec = rows) %>%
separate(rec, into=sprintf("X%s", 1:nfields), sep="[[:space:]]{2,}", fill="left") %>%
print(n=length(rows))
}
transcribe_page(book, 112)
transcribe_page(book, 113)
transcribe_page(book, 114)
transcribe_page(book, 115)
Take a look at the outputs for ^^. They aren't in terrible shape and some of the cleanup can be programmatic.

Related

How to sample data non-random

I have weather dataset my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample non random data
I want to sample data 80% train data and 30% test data but I want to sample the data in the original order not randomly. Is that possible ?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset <- dataset[1:200,]
dataset <- dataset[order(dataset$Date),]
set.seed(321)
sample_data = sample(nrow(dataset), nrow(dataset)*.8)
test<-dataset[sample_data,] # 30%
train<-dataset[-sample_data,] # 80%
output
> head(dataset)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 NW 30
2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39
3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85
4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54
5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50
6 2007-11-06 Canberra 6.2 16.9 0.0 5.8 8.2 SE 44
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
1 SW NW 6 20 68 29 1019.7
2 E W 4 17 80 36 1012.4
3 N NNE 6 6 82 69 1009.5
4 WNW W 30 24 62 56 1005.5
5 SSE ESE 20 28 68 49 1018.3
6 SE E 20 24 70 57 1023.8
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
1 1015.0 7 7 14.4 23.6 No 3.6 Yes
2 1008.4 5 3 17.5 25.7 Yes 3.6 Yes
3 1007.2 8 7 15.4 20.2 Yes 39.8 Yes
4 1007.0 2 7 13.5 14.1 Yes 2.8 Yes
5 1018.5 7 7 11.1 15.4 Yes 0.0 No
6 1021.7 7 5 10.9 14.8 No 0.2 No
> head(test)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
182 2008-04-30 Canberra -1.8 14.8 0.0 1.4 7.0 N 28
77 2008-01-16 Canberra 17.9 33.2 0.0 10.4 8.4 N 59
88 2008-01-27 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 46
58 2007-12-28 Canberra 15.1 28.3 14.4 8.8 13.2 NNW 28
96 2008-02-04 Canberra 18.2 22.6 1.8 8.0 0.0 ENE 33
126 2008-03-05 Canberra 12.0 27.6 0.0 6.0 11.0 E 46
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
182 E N 2 19 80 40 1024.2
77 N NNE 15 20 58 62 1008.5
88 N WNW 4 26 71 28 1013.1
58 NNW NW 6 13 73 44 1016.8
96 SSE ENE 7 13 92 76 1014.4
126 SSE WSW 7 6 69 35 1025.5
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
182 1020.5 1 7 5.3 13.9 No 0.0 No
77 1006.1 6 7 24.5 23.5 No 4.8 Yes
88 1009.5 1 4 19.7 30.7 No 0.0 No
58 1013.4 1 5 18.3 27.4 Yes 0.0 No
96 1011.5 8 8 18.5 22.1 Yes 9.0 Yes
126 1022.2 1 1 15.7 26.2 No 0.0 No
> head(train)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
7 2007-11-07 Canberra 6.1 18.2 0.2 4.2 8.4 SE 43
9 2007-11-09 Canberra 8.8 19.5 0.0 4.0 4.1 S 48
11 2007-11-11 Canberra 9.1 25.2 0.0 4.2 11.9 N 30
16 2007-11-16 Canberra 12.4 32.1 0.0 8.4 11.1 E 46
22 2007-11-22 Canberra 16.4 19.4 0.4 9.2 0.0 E 26
25 2007-11-25 Canberra 15.4 28.4 0.0 4.4 8.1 ENE 33
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
7 SE ESE 19 26 63 47 1024.6
9 E ENE 19 17 70 48 1026.1
11 SE NW 6 9 74 34 1024.4
16 SE WSW 7 9 70 22 1017.9
22 ENE E 6 11 88 72 1010.7
25 SSE NE 9 15 85 31 1022.4
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
7 1022.2 4 6 12.4 17.3 No 0.0 No
9 1022.7 7 7 14.1 18.9 No 16.2 Yes
11 1021.1 1 2 14.6 24.0 No 0.2 No
16 1012.8 0 3 19.1 30.7 No 0.0 No
22 1008.9 8 8 16.5 18.3 No 25.8 Yes
25 1018.6 8 2 16.8 27.3 No 0.0 No
I use mtcars as an example. An option to non-randomly split your data in train and test is to first create a sample size based on the number of rows in your data. After that you can use split to split the data exact at the 80% of your data. You using the following code:
smp_size <- floor(0.80 * nrow(mtcars))
split <- split(mtcars, rep(1:2, each = smp_size))
With the following code you can turn the split in train and test:
train <- split$`1`
test <- split$`2`
Let's check the number of rows:
> nrow(train)
[1] 25
> nrow(test)
[1] 7
Now the data is split in train and test without losing their order.

Scrapy Xpath return empty list

It work if Xpath using contains function
response.xpath('//table[contains(#class, "wikitable sortable")]')
However it returns a empty using code below:
response.xpath('//table[#class="wikitable sortable jquery-tablesorter"]')
Any explanation about why it return an empty list?
For more information, I'm trying to extract territory rankings table from this site https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population as practice.
You can extract territory rankings table easily using only pandas as follows:
Code:
import pandas as pd
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population',attrs={'class':'wikitable sortable'})
df = dfs[0]#.to_csv('d.csv')
print(df)
Output:
Rank State or territory ... % of the total U.S. pop.[d] % of Elec. Coll.
'20 '10 State or territory ... 2010 Ch.2010–2020 % of Elec. Coll.
0 1.0 1.0 California ... 11.91% –0.11%
10.04%
1 2.0 2.0 Texas ... 8.04% 0.66%
7.43%
2 3.0 4.0 Florida ... 6.01% 0.42%
5.58%
3 4.0 3.0 New York ... 6.19% –0.17%
5.20%
4 5.0 6.0 Pennsylvania ... 4.06% –0.18%
3.53%
5 6.0 5.0 Illinois ... 4.10% –0.28%
3.53%
6 7.0 7.0 Ohio ... 3.69% –0.17%
3.16%
7 8.0 9.0 Georgia ... 3.10% 0.10%
2.97%
8 9.0 10.0 North Carolina ... 3.05% 0.07%
2.97%
9 10.0 8.0 Michigan ... 3.16% –0.15%
2.79%
10 11.0 11.0 New Jersey ... 2.81% –0.04%
2.60%
11 12.0 12.0 Virginia ... 2.56% 0.02%
2.42%
12 13.0 13.0 Washington ... 2.15% 0.15%
2.23%
13 14.0 16.0 Arizona ... 2.04% 0.09%
2.04%
14 15.0 14.0 Massachusetts ... 2.09% 0.00%
2.04%
15 16.0 17.0 Tennessee ... 2.03% 0.03%
2.04%
16 17.0 15.0 Indiana ... 2.07% –0.05%
2.04%
17 18.0 19.0 Maryland ... 1.85% –0.00%
1.86%
18 19.0 18.0 Missouri ... 1.91% –0.08%
1.86%
19 20.0 20.0 Wisconsin ... 1.82% –0.06%
1.86%
20 21.0 22.0 Colorado ... 1.61% 0.12%
1.86%
21 22.0 21.0 Minnesota ... 1.70% 0.01%
1.86%
22 23.0 24.0 South Carolina ... 1.48% 0.05%
1.67%
23 24.0 23.0 Alabama ... 1.53% –0.03%
1.67%
24 25.0 25.0 Louisiana ... 1.45% –0.06%
1.49%
25 26.0 26.0 Kentucky ... 1.39% –0.04%
1.49%
26 27.0 27.0 Oregon ... 1.22% 0.04%
1.49%
27 28.0 28.0 Oklahoma ... 1.20% –0.02%
1.30%
28 29.0 30.0 Connecticut ... 1.14% –0.07%
1.30%
29 30.0 29.0 Puerto Rico ... 1.19% –0.21%
—
30 31.0 35.0 Utah ... 0.88% 0.09%
1.12%
31 32.0 31.0 Iowa ... 0.97% –0.02%
1.12%
32 33.0 36.0 Nevada ... 0.86% 0.06%
1.12%
33 34.0 33.0 Arkansas ... 0.93% –0.03%
1.12%
34 35.0 32.0 Mississippi ... 0.95% –0.06%
1.12%
35 36.0 34.0 Kansas ... 0.91% –0.04%
1.12%
36 37.0 37.0 New Mexico ... 0.66% –0.03%
0.93%
37 38.0 39.0 Nebraska ... 0.58% 0.00%
0.93%
38 39.0 40.0 Idaho ... 0.50% 0.05%
0.74%
39 40.0 38.0 West Virginia ... 0.59% –0.06%
0.74%
40 41.0 41.0 Hawaii ... 0.43% 0.00%
0.74%
41 42.0 43.0 New Hampshire ... 0.42% –0.01%
0.74%
42 43.0 42.0 Maine ... 0.42% –0.02%
0.74%
43 44.0 44.0 Rhode Island ... 0.34% –0.01%
0.74%
44 45.0 45.0 Montana ... 0.32% 0.01%
0.74%
45 46.0 46.0 Delaware ... 0.29% 0.01%
0.56%
46 47.0 47.0 South Dakota ... 0.26% 0.00%
0.56%
47 48.0 49.0 North Dakota ... 0.21% 0.02%
0.56%
48 49.0 48.0 Alaska ... 0.23% –0.01%
0.56%
49 50.0 51.0 District of Columbia ... 0.19% 0.01% 0.56%
50 51.0 50.0 Vermont ... 0.20% –0.01% 0.56%
51 52.0 52.0 Wyoming ... 0.18% –0.01% 0.56%
52 53.0 53.0 Guam[8] ... 0.05% –0.00% —
53 54.0 54.0 U.S. Virgin Islands[9] ... 0.03% –0.00% —
54 55.0 55.0 American Samoa[10] ... 0.02% –0.00% —
55 56.0 56.0 Northern Mariana Islands[11] ... 0.02% –0.00% —
56 NaN NaN Contiguous United States ... 98.03% 0.23% 98.70%
57 NaN NaN The fifty states ... 98.50% 0.21% 99.44%
58 NaN NaN The fifty states and D.C. ... 98.69% 0.22% 100.00%
59 NaN NaN Total United States ... — — —
[60 rows x 16 columns]

How can I organise and move row of data based on label matches?

I have raw data shown below. I'm trying to move a row of data that corresponds to a label it matches to a new location in the dataframe.
dat<-read.table(text='RowLabels col1 col2 col3 col4 col5 col6
L 24363.7 25944.9 25646.1 25335.4 23564.2 25411.5
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
181920 652.9 729.2 652.1 689.1 612.5 738.4
1 104.3 107.3 103.5 104.2 98.3 110.1
2 103.6 102.6 100.1 103.2 88.8 117.7
3 53.5 99.1 46.7 70.3 53.9 32.5
4 93.5 107.2 98.3 99.3 97.3 121.1
5 96.8 109.3 104 102.2 98.7 112.9
6 103.6 96.9 104.7 104.4 91.5 137.7
7 97.6 106.8 94.8 105.5 84 106.4
181930 732.1 709.6 725.8 729.5 554.5 873.1
1 118.4 98.8 102.3 102 101.9 115.8
2 96.7 103.3 104.6 105.2 81.9 128.7
3 96 98.2 99.4 97.9 69.8 120.6
4 100.7 101 103.6 106.6 59.6 136.2
5 106.1 103.4 104.7 104.8 76.1 131.8
6 105 102.1 103 108.3 81 124.7
7 109.2 102.8 108.2 104.7 84.2 115.3
N 3836.4 4395.8 4227.3 4567.4 4009.9 4434.6
610 88.1 96.3 99.6 92 90 137.6
1 88.1 96.3 99.6 92 90 137.6
181920 113.1 100.6 106.5 104.2 87.3 108.2
1 113.1 100.6 106.5 104.2 87.3 108.2
181930 111.3 99.1 104.5 115.5 103.6 118.8
1 111.3 99.1 104.5 115.5 103.6 118.8
',header=TRUE)
I want to match the values of the three N-prefix labels: 610, 181920 and 181930 with its corresponding L-prefix labels. Basically move that row of data into the L-prefix as a new row, labeled 0 or 8 for example. So, the result for label, 610 would look like:
RowLabels col1 col2 col3 col4 col5 col6
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
8 88.1 96.3 99.6 92 90 137.6
Is this possible? I tried searching and I found some resources pointing toward dplyr or tidyr or aggregate. But I can't find a good example that matches my case. How to combine rows based on unique values in R? and
Aggregate rows by shared values in a variable
library(dplyr)
library(zoo)
df <- dat %>%
filter(grepl("^\\d+$",RowLabels)) %>%
mutate(RowLabels_temp = ifelse(grepl("^\\d{3,}$",RowLabels), as.numeric(as.character(RowLabels)), NA)) %>%
na.locf() %>%
select(-RowLabels) %>%
distinct() %>%
group_by(RowLabels_temp) %>%
mutate(RowLabels_indexed = row_number()-1) %>%
arrange(RowLabels_temp, RowLabels_indexed) %>%
mutate(RowLabels_indexed = ifelse(RowLabels_indexed==0, RowLabels_temp, RowLabels_indexed)) %>%
rename(RowLabels=RowLabels_indexed) %>%
data.frame()
df <- df %>% select(-RowLabels_temp)
df
Output is
col1 col2 col3 col4 col5 col6 RowLabels
1 411.4 439.0 437.3 436.9 420.7 516.9 610
2 86.4 113.9 103.5 113.5 80.3 129.0 1
3 102.1 99.5 96.3 100.4 99.5 86.0 2
4 109.7 102.2 100.2 112.9 92.3 123.8 3
5 88.9 87.1 103.6 102.5 93.6 134.1 4
6 -50.3 -40.2 -72.3 -61.4 -27.0 -22.7 5
7 -35.3 -9.3 25.3 -0.3 15.6 -27.3 6
8 109.9 85.8 80.7 69.3 66.4 94.0 7
9 88.1 96.3 99.6 92.0 90.0 137.6 8
...
It sounds like you want to use the match() function, for example:
target<-c(the values of your target order)
df<-df[match(target, df$column_to_reorder),]

run function on consecutive vals with specific range in the vector with R

spouse i have a vector tmp of size 100
i want to know where there is for example an average of 10 between
each 4 elements.
i.e
i want to know which of these: mean(tmp[c(1,2,3,4)]),mean(tmp[c(2,3,4,5)]),mean(tmp[c(3,4,5,6)])..and so on...mean(tmp[c(97,98,99,100)])
are larger then 10
how can i do it not in a loop?
(loop takes too long since i have a table of 500000 rows by 60 col)
and more not only avg but also difference or sum and so on...
i have tried splitting rows as such
tmp<-seq(1,100,1)
one<-seq(1,97,1)
two<-seq(2,98,1)
tree<-seq(3,99,1)
four<-seq(4,100,1)
aa<-(tmp[one]+tmp[two]+tmp[tree]+tmp[four])/4
which(aa>10)
its working but its not rational to do it if you want for example avg of 12
here is an example of what i do to be clear
b12<-seq(1,988,1)
b11<-seq(2,989,1)
b10<-seq(3, 990,1)
b9<-seq(4,991,1)
b8<-seq(5,992,1)
b7<-seq(6,993,1)
b6<-seq(7,994,1)
b5<-seq(8, 995,1)
b4<-seq(9,996,1)
b3<-seq(10,997,1)
b2<-seq(11,998,1)
b1<-seq(12,999,1)
now<-seq(13, 1000,1)
po<-rpois(1000,4)
nor<-rnorm(1000,5,0.2)
uni<-runif(1000,10,75)
chis<-rchisq(1000,3,0)
which((po[now]/nor[now])>1 & (nor[b12]/nor[now])>1 &
((po[now]/po[b4])>1 | (uni[now]-uni[b4])>=0) &
((chis[now]+chis[b1]+chis[b2]+chis[b3])/4)>2 &
(uni[now]/max(uni[b1],uni[b2],uni[b3],uni[b4],
uni[b5],uni[b6],uni[b7],uni[b8]))>0.5)+12
this code give me the exact index in the real table
that mach all the conditions
and i have 58 vars with 550000 rows
thank you
The question is not very clear. Based on the wording, I guess, this should help:
n <- 100
res <- sapply(1:(n-3), function(i) mean(tmp[i:(i+3)]))
which(res >10)
Also,
m1 <- matrix(tmp[1:4+ rep(0:96,each=4)],ncol=4,byrow=T)
which(rowMeans(m1) >10)
Maybe you should look at the rollapply function from the "zoo" package. You would need to adjust the width argument according to your specific needs.
library(zoo)
tmp <- seq(1, 100, 1)
rollapply(tmp, width = 4, FUN = mean)
# [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5
# [15] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5
# [29] 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5
# [43] 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5
# [57] 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5
# [71] 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5
# [85] 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
So, to get the details you want:
aa <- rollapply(tmp, width = 4, FUN = mean)
which(aa > 10)

R: Overlapping ggplots

I have a quick question about R. I am trying to make a layered histogram from some data I am pulling out of files but I am having a hard time getting ggplot to work with me. I keep getting this error and I have been looking around for an answer but I haven't seen much.
Error: ggplot2 doesn't know how to deal with data of class uneval
Execution halted
Here is a brief look at my program so far.
library("ggplot2")
ex <- '/home/Data/run1.DOC'
ex2 <- '/home/Data/run2.DOC'
...
ex<- read.table(ex,header=TRUE)
ex2<- read.table(ex2,header=TRUE)
...
colnames(ex) <- c(1:18)
colnames(ex2) <- c(1:18)
...
Ex <- c(ex$'14')
Ex2 <- c(ex2$'14')
...
ggplot()+
geom_histogram(data = Ex, fill = "red", alpha = 0.2) +
geom_histogram(data = Ex2, fill = "blue", alpha = 0.2)
And my data is in the files and look a bit like this:
head(ex,10)
1 2 3 4 5 6 7 8 9 10 11 12
1 1:28 400 0.42 400 0.42 1 1 2 41.8 0 0.0 0.0
2 1:96 5599 39.99 5599 39.99 34 42 50 100.0 100 100.0 100.0
3 1:53 334 0.63 334 0.63 1 2 2 62.1 0 0.0 0.0
4 1:27 6932 49.51 6932 49.51 48 52 57 100.0 100 100.0 100.0
5 1:36 27562 124.15 27562 124.15 97 123 157 100.0 100 100.0 100.0
6 1:14 2340 16.71 2340 16.71 13 17 21 100.0 100 100.0 95.7
7 1:96 8202 49.71 8202 49.71 23 43 80 100.0 100 100.0 100.0
8 1:34 3950 28.21 3950 28.21 22 33 36 100.0 100 100.0 100.0
9 1:60 5563 24.62 5563 24.62 11 24 41 100.0 100 96.5 75.2
10 1:06 1646 8.11 1646 8.11 7 8 13 100.0 100 87.2 32.0
13 14 15 16 17 18
1 0.0 0.0 0.0 0.0 0.0 0.0
2 93.6 82.9 57.9 24.3 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 100.0 97.1 87.1 57.1 0.0 0.0
5 100.0 100.0 100.0 100.0 88.3 71.2
6 40.0 0.0 0.0 0.0 0.0 0.0
7 81.2 66.7 54.5 47.9 29.1 0.0
8 76.4 55.7 0.0 0.0 0.0 0.0
9 57.5 35.4 26.5 4.4 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.0
But much larger. This means that ex and ex2 will be a percentage from 0 to 100. The colnames line changes the column heads like %_above_30 to something R likes better so I change it to number each column name.
Does anyone know/see the problem here because I am not really getting it.
Thanks!!
Maybe try combining the two data frames in one and supply that to one geom_histogram:
#maybe reshape it something like this (base reshape or the
#reshape package may be a better tool)
dat <- data.frame(rbind(ex, ex2),
colvar=factor(c(rep("ex", nrow(ex)), rep("ex2", nrow(ex2))))
ggplot(data = dat, fill = colvar)+
geom_histogram(position="identity", alpha = 0.2)
This is untested as your code isn't reproducible (please see this link on how to make a reproducible example).
Here's the idea I'm talking about with a reproducible example:
library(ggplot2)
path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
saheart <- read.table(path, sep=",",head=T,row.names=1)
fmla <- "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity"
model <- glm(fmla, data=saheart, family=binomial(link="logit"),
na.action=na.exclude)
dframe <- data.frame(chd=as.factor(saheart$chd),
prediction=predict(model, type="response"))
ggplot(dframe, aes(x=prediction, fill=chd)) +
geom_histogram(position="identity", binwidth=0.05, alpha=0.5)

Resources