Tabulize function in R - r
I want to extract the table of page 112 in this pdf document:
http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040
# report 2017
url_location <-"http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040"
out <- extract_tables(url_location, pages = 112)
I have tried using these tutorials (link1,link2) about 'tabulize' package but I largely failed. There are some difficult aspects which I am not very experienced how to handle in R.
Can someone suggest something and help me with that ?
Installation
devtools::install_github("ropensci/tabulizer")
# load package
library(tabulizer)
Java deps — while getting easier to deal with — aren't necessary when the tables are this clean. Just a bit of string wrangling will get you what you need:
library(pdftools)
library(stringi)
library(tidyverse)
# read it with pdftools
book <- pdf_text("global-wealth-databook.pdf")
# go to the page
lines <- stri_split_lines(book[[113]])[[1]]
# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")
# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1
# find line after table
end <- last(which(lines == ""))-1
# smuch into something read.[table|csv] can read
tab <- paste0(stri_replace_all_regex(lines[start:end], "[[:space:]][[:space:]]+", "\t"), collapse="\n")
#read it
read.csv(text=tab, header=FALSE, sep="\t", stringsAsFactors = FALSE)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 Egypt 56,036 3,168 324 98.1 1.7 0.2 0.0 100.0 91.7
## 2 El Salvador 3,957 14,443 6,906 66.0 32.8 1.2 0.0 100.0 65.7
## 3 Equatorial Guinea 670 8,044 2,616 87.0 12.2 0.7 0.1 100.0 77.3
## 4 Eritrea 2,401 3,607 2,036 94.5 5.4 0.1 100.0 57.1 NA
## 5 Estonia 1,040 43,158 27,522 22.5 72.2 5.1 0.2 100.0 56.4
## 6 Ethiopia 49,168 153 103 100.0 0.0 100.0 43.4 NA NA
## 7 Fiji 568 6,309 3,059 85.0 14.6 0.4 0.0 100.0 68.2
## 8 Finland 4,312 159,098 57,850 30.8 33.8 33.5 1.9 100.0 76.7
## 9 France 49,239 263,399 119,720 25.3 21.4 49.3 4.0 100.0 70.2
## 10 Gabon 1,098 15,168 7,367 62.0 36.5 1.5 0.0 100.0 68.4
## 11 Gambia 904 898 347 99.2 0.7 0.0 100.0 72.4 NA
## 12 Georgia 2,950 19,430 9,874 50.7 47.6 1.6 0.1 100.0 66.8
## 13 Germany 67,244 203,946 47,091 29.5 33.7 33.9 2.9 100.0 79.1
## 14 Ghana 14,574 809 411 99.5 0.5 0.0 100.0 66.1 NA
## 15 Greece 9,020 111,684 54,665 20.7 52.9 25.4 1.0 100.0 67.7
## 16 Grenada 70 17,523 4,625 74.0 24.3 1.5 0.2 100.0 81.5
## 17 Guinea 5,896 814 374 99.4 0.6 0.0 100.0 69.7 NA
## 18 Guinea-Bissau 884 477 243 99.8 0.2 100.0 65.6 NA NA
## 19 Guyana 467 5,345 2,510 89.0 10.7 0.3 0.0 100.0 67.2
## 20 Haiti 6,172 2,879 894 96.2 3.6 0.2 0.0 100.0 76.9
## 21 Hong Kong 6,172 193,248 46,079 26.3 50.9 20.9 1.9 100.0 85.1
## 22 Hungary 7,846 39,813 30,111 11.8 83.4 4.8 0.0 100.0 45.3
## 23 Iceland 245 587,649 444,999 13.0 72.0 15.0 100.0 46.7 NA
## 24 India 834,608 5,976 1,295 92.3 7.2 0.5 0.0 100.0 83.0
## 25 Indonesia 167,559 11,001 1,914 81.9 17.0 1.1 0.1 100.0 83.7
## 26 Iran 56,306 3,831 1,856 94.1 5.7 0.2 0.0 100.0 67.3
## 27 Ireland 3,434 248,466 84,592 31.2 22.7 42.3 3.6 100.0 81.3
## 28 Israel 5,315 198,406 78,244 22.3 38.7 36.7 2.3 100.0 74.2
## 29 Italy 48,544 223,572 124,636 21.3 22.0 54.1 2.7 100.0 66.0
## 30 Jamaica 1,962 9,485 3,717 79.0 20.2 0.8 0.0 100.0 74.3
## 31 Japan 105,228 225,057 123,724 7.9 35.7 53.9 2.6 100.0 60.9
## 32 Jordan 5,212 13,099 6,014 65.7 33.1 1.2 0.0 100.0 76.1
## 33 Kazakhstan 12,011 4,441 334 97.6 2.1 0.3 0.0 100.0 92.6
## 34 Kenya 23,732 1,809 662 97.4 2.5 0.1 0.0 100.0 77.2
## 35 Korea 41,007 160,609 67,934 20.0 40.5 37.8 1.7 100.0 70.0
## 36 Kuwait 2,996 97,304 37,788 30.3 48.3 20.4 1.0 100.0 76.9
## 37 Kyrgyzstan 3,611 4,689 2,472 92.7 7.0 0.2 0.0 100.0 62.9
## 38 Laos 3,849 5,662 1,382 94.6 4.7 0.7 0.0 100.0 84.9
## 39 Latvia 1,577 27,631 17,828 29.0 68.6 2.2 0.1 100.0 53.6
## 40 Lebanon 4,085 24,161 6,452 69.0 28.5 2.3 0.2 100.0 82.0
## 41 Lesotho 1,184 3,163 945 95.9 3.8 0.3 0.0 100.0 79.8
## 42 Liberia 2,211 2,193 959 97.3 2.6 0.1 0.0 100.0 71.6
## 43 Libya 4,007 45,103 24,510 29.6 61.1 9.2 0.2 100.0 59.9
## 44 Lithuania 2,316 27,507 17,931 27.3 70.4 2.1 0.1 100.0 51.6
## 45 Luxembourg 450 313,687 167,664 17.0 20.0 58.8 4.2 100.0 68.1
## 46 Macedonia 1,607 9,044 5,698 77.0 22.5 0.5 0.0 100.0 56.4
UPDATE
This is more generic but you'll still have to do some manual cleanup. I think you would even if you used Tabula.
library(pdftools)
library(stringi)
library(tidyverse)
# read it with pdftools
book <- pdf_text("~/Downloads/global-wealth-databook.pdf")
transcribe_page <- function(book, pg) {
# go to the page
lines <- stri_split_lines(book[[pg]])[[1]]
# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")
# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1
# find line after table
end <- last(which(lines == ""))-1
# get the target rows
rows <- lines[start:end]
# map out where data values are
stri_replace_first_regex(rows, "([[:alpha:]]) ([[:alpha:]])", "$1_$2") %>%
stri_replace_all_regex("[^[:blank:]]", "X") %>%
map(~rle(strsplit(.x, "")[[1]])) -> pos
# compute the number of data fields
nfields <- ceiling(max(map_int(pos, ~length(.x$lengths))) / 2)
# do our best to get them into columns
data_frame(rec = rows) %>%
separate(rec, into=sprintf("X%s", 1:nfields), sep="[[:space:]]{2,}", fill="left") %>%
print(n=length(rows))
}
transcribe_page(book, 112)
transcribe_page(book, 113)
transcribe_page(book, 114)
transcribe_page(book, 115)
Take a look at the outputs for ^^. They aren't in terrible shape and some of the cleanup can be programmatic.
Related
How to sample data non-random
I have weather dataset my data is date-dependent I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200 I will be using decision tree/RF and SVM & NN to make my prediction I've never handled data like this so I'm not sure how to sample non random data I want to sample data 80% train data and 30% test data but I want to sample the data in the original order not randomly. Is that possible ? install.packages("rattle") install.packages("RGtk2") library("rattle") seed <- 42 set.seed(seed) fname <- system.file("csv", "weather.csv", package = "rattle") dataset <- read.csv(fname, encoding = "UTF-8") dataset <- dataset[1:200,] dataset <- dataset[order(dataset$Date),] set.seed(321) sample_data = sample(nrow(dataset), nrow(dataset)*.8) test<-dataset[sample_data,] # 30% train<-dataset[-sample_data,] # 80% output > head(dataset) Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed 1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 NW 30 2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54 5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50 6 2007-11-06 Canberra 6.2 16.9 0.0 5.8 8.2 SE 44 WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am 1 SW NW 6 20 68 29 1019.7 2 E W 4 17 80 36 1012.4 3 N NNE 6 6 82 69 1009.5 4 WNW W 30 24 62 56 1005.5 5 SSE ESE 20 28 68 49 1018.3 6 SE E 20 24 70 57 1023.8 Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow 1 1015.0 7 7 14.4 23.6 No 3.6 Yes 2 1008.4 5 3 17.5 25.7 Yes 3.6 Yes 3 1007.2 8 7 15.4 20.2 Yes 39.8 Yes 4 1007.0 2 7 13.5 14.1 Yes 2.8 Yes 5 1018.5 7 7 11.1 15.4 Yes 0.0 No 6 1021.7 7 5 10.9 14.8 No 0.2 No > head(test) Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed 182 2008-04-30 Canberra -1.8 14.8 0.0 1.4 7.0 N 28 77 2008-01-16 Canberra 17.9 33.2 0.0 10.4 8.4 N 59 88 2008-01-27 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 46 58 2007-12-28 Canberra 15.1 28.3 14.4 8.8 13.2 NNW 28 96 2008-02-04 Canberra 18.2 22.6 1.8 8.0 0.0 ENE 33 126 2008-03-05 Canberra 12.0 27.6 0.0 6.0 11.0 E 46 WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am 182 E N 2 19 80 40 1024.2 77 N NNE 15 20 58 62 1008.5 88 N WNW 4 26 71 28 1013.1 58 NNW NW 6 13 73 44 1016.8 96 SSE ENE 7 13 92 76 1014.4 126 SSE WSW 7 6 69 35 1025.5 Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow 182 1020.5 1 7 5.3 13.9 No 0.0 No 77 1006.1 6 7 24.5 23.5 No 4.8 Yes 88 1009.5 1 4 19.7 30.7 No 0.0 No 58 1013.4 1 5 18.3 27.4 Yes 0.0 No 96 1011.5 8 8 18.5 22.1 Yes 9.0 Yes 126 1022.2 1 1 15.7 26.2 No 0.0 No > head(train) Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed 7 2007-11-07 Canberra 6.1 18.2 0.2 4.2 8.4 SE 43 9 2007-11-09 Canberra 8.8 19.5 0.0 4.0 4.1 S 48 11 2007-11-11 Canberra 9.1 25.2 0.0 4.2 11.9 N 30 16 2007-11-16 Canberra 12.4 32.1 0.0 8.4 11.1 E 46 22 2007-11-22 Canberra 16.4 19.4 0.4 9.2 0.0 E 26 25 2007-11-25 Canberra 15.4 28.4 0.0 4.4 8.1 ENE 33 WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am 7 SE ESE 19 26 63 47 1024.6 9 E ENE 19 17 70 48 1026.1 11 SE NW 6 9 74 34 1024.4 16 SE WSW 7 9 70 22 1017.9 22 ENE E 6 11 88 72 1010.7 25 SSE NE 9 15 85 31 1022.4 Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow 7 1022.2 4 6 12.4 17.3 No 0.0 No 9 1022.7 7 7 14.1 18.9 No 16.2 Yes 11 1021.1 1 2 14.6 24.0 No 0.2 No 16 1012.8 0 3 19.1 30.7 No 0.0 No 22 1008.9 8 8 16.5 18.3 No 25.8 Yes 25 1018.6 8 2 16.8 27.3 No 0.0 No
I use mtcars as an example. An option to non-randomly split your data in train and test is to first create a sample size based on the number of rows in your data. After that you can use split to split the data exact at the 80% of your data. You using the following code: smp_size <- floor(0.80 * nrow(mtcars)) split <- split(mtcars, rep(1:2, each = smp_size)) With the following code you can turn the split in train and test: train <- split$`1` test <- split$`2` Let's check the number of rows: > nrow(train) [1] 25 > nrow(test) [1] 7 Now the data is split in train and test without losing their order.
Scrapy Xpath return empty list
It work if Xpath using contains function response.xpath('//table[contains(#class, "wikitable sortable")]') However it returns a empty using code below: response.xpath('//table[#class="wikitable sortable jquery-tablesorter"]') Any explanation about why it return an empty list? For more information, I'm trying to extract territory rankings table from this site https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population as practice.
You can extract territory rankings table easily using only pandas as follows: Code: import pandas as pd dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population',attrs={'class':'wikitable sortable'}) df = dfs[0]#.to_csv('d.csv') print(df) Output: Rank State or territory ... % of the total U.S. pop.[d] % of Elec. Coll. '20 '10 State or territory ... 2010 Ch.2010–2020 % of Elec. Coll. 0 1.0 1.0 California ... 11.91% –0.11% 10.04% 1 2.0 2.0 Texas ... 8.04% 0.66% 7.43% 2 3.0 4.0 Florida ... 6.01% 0.42% 5.58% 3 4.0 3.0 New York ... 6.19% –0.17% 5.20% 4 5.0 6.0 Pennsylvania ... 4.06% –0.18% 3.53% 5 6.0 5.0 Illinois ... 4.10% –0.28% 3.53% 6 7.0 7.0 Ohio ... 3.69% –0.17% 3.16% 7 8.0 9.0 Georgia ... 3.10% 0.10% 2.97% 8 9.0 10.0 North Carolina ... 3.05% 0.07% 2.97% 9 10.0 8.0 Michigan ... 3.16% –0.15% 2.79% 10 11.0 11.0 New Jersey ... 2.81% –0.04% 2.60% 11 12.0 12.0 Virginia ... 2.56% 0.02% 2.42% 12 13.0 13.0 Washington ... 2.15% 0.15% 2.23% 13 14.0 16.0 Arizona ... 2.04% 0.09% 2.04% 14 15.0 14.0 Massachusetts ... 2.09% 0.00% 2.04% 15 16.0 17.0 Tennessee ... 2.03% 0.03% 2.04% 16 17.0 15.0 Indiana ... 2.07% –0.05% 2.04% 17 18.0 19.0 Maryland ... 1.85% –0.00% 1.86% 18 19.0 18.0 Missouri ... 1.91% –0.08% 1.86% 19 20.0 20.0 Wisconsin ... 1.82% –0.06% 1.86% 20 21.0 22.0 Colorado ... 1.61% 0.12% 1.86% 21 22.0 21.0 Minnesota ... 1.70% 0.01% 1.86% 22 23.0 24.0 South Carolina ... 1.48% 0.05% 1.67% 23 24.0 23.0 Alabama ... 1.53% –0.03% 1.67% 24 25.0 25.0 Louisiana ... 1.45% –0.06% 1.49% 25 26.0 26.0 Kentucky ... 1.39% –0.04% 1.49% 26 27.0 27.0 Oregon ... 1.22% 0.04% 1.49% 27 28.0 28.0 Oklahoma ... 1.20% –0.02% 1.30% 28 29.0 30.0 Connecticut ... 1.14% –0.07% 1.30% 29 30.0 29.0 Puerto Rico ... 1.19% –0.21% — 30 31.0 35.0 Utah ... 0.88% 0.09% 1.12% 31 32.0 31.0 Iowa ... 0.97% –0.02% 1.12% 32 33.0 36.0 Nevada ... 0.86% 0.06% 1.12% 33 34.0 33.0 Arkansas ... 0.93% –0.03% 1.12% 34 35.0 32.0 Mississippi ... 0.95% –0.06% 1.12% 35 36.0 34.0 Kansas ... 0.91% –0.04% 1.12% 36 37.0 37.0 New Mexico ... 0.66% –0.03% 0.93% 37 38.0 39.0 Nebraska ... 0.58% 0.00% 0.93% 38 39.0 40.0 Idaho ... 0.50% 0.05% 0.74% 39 40.0 38.0 West Virginia ... 0.59% –0.06% 0.74% 40 41.0 41.0 Hawaii ... 0.43% 0.00% 0.74% 41 42.0 43.0 New Hampshire ... 0.42% –0.01% 0.74% 42 43.0 42.0 Maine ... 0.42% –0.02% 0.74% 43 44.0 44.0 Rhode Island ... 0.34% –0.01% 0.74% 44 45.0 45.0 Montana ... 0.32% 0.01% 0.74% 45 46.0 46.0 Delaware ... 0.29% 0.01% 0.56% 46 47.0 47.0 South Dakota ... 0.26% 0.00% 0.56% 47 48.0 49.0 North Dakota ... 0.21% 0.02% 0.56% 48 49.0 48.0 Alaska ... 0.23% –0.01% 0.56% 49 50.0 51.0 District of Columbia ... 0.19% 0.01% 0.56% 50 51.0 50.0 Vermont ... 0.20% –0.01% 0.56% 51 52.0 52.0 Wyoming ... 0.18% –0.01% 0.56% 52 53.0 53.0 Guam[8] ... 0.05% –0.00% — 53 54.0 54.0 U.S. Virgin Islands[9] ... 0.03% –0.00% — 54 55.0 55.0 American Samoa[10] ... 0.02% –0.00% — 55 56.0 56.0 Northern Mariana Islands[11] ... 0.02% –0.00% — 56 NaN NaN Contiguous United States ... 98.03% 0.23% 98.70% 57 NaN NaN The fifty states ... 98.50% 0.21% 99.44% 58 NaN NaN The fifty states and D.C. ... 98.69% 0.22% 100.00% 59 NaN NaN Total United States ... — — — [60 rows x 16 columns]
How can I organise and move row of data based on label matches?
I have raw data shown below. I'm trying to move a row of data that corresponds to a label it matches to a new location in the dataframe. dat<-read.table(text='RowLabels col1 col2 col3 col4 col5 col6 L 24363.7 25944.9 25646.1 25335.4 23564.2 25411.5 610 411.4 439 437.3 436.9 420.7 516.9 1 86.4 113.9 103.5 113.5 80.3 129 2 102.1 99.5 96.3 100.4 99.5 86 3 109.7 102.2 100.2 112.9 92.3 123.8 4 88.9 87.1 103.6 102.5 93.6 134.1 5 -50.3 -40.2 -72.3 -61.4 -27 -22.7 6 -35.3 -9.3 25.3 -0.3 15.6 -27.3 7 109.9 85.8 80.7 69.3 66.4 94 181920 652.9 729.2 652.1 689.1 612.5 738.4 1 104.3 107.3 103.5 104.2 98.3 110.1 2 103.6 102.6 100.1 103.2 88.8 117.7 3 53.5 99.1 46.7 70.3 53.9 32.5 4 93.5 107.2 98.3 99.3 97.3 121.1 5 96.8 109.3 104 102.2 98.7 112.9 6 103.6 96.9 104.7 104.4 91.5 137.7 7 97.6 106.8 94.8 105.5 84 106.4 181930 732.1 709.6 725.8 729.5 554.5 873.1 1 118.4 98.8 102.3 102 101.9 115.8 2 96.7 103.3 104.6 105.2 81.9 128.7 3 96 98.2 99.4 97.9 69.8 120.6 4 100.7 101 103.6 106.6 59.6 136.2 5 106.1 103.4 104.7 104.8 76.1 131.8 6 105 102.1 103 108.3 81 124.7 7 109.2 102.8 108.2 104.7 84.2 115.3 N 3836.4 4395.8 4227.3 4567.4 4009.9 4434.6 610 88.1 96.3 99.6 92 90 137.6 1 88.1 96.3 99.6 92 90 137.6 181920 113.1 100.6 106.5 104.2 87.3 108.2 1 113.1 100.6 106.5 104.2 87.3 108.2 181930 111.3 99.1 104.5 115.5 103.6 118.8 1 111.3 99.1 104.5 115.5 103.6 118.8 ',header=TRUE) I want to match the values of the three N-prefix labels: 610, 181920 and 181930 with its corresponding L-prefix labels. Basically move that row of data into the L-prefix as a new row, labeled 0 or 8 for example. So, the result for label, 610 would look like: RowLabels col1 col2 col3 col4 col5 col6 610 411.4 439 437.3 436.9 420.7 516.9 1 86.4 113.9 103.5 113.5 80.3 129 2 102.1 99.5 96.3 100.4 99.5 86 3 109.7 102.2 100.2 112.9 92.3 123.8 4 88.9 87.1 103.6 102.5 93.6 134.1 5 -50.3 -40.2 -72.3 -61.4 -27 -22.7 6 -35.3 -9.3 25.3 -0.3 15.6 -27.3 7 109.9 85.8 80.7 69.3 66.4 94 8 88.1 96.3 99.6 92 90 137.6 Is this possible? I tried searching and I found some resources pointing toward dplyr or tidyr or aggregate. But I can't find a good example that matches my case. How to combine rows based on unique values in R? and Aggregate rows by shared values in a variable
library(dplyr) library(zoo) df <- dat %>% filter(grepl("^\\d+$",RowLabels)) %>% mutate(RowLabels_temp = ifelse(grepl("^\\d{3,}$",RowLabels), as.numeric(as.character(RowLabels)), NA)) %>% na.locf() %>% select(-RowLabels) %>% distinct() %>% group_by(RowLabels_temp) %>% mutate(RowLabels_indexed = row_number()-1) %>% arrange(RowLabels_temp, RowLabels_indexed) %>% mutate(RowLabels_indexed = ifelse(RowLabels_indexed==0, RowLabels_temp, RowLabels_indexed)) %>% rename(RowLabels=RowLabels_indexed) %>% data.frame() df <- df %>% select(-RowLabels_temp) df Output is col1 col2 col3 col4 col5 col6 RowLabels 1 411.4 439.0 437.3 436.9 420.7 516.9 610 2 86.4 113.9 103.5 113.5 80.3 129.0 1 3 102.1 99.5 96.3 100.4 99.5 86.0 2 4 109.7 102.2 100.2 112.9 92.3 123.8 3 5 88.9 87.1 103.6 102.5 93.6 134.1 4 6 -50.3 -40.2 -72.3 -61.4 -27.0 -22.7 5 7 -35.3 -9.3 25.3 -0.3 15.6 -27.3 6 8 109.9 85.8 80.7 69.3 66.4 94.0 7 9 88.1 96.3 99.6 92.0 90.0 137.6 8 ...
It sounds like you want to use the match() function, for example: target<-c(the values of your target order) df<-df[match(target, df$column_to_reorder),]
run function on consecutive vals with specific range in the vector with R
spouse i have a vector tmp of size 100 i want to know where there is for example an average of 10 between each 4 elements. i.e i want to know which of these: mean(tmp[c(1,2,3,4)]),mean(tmp[c(2,3,4,5)]),mean(tmp[c(3,4,5,6)])..and so on...mean(tmp[c(97,98,99,100)]) are larger then 10 how can i do it not in a loop? (loop takes too long since i have a table of 500000 rows by 60 col) and more not only avg but also difference or sum and so on... i have tried splitting rows as such tmp<-seq(1,100,1) one<-seq(1,97,1) two<-seq(2,98,1) tree<-seq(3,99,1) four<-seq(4,100,1) aa<-(tmp[one]+tmp[two]+tmp[tree]+tmp[four])/4 which(aa>10) its working but its not rational to do it if you want for example avg of 12 here is an example of what i do to be clear b12<-seq(1,988,1) b11<-seq(2,989,1) b10<-seq(3, 990,1) b9<-seq(4,991,1) b8<-seq(5,992,1) b7<-seq(6,993,1) b6<-seq(7,994,1) b5<-seq(8, 995,1) b4<-seq(9,996,1) b3<-seq(10,997,1) b2<-seq(11,998,1) b1<-seq(12,999,1) now<-seq(13, 1000,1) po<-rpois(1000,4) nor<-rnorm(1000,5,0.2) uni<-runif(1000,10,75) chis<-rchisq(1000,3,0) which((po[now]/nor[now])>1 & (nor[b12]/nor[now])>1 & ((po[now]/po[b4])>1 | (uni[now]-uni[b4])>=0) & ((chis[now]+chis[b1]+chis[b2]+chis[b3])/4)>2 & (uni[now]/max(uni[b1],uni[b2],uni[b3],uni[b4], uni[b5],uni[b6],uni[b7],uni[b8]))>0.5)+12 this code give me the exact index in the real table that mach all the conditions and i have 58 vars with 550000 rows thank you
The question is not very clear. Based on the wording, I guess, this should help: n <- 100 res <- sapply(1:(n-3), function(i) mean(tmp[i:(i+3)])) which(res >10) Also, m1 <- matrix(tmp[1:4+ rep(0:96,each=4)],ncol=4,byrow=T) which(rowMeans(m1) >10)
Maybe you should look at the rollapply function from the "zoo" package. You would need to adjust the width argument according to your specific needs. library(zoo) tmp <- seq(1, 100, 1) rollapply(tmp, width = 4, FUN = mean) # [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 # [15] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5 # [29] 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5 # [43] 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5 # [57] 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 # [71] 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5 # [85] 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5 So, to get the details you want: aa <- rollapply(tmp, width = 4, FUN = mean) which(aa > 10)
R: Overlapping ggplots
I have a quick question about R. I am trying to make a layered histogram from some data I am pulling out of files but I am having a hard time getting ggplot to work with me. I keep getting this error and I have been looking around for an answer but I haven't seen much. Error: ggplot2 doesn't know how to deal with data of class uneval Execution halted Here is a brief look at my program so far. library("ggplot2") ex <- '/home/Data/run1.DOC' ex2 <- '/home/Data/run2.DOC' ... ex<- read.table(ex,header=TRUE) ex2<- read.table(ex2,header=TRUE) ... colnames(ex) <- c(1:18) colnames(ex2) <- c(1:18) ... Ex <- c(ex$'14') Ex2 <- c(ex2$'14') ... ggplot()+ geom_histogram(data = Ex, fill = "red", alpha = 0.2) + geom_histogram(data = Ex2, fill = "blue", alpha = 0.2) And my data is in the files and look a bit like this: head(ex,10) 1 2 3 4 5 6 7 8 9 10 11 12 1 1:28 400 0.42 400 0.42 1 1 2 41.8 0 0.0 0.0 2 1:96 5599 39.99 5599 39.99 34 42 50 100.0 100 100.0 100.0 3 1:53 334 0.63 334 0.63 1 2 2 62.1 0 0.0 0.0 4 1:27 6932 49.51 6932 49.51 48 52 57 100.0 100 100.0 100.0 5 1:36 27562 124.15 27562 124.15 97 123 157 100.0 100 100.0 100.0 6 1:14 2340 16.71 2340 16.71 13 17 21 100.0 100 100.0 95.7 7 1:96 8202 49.71 8202 49.71 23 43 80 100.0 100 100.0 100.0 8 1:34 3950 28.21 3950 28.21 22 33 36 100.0 100 100.0 100.0 9 1:60 5563 24.62 5563 24.62 11 24 41 100.0 100 96.5 75.2 10 1:06 1646 8.11 1646 8.11 7 8 13 100.0 100 87.2 32.0 13 14 15 16 17 18 1 0.0 0.0 0.0 0.0 0.0 0.0 2 93.6 82.9 57.9 24.3 0.0 0.0 3 0.0 0.0 0.0 0.0 0.0 0.0 4 100.0 97.1 87.1 57.1 0.0 0.0 5 100.0 100.0 100.0 100.0 88.3 71.2 6 40.0 0.0 0.0 0.0 0.0 0.0 7 81.2 66.7 54.5 47.9 29.1 0.0 8 76.4 55.7 0.0 0.0 0.0 0.0 9 57.5 35.4 26.5 4.4 0.0 0.0 10 0.0 0.0 0.0 0.0 0.0 0.0 But much larger. This means that ex and ex2 will be a percentage from 0 to 100. The colnames line changes the column heads like %_above_30 to something R likes better so I change it to number each column name. Does anyone know/see the problem here because I am not really getting it. Thanks!!
Maybe try combining the two data frames in one and supply that to one geom_histogram: #maybe reshape it something like this (base reshape or the #reshape package may be a better tool) dat <- data.frame(rbind(ex, ex2), colvar=factor(c(rep("ex", nrow(ex)), rep("ex2", nrow(ex2)))) ggplot(data = dat, fill = colvar)+ geom_histogram(position="identity", alpha = 0.2) This is untested as your code isn't reproducible (please see this link on how to make a reproducible example). Here's the idea I'm talking about with a reproducible example: library(ggplot2) path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data" saheart <- read.table(path, sep=",",head=T,row.names=1) fmla <- "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity" model <- glm(fmla, data=saheart, family=binomial(link="logit"), na.action=na.exclude) dframe <- data.frame(chd=as.factor(saheart$chd), prediction=predict(model, type="response")) ggplot(dframe, aes(x=prediction, fill=chd)) + geom_histogram(position="identity", binwidth=0.05, alpha=0.5)