Opening .bcp files in R - r

I have been trying to convert UK charity commission data which is in .bcp file format into .csv file format which could then be read into R. The data I am referring to is available here: http://data.charitycommission.gov.uk/. What I am trying to do is turn these .bcp files into useable dataframes that I can clean and run analyses on in R.
There are suggestions on how to do this through python on this github page https://github.com/ncvo/charity-commission-extract but unfortunately I haven't been able to get these options to work.
I am wondering if there is any syntax or packages that will allow me to open these data in R directly? I haven't been able to find any.
Another option would be to simply open the files within R as a single character vector using readLines. I have done this and the files are delimited with #**# for columns and *##* for rows. (See here: http://data.charitycommission.gov.uk/data-definition.aspx). Is there an R command that would allow me to create a dataframe from a long character string, defining de-limiters for both rows and columns?

R-solution
edited version
Not sure if all .bcp files are in the same format.. I downloaded the dataset you mentioned, and tried a solution for the smallest file; extract_aoo_ref.bcp
library(data.table)
#read the file as-is
text <- readChar("./extract_aoo_ref.bcp",
nchars = file.info( "./extract_aoo_ref.bcp" )$size,
useBytes = TRUE)
#replace column and row separator
text <- gsub( ";", ":", text)
text <- gsub( "#\\*\\*#", ";", text)
text <- gsub( "\\*##\\*", "\n", text, perl = TRUE)
#read the results
result <- data.table::fread( text,
header = FALSE,
sep = ";",
fill = TRUE,
quote = "",
strip.white = TRUE)
head(result,10)
# V1 V2 V3 V4 V5 V6
# 1: A 1 THROUGHOUT ENGLAND AND WALES At least 10 authorities in England and Wales N NA
# 2: B 1 BRACKNELL FOREST BRACKNELL FOREST N NA
# 3: D 1 AFGHANISTAN AFGHANISTAN N 2
# 4: E 1 AFRICA AFRICA N NA
# 5: A 2 THROUGHOUT ENGLAND At least 10 authorities in England only N NA
# 6: B 2 WEST BERKSHIRE WEST BERKSHIRE N NA
# 7: D 2 ALBANIA ALBANIA N 3
# 8: E 2 ASIA ASIA N NA
# 9: A 3 THROUGHOUT WALES At least 10 authorities in Wales only Y NA
# 10: B 3 READING READING N NA
same for the tricky file; extract_charity.bcp
head(result[,1:3],10)
# V1 V2 V3
# 1: 200000 0 HOMEBOUND CRAFTSMEN TRUST
# 2: 200001 0 PAINTERS' COMPANY CHARITY
# 3: 200002 0 THE ROYAL OPERA HOUSE BENEVOLENT FUND
# 4: 200003 0 HERGA WORLD DISTRESS FUND
# 5: 200004 0 THE WILLIAM GOLDSTEIN LAY STAFF BENEVOLENT FUND (ROYAL HOSPITAL OF ST BARTHOLOMEW)
# 6: 200005 0 DEVON AND CORNWALL ROMAN CATHOLIC DEVELOPMENT SOCIETY
# 7: 200006 0 THE HORLEY SICK CHILDREN'S FUND
# 8: 200007 0 THE HOLDENHURST OLD PEOPLE'S HOME TRUST
# 9: 200008 0 LORNA GASCOIGNE TRUST FUND
# 10: 200009 0 THE RALPH LEVY CHARITABLE COMPANY LIMITED
so.. looks like it is working :)

Related

Remove specific value in R or Linux

Hi I have a file (tab sep) in terminal that has several columns as below. You can see last column has a comma in between followed by one or more characters.
1 100 Japan Na pa,cd
2 120 India Ca pa,ces
5 110 Japan Ap pa,cres
1 540 China Sn pa,cd
1 111 Nepal Le pa,b
I want to keep last column values before the comma so the file can look like
2 120 India Ca pa
5 110 Japan Ap pa
1 540 China Sn pa
1 111 Nepal Le pa
I have looked for sed but I cannot find a way to exclude them
Regards
In R you can read the file with a tab-separator and remove the values after comma.
result <- transform(read.table('file1.txt', sep = '\t'), V5 = sub(',.*', '', V5))
V5 is used assuming it is the 5th column that you want to change the value.
We can use
df1 <- read.tsv('file1.txt', sep="\t")
df1$V5 <- sub("^([^,]+),.*", "\\1", df1$V5)

data table lapply and additional columns in output

I am just hoping there is a more convenient way. Imaging I would like to run a model with different transformations of some of the columns, e.g. winsorizing. I would like to provide the transformed data set to the model and some additional columns that do not need to be transformed. Is there a practical way to this in one line? I do not want to replace the data using := because I am planning to run the model with different specifications of the transformation.
dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
sel.col<-c("x","y")
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
I would Need to call data.table again to merge the original dt with the transformed data and pay Attention to the order.
data.table(dt[,.(id,Country),by=factor],
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor])
I was hoping that I could include the additional columns with the lapply call
dt[,.(lapply(.SD,Winsorize), id, Country),.SDcols=sel.col,by=factor]
Are there any other solutions?
Do you just need?
dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor]
Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.
There is no need to merge, you can assign columns after lapply call:
> library(DescTools)
> library(data.table)
> dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
> sel.col<-c("x","y")
> dt
id Country x y factor
1: 1 Germany 13.116248 -0.4609152 B
2: 2 Germany -6.623404 -3.7048052 A
3: 3 USA -18.027532 22.2946805 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -12.585897 0.8255081 B
6: 6 Germany -8.816252 -12.1218135 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 6.3262951 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 15.857069 8.6422997 A
> # Notice an assignment `(sel.col) :=` here:
> dt[,(sel.col) := lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
> dt
id Country x y factor
1: 1 Germany 11.129140 -0.4609152 B
2: 2 Germany -6.623404 -1.7234191 A
3: 3 USA -17.097573 19.5642043 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -11.831968 0.8255081 B
6: 6 Germany -8.816252 -12.0116571 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 5.2261377 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 11.581528 8.6422997 A

R Filling missing values with NA for a data frame

I am currently trying to create a data-frame with the following lists
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
When I try and create a data-frame I get the (obvious) error that the lengths of the lists are not equal. I want to find a way to either make the lists the same length, or fill the missing data-frame entries with "NA". After doing some searching I have not been able to find a solution
Here are purrr (part of tidyverse) and base R solutions, assuming you just want to fill remaining values in each list with NA. I'm taking the maximum length of any list as len, then for each list doing rep(NA) for the difference between the length of that list and the maximum length of any list.
library(tidyverse)
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
all_lists <- list(location, organization, person, date, Jobs)
len <- max(lengths(all_lists))
With purrr::map_dfc, you can map over the list of lists, tack on NAs as needed, convert to character vector, then get a data frame of all those vectors cbinded in one piped call:
map_dfc(all_lists, function(l) {
c(l, rep(NA, len - length(l))) %>%
as.character()
})
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
In base R, you can lapply the same function across the list of lists, then use Reduce to cbind the resulting lists and convert it to a data frame. Takes two steps instead of purrr's one:
cols <- lapply(all_lists, function(l) c(l, rep(NA, len - length(l))))
as.data.frame(Reduce(cbind, cols, init = NULL))
#> V1 V2 V3 V4 V5
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
For both of these, you can now set the names however you like.
You could do:
data.frame(sapply(dyem_list, "length<-", max(lengths(dyem_list))))
location organization person date Jobs
1 USA Microsoft NULL 1989 CEO
2 Singapore University of London NULL 2001 Chairman
3 UK Boeing NULL 2018 VP of sales
4 NULL Apple NULL NULL General Manager
5 NULL NULL NULL NULL Director
Where dyem_list is the following:
dyem_list <- list(
location = list("USA","Singapore","UK"),
organization = list("Microsoft","University of London","Boeing","Apple"),
person = list(),
date = list("1989","2001","2018"),
Jobs = list("CEO","Chairman","VP of sales","General Manager","Director")
)

Find html table name and scrape in R

I'm trying to scrape a table from a web page that has multiple tables. I'd like to get the "FIPS Codes for the States and the District of Columbia" table from https://www.census.gov/geo/reference/ansi_statetables.html . I think the XML::readHTMLTable() is the right way to go, but when I try the following I get an error:
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)
named list()
Warning message:
XML content does not seem to be XML: 'https://www.census.gov/geo/reference/ansi_statetables.html'
This is not surprising, of course, because I'm not giving the function any indication of which table I'd like to read. I've dug around in "Inspect" for quite a while but I'm not connecting dots on how to be more precise. There doesn't seem to be a name or class of the table that is analogous to other examples I've found in documentation or on SO. Thoughts?
Consider using readLines() to scrape the html page content and use result in readHTMLTable():
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
webpage <- readLines(url)
readHTMLTable(webpage, header = T, stringsAsFactors = F) # LIST OF 3 TABLES
# $`NULL`
# Name FIPS State Numeric Code Official USPS Code
# 1 Alabama 01 AL
# 2 Alaska 02 AK
# 3 Arizona 04 AZ
# 4 Arkansas 05 AR
# 5 California 06 CA
# 6 Colorado 08 CO
# 7 Connecticut 09 CT
# 8 Delaware 10 DE
# 9 District of Columbia 11 DC
# 10 Florida 12 FL
# 11 Georgia 13 GA
# 12 Hawaii 15 HI
# 13 Idaho 16 ID
# 14 Illinois 17 IL
# ...
For specific dataframe return:
fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]
Another solution using rvest instead of XML is:
require(rvest)
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>%
html_table %>% .[[1]]

R: multiply columns by rows to create country-specific index

I am trying to create a country-specific index based on the import share of certain food commodities.
I have the following data: Prices contains time-series data on commodity prices for a number of food commodities. Weights contains data on the country-specific import shares for the relevant commodities (see mock data).
What I want to do is to create a country-specific food-price index which is the sum of the price-series of imported commodities multiplied by the import share.
So in the example data the food-price index for Australia will be:
FOODct = 0.12 * WHEATt + 0.08 * SUGARt
Where c indicates country and t time.
So basically my question is: How do I multiply the columns by the rows for each country?
I have some experience with R but trying to solve this I seem to be punching above my weight. I also haven't found any useful pointers elsewhere so I was hoping that maybe anyone of you might have good suggestions.
## Code to create mock data:
## Generate data on country weights
country<-c(rep("Australia",2),rep("Zimbabwe",3))
item<-c("Wheat","Sugar","Wheat","Sugar","Soybeans")
itemcode<-c(1,2,1,2,3)
share<-c(0.12,0.08,0.16,0.08,0.03)
weights<-data.frame(country,item,itemcode,share)
## Generate data on price index
date<-seq(as.Date("2005/1/1"),by="month",length.out=12)
Wheat<-runif(12,80,160)
Sugar<-runif(12,110,230)
Soybeans<-runif(12,60,130)
prices<-data.frame(date,Wheat,Sugar,Soybeans)
EDIT: Solution
Thanks to alexwhan for his suggestion ( I can't upvote unfortunately due to lack of stackoverflow street cred). And dnlbrky for the solution which was easiest to implement with the original data.
## Load data.table package
require(data.table)
## Convert data to data table
prices<-data.table(prices)
weights<-data.table(weights,key="item")
## Extract names for all the food commodities
vars<-names(prices)[!names(prices) %in% "date"]
## Unstack items to create table in long format
prices<-data.table(date=prices[,date], stack(prices,vars),key="ind")
## Rename the columns
setnames(prices,c("values","ind"),c("price","item"))
## Calculate the food price index
priceindex<-weights[prices,allow.cartesian=T][,list(index=sum(share*price)),
by=list(country,date)]
## Order food price index if not done automatically
priceindex<-priceindex[order(priceindex$country,priceindex$date),]
Here's one option. There will absolutely be a neater way to do this, but it should get you going.
First, I'm going to get weights into wide format so that it's easier to work with for our purposes:
library(reshape2)
weights.c <- dcast(weights, country~item)
# country Soybeans Sugar Wheat
# 1 Australia NA 0.08 0.12
# 2 Zimbabwe 0.03 0.08 0.16
Then I've used apply to go through each row of weights.c and calculate the 'food-price index' (tell me if this is being calculated incorrectly, I think I followed the example right...).
FOOD <- as.data.frame(apply(weights.c, 1, function(x)
as.numeric(x[3]) * prices$Soybeans +
as.numeric(x[3])*prices$Sugar + as.numeric(x[4])*prices$Wheat))
Adding in the country and date identifiers:
colnames(FOOD) <- weights.c$country
FOOD$date <- prices$date
FOOD
# Australia Zimbabwe date
# 1 35.04337 39.99131 2005-01-01
# 2 38.95579 44.72377 2005-02-01
# 3 33.45708 38.50418 2005-03-01
# 4 30.42181 34.04647 2005-04-01
# 5 36.03443 39.90905 2005-05-01
# 6 46.21269 52.29347 2005-06-01
# 7 41.88694 48.15334 2005-07-01
# 8 34.47848 39.83654 2005-08-01
# 9 36.32498 40.60091 2005-09-01
# 10 33.74768 37.17185 2005-10-01
# 11 38.84855 44.87495 2005-11-01
# 12 36.45119 40.11678 2005-12-01
Hopefully this is close enough to what you're after...
I would unstack/reshape the items in the weights table, and then use data.table to join the prices and the weights.
## Generate data table for country weights:
weights<-data.table(country=c(rep("Australia",2),rep("Zimbabwe",3)),
item=c("Wheat","Sugar","Wheat","Sugar","Soybeans"),
itemcode=c(1,2,1,2,3),
share=c(0.12,0.08,0.16,0.08,0.03),
key="item")
## Generate data table for price index:
prices<-data.table(date=seq(as.Date("2005/1/1"),by="month",length.out=12),
Wheat=runif(12,80,160),
Sugar=runif(12,110,230),
Soybeans=runif(12,60,130))
## Get column names of all the food types:
vars<-names(prices)[!names(prices) %in% "date"]
## Unstack the items and create a "long" table:
prices<-data.table(date=prices[,date], stack(prices,vars),key="ind")
## Rename the columns:
setnames(prices,c("values","ind"),c("price","item"))
prices[1:5]
## date price item
## 1: 2005-01-01 88.25818 Soybeans
## 2: 2005-02-01 71.61261 Soybeans
## 3: 2005-03-01 77.91082 Soybeans
## 4: 2005-04-01 129.05806 Soybeans
## 5: 2005-05-01 74.63005 Soybeans
## Join the weights and prices tables, multiply the share by the price, and sum by country and date:
weights[prices,allow.cartesian=T][,list(index=sum(share*price)),by=list(country,date)]
## country date index
## 1: Zimbabwe 2005-01-01 27.05711
## 2: Zimbabwe 2005-02-01 34.72842
## 3: Zimbabwe 2005-03-01 35.23615
## 4: Zimbabwe 2005-04-01 39.05027
## 5: Zimbabwe 2005-05-01 39.48388
## 6: Zimbabwe 2005-06-01 33.43677
## 7: Zimbabwe 2005-07-01 32.55172
## 8: Zimbabwe 2005-08-01 34.86790
## 9: Zimbabwe 2005-09-01 33.29748
## 10: Zimbabwe 2005-10-01 38.31180
## 11: Zimbabwe 2005-11-01 31.29709
## 12: Zimbabwe 2005-12-01 40.70930
## 13: Australia 2005-01-01 21.07165
## 14: Australia 2005-02-01 27.47660
## 15: Australia 2005-03-01 27.03025
## 16: Australia 2005-04-01 29.34917
## 17: Australia 2005-05-01 31.95188
## 18: Australia 2005-06-01 26.22890
## 19: Australia 2005-07-01 24.58945
## 20: Australia 2005-08-01 27.44728
## 21: Australia 2005-09-01 27.02199
## 22: Australia 2005-10-01 31.58282
## 23: Australia 2005-11-01 24.42326
## 24: Australia 2005-12-01 31.70109

Resources