Write data to Excel with certain format - r

I'm currently doing this in Perl, and I'd like to find a more efficient/faster way to do it. Any advice is appreciated!
What I'm trying to do is to extract certain data from a csv/xlsx file and write them into Excel so that Bloomberg can read.
Here is an example of the csv file:
Account.Name Source.Number Source.Name As.Of.Date CUSIP.ID Value
AR PSF30011202 DK 3/31/2016 111165194 100.00
AR PSF30011602 MOF 3/31/2016 11VVA0WE4 150.00
AR PSF30014002 OZM 3/31/2016 11VVADWF3 125.00
FI PSF30018502 FS 3/31/2016 11VVA2625 170.00
FI PSF30018102 IP 3/31/2016 11VVAFPH2 115.00
....
What I want to have in the Excel file is that if Account.Name = AR, then:
Cell A1 =Source.Name. E.g. DK.
Cell A2 =weight of Value. E.g. the weight of DK is 0.151515 (100/660).
Cell A3 = =BDH("CUSIP.ID CUSIP","PX_LAST","01/01/2000","As.Of.Date","PER=CM"). E.g. =BDH("111165194 CUSIP","PX_LAST","01/01/2000","03/31/2016","PER=CM")
Cell D1 =MOF
Cell D2 =0.227273
Cell D3 = =BDH("11VVA0WE4 CUSIP","PX_LAST","01/01/2000","03/31/2016","PER=CM")
There are two columns in between because if DK's CUSIP is valid, then A3 and after would be the dates; B3 and after would contain monthly price from Bloomberg; C4 and after will be the log returns of monthly prices (=LN(B4/B3)).
Below is what it should look like:

I don't know anything about Pearl and I'm not sure what you are doing, but it looks like you are getting stock prices. Is that right. Maybe you can download what you need from Yahoo.finance and get rid of Bloomberg altogether. Take a look at the link below and see if it helps you get what you need.
http://www.financialwisdomforum.org/gummy-stuff/Yahoo-data.htm

Related

Add a column of length matching that of another dataframe AND adjust the value of that column in each row depending on a filename

I have datasets formatted in a way represented by the set below:
FirstName Letter
Alexsmith A1
ThegreatAlex A6
AlexBobJones1 A7
Bobsmiles222 A1
Christopher A9
Christofer A6
I want to change it to this:
School FirstName Letter
Greenfield Alexsmith A1
Greenfield ThegreatAlex A6
Greenfield AlexBobJones1 A7
Greenfield Bobsmiles222 A1
Greenfield Christopher A9
Greenfield Christofer A6
I want to add a leftmost column indicated which school the dataset comes from. I am importing this data from csv into R to begin with, and the filenames already have the school name in them.
Is it possible to retrieve the school name from the file name? The filenames are formatted like this: SCHOOLNAME_1, SCHOOLNAME_2, etc. The numbers do not need to be retained
My goal here is to automate this process through a loop because of how many of these datasets I will be accumulating, which is why I am starting small with this question.
I tried something like this:
School <- c(length(schoolimport))
but don't know how to add in the values of each cell
Thank you & I am happy to clarify anything
Assuming you want them all in the same data frame, my suggestion would be to use the functions purrr::map_dfr and fs::dir_ls. The files will need to be in the same format for this to work.
Put the files in their own folder, then do
list_of_files <- dir_ls(folder_name)
list_of_files |>
map_dfr(read_csv, .id = 'school_name')
This will return an appended data frame with the file names added as a column called 'school_name'. You could then use regular expressions to extract the school name from the file name.

How can I count the number of individuals in populations, as listed in order, from a vcf file

I would like to get the number of individuals in each population, in the order in which populations are read in, from a vcf file. The fields of my file look like this
##fileformat=VCFv4.2
##fileDate=20180425
##source="Stacks v1.45"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood">
##INFO=<ID=locori,Number=1,Type=Character,Description="Orientation the
corresponding Stacks locus aligns in">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
CHALIFOUR_2003_ChHis-1 CHALIFOUR_2003_ChHis-13 CHALIFOUR_2003_ChHis-14
CHALIFOUR_2003_ChHis-15
un 1027 13_65 C T . PASS NS=69;AF=0.188;locori=p GT:DP:AD
0/1:16:9,7 0/0:39:39,0 0/0:17:17,0 0/0:39:39,0
See example file here vcf file
For example, in the file that I have linked to, I have two populations, Chalifour 2003 and Chalifour 2015. Individuals have a prefix "CHALIFOUR_2003..." that identifies this.
I would like to be able to extract something like:
Chalifour_2003* 35
Chalifour 2015* 45
With the "35" and "45" indicating the number of individuals in each population (though these numbers are made up). I don't care at all about the format of the output, I just need the numbers, and it is important that the populations are listed in the order in which they would be read into the file.
Any suggestions for avenues to try to get this information would be much appreciated.
Using the data.table package to read in the vcf file you can do the following:
library(data.table)
df <- fread("~/Downloads/ChaliNoOddsWithOuts.vcf")
samples <- colnames(df)[-c(1:9)]
table(gsub("(.*_.*)_.*","\\1", samples))
If you don't insist on using R then this is one liner in bash that does the job
grep "#CHROM" file.vcf | tr "\t" "\n " | tail -n +10 | cut -f1,2 -d'_' | uniq -c

Extracting and storing data from a very large file in R

I have a very large DAT file (16 GB). It contains some information of let's say, 1000 customers. This data is sorted like below that the first column is representing the customer IDs:
9909814 246766 0 31/07/2012 7:00 0.03 0 0 0 0
8211675 262537 0 8/04/2013 3:00 0.52 0 0 0 0
However, the data of customers are not stored in an organized way. So, I want to extract the data of each customer and store it in a separate file. (I have a file that contains the customer IDs. )
For just one customer, I wrote the following code that can search through the file and extract data. However, my problem is to how to do this for all the customers when I'm reading this big file into R.
con<-file('D:/CD_INTERVAL_READING.DAT')
open(con)
n=20
nk=100000
B=9909814 #customer ID for customer no.1
customer1 <- read.table(con, sep=",", nrow=1)
for (i in 1:n) {
conn <- read.table(con,sep=",",skip=(i-1)*nk, nrow=nk)
## extracts just those rows that belong to a specific customer ID
temp1 <-conn[conn$V1==B,]
customer1 <-rbind(customer1,temp1)
}
customer1 <- customer1 [-1,]
library(xlsx)
write.xlsx(customer1, "D:/customer1.xlsx")
The optimal solution would probably be to import the data into a proper database but if you really want to split the file into multiple files based on the first token then you can use awk with this one-liner.
awk '/^/ {ofn=$1 ".txt"} ofn {print > ofn}' filetosplit.txt
It works by
/^/ matching start of line
{ofn=$1 ".txt"} sets the ofn variable to the first word (split by white space) with .txt appended.
Print each line to the file set by ofn.
It takes me just under two minutes on my laptop to split a 1 GB file with the same format as you listed above into multiple text files. I have no idea how well that scales or if it's fast enough for you. If you want an R solution you can always wrap it into a system() call ;o)
Addendum:
Oh ... I'm guessing you are on windows based on the path you mentioned. Then you may need to install Cygwin to get awk.

Create a new column for comments added to excel cell

I have a data in excel in the format as shown below. The user may add comments to the score column using 'Insert Comment' option in excel. I would like to extract the comments added to scores column and put it in the column 'Comments'. Is this possible? Can you please help?
Report Component Score Comments
R1 C1 1
R2 C2 2
R3 C3 3
R4 C4 4
R5 C5 5
Here is the code I have written so far. Not sure how to proceed further. PLease help.
require(readxl)
read_excel("Testfile01.xlsx")
I have yet to see this kind of functionality in read_excel, but in the mean time, you could perhaps write comments into cell content using just prior to importing the file into R.
From ExtendOffice:
Function GetComments(pRng As Range) As String
'Updateby20140509
If Not pRng.Comment Is Nothing Then
GetComments = pRng.Comment.Text
End If
End Function
You can then use GetComments function, e.g. =GetComments(A1).

Download weather data by zip code tabulate area in R

I am trying to figure out how to download weather(temperature,radiation,etc) by zip code tabulation area(ZCTA). Although Census data have census information by ZCTA, this is not the case for weather data.
I tried find information from http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N
but couldn't figure it out.
Has anyone of you ever downloaded the weather data by ZCTA? if not, has anyone had experience to convert weather observation station information to ZCTA?
The National Weather Service provides two web-based API's to extract weather forecast information from the National Digital Forecast Database(NDFD), a SOAP interface, and a REST interface. Both return data in Digital Weather Markup Language (DWML), which is an XML dialect. Data elements which can be returned are listed here.
IMO the REST interface is by far the easier to use. Below is an example where we extract the forecast temperature, relative humidity, and wind speed for Zip Code 10001 (Lower Manhattan) in 3-hour increments for the next 5 days.
# NOAA NWS REST API Example
# 3-hourly forecast for Lower Mannhattan (Zip Code: 10001)
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
# dates temperature wind.speed humidity
# 1 2014-11-06 19:00:00 52 8 96
# 2 2014-11-06 22:00:00 50 7 86
# 3 2014-11-07 01:00:00 50 7 83
# 4 2014-11-07 04:00:00 47 11 83
# 5 2014-11-07 07:00:00 45 14 83
# 6 2014-11-07 10:00:00 50 16 61
It is possible to query multiple zip codes in a single request, but this complicates parsing the returned XML.
To get the NOAA QCLCD data into zip codes you need to use the latitude/longitude values from the station.txt file and compare that with data from Census. This can only be done with GIS related tools. My solution is to use a PostGIS enabled database so you can use the ST_MakePoint function:
ST_MakePoint(longitude, latitude)
You would then need to load the ZCTA from Census Bureau into the database as well to determine which zip codes contain which stations. The ST_Contains function will help with that.
ST_Contains(zip_way, ST_MakePoint(longitude, latitude))
A full query might look something like this:
SELECT s.wban, z.zip5, s.state, s.location
FROM public.station s
INNER JOIN public.zip z
ON ST_Contains(z.way, ST_MakePoint(s.longitude, s.latitude)
I'm obviously making assumptions about the column names, but the above should be a great starting point.
You should be able to accomplish the same tasks with QGIS (Free) or ArcGIS (expensive) too. That gets rid of the overhead of installing a PostGIS enabled database, but I'm not as familiar with the required steps in those software packages.
Weather data is only available for weather stations, and there is not a weather station for each ZCTA (the ZCTA's are much smaller than the regions covered by weather stations).
I have seen options on the noaa website where you can enter a latitude and longitude and it will find the weather from the appropriate weather station. So if you can convert your ZCTA's of interest into a lat/lon pair (center, random corner, etc.) you could submit that to the website. But note that if you do this for a large number of ZCTA's that are close together you will be downloading redundant information. It would be better to do a one time matching of ZCTA to weather station, then download weather info from each station only once and then merge with the ZCTA data.

Resources