Would like to be able to read Google Sheets cell values into R with googlesheets package, but without any cell formatting applied (e.g. comma separators, percentage conversion, etc.).
Have tried gs_read() without specifying a range, which uses gs_read_csv(), which will "request the data from the Sheets API via the exportcsv link". Can't find a way to tell it to provide underlying cell value without formatting applied.
Similarly, tried gs_read() and specifying a range, which uses gs_read_cellfeed(). But can't find a way to indicate that I want un-formatted cell values.
Note: I'm not after the formulas in any cells, just the values without any formatting applied.
Example:
(looks like I'm not able to post image images)
Here's a screenshot of an example Google Sheet:
https://www.dropbox.com/s/qff05u8nn3do33n/Screenshot%202015-07-26%2008.42.58.png?dl=0
First and third columns are numeric with no formatting applied, 2nd column applies comma separators for thousands, 4th column applies percentage formatting.
Reading this sheet with the following code:
library(googlesheets)
gs <- gs_title("GoogleSheets Test")
ws <- gs_read(gs, ws = "Sheet1")
yields:
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 4 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
Would like to be able to read a worksheet that has formatting applied (ala columns 2 and 4), but read the unformatted values (ala columns 1 and 3).
At this point, I think your best bet is to fix the imported data like so:
> ws$Number_fixed <- type.convert(gsub(',', '', ws$Number_wFormat))
> ws$Percent_fixed <- type.convert(gsub('%', '', ws$Percent_wFormat)) / 100
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 6 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
$ Number_fixed : int 123456 123457 123458
$ Percent_fixed : num 0.123 0.234 0.346
I had some hope that post-processing with functions from readr would be a decent answer, but it looks like percentages and "currency" style numbers are open issues there too.
I have opened an issue to solve this better in googlesheets, one way or another.
Related
I don't know if I have a logical error, but my substr() function is keep returning an empty string? My initial thought was that I am cutting the string wrong, however even with a lower starting value, I am getting null string in my dataframe column.
I have looked at this PHP question to get some information, but didn't work: PHP: substr returns empty string
Reproducible example-
#Original Data set str() output
#'data.frame': 9245 obs. of 3 variables:
#$ Latitude : num 29.7 29.7 29.7 29.6 29.7 ...
#$ Longitude : num -82.3 -82.4 -82.3 -82.4 -82.3 ...
#$ Census Code: chr "120010011003032" "120010010004035" "120010002003009" "120010015213000" ...
#For example, even if I do this:
base::substr("120010011003032", 6, 1)
#Output : ""
#Desired output: 001100
I needed to cut census codes to generate tract information, and the tract information is usually first two being the state, next three the county, and the following six the tract.
You need base::substr("120010011003032", 1, 6) ! (EDIT: or 11, 6 per your comment)
The arguments are start, stop not length, start; see the doc or type ?base::substr. Tip: always triple-check the R doc first. And also try copy-pasting the working examples it gives you, then see if/how they differ from yours. Or just try various argument values.
base::substr("120010011003032", 6, 11)
"001100"
I am trying to make a query to use in a R package named RISmed, which will search and downloaded relevant journal article information from pubmed database. I want to search two words always together, for example :
query= "gene sequencing"
search<-EUtilsSummary(query,type="esearch",db = "pubmed",mindate=2014, maxdate=2014, retmax=20)
If I use, above command, it will search gene and sequencing separately, then both gene and sequencing,that means if in whole text gene and sequencing exists, my command captures them but I want to search in such a way, that it will consider "Gene sequencing", two words always together. How can I write that query? Would anyone please help me?
Thanks in advance !
I would try this:
query <- '"gene sequencing"[Title/Abstract]'
The pubmed search engine does accept quoted strings and you just need to know how to preserve them within R. Using surrounding single quotes is one method. Using back-slashed quotes would be another. Notice that the returned value from my experiment with your code shows that escape-backslashing is how the implemeters of that package do it:
> str(search)
Formal class 'EUtilsSummary' [package "RISmed"] with 6 slots
..# db : chr "pubmed"
..# count : num 542
..# retmax : num 20
..# retstart : num 0
..# PMID : chr [1:20] "25548628" "25543043" "25542841" "25540641" ...
..# querytranslation: chr "\"gene sequencing\"[Title/Abstract] AND 2014[EDAT] : 2014[EDAT]"
This is my first post on StackOverflow and I could use a little help... Please forgive me if I am not following the correct posting protocols.
There is another example in the StackOverflow for which I am heavily basing my work off of but I cant quite figure out how to adapt the code. Most importantly, I am looking at the solution to the question provided.
Here is the link:
Getting the next observation from a HMM gaussian mixture distribution
Some background:
RHmm - version 2.1.0 downloaded from R Forge.
RStudio - 0.98.953
R - 3.0.2 32 bit
I am trying to figure out the following issues with my code:
How do I amend the solution from the link above (prediction of the next observation) to work with my Baum-Welch model?
Ex. hm_model <- HMMFit(obs=TWII_Train, nStates=5)
The R / RStudio session aborts when I run the Baum-Welch version of the hm_model <- HMMFit(obs=TWII_Train, dis="MIXTURE", nStates=5, nMixt=4). Can you recreate the error and propose a workaround?
Here is my R code:
library(quantmod)
library(RHmm)
getSymbols("^TWII")
TWII_Subset <- window(TWII, start=as.Date("2012-01-01"), end = as.Date("2013-04-01"))
TWII_Train <- cbind(TWII_Subset$TWII.Close - TWII_Subset$TWII.Open,
TWII_Subset$TWII.Volume)
hm_model <- HMMFit(obs=TWII_Train, nStates=5)
VitPath <- viterbi(hm_model, TWII_Train)
I'm not a user of this package and this is not really an answer, but a comment would obscure some of the structures. It appears that the "proportion" value of your model is missing (so the structures are different. The "mean" value looks like this:
$ mean :List of 5
..$ : num [1:2] 6.72 3.34e+06
..$ : num [1:2] -12.4 2420174.5
..$ : num [1:2] -2.4 1832546.5
..$ : num [1:2] -10.4 1432636.1
..$ : num [1:2] 5.02 1.96e+06
I also suspect that you should be using 2 and 5 rather than 4 and 5 for m and n. Look at the rest of the model with:
str(hm_model)
I must be missing something very basic. Hope someone can point it out. I'm trying to subset the following data frame based on a specific year and sex...
str(Bnames)
'data.frame': 258000 obs. of 4 variables:
$ X.year. : int 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
$ X.name. : Factor w/ 6782 levels "\"Aaden\"","\"Aaliyah\"",..: 3380 6632 3125 1174 2554 2449 3428 6232 2834 5517 ...
$ X.percent.: num 0.0815 0.0805 0.0501 0.0452 0.0433 ...
$ X.sex. : Factor w/ 2 levels "\"boy\"","\"girl\"": 1 1 1 1 1 1 1 1 1 1 ...
The code I have entered is
one <- subset(Bnames, X.year.==2008 & X.sex.=="boy") # I get zero rows returned
two<- subset(Bnames, X.year.==2008) # I get 2000 rows returned, which is correct
three <- subset(Bnames, X.sex.=="boy") # I get 0 rows returned
four <- subset(Bnames, X.name.=="John") # I get 0 rows returned
I don't understand. I'm using a data set that is freely available at http://plyr.had.co.nz/09-user/
If I make my own data frame by repeat sampling of c("boy","girl"), the subset works fine. Why is the code failing with the data that I started with?
The reason you are getting 0 results is that the levels of your factor columns are quoted. For instance, X.sex. column levels are not boy or girl, but rather "boy" and "girl". This may due to the fact that the file you have imported your data.frame from had fields quoted and it was read through read.table (or some other equivalent function) with the quote=FALSE argument. If that's the case, you could easily re-read the file and correct this rather annoying feature.
Anyway, to proper subset your data.frame remember the quotes. For instance:
one <- subset(Bnames, X.year.==2008 & X.sex.=="\"boy\"")
Alternatively, you may use the ' as quote:
one <- subset(Bnames, X.year.==2008 & X.sex.=='"boy"')
If you want to get rid of the annoying quotes without having to rebuild your data.frame, just try:
Bnames[,4]<-factor(gsub(Bnames[,4],'"',""))
I am learning to use topicmodels package and R as well, and explored one of its example data set by using
str(testdata)
'data.frame': 3104 obs. of 5 variables:
$ Article_ID: int 41246 41257 41268 41279 41290 41302 41314 41333 41344 41355 ...
$ Date : chr "1-Jan-96" "2-Jan-96" "3-Jan-96" "4-Jan-96" ...
$ Title : chr "Nation's Smaller Jails Struggle To Cope With Surge in Inmates" "FEDERAL IMPASSE SADDLING STATES WITH INDECISION" "Long, Costly Prelude Does Little To Alter Plot of Presidential Race" "Top Leader of the Bosnian Serbs Now Under Attack From Within" ...
$ Subject : chr "Jails overwhelmed with hardened criminals" "Federal budget impasse affect on states" "Contenders for 1996 Presedential elections" "Bosnian Serb leader criticized from within" ...
$ Topic.Code: int 12 20 20 19 1 19 1 1 20 15 ...
If I want to create a data set according to the above format in R, how to do that?
test.data is a data.frame, one of the few fundamental R objects. You should probably start here: http://cran.r-project.org/doc/manuals/R-intro.pdf.
Some functions for creating data.frames are data.frame, read.table, read.csv. For each of these you can access their documentation by typing ?data.frame for example. Good luck.