I'm trying to look at a series of unusual words taken from OCR and determine which ones merit further investigation to see where/if the OCR screwed up. I've tried a few different approaches like comparing the words to existing dictionaries and so on, but I still have a large number of words to look at manually.
What I would like to do:
Send my list of words to Google's ngrams, starting from a particular year (say 1950), and get the average ngram frequency to reduce my list to the real outliers.
What I have tried:
I thought I could use the ngramr package for this, but it can only send 12 words at a time, and I have a list of thousands of words. I was hoping I could save myself some time doing it this way. Perhaps there is some way of dividing my dataset into separate groups of no larger than 12 and then send them all to this package, but it doesn't seem like it is intended for this kind of approach.
Is there a better/smarter way of doing this? Ideally I would like to avoid having to export the data into Python because I want my code to be fully reproducible by someone who is only familiar with R.
Any help would be hugely appreciated.
I am struggling a bit with an analysis I need to do. I have collected data consisting of little owl calls that were recorded along transects. I want to analyse these recordings for similarity, in order to see which recorded calls are from the same owls and which are from different owls. In that way I can make an estimate of the size of the population at my study area.
I have done a bit of research and it seems that the package warbleR seems to be suitable for this. However, I am far from an R expert and am struggling a bit with how to go about this. Do any of you have experience with these types of analyses and maybe have example scripts? It seems to me that I could use the function cross_correlation and maybe make a pca, however in the warbleR vignette I looked at they only do this for different types of calls and not for the same type call from different individuals, so I am not sure if it would work.
to be able to run analyses with warbleR you need to input the data using the "selection_table" format. Take a look at the example data "lbh_selec_table" to get a sense of the format:
library(warbleR)
data(lbh_selec_table)
head(lbh_selec_table)
The whole point of these objects is to tell R the time location in your sound files (in seconds) of the signals you want to analyze. Take a look at this link for more details on this object structure and how to import it into R.
I am beginner in R. So, I am confused about the title of my question. sorry for that. I am trying to explain..
Professor gave me a NetCDF atmospheric data file(18.3MB).this file has 8 dimension and 8 variable. i have to work with 4 variable. every variable(time,site number,urban site,pm10) has 683016 data. suppose,
Urban site number:[2,5],
site number:[1,2,3,4,5,6],
time:[1-3-2012,2-3-2012....](hourly data(24) has taken in each day ),
pm10:[1,2,3,4,5,6.......](different for every hourly data with some missing value)
I have to manage this data set only for urban site and 1-3-2012(actually I have to make this spatio-temporal data to spatial data).I want my final data set like this:
Colum 1(time): 1-3-2012,1-3-2012,1-3-2012,1-3-2012,1-3-2012,1-3-2012
colum 2(Urban site number): 2,2,2,5,5,5
colum 3(pm10 value):1,2,3,NA,4,5,
As I only know very basic commands of R so I cant understand how can I solve this problem. Even I don't under stand How can I find any example of this type of problem in internet.
so, please give me some suggestion or link about what I have to learn to solve this problem in R. Please, help me out?
I think you're trying to reshape the dataset but i'm afraid i do not see how your current dataset looks like.
Could you elaborate more on what your dataset looks like right now?
There are packages that help reshaping such as {reshape} or {plyr}. But i need more detail to suggest which one you should use.
I have been messing around with R for the last year and now want to get a little deeper. I want to learn more about the ff and big data packages because have been trouble getting through some of the documentation.
I like to learn by doing, so lets say I have a huge CSV called data.csv and its 300 mbs. It has 5 headers Url, PR, tweets, likes, age. I want to deduplicate the list based on URLs. Then I want to plot PR and likes on a scatter plot to see if there is any correlation. How would I go about doing that basic analysis?
I always get confused with the chunking of the big data processes and how you cant load everything in at once.
What are come common problems you have ran into using the ff package or big data?
Is there another package that works better?
Basically any information to get started using a lot of data in R would be useful.
Thanks!
Nico
I'm wondering if there is a possibility to download (from the website) a subset made from original dataset in Rdata format. The easiest way of course is to proceed in this manner:
set<-url("http://xxx.com/datasets/dataset.RData")
load(set)
subset<-set[set$var=="yyy",]
however I'm trying to speed up my code and avoid downloading unnecessary columns.
Thanks for any feedback.
Matt
There is no mechanism for that task. There is also no mechanism for inspecting .Rdata files. In the past when this has been requested, people have been advised to convert to a real database management system.