Understanding the output of MFCCs

Understanding the output of MFCCs - r

I am a beginner of signal analysis. I want to extract the MFCCs of a sound, because I read that MFCC is a good parameter for automatic speech recognition. So I tried that in RStudio like this:
wl=512
ncep=13
mfcc.peewit <- melfcc(peewit,sr=peewit#samp.rate,wintime = wl/f,hoptime = wl/f,numcep = ncep,
nbands = ncep*2,fbtype = "htkmel",dcttype = "t3",htklifter = TRUE,
lifterexp = ncep-1,frames_in_rows = FALSE,spec_out = TRUE)
It turned out a 13*30 data frame and I am confused about the output of MFCC. I thought MFCCs would be 13 actual numbers but here I got a data frame, is the data frame MFCCs? Or I did something wrong? Or, I read this somewhere else that the 13 in "13*30" is the discrete representation of 13 coefficients, is that correct?
Thank you for your reply in advance.

The audio signal is a time-series. There will be one set of MFCC coefficients per hop. Typical hop times for speech is maybe around 20-50 ms. So the 13 dimension is MFCC, and the 30 dimension is time.

Related

R read_delim() changes values when reading data

I am trying to read in a tabstop seperated csv file using read_delim(). For some reason the function seems to change some field entries in integer values:
# Here some example data
# This should have 3 columns and 1 row
file_string = c("Tage\tID\tVISITS\n19.02.01\t2163994407707046646\t40")
# reading that data using read_delim()
data = read_delim(file_string, delim = "\t")
view(data)
data$ID
2163994407707046656 # This should be 2163994407707046646
I totally do not understand what is happening here. If I chnage the col type to character the entry stays the same. Does anyone has an explanation for this?
Happy about any help!

Your number has so many digits, that it does not fit into the R object. According to the specification IEEE 754, the precision of double is 53 bits which is approx. a number with 15 decimal digits. You reach that limit using as.double("2163994407707046646").

Generating testing and training datasets with replacement in R

I have mirrored some code to perform an analysis, and everything is working correctly (I believe). However, I am trying to understand a few lines of code related to splitting the data up into 40% testing and 60% training sets.
To my current understanding, the code randomly assigns each row into group 1 or 2. Subsequently, all the the rows assigned to 1 are pulled into the training set, and the 2's into the testing.
Later, I realized that sampling with replacement is not want I wanted for my data analysis. Although in this case I am unsure of what is actually being replaced. Currently, I do not believe it is the actual data itself being replaced, rather the "1" and "2" place holders. I am looking to understand exactly how these lines of code work. Based on my results, it seems as it is working accomplishing what I want. I need to confirm whether or not the data itself is being replaced.
To test the lines in question, I created a dataframe with 10 unique values (1 through 10).
If the data values themselves were being sampled with replacement, I would expect to see some duplicates in "training1" or "testing2". I ran these lines of code 10 times with 10 different set.seed numbers and the data values were never duplicated. To me, this suggest the data itself is not being replaced.
If I set replace= FALSE I get this error:
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
set.seed(8)
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
training1 <- df[test==1,]
testing2 <- df[test==2,]
Id like to split up my data into 60-40 training and testing. Although I am not sure that this is actually happening. I think the prob function is not doing what I think it should be doing. I've noticed the prob function does not actually split the data exactly into 60percent and 40percent. In the case of the n=10 example, it can result in 7 training 2 testing, or even 6 training 4 testing. With my actual larger dataset with ~n=2000+, it averages out to be pretty close to 60/40 (i.e., 60.3/39.7).

The way you are sampling is bound to result in a undesired/ random split size unless number of observations are huge, formally known as law of large numbers. To make a more deterministic split, decide on the size/ number of observation for the train data and use it to sample from nrow(df):
set.seed(8)
# for a 60/40 train/test split
train_indx = sample(x = 1:nrow(df),
size = 0.6*nrow(df),
replace = FALSE)
train_df <- df[train_indx,]
test_df <- df[-train_indx,]

I recommend splitting the code based on Mankind_008's answer. Since I ran quite a bit of analysis based on the original code, I spent a few hours looking into what it does exactly.
The original code:
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
Answer From ( https://www.datacamp.com/community/tutorials/machine-learning-in-r ):
"Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights. Note also that, even though you don’t see it in the DataCamp Light chunk, the seed has still been set to 1234."
One of my main concerns that the data values themselves were being replaced. Rather it seems it allows the 1 and 2 placeholders to be assigned over again based on the probabilities.

Extracting data from lower layers in a Rasterbrick

So I'm extracting data from a rasterbrick I made using the method from this question: How to extract data from a RasterBrick?
In addition to obtaining the data from the layer given by the date, I want to extract the data from months prior. In my best guess I do this by doing something like this:
sapply(1:nrow(pts), function(i){extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i-1], nl=1)})
So it the extracting should look at layerindex i-1, this should then give the data for one month earlier. So a point with layerindex = 5, should look at layer 5-1 = 4.
However it doesn't do this and seems to give either some random number or a duplicate from months prior. What would be the correct way to go about this?

Your code is taking the value from the layer of the previous point, not the previous layer.
To see that imagine we are looking at the point in row 2 (i=2). your code that indicates the layer is pts$layerindex[i-1], which is pts$layerindex[1]. In other words, the layer of the point in row 1.
The fix is easy enough. For clarity I will write the function separetely:
foo = function(i) extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i]-1, nl=1)
sapply(1:nrow(pts), foo)
I have not tested it, but this should be all.

From Stata to R: recoding bysort and xtreg

I'm very new to R and currently working on a replication project for a meta-research course at my university. The paper examines if having a in-home display to monitor energy consumption reduces the energy usage. I have already recoded 300 lines of code, but now I ran into a problem I could not yet solve.
The source code says: bysort id expdays: egen ave15 = mean(power) if hours0105==1
I do understand what this does, but I cannot replicate it in R. id is the identifier for the examined household and expdays denotes the current day of the experiment. So ave15 is the average power consumption from midnight to 6 am sorted for every household on each day. I figured out that (EIPbasedata is the complete dataset containing hourly data)
EIPbasedata$ave15[EIPbasedata$hours0105 == 1] <- ave(EIPbasedata$power, EIPbasedata$ID, EIPbasedata$ExpDays, FUN=mean)
would probably do the job, but this gives me a warning:
number of items to replace is not a multiple of replacement length
and the results are not right too. I do not have any idea what I could do to solve this.
The next thing I struggle to recode is:
xtreg ln_power0105 ihd0105 i.days0105 if exptime==4, fe vce(bootstrap, rep(200) seed(12345))
I think the right way would be using plm but I'm not sure how to implement the if condition (days0105 is a running variable for the number of the day in experiment and 0 if not between 0-6am, ihd0105 is a dummy for having an in-home display, exptime denotes 4 am in the morning- however I do not understand what exptime does here)
table4_1 <- plm(EIPbasedata$ln_power0105 ~ EIPbasedata$ihd0105, data=EIPbasedata, index = c("days0105"), model="within")
How do I compute the bootstrapped standard errors in plm?
I hope some expert can help me, since my R and Stata knowledge is not sufficient for this..

My lecturer provided the answer to me: at first i do specify a subsample which I call tmp_data here: tmp_data <- EIPbasedata[which(EIPbasedata$ExpTime == 4) , ]
Then I'm regressing the tmp_data with as.factor(days0105) values, which is the R equivalent to i.days0105
tmp_results <- plm(tmp_data$ln_power0105 ~ tmp_data$ihd0105 + as.factor(tmp_data$days0105), data = tmp_data, index = ("ID"), model = "within")
There are probably better and cleaner ways to do this, but I'm fine with it for now.

reading/writing data frame to google sheets using pygsheets

What is the correct program flow to write different sized data frame to the same worksheet but ensure only the most recent data values written are visible?
Here was my original sequence:
gc = pygsheets.authorize(outh_file=oauth_file)
sh = gc.open(sheet_name)
wks = sh.worksheet_by_title(wks_name)
wks.set_dataframe(df, (1, 1))
Problem with above sequence is if 1st write was 3800 rows x 12 cols and 2nd write was 2400 rows x 12 cols the wks would still show data from the prior write for rows above 2400.
My 2nd solution (basically a hack just to get it to work for me):
gc = pygsheets.authorize(outh_file=oauth_file)
sh = gc.open(spreadsheet_name)
wks = sh.worksheet_by_title(sheet_name)
sh.del_worksheet(wks)
sh.add_worksheet(sheet_name, rows=len(df) + 1, cols=len(df.columns))
wks = sh.worksheet_by_title(sheet_name)
wks.set_dataframe(df, (1, 1))
The above sequence basically does what I want but I do not like having to delete the wks (I lose all my manual formatting). I know there must be a correct way to accomplish but I do not know the pygsheets API very well.
Will a more advanced pygsheet users please advise proper program flow and methods to use?
TIA,
--Rj

fit=True will basically resize the sheet to fit you data frame. so if you wanna keep the sheet at same size, you can clear the sheet before next write. it wold be easier than your second solution. Also if you just wanna clear the range you had written earlier, you can pass a range to clear function.
wks.set_dataframe(df, (1, 1))
wks.clear()
wks.set_dataframe(df, (1, 1))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Understanding the output of MFCCs - r

The audio signal is a time-series. There will be one set of MFCC coefficients per hop. Typical hop times for speech is maybe around 20-50 ms. So the 13 dimension is MFCC, and the 30 dimension is time.

Related

R read_delim() changes values when reading data

Generating testing and training datasets with replacement in R

Extracting data from lower layers in a Rasterbrick

From Stata to R: recoding bysort and xtreg

reading/writing data frame to google sheets using pygsheets

Categories

Resources