Scilab Data stretching - scilab

I have a data file with 2 columns. First column runs from 0 to 1390 second column has different values. (1st column is X pixel coordinates 2nd is intensity values).
I would like to "stretch" the data so that the first column runs from 0 to 1516 and the second column gets linearly interpolated for these new datapoints.
Any simple way to do this in scilab?
Data looks like this:
0 300.333
1 289.667
2 273
...
1388 427
1389 393.667
1390 252

Interpolation
You can linearly interpolate using interpln. Following the demo implementation on the docs, this results in the below code.
Example code
x=[0 1 2 1388 1389 1390];
y=[300.333 289.667 273 427 393.667 252];
plot2d(x',y',[-3],"011"," ",[-10,0,1400, 500]);
yi=interpln([x;y],0:1390);
plot2d((0:1390)',yi',[3],"000");
Resulting plot
Extrapolation
I think you are thinking of extrapolation, since it is outside the known measurements and not in between.
You should determine if you would like to fit the data datafit. For a tutorial see here or here.

The question was how to "stretch" the y vector from 1391 values to 1517 values. It is possible to do that with interpln as suggested by #user1149326 but we need to stretch the x vector before the interpolation:
x=[0 1 2 1388 1389 1390];
y=[300.333 289.667 273 427 393.667 252];
d=1391/1517;
x2=0:d:1390;
yi=interpln([x;y],x2);
x3=0:1516;
plot2d(x3',yi',[3],"000");

Related

add value to column within a data frame only for a subset

I have a data frame, called APD, and I would like to assign a value to the column "Fitted_voltage", but only for a specific subset (grouped by serial_number). How do I do that?
In the following example I want to assign 150 for the Fitted_Voltage but only for the Serial_number 913009814.
Serial_number Lot Wafer Amplification Voltage Fitted_Voltage
912009913 9 912 1878 375.3 NA
912009913 9 912 1892 376.8 NA
912009913 9 912 1900 377.9 NA
812009897 8 812 3931.1 370.5 NA
812009897 8 812 3934.8 371 NA
812009897 8 812 3939.9 372.3 NA
...
...
Finally I would like to do this automatically. I fit some data points and want to assign to each serial_number the fitted result.
The process could be:
Fit via function function_to_observe and do point-wise inverse regression at a specific value of 150 for serial number 912009913:
function_to_observe(150)
This yields the result
[1] 360.6395
which shall be stored in the data frame in the column Fitted_Voltage for one single serial_number
Then the next serial_number 812009897 will be fitted and this value shall be stored for it and again and again..
I know I can add the value to the column, but not limited to the subset:
APD["Fitted_Voltage"] <- Fitted_voltage<- function_to_observe(150)
Update: According to Eric Lecoutre answer I have now:
ID<- 912009913
ID2<- 912009914
APD_result<- data.frame(Serial_Number=rep(c(ID, ID2),each=1), Fitted_Voltage=NA)
comp <- tapply(APD_result$Fitted_Voltage, APD_result$Serial_Number, function_to_observe = inverse((function(x=150) exp(exp(sum(x^0*VK[1],x^1*VK[2],x^2*VK[3],x^3*VK[4])))), 50, 375))
APD_result$Fitted_Voltage = comp[APD_result$Serial_Number]
This works very well but I need to apply some minor changes. Which are not so minor for me..
1.) The Serial_numbers have to be added automatically (given as two examples "ID, ID2")
2.) I do not get tapply to run since I removed Voltage. Sorry for not specifying this in my previous question. The voltage is not of interest, I only want Serial_number and Fitted_Voltage in the end frame, which belong to each other.
Not so clear for me what your function_to_observe does. I assume you "exploits" the set of Voltage values for a given Serial_Number.
I prepared a small function that does so having an additional argument (value).
Does the following answer your question?
df <- data.frame(Serial_Number=rep(c("a","b"),each=3),Voltage=abs(100*rnorm(6)), FittedVoltage=NA)
function_to_observe <- function(vec,value=150) {mean(vec)+value}
comp <- tapply(df$Voltage, df$Serial_Number, function_to_observe, value=150)
df$FittedVoltage = comp[df$Serial_Number]
Having as
result
Serial_Number Voltage FittedVoltage
1 a 21.01196 205.4419
2 a 37.04815 205.4419
3 a 108.26565 205.4419
4 b 121.37657 264.3040
5 b 39.92053 264.3040
6 b 181.61485 264.3040
(yeah I know fitted voltage here is totally unrelated to voltage... Just does not understand what your 150 does here)

Calculating overlapping between different datasets using R

I have 49 datasets which include different values like this.
inPUT=read.table(file="TEST.csv", sep=",", header=T, row.names=1)
class(inPUT)
[1] "data.frame"
length(inPUT)
[1] 49
head(inPUT)
GO.1 GO.2 ... GO.49
1 811 811 ... 811
2 813 814 ... 814
3 814 819 ... 817
length(inPUT$GO.1)
[1]191
length(inPUT$GO.49)
[1]170
I'd like to calculate overlap between two different datasets among total 49 datasets (all possible pairwise calculation). Is there any R package to calculate how two sets overlap (I'm still new but..). Any ideas?
Is it so that every dataset only represents one column, as in your example?
One possible option is to use the '%in%' operator function, e.g.
mean(GO.1 %in% GO.2)
will tell you the percentage of observation in GO.1 are also present in GO.2. If you want to calculate the total overlap, you could use it in a function.
You can use outer to compute pairwise combinations, and intersect to find the members that are in both sets. So something like:
outer(inPUT, inPUT, intersect)
may help.

How does R know that I have no entries of a certain type

I have a table where one of the variables is country of registration.
table(df$reg_country)
returns:
AR BR ES FR IT
123 202 578 642 263
Now, if I subset the original table to exclude one of the countries
df_subset<-subset(df, reg_country!='AR')
table(df_subset$reg_country)
returns:
AR BR ES FR IT
0 202 578 642 263
This second result is very surprising to me, as R seems to somehow magically know that I have removed the the entries from AR.
Why does that happen?
Does it affect the size of the second data frame (df_subset)? If 'yes' - is there a more efficient way to to subset in order to minimize the size?
df$reg_country is a factor variable, which contains the information of all possible levels in the levels attribute. Check levels(df_subset$reg_country).
Factor levels only have a significant impact on data size if you have a huge number of them. I wouldn't expect that to be the case. However, you could use droplevels(df_subset$reg_country) to remove unused levels.

Looping within a loop in R

I'm trying to build quite a complex loop in R.
I have a set of data set as an object called p_int (p_int is peak intensity).
For this example the structure of p_int i.e. str(p_int) is:
num [1:1599]
The size of p_int can vary i.e. [1:688], [1:1200] etc.
What I'm trying to do with p_int is to construct a complex loop to extract the monoisotopic peaks, these are peaks with certain characteristics which will be extracted into a second object: mono_iso:
search for the first eight sets of data results in p_int. Of these eight, find the set of data with the greatest score (this score also needs to be above 50).
Once this result has been found, record it into mono_iso.
The loop will then fix on to this position of where this result is located within the large dataset. From this position it will then skip the next result along the dataset before doing the same for the next set of 8 results.
So something similar to this:
16 Results: 100 120 90 66 220 90 70 30 70 100 54 85 310 200 33 41
** So, to begin with, the loop would take the first 8 results:
100 120 90 66 220 90 70 30
**It would then decide which peak is the greatest:
220
**It would determine whether 220 was greater than 50
IF YES: It would record 220 into "mono_iso"
IF NO: It would move on to the next set of 8 results
**220 is greater than 50... so records into mono_iso
The loop would then place it's position at 220 it would then skip the "90" and begin the same thing again for the next set of 8 results beginning at the next data result in line: in this case at the 70:
70 30 70 100 54 85 310 200
It would then record the "310" value (highest value) and do the same thing again etc etc until the end of the set of data.
Hope this makes perfect sense. If anyone could possibly help me out into making such a loop work with R-script, I'd very much appreciate it.
Use this:
mono_iso <- aggregate(p_int, by=list(group=((seq_along(p_int)-1)%/%8)+1), function(x)ifelse(max(x)>50,max(x),NA))$x
This will put NA for groups such that max(...)<=50. If you want to filter those out, use this:
mono_iso <- mono_iso[!is.na(mono_iso)]

Data dictionary packing in R

I am thinking of writing a data dictionary function in R which, taking a data frame as an argument, will do the following:
1) Create a text file which:
a. Summarises the data frame by listing the number of variables by class, number of observations, number of complete observations … etc
b. For each variable, summarise the key facts about that variable: mean, min, max, mode, number of missing observations … etc
2) Creates a pdf containing a histogram for each numeric or integer variable and a bar chart for each attribute variable.
The basic idea is to create a data dictionary of a data frame with one function.
My question is: is there a package which already does this? And if not, do people think this would be a useful function?
Thanks
There are a variety of describe functions in various packages. The one I am most familiar with is Hmisc::describe. Here's its description from its help page:
" This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 unique values. For any variable with at least 20 unique values, the 5 lowest and highest values are printed."
And an example of the output:
Hmisc::describe(work2[, c("CHOLEST","HDL")])
work2[, c("CHOLEST", "HDL")]
2 Variables 5325006 Observations
----------------------------------------------------------------------------------
CHOLEST
n missing unique Mean .05 .10 .25 .50 .75 .90
4410307 914699 689 199.4 141 152 172 196 223 250
.95
268
lowest : 0 10 19 20 31, highest: 1102 1204 1213 1219 1234
----------------------------------------------------------------------------------
HDL
n missing unique Mean .05 .10 .25 .50 .75 .90
4410298 914708 258 54.2 32 36 43 52 63 75
.95
83
lowest : -11.0 0.0 0.2 1.0 2.0, highest: 241.0 243.0 248.0 272.0 275.0
----------------------------------------------------------------------------------
Furthermore, on your point about getting histograms, the Hmisc::latex method for a describe-object will produce histograms interleaved in the output illustrated above. (You do need to have a function LaTeX installation to take advantage of this.) I'm pretty sure you can find an illustration of the output in either Harrell's website or with the Amazon "Look Inside" presentation of his book "Regression Modeling Strategies". The book has a ton of useful material regarding data analysis.

Resources