Vectorized euclidean distance between 2 dataset in r

Vectorized euclidean distance between 2 dataset in r - r

I have 2 data frames (temp_inp & temp) with 5 & 50k rows respectively and both of them having around 3k columns.
I want to calculate euclidean distance [(x1-x2)^2- no need to take square roots or divide by nrow - I just need the rows with minimum distance] between all the rows of dataframe1 against dataframe 2.
Final output required will have
rows = rows from temp(50k)
columns = rows from temp_inp(5)
After which I want to take the 5 rows with minimum distances from data frame 1.
Note that something like rbind & dist function will not work due to the size of the data. What I tried is this:
for (i in c(1:nrow(temp_inp))) {
for (j in c(1:nrow(temp))) {
temp1[j,i] <- sum((temp[j,]-temp_inp[i,])^2)
}}
This is taking insane amount of time... (8 hours)
I tried racking my to find a vectorized version of the same code. Please help me if you have any idea regarding this or know of any in-built function/package which will help me do so.

Related

Writing a function to give an interval where the values are greater than a certain number

I'm trying to write a function where I find the number of times the value in a data frame is above a certain number x (in this case, 3). Basically, the data start from 1.0, increase, then go below 1.0 (in a span of about 150 data points). I want the function to return to me the number of times the values are above this threshold. I'm fairly new to R and am just confused on how to go about this. Any help is appreciated. Thank you!

If your data frame is called df then sum(df$x>3) will return the number of rows of df where x is greater than 3.
If there are missing values in x and you want to ignore them then use sum(df$x>3, na.rm=TRUE).

R multiply values from different rows

I have the following data frame in R:
df <- data.frame(time=c("10:01","10:05","10:11","10:21"),
power=c(30,32,35,36))
Problem: I want to calculate the energy consumption, so I need the sum of the time differences multiplied by the power. But every row has one timestamp, meaning I need to do subtraction between two different rows. And that is the part I cannot figure out. I guess I would need some kind of function but I couldn't find online hints.
Example: It has to subtract row2$time from row1$time, and then multiply it to row1$power.
As said, I do not know how to implement the step in one call, I am confused about the subtraction part since it takes values from different rows.
Expected output: E=662

Try this:
tmp = strptime(df$time, format="%H:%M")
df$interval = c(as.numeric(diff(tmp)), NA)
sum(df$interval*df$power, na.rm=TRUE)
I got 662 back.

Get the standar deviation each n rows with dplyr

I'm trying to get the ST of the last 20 values for each row in a data.frame. The procedure would be something like this in excel, but im trying to do it in r and with dplyr.
enter image description here

Your question is thin and does not provide a data source. But here is an example of what you want. This is a dataframe with random numbers with 50 rows. There are two columns of numbers.
The last twenty rows (this is what you want) from that dataframe are selected and then these twenty rows are used within a loop. A math function is applied to each of these 20 rows.
set.seed(14)
n <- sample(100, 50)
b <- sample(200, 50)
cdf <- data.frame(n, b) # new data frame with random two columns of random numbers.
nrow(cdf)
last <- nrow(cdf) - 20 # Selecting the last 20 rows of data from the data frame
for (i in (last:nrow(cdf))) { # loop to apply math to last 20 rows of data.
print(mean(cdf$b[i]))
print(i)
}

Counting consecutive values in rows in R

I have a time series and panel data data frame with a specific ID in the first column, and a weekly status for employment: Unemployed (1), employed (0).
I have 261 variables (the weeks every year) and 1.000.000 observations.
I would like to count the maximum number of times '1' occurs consecutively for every row in R.
I have looked a bit at rowSums and rle(), but I am not as far as I know interested in the sum of the row, as it is very important the values are consecutive.
You can see an example of the structure of my data set here - just imagine more rows and columns

We can write a little helper function to return the maximum number of times a certain value is consecutively repeated in a vector, with a nice default value of 1 for this use case
most_consecutive_val = function(x, val = 1) {
with(rle(x), max(lengths[values == val]))
}
Then we can apply this function to the rows of your data frame, dropping the first column (and any other columns that shouldn't be included):
apply(your_data_frame[-1], MARGIN = 1, most_consecutive_val)
If you share some easily imported sample data, I'll be happy to help debug in case there are issues. dput is an easy way to share a copy/pasteable subset of data, for example dput(your_data[1:5, 1:10]) would be a great way to share the first 5 rows and 10 columns of your data.
If you want to avoid warnings and -Inf results in the case where there are no 1s, use Ryan's suggestion from the comments:
most_consecutive_val = function(x, val = 1) {
with(rle(x), if(all(values != val)) 0 else max(lengths[values == val]))
}

R code to generate random pairs of rows and do simulation

I have a data matrix in R having 45 rows. Each row represents a value of a individual sample. I need to do to a trial simulation; I want to pair up samples randomly and calculate their differences. I want a large sampling (maybe 10000) from all the possible permutations and combinations.
This is how I managed to do it till now:-
My data matrix ("data") has 45 rows and 2 columns. I selected 45 rows randomly and subtracted from another randomly generated 45 rows.
n1<-(data[sample(nrow(data),size=45,replace=F),])-(data[sample(nrow(data),size=45,replace=F),])
This gave me a random set of differences of size 45.
I made 50 such vectors (n1 to n50) and did rbind, which gave me a big data matrix containing random differences.
Of course, many rows between first random set and second random set were same and cancelled out. I removed it with a code as follows:
row_sub = apply(new, 1, function(row) all(row !=0 ))
new.remove.zero<-new[row_sub,]
BUT, is there a cleaner way to do this? A simpler way to generate all possible random pairs of rows, calculate their difference as bind it together as a new matrix?
Thanks in advance.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Vectorized euclidean distance between 2 dataset in r - r

Related

Writing a function to give an interval where the values are greater than a certain number

R multiply values from different rows

Get the standar deviation each n rows with dplyr

Counting consecutive values in rows in R

R code to generate random pairs of rows and do simulation

Categories

Resources