Combing information of two data sets_Loop function - r

I have two datasets: m and s. The first data set includes variables Frequency, p1, p2 and p3.
The second dataset includes the value for type of regression, mean and sample size. Column names are z, mean, and samplesize, respectively.
I need to add four columns to the first dataset m as follows:
The first column m$reg1 should be m$p1 times the value of s$samplesize corresponding to s$z == 'Regression1'.
The second column m$reg2 should be m$p2 times the value of s$samplesize corresponding to s$z == 'regression2'.
The third column m$reg3 should be m$p3 times the value of s$samplesize corresponding to s$z == 'regression3'.
I was wondering how can I write a loop function for calculating these new four columns in m data set.
See how the datasets are created in the code below:
Frequency<-seq(1,27,1)
p1<-seq(2,28,1)
p2<-seq(10,36,1)
p3<-seq(0,26,1)
m<-data.frame(Frequency,p1,p2,p3)
z<-c('Regression1','Regression2','Regression3','Regression4')
mean<-c(2,28,1,17)
samplesize<-c(10,20,30,40)
s<-data.frame(z,mean,samplesize)

Use the same principle as we applied in this answer. First, define names of columns or row values that would subset tables and then perform the calculation, filling the values into a new, similarly constructed, column.
# custom function that calculates column values
add.col <- function(i){
# name in the s$z that defines the correct row
reg <- paste0("Regression", i)
# name of the m column
p <- paste0("p", i)
# multiply the named column from m with respective samplesize in s
return(m[, p] * s$samplesize[s$z == reg])
}
# loop through all indices
for(i in 1:3){
# create a new column with the compound name and fill it with appropriate values
m[, paste0("reg", i)] <- add.col(i = i)
}

No need for a loop, if I understand your question correctly. Just do:
m$regr1 <- m$p1*s$samplesize[s$z=="Regression1"]
m$regr2 <- m$p2*s$samplesize[s$z=="Regression2"]
m$regr3 <- m$p3*s$samplesize[s$z=="Regression3"]

If you want to do a for loop this might work as well:
desired_col = c(2,3,4) # this can be any selection
for(i in desired_col) { m[[paste0(i,"reg")]] = m[,i]*s[match(i,desired_col),3] }

Related

Count number of rows in each column in a dataframe that specify a specific condition

New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})

How to find approximately close values in different number of rows from two dataframe?

I have two data frame one with 24 row*2 columns and another with 258 row*2 columns. The columns are similar, I am interested in one column and want to find the values in two data frame that are approximately close to each other?
I am trying to simulate a spectrum and compare with an experiment.
df_exp <- data.frame("Name"=c(exp,Int), "exp" = c(x1, x2, x3, ...,x258),"int"= c(y1,y2,y3,...,y258))
df_sim <- data.frame("Name"=c(sim,Int), "sim" = c(x1, x2, x3, ...,x24),"int" = c(y1,y2,y3,...,y24))
Initial values (exp column from df_exp and sim column from df_sim):
exp sim
206.0396 182.0812
207.1782 229.1183
229.0776 246.1448
232.1367 302.1135
241.1050 319.1401
246.1691 357.1769
250.0235 374.2034
... ...
I tried this r code
match(df_exp$exp[1:258], df_sim$sim[1:24], nomatch = 0)
This code gives me all zero values because there is no exact match. The numbers always vary in decimal places. I tried to round the numbers to zero decimal places, and find values that are close. But that is not my intent. I want to find df_exp(229.0776,246.1691,...) and df_sim(229.1183, 246.1448,...) and make a new data frame with all those approximately close values. Can you please suggest some help?
You can define a similarity cutoff and loop over them:
### define your cutoff for similarity
cutoff <- 0.01
### initialize vectors to store the similar values
similar_sim <- vector(); similar_exp <- vector();
### open loop over both DF values
for (sim_value in df_sim$sim) {
for (exp_value in df_exp$exp) {
### if similar (< cutoff) append values to vectors
if ( abs(sim_value - exp_value) < cutoff ) {
similar_sim <- append(similar_sim, sim_value)
expilar_exp <- append(expilar_exp, exp_value)
}
}
}
### recreate a DF with the similar values
similar_df <- as.data.frame(cbind(similar_sim, similar_exp))
if you want to save every values of one similar to the value of the other as it sounds. Otherwise you can skip a loop and use range selection, e.g:
x[ x < x+cutoff & x > x-cutoff ]

Resampling with entries with same code (ID)

In R, I'm trying to resample my dataset.
The database A includes some codes in the first column (integer) and characteristics of each row as follows:
A <- as.matrix(cbind(floor(runif(1000, 1,101)), matrix(rexp(20000, rate=.1), ncol=20) ))
Some codes are repeated in the first column.
I want to resample randomly codes from the first column and create a new matrix or dataframe such that for each entry in the resampled code vector it gives me the right hand side. If there are more vectors with the same resampled code it should include both. Also, if I'm resampling the same code twice, all rows in A with the same resample code should appear twice.
---EDIT---
The resampling is done with replacement. So far what I did is:
res <- resample(unique(A[,1]), size = length(unique(A[,1])) , replace = TRUE, prob= NULL)
A.new <- A[which(A[,1] %in% res),]
however, assume that two lines in A have the same code (say 2), and that the vector res selects 2 4 times. In A.new I will only have 2 twice (because there are two lines coded as 2 in A[,1]), instead that having these two lines repeated 4 times
We can do it like this:
A.new = sapply(res, function(x) A[A[,1] == x, ])
A.new = do.call(rbind, A.new)
The first line makes a list of matrices in which each value of res creates a list item that is the subset of A for which the 1st column equals that value of res. If res contains the same number more than once, a matrix will be created for each occurrence of that value.
The 2nd line uses rbind to condense this list into a single matrix

Method in [R] for arrays of data frames

I am looking for a best practice to store multiple vector results of an evaluation performed at several different values. Currently, my working code does this:
q <- 55
value <- c(0.95, 0.99, 0.995)
a <- rep(0,q) # Just initialize the vector
b <- rep(0,q) # Just initialize the vector
for(j in 1:length(value)){
for(i in 1:q){
a[i]<-rnorm(1, i, value[j]) # just as an example function
b[i]<-rnorm(1, i, value[j]) # just as an example function
}
df[j] <- data.frame(a,b)
}
I am trying to find the best way to store individual a and b for each value level
To be able to iterate through the variable "value" later for graphing
To have the value of the variable "value" and/or a description of it available
I'm not exactly sure what you're trying to do, so let me know if this is what you're looking for.
q = 55
value <- c(sd95=0.95, sd99=0.99, sd995=0.995)
a = sapply(value, function(v) {
rnorm(q, 1:q, v)
})
In the code above, we avoid the inner loop by vectorizing. For example, rnorm(55, 1:55, 0.95) will give you 55 random normal deviates, the first drawn from a distribution with mean=1, the second from a distribution with mean=2, etc. Also, you don't need to initialize a.
sapply takes the place of the outer loop. It applies a function to each value in value and returns the three vectors of random draws as the data frame a. I've added names to the values in value and sapply uses those as the column names in the resulting data frame a. (It would be more standard to make value a list, rather than a vector with named elements. You can do that with value <- list(sd95=0.95, sd99=0.99, sd995=0.995) and the code will otherwise run the same.)
You can create multiple data frames and store them in a list as follows:
q <- list(a=10, b=20)
value <- list(sd95=0.95, sd99=0.99, sd995=0.995)
df.list = sapply(q, function(i) {
sapply(value, function(v) {
rnorm(i, 1:i, v)
})
})
This time we have two different values for q and we wrap the sapply code from above inside another call to sapply. The inner sapply does the same thing as before, but now it gets the value of q from the outer sapply (using the dummy variable i). We're creating two data frames, one called a and the other called b. a has 10 rows and b has 20 (due to the values we set in q). Both data frames are stored in a list called df.list.

Extract data using a matching matrix pair of data in R

I have two data sets with latitude, longitude, and temperature data. One data set corresponds to a geographic region of interest with the corresponding lat/long pairs that form the boundary and contents of the region (Matrix Dimension = 4518x2)
The other data set contains lat/long and temperature data for a larger region that envelopes the region of interest (Matrix Dimenion = 10875x3).
My question is: How do you extract the appropriate row data (lat, long, temperature) from the 2nd data set that matches the first data set's lat/long data?
I've tried a variety of "for loops," "subset," and "unique" commands but I can't obtain the matching temperature data.
Thanks in advance!
10/31 Edit: I forgot to mention that I'm using "R" to process this data.
The lat/long data for the region of interest was provided as a list of 4,518 files containing the lat/long coordinates in the name of each file:
x<- dir()
lenx<- length(x)
g <- strsplit(x, "_")
coord1 <- matrix(NA,nrow=lenx, ncol=1)
coord2 <- matrix(NA,nrow=lenx, ncol=1)
for(i in 1:lenx) {
coord1[i,1] <- unlist(g)[2+3*(i-1)]
coord2[i,1] <- unlist(g)[3+3*(i-1)]
}
coord1<-as.numeric(coord1)
coord2<-as.numeric(coord2)
coord<- cbind(coord1, coord2)
The lat/long and temperature data was obtained from an NCDF file for with temperature data for 10,875 lat/long pairs:
long<- tempcd$var[["Temp"]]$size[1]
lat<- tempcd$var[["Temp"]]$size[2]
time<- tempcd$var[["Temp"]]$size[3]
proj<- tempcd$var[["Temp"]]$size[4]
temp<- matrix(NA, nrow=lat*long, ncol = time)
lat_c<- matrix(NA, nrow=lat*long, ncol=1)
long_c<- matrix(NA, nrow=lat*long, ncol =1)
counter<- 1
for(i in 1:lat){
for(j in 1:long){
temp[counter,]<-get.var.ncdf(precipcd, varid= "Prcp", count = c(1,1,time,1), start=c(j,i,1,1))
counter<- counter+1
}
}
temp_gcm <- cbind(lat_c, long_c, temp)`
So now the question is how do you remove values from "temp_gcm" that correspond to lat/long data pairs from "coord?"
Noe,
I can think of a number of ways you could do this. The simplest, albeit not the most efficient would be to make use of R's which() function, which takes a logical argument, while iterating over the data frame which you want to apply the matches to. Of course, this is assuming that there can be at most a single match in the larger data set. Based on your data sets, I would do something like this:
attach(temp_gcm) # adds the temp_gcm column names to the global namespace
attach(coord) # adds the coord column names to the global namespace
matched.temp = vector(length = nrow(coord)) # To store matching results
for (i in seq(coord)) {
matched.temp[i] = temp[which(lat_c == coord1[i] & long_c == coord2[i])]
}
# Now add the results column to the coord data frame (indexes match)
coord$temperature = matched.temp
The function which(lat_c == coord1[i] & long_c == coord2[i]) returns a vector of all rows in the dataframe temp_gcm which satisfy lat_c and long_c matching coord1 and coord2 respectively from row i in the iteration (NOTE: I'm assuming this vector will only have length 1, i.e. there is only 1 possible match). matched.temp[i] will then be assigned the value from the column temp in the dataframe temp_gcm which satisfied the logical condition. Note that the goal in doing this is that we create a vector which has matched values that correspond by index to the rows of the dataframe coord.
I hope this helps. Note that this is a rudimentary approach, and I would advise looking up the function merge() as well as apply() to do this in a more succinct manner.
I added an additional column full of zeros to use as the resultant for an IF statement. "x" is the number of rows in temp_gcm. "y" is the number of columns (representative of time steps). "temp_s" is the standardized temperature data
indicator<- matrix(0, nrow = x, ncol = 1)
precip_s<- cbind(precip_s, indicator)
temp_s<- cbind(temp_s, indicator)
for(aa in 1:x){
current_lat<-latitudes[aa,1] #Latitudes corresponding to larger area
current_long<- longitudes[aa,1] #Longitudes corresponding to larger area
for(ab in 1:lenx){ #Lenx coresponds to nrow(coord)
if(current_lat == coord[ab,1] & current_long == coord[ab,2]) {
precip_s[aa,(y/12+1)]<-1 #y/12+1 corresponds to "indicator column"
temp_s[aa,(y/12+1)]<-1
}
}
}
precip_s<- precip_s[precip_s[,(y/12+1)]>0,] #Removes rows with "0"s remaining in "indcator" column
temp_s<- temp_s[temp_s[,(y/12+1)]>0,]
precip_s<- precip_s[,-(y/12+1)] #Removes "indicator column
temp_s<- temp_s[,-(y/12+1)]

Resources