Search and replace rows of a data frame in R - r

search <- function(x,max_hp){
count <- 1
result <- matrix(NA, nrow =nrow(x), ncol = ncol(x))
for(i in 1:nrow(x)){
temp_row <- x[i,]
if(temp_row[4] < max_hp){
result[count,] <- temp_row
count <- count + 1
}
}
return(result)
}
I want to search the rows of mtcars data frame in R that have hp > 240
using a for loop (iterating over each row of the data frame) and then, return only the ones that match. But, my code doesn't work. I want to store each matched row in an empty matrix.

I have too few points to comment but I have a couple points to share. First, I agree with #Otto Kässi or #seeellayewhy. I would just add that if you don't whant any NAs in mtcars$hp to remain in your result, you need to use
result <- mtcars[which(mtcars$hp > 240),]
Regarding substituting rows, I would just follow the above command with
result <- rbind(result,newrows)
R will complain if any attributes of the columns in newrows are different than in result, especially if any of your columns are factor data types with any difference in the levels defined.

Related

Re order issue in for loop R

I am looking to reorder my data into a new dataframe (list in the example below) which the first observation is first, then the last observation is second, both observations are removed from the initial dataframe and then repeat.
data <- seq(1,12,1)
i <- 1
ii <- 1:length(data)
newData <- seq(1,12,1)
for (i in ii){
a <- 1
newData[i] <- data[a]
i <- i + 1
b <- as.numeric(length(data))
newData[i]<- data[b]
index <- c(a, b)
data <- data[-index]
i <- i + 1
}
I receive the error: "Error in newData[i] <- data[b] : replacement has length zero" and the loop stops at i = 8, and the list "data" is empty.
If I run the contents of the loop, but not the loop itself, I get my desired outcome both in this example and my task; but obviously I want to run the loop given the size of my data.
As MrFlick mentioned, you can't modify index in a for loop. But given you only need every second index, you can specify that your loop definition, by using
ii <- seq(1,length(data),2)
However, you don't need a for loop for rearranging the elements of your vector data. you only need a vector of the form (firs, last, second, secon last, etc.):
m = matrix(c(1:6,12:7), ncol=2)
i = as.vector(t(m))
newdata = data[i]

How to make a for loop to find the average of the columns in several data frames and add a new column?

I looked at the variations of this question, but was not able to find an answer that worked...so here goes. I have a lot of data frames, each representing a psychological index (the kind where they ask several questions and the average of them all gives you a score on what it is that you are measuring (anger, anxiety, etc). For this example, I will choose three of them: SA, SE, GT
I would like to make a for loop to automatically calculate the average of the columns in each data frame, and then add a new column with that average.
I was able to make a for loop to do this for one data frame, but how do I then loop this loop to do it for all of my data frames (which is a lot more than 3)?
#This is the for loop to do it for just one data frame (SA)
avg <- c()
for(i in 1:nrow(SA)){
avg[i] <- (sum(as.numeric(SA[i,]), na.rm =T)/ncol(SA))
}
SA$avg <- avg
#This is what I tried to do for multiple:
my.list <- list(SA, SE, GT)
for(l in my.list){
avg <- c()
for(i in 1:nrow(l)){
avg[i] <- (sum(as.numeric(l[i,]), na.rm =T)/ncol(l))
}
l$avg <- avg
}
This may work for you. I've created some dummy data frames, assuming that you have the same number of observations for each psychological index. You then bash them all together into one big dataframe. The colMeans function will compute means for each column:
SA <- data.frame(SA=runif(10))
SE <- data.frame(SE=runif(10))
GT <- data.frame(GT=runif(10))
MP <- data.frame(MP=runif(10))
df <- cbind(SA, SE, GT, MP)
av <- colMeans(df, na.rm = TRUE)
If the indices have differing numbers of observations, you can combine them into a list as you did, and then use the function sapply(). Since each element of the list is a dataframe, you need to extract the actual column by using the index operator [, 1] (first column):
df <- list(SA, SE, GT, MP)
sapply(df, function(x) mean(x[,1], na.rm=TRUE))
UPDATE:
You can create a list of your dataframes again, but as you need means across rows, just use the rowMeans() function:
SA <- data.frame(matrix(runif(50), nrow=10))
SE <- data.frame(matrix(runif(80), nrow=10))
df <- list(SA, SE)
lapply(df, function(x) {x$index_means <- rowMeans(x, na.rm=TRUE); return(x) })
This will give you a list of data frames with a new column of means for each index.

for loop only showing result of one case in R

I intend to fill a matrix I created that has 1000 rows and 2 columns. Here B is 1000.
resampled_ests <- matrix(NA, nrow = B, ncol = 2)
names(resampled_ests) <- c("Intercept_Est", "Slope_Est")
I want to fill it using a for loop looping from 1 to 1000.
ds <- diamonds[resampled_values[b,],]
Here, each of the ds(there should be 1000 versions of it in the for loop) is a data frame with 2 columns and 2000 rows. and I would like to use the lm() function to get the Beta coefficients of the two columns of data.
for (b in 1:B) {
#Write code that fills in the matrix resample_ests with coefficent estimates.
ds <- diamonds[resampled_values[b,],]
lm2 <- lm(ds$price~ds$carat, data = ds)
rowx <- coefficients(lm2)
resampled_ests <- rbind(rowx)
}
However, after I run the loop, resampled_ests, which is supposed to be a matrix of 1000 rows only shows 1 row, 1 pair of coefficients. But when I test the code outside of the loop by replacing b with numbers, I get different results which are correct. But by putting them together in a for loop, I don't seem to be row binding all of these different pairs of coefficients. Can someone explain why the result matrix resampled_etsis only showing one result case(1 row) of data?
rbind(x) returns x because you're not binding it to anything. If you want to build a matrix row by row, you need something like
resampled_ests <- rbind(resampled_ests, rowx)
This also means you need to initialize resampled_ests before the loop.
Which, if you're doing that anyway, I might just make a 1000 x 2 matrix of zeros and fill in the rows in the loop. Something like...
resampled_ests <- matrix(rep(0, 2*B), nrow=B)
for (b in 1:B) {
ds <- diamonds[resampled_values[b,],]
lm2 <- lm(ds$price~ds$carat, data = ds)
rowx <- coefficients(lm2)
resampled_ests[b,] <- rowx
}

Split data to make train and test sets - for loop - insert variable to subset by row

I am trying to subset this data frame by pre determined row numbers.
# Make dummy data frame
df <- data.frame(data=1:200)
train.length <- 1:2
# Set pre determined row numbers for subsetting
train.length.1 = 1:50
test.length.1 = 50:100
train.length.2 = 50:100
test.length.2 = 100:150
train.list <- list()
test.list <- list()
# Loop for subsetting by row, using row numbers in variables above
for (i in 1:length(train.length)) {
# subset by row number, each row number in variables train.length.1,2etc..
train.list[[i]] <- df[train.length.[i],] # need to place the variable train.length.n here...
test.list[[i]] <- df[test.length.[i],] # place test.length.n variable here..
# save outcome to lists
}
My question is, if I have my row numbers stored in a variable, how I do place each [ith] one inside the subsetting code?
I have tried:
df[train.length.[i],]
also
df[paste0"train.length.",[i],]
however that pastes as a character and it doesnt read my train.length.n variable... as below
> train.list[[i]] <- df[c(paste0("train.length.",train.length[i])),]
> train.list
[[1]]
data data1
NA NA NA
If i have the variable in there by itself, it works as intended. Just need it to work in a for loop
Desired output - print those below
train.set.output.1 <- df[train.length.1,]
test.set.output.1 <- df[test.length.1,]
train.set.output.2 <- df[train.length.2,]
test.set.output.2 <- df[test.length.2,]
I can do this manually, but its cumersome for lots of train / test sets... hence for loop
Consider staggered seq() and pass the number sequences in lapply to slice by rows. Also, for equal-length dataframes, you likely intended starts at 1, 51, 101, ...
train_num_set <- seq(1, 200, by=50)
train.list <- lapply(train_num_set, function(i) df[c(i:(i+49)),])
test_num_set <- seq(51, 200, by=50)
test.list <- lapply(test_num_set, function(i) df[c(i:(i+49)),])
Create a function that splits your data frame into different chunks:
split_frame_by_chunks <- function(data_frame, chunk_size) {
n <- nrow(data_frame)
r <- rep(1:ceiling(n/chunk_size),each=chunk_size)[1:n]
sub_frames <- split(data_frame,r)
return(sub_frames)
}
Call your function using your data frame and chunk size. In your case, you are splitting your data frame into chunks of 50:
chunked_frames <- split_frame_by_chunks(data_frame, 50)
Decide number of train/test splits to create in the loop
num_splits <- 2
Create the appropriate train and test sets inside your loop. In this case, I am creating the 2 you showed in your question. (i.e. the first loop creates a train and test set with rows 1-50 and 50-100 respectively):
for(i in 1:num_splits) {
this_train <- chunked_frames[i]
this_test <- chunked_frames[i+1]
}
Just do whatever you need to the dynamically created train and test frames inside your loop.

How to vectorize a for loop in R

I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Please let me know what you think and thanks for the help.
Below is some sample data with 2K observations instead of 200K
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply.
Also, we are creating two vectors that we will combine for vectorized multiplication.
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Two points to note:
Try not to use words like data, table, df, sub etc which are commonly used functions
In the above code I used mydf in place of data.
You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation

Resources